Anthony J. Pennings, PhD

WRITINGS ON DIGITAL STRATEGIES, ICT ECONOMICS, AND GLOBAL COMMUNICATIONS

Working Big Data – Hadoop and the Transformation of Data Processing

Posted on | February 15, 2013 | No Comments


One day Google downloaded the Internet, and wanted to play with it.

Well, that is my version of an admittedly mythologized origin story for what is now commonly called “Big Data.

Early on, Google developed a number of new applications to manage a wide range of online services such as advertising, free email, blog publishing, and free search. Each required sophisticated telecommunications, storage and analytical techniques to work and be profitable. In the wake of the dot.com and subsequent telecom crash, Google started to buy up cheap fiber optic lines from defunct companies like Enron and Global Crossing to speed up connection and interconnection speeds. Google also created huge data centers to collect, store and index this information. Their software success enabled them to become a major disruptor of the advertising and publishing industries and turned them into a major global corporation now making over US$50 billion a year in revenues. These innovations would also help drive the development of Big Data – the unprecedented use of massive amounts of information from a wide variety of sources to solve business and other problems.

Unable to buy the type of software they needed from any known vendor, Google developed its own software solutions to fetch and manage the petabytes of information they were downloading from the World Wide Web on a regular basis. Like other Silicon Valley companies, Google drew on the competitive cluster’s rich sources of talent and ideas, including Stanford University. Other companies such as Teradata were also developing parallel processing technology for data center hardware and software technologies, but Google was able to raise the investment capital to attract the talent to produce an extraordinary range of proprietary database technology. Google File System was created to distribute files securely across its many inexpensive commodity server/storage systems. A program called Borg emerged as an automated methodology to distribute the workload for data coming in amongst its myriad of machines in a process called “load-balancing”. Bigtable scaled data management and storage to enormous sizes. Perhaps the most critical part of the software equation was MapReduce, an almost Assembly-like piece of software that allowed Google to write applications that could take advantage of the large data-sets distributed across their “cloud” of servers.[1] With these sets of software solutions, Google began creating huge warehouse-sized data centers to collect, store and index information.

When Google published the conceptual basis for MapReduce, most database experts didn’t comprehend its implications, but not surprising, a few at Yahoo! were very curious. By that year the whole area of data management and processing was facing new challenges, particularly those managing data warehouses for hosting, search and other applications. Data was growing exponentially; it was dividing into many different types of formats; data models or schemas were evolving; and probably most challenging of all was that data was becoming ever more useful and enticing for businesses and other organizations, including those in politics. While relational databases would continue to be used, a new framework for data processing was in the works. Locked in a competitive battle with Google, Yahoo! strove to catch up by developing their own parallel-processing power.[2]

At Yahoo! a guy named Doug Cutting was also working on software that could “crawl” the Web for content and then organize it so it can be searched. Called Nutch, his software agent or “bot” tracked down URLs and selectively downloaded the webpages from thousands of hosts where it would be indexed by another program he created called Lucene. Nutch could “fetch” data and run on clusters of 100s of distributed servers. Nutch and Lucene led to the development of Hadoop, which drew on the concepts which had been designed into Google’s MapReduce. With MapReduce providing the programming framework, Cutting separated the “data-parallel processing engine out of the Nutch crawler” to create Apache Hadoop, an open source project created to make it faster, easier and cheaper to process and analyze large volumes of data.[3]

Amr Awadallah of Cloudera is one of the best spokesmen for Hadoop.

By 2007, Hadoop began to circulate as a new open source software engine for Big Data initiatives. It was built on Google’s and Yahoo!’s indexing and search technology and adopted by companies like Amazon, Facebook, Hulu, IBM, and the New York Times. Hadoop, in a sense, is a new type of operating system directing workloads, performing queries, conducting analyses, but at a totally unprecedented new scale. It was designed to work across multiple low-cost storage/server systems to manage large data-sets and run applications on them. As an operating system, it works across a wide range of servers in that it is a system for managing files and also for running applications on top of those files. Hadoop made use of data from mobile devices, PCs, and the whole Internet of “things” such as cars, cash registers, and home environmental systems. Information from these grids of data collection increasingly became fodder for analysis and innovative value creation.

In retrospect, what happened in the rise of Big Data was a major transition in the economics and technology of data. Instead of traditional database systems that saved information to archival systems like magnetic tape, which made it expensive to retrieve and reuse the data, low-cost servers became available with central processing units that could run programs within that individual server and across an array of servers. Large data centers emerged with networked storage equipment that made it possible to perform operations across tens of thousands of distributed servers and produce immediate results. Hadoop and related software solutions that could manage, store and process these large sets of data were developed to run data centers and access unstructured data such as video files from the larger world of the Internet. Big Data emerged from its infancy and began to farm the myriad of mobile devices and other data producing instruments for a wide range of new analytical and commercial purposes.

Share

Notes

[1] Steven Levy’s career of ground-breaking research includes this article on Google’s top secret data centers.
[2] Amr Awadallah listed these concerns at Cloud 2012 in Honolulu, June 24.
[3] Quote from Mike Olsen, CEO of Cloudera.

Share

© ALL RIGHTS RESERVED



AnthonybwAnthony J. Pennings, PhD is Professor and Associate Chair of the Department of Technology and Society, State University of New York, Korea. Before joining SUNY, he taught at Hannam University in South Korea and from 2002-2012 was on the faculty of New York University. Previously, he taught at St. Edwards University in Austin, Texas, Marist College in New York, and Victoria University in New Zealand. He has also spent time as a Fellow at the East-West Center in Honolulu, Hawaii.

Comments

Comments are closed.

  • Referencing this Material

    Copyrights apply to all materials on this blog but fair use conditions allow limited use of ideas and quotations. Please cite the permalinks of the articles/posts.
    Citing a post in APA style would look like:
    Pennings, A. (2015, April 17). Diffusion and the Five Characteristics of Innovation Adoption. Retrieved from http://apennings.com/characteristics-of-digital-media/diffusion-and-the-five-characteristics-of-innovation-adoption/
    MLA style citation would look like: "Diffusion and the Five Characteristics of Innovation Adoption." Anthony J. Pennings, PhD. Web. 18 June 2015. The date would be the day you accessed the information. View the Writing Criteria link at the top of this page to link to an online APA reference manual.

  • About Me

    Professor and Associate Chair at State University of New York (SUNY) Korea. Recently taught at Hannam University in Daejeon, South Korea. Moved to Austin, Texas in August 2012 to join the Digital Media Management program at St. Edwards University. Spent the previous decade on the faculty at New York University teaching and researching information systems, media economics, and strategic communications.

    You can reach me at:

    anthony.pennings@gmail.com
    anthony.pennings@sunykorea.ac.kr

    Follow apennings on Twitter

  • About me

  • Traffic Feed

  • Calendar

    August 2018
    M T W T F S S
    « May    
     12345
    6789101112
    13141516171819
    20212223242526
    2728293031  
  • Pages

  • August 2018
    M T W T F S S
    « May    
     12345
    6789101112
    13141516171819
    20212223242526
    2728293031  
  • Flag Counter
  • My Fed Watcher’s Handbook is now available on

  • Disclaimer

    The opinions expressed here do not necessarily reflect the views of my employers, past or present.