Paco Nathan: hadoop, part 1: approach

Recently I started putting together a collection of code examples based on the Hadoop open source project at Apache. These examples are available as open source on Google Code at:

http://code.google.com/p/ceteri-mapred/

This project is intended to explore applications of MapReduce programming techniques for handling large data sets within a distributed computing framework. It also presents more Hadoop examples (currently based on version 0.15.3) to help other developers overcome its learning curve. You'll need to download and install Hadoop first.

source: Apache.org

Yahoo! developers announced today about production use of Hadoop supporting their search engine, which includes more than 10,000 cores. Some have estimated that as approaching 10% of the size of Google's server farm running MapReduce. Given that Hadoop is still relatively "early code" as an open source project – in other words waaay prior to its "release 1.0", and that's impressive!

Google provides an excellent set of lectures online introducing these topics, entitled "Cluster Computing and MapReduce". I highly recommend it. While you're studying up, be sure to read the Dynamo paper by Amazon.com CTO Werner Vogels. Between those two sources, one begins to get a clear vision established for a new mode of programming techniques and system architecture priorities.

source: Google.com

Those emerging priorities are no so much based on a need to crunch the data faster, but rather to be able to scale to handle very large data reliably and cost-effectively – even if, statistically, the data quality is questionable, the third-party libraries are buggy, and some of the processors fail. Generally speaking, for large scale problems that approach does end up crunching the data faster overall. For example, if you happen to be Last.fm or Rackspace.com then even relatively mundane business needs like log file analysis become very large data problems ... and Hadoop is providing their solution.

One interesting point is that this style of coding and architecture may not quite fit the first-foot-forward techniques that a "pre-Millenial" crop of software developers learned and tend to practice. Bread and butter concepts such as relational databases, O/R mapping for persistence layers, model-view-controller, threading – each have their purposes, but not longer provide the default patterns. Instead, in-between the lines of whitepapers about Dynamo, MapReduce, Hadoop, etc., we catch glimpses of architecture of IBM mainframe programming, specifically hierarchical databases, ETL for data streaming, zSeries hardware with many cores, Shark, MQSeries, etc. The underlying technologies for those date back to the 1960s, and IBM has been refining them ever since, considering that they provide the IT foundations for finance, transportation, manufacturing, etc. Moreover there are programmers who have quite literally logged four decades perfecting highly sophisticated techniques for handling large data sets with these system components.

Working mostly in banking, two start-ups ago, I learned much about this kind of technology – at times, brainstorming alongside senior IT architects at customer sites like Wachovia. I have a hunch that just about anyone who's spent quality time snuggling close with IBM's microcoded hardware sort algorithm would recognize the heavy-lifting performed by the merge sort in the middle of MapReduce ... in a heartbeat.

What is even more interesting is how the core components rolled out so far in Amazon AWS bear strong resemblance to those same mainframe programming techniques I mentioned above. Specifically, XML resembles a hierarchical data model (IMS), Hadoop resembles ETL, EC2 resembles virtualization on zSeries, S3 resembles Shark, SMS resembles MQSeries, etc. I could go on, but it's not necessary. What is necessary is to recognize that these mainframe technologies run the bulk of commercial computing and data storage – once you add up all the processing performed by banks, insurance, accounting, airlines, shipping, tax agencies, pharma, etc. Their corollary technologies represented within Dynamo, MapReduce, and Hadoop are being presented by large firms earning large revenues based on the Internet: AMZN, GOOG, YHOO, etc. In other words, it is refreshing to see that real money is not chasing yet-more-web-apps as their respective business plans. That seems to indicate a good trend in the technology sector.

Okay, enough broad perspectives – let's cut to some examples. Over the next few parts in this series, I'd like to present some sample code and summary analysis for applications. Let's start by taking a look at data from the jyte.com cred API.

Paco Nathan

2008-02-20

hadoop, part 1: approach

No comments:

about

archive

perspective

usage