sample src + data for getting started on hadoop

Monday, 19 July 2010... I had a wonderful opportunity to present at the Silicon Valley Cloud Computing meetup, on the topic "Getting Started on Hadoop".

The talk showed examples of Hadoop Streaming, based on Python scripts 
running on the AWS Elastic MapReduce service.

We started with a brief history of MapReduce, including the concepts leading up to the framework as well as open source projects and services which have followed. Then we stepped through the ubiquitous “WordCount” example (a “Hello World” for MapReduce), showing how Python and Hadoop Streaming make it simple to iterate and debug from a command line using Unix/Linux pipes.

Source code is available on GitHub and the oddly enough, the slide deck got an editor's pick that week on SlideShare.

The focus of the talk was to show text mining of the infamous Enron Email Dataset, which Infochimps.com and CMU make available. In that context, the example code creates an inverted index of keywords found in the email dataset, begins to semantic lexicon of "neighbor" keyword relationships, plus some data visualization and social graph analysis using R and Gephi.

Along with my presentation, Matthew Schumpert from Datameer gave a demo of their product, doing some similar kinds of text analysis.

Lots of people showed up, enough that the kind folks at Fenwick & West LLP grew concerned about running out of seats :) The audience asked several excellent questions and we had a lot of discussion after the talk. Todd Hoff wrote an article summarizing the talk and discussions, along with some great perspectives on High Scalability.

Admittedly, the Enron aspects of the talk were intended as somewhat of a teaser; my examples focused more on method than on results. I'd talked with several people who'd never seen how to write Python scripts for Hadoop Streaming, how to run Hadoop jobs on Elastic MapReduce, how to calculate some basic text analytics or produce simple data visualizations. Even so, if you want to see investigate the Enron data yourself, then checkout the code, download the data, and run this on AWS. There were some fun surprises to be found among the analytics results, which may be good to publish as a follow-up talk.

Many thanks to SVCC and our organizer Sebastian Stadil, our venue host Fenwick & West LL, and all who participated.