As some of my friends have recently noticed (from profiles on LinkedIn, etc.) I've got a new employer.
Of course, many folks have asked me, "How does one get hired by S.P.E.C.T.R.E., anyway?" Well, it's a long, involved process – lemme tell you. I actually got hired years ago, and have been moonlighting there on the sly. Hey, it happens. Silicon Valley was built out of that kind of "resourcefulness".
Anywhoo, I figured it was time to commit, go full-time, and moreover go public with my affiliation. My grandpa always said, if you're gonna swim a river right, you gotta jump in with both feet. Or something.
The thing is, even though our prime objective at S.P.E.C.T.R.E. is world domination, and the prevalence of evil over good, etc., we really have a lot of latitude in our daily work life. I mean, the corporate campus is awesome! Who cares about being at the Googleplex? We get to work inside a secret volcano base. With atomic-powered peoplemovers. Access not only to a ginormous cloud computing facility, but to all the banking IT in the world. Ninjas instead of silly security guards and Wakenhuts. Hell, we've even got Fem Bots in the break room!
But I digress.
It's not all guts and glory. We still have to follow process (which is killer) and meet our milestones. Those can be grueling, especially when that giant dude with the bad metal teeth starts lurking around right before deadline. The benefits plan is amazing though – free travel around the world, staying at some of the most fantastics resorts. Overall, serving evil a pretty good deal.
As some of my friends have recently noticed (from profiles on LinkedIn, etc.) I've got a new employer.
Following from Part 1 in the series, this article takes a look at an initial application in Hadoop, using a simplified PageRank algorithm with graph data from jyte.com to illustrate a gentle introduction to distributed computing. The really cool part is that this runs quickly on my MacBook laptop, the same Java code can run and scale without modification on Amazon AWS, on an IBM mainframe, and (with minor changes) in the clouds at Yahoo and Google. Source code is available as open source on Google Code at:
The social networking system jyte.com provides means for using OpenID as primary identification for participants – at least one is required for login. Participants then post "claims" on which others vote. They can also give "cred" to other participants, citing specific areas of credibility with tags. The contacts API provides interesting means for managing "social whitelisting".
Okay, that was a terribly dry, academic description, but the bottom line is that jyte is fun. I started using it shortly after its public launch in early 2007. You can find a lot more info on their wiki. Some of the jyte data gets published via their API page – essentially the "cred" graph. Others have been working with that data outside of jyte.com, notably Jim Snow's work on reputation metrics.
Looking at the "cred" graph, it follows a form suitable for link analysis such as the PageRank used by Google. In other words, the cred given by one jyte user to another resembles how web pages link to other web pages. Therefore a PageRank metric can be determined, as Wikipedia describes, "based upon quantity of inbound links as well as importance of the page providing the link." Note that this application uses a simplified PageRank, i.e., no damping term in the difference equation.
Taking the jyte cred graph, inverting its OpenID mapping index, then calculating a PageRank algorithm seemed like a reasonably-sized project. There are approximately 1K nodes in the graph (currently) which provides enough data to make the problem and its analysis interesting, but not so much to make it unwieldy. Sounds like a great first app for Hadoop.
The data as downloaded from jyte.com has a caveat: it has been GZIP'ed twice. Once you get it inflated and loaded into the HDFS distributed file system, as "input", the TSV text file line format looks like:
from_openid to_openid csv cred tags
Pass 1 in Hadoop is a job which during its "map" phase takes the downloaded data from a local file, and during the "reduce" phase it aggregates the one-to-many relation of OpenID URIs which give cred to other OpenID URIs. See in the Java source code the method configPass1() in the org.ceteri.JyteRank class, and also the inner classes MapFrom2To and RedFrom2To. The resulting "from2to" TSV text data line format looks like:
from_openid tsv list of to_openids
Pass 2 is a very simple job which initializes rank values for each OpenID URI in the graph – starting simply with a value of 1. We'll iterate to change that, later. See the method configPass2() and also the inner classes MapInit and RedInit. The resulting "prevrank" TSV text data line format looks like:
Pass 3 performs a join of "from2to" data with "prevrank" data, then expands that into individual terms in the PageRank summation, emitting a partial calculation for each OpenID URI. See the method configPass3() and also the inner classes MapExpand and RedExpand. The resulting "elemrank" TSV text data line format looks like:
Pass 4 aggregates the "elemrank" data to sum the terms for each OpenID URI in the graph, reducing them to a rank metric for each. See the method configPass4() and also the inner classes MapSum and RedSum. The resulting "thisrank" TSV text data line format looks like:
The trick with PageRank is that it uses a difference equation to represent the equilibrium values in a network, iterating through sources and sinks. So the calculation generally needs to iterate before it converges to a reasonably good approximation of the solution at equilibrium. In other words, after completing Pass 4, the code swaps "thisrank" in place of "prevrank" and then iterates on Pass 3 and Pass 4. Rinse, later, repeat – until the difference between two successive iterations produces a relatively small error.
Whereas Google allegedly runs their PageRank algorithm through thousands of iterations, running for several hours each day, I have a hunch that jyte.com does not run many iterations to determine their cred points. Stabbing blindly, I found that using 9 iterations seems to produce a close correlation to the actual values listed as jyte.com cred points. A linear regression can then be used to approximate the actual values:
My analysis here is quite rough, and the code is far from optimal. For example, I'm handling data in-between the map and reduce phases simply as TSV text. There could be much improvement by using subclassed InputFormat and OutputFormat definitions for my own data formats. Even so, the results look pretty good: not bad for one afternoon's work, eh?
I have big reservations about PageRank for web pages, since it assumes that an equilibrium state effectively models attention given to web pages. PageRank provides a mathematical model for a kind of "coin toss", where a user looking at a browser either chooses to navigate to some random URL (with probability 0.15) or they follow one of the links on the current page (with probability 0.85). In practice that implies how some important web pages will experience substantial latency before PageRank begins to "notice" them, since PageRank must wait for (a) other pages to link to the new page, and (b) their web crawlers to visit and analyze those changes. That process can take several days.
A more realistic model of human behavior might take into account that – contrary to Google PageRank assumptions – people do not navigate to some random URL because it is in their bookmarks or is their individual home page. That might have been the general case in 1998, but these days I get my fix of random URLs because of email, IM, RSS, etc., all pushing them out to me. In other words, there is an implied economy of attention (assuming I only look at one browser window at a time) which has drivers that are external to links in the World Wide Web. In lieu of using equilibrium models, the link analysis might consider some URLs as "pumps" and then take a far-from-equilibrium approach.
Certainly, a site such as CNN.com demonstrates that level of network effects, and news events function as attention drivers.
But I digress.
A simplified PageRank analysis based on an equilibrium model appears to work fine enough to approximate jyte.com cred points. It also illustrates how to program within the Hadoop framework. I have ulterior motives for getting jyte cred graph data into Hadoop – which we'll examine in the next part of this series.
Recently I started putting together a collection of code examples based on the Hadoop open source project at Apache. These examples are available as open source on Google Code at:
This project is intended to explore applications of MapReduce programming techniques for handling large data sets within a distributed computing framework. It also presents more Hadoop examples (currently based on version 0.15.3) to help other developers overcome its learning curve. You'll need to download and install Hadoop first.
Yahoo! developers announced today about production use of Hadoop supporting their search engine, which includes more than 10,000 cores. Some have estimated that as approaching 10% of the size of Google's server farm running MapReduce. Given that Hadoop is still relatively "early code" as an open source project – in other words waaay prior to its "release 1.0", and that's impressive!
Google provides an excellent set of lectures online introducing these topics, entitled "Cluster Computing and MapReduce". I highly recommend it. While you're studying up, be sure to read the Dynamo paper by Amazon.com CTO Werner Vogels. Between those two sources, one begins to get a clear vision established for a new mode of programming techniques and system architecture priorities.
Those emerging priorities are no so much based on a need to crunch the data faster, but rather to be able to scale to handle very large data reliably and cost-effectively – even if, statistically, the data quality is questionable, the third-party libraries are buggy, and some of the processors fail. Generally speaking, for large scale problems that approach does end up crunching the data faster overall. For example, if you happen to be Last.fm or Rackspace.com then even relatively mundane business needs like log file analysis become very large data problems ... and Hadoop is providing their solution.
One interesting point is that this style of coding and architecture may not quite fit the first-foot-forward techniques that a "pre-Millenial" crop of software developers learned and tend to practice. Bread and butter concepts such as relational databases, O/R mapping for persistence layers, model-view-controller, threading – each have their purposes, but not longer provide the default patterns. Instead, in-between the lines of whitepapers about Dynamo, MapReduce, Hadoop, etc., we catch glimpses of architecture of IBM mainframe programming, specifically hierarchical databases, ETL for data streaming, zSeries hardware with many cores, Shark, MQSeries, etc. The underlying technologies for those date back to the 1960s, and IBM has been refining them ever since, considering that they provide the IT foundations for finance, transportation, manufacturing, etc. Moreover there are programmers who have quite literally logged four decades perfecting highly sophisticated techniques for handling large data sets with these system components.
Working mostly in banking, two start-ups ago, I learned much about this kind of technology – at times, brainstorming alongside senior IT architects at customer sites like Wachovia. I have a hunch that just about anyone who's spent quality time snuggling close with IBM's microcoded hardware sort algorithm would recognize the heavy-lifting performed by the merge sort in the middle of MapReduce ... in a heartbeat.
What is even more interesting is how the core components rolled out so far in Amazon AWS bear strong resemblance to those same mainframe programming techniques I mentioned above. Specifically, XML resembles a hierarchical data model (IMS), Hadoop resembles ETL, EC2 resembles virtualization on zSeries, S3 resembles Shark, SMS resembles MQSeries, etc. I could go on, but it's not necessary. What is necessary is to recognize that these mainframe technologies run the bulk of commercial computing and data storage – once you add up all the processing performed by banks, insurance, accounting, airlines, shipping, tax agencies, pharma, etc. Their corollary technologies represented within Dynamo, MapReduce, and Hadoop are being presented by large firms earning large revenues based on the Internet: AMZN, GOOG, YHOO, etc. In other words, it is refreshing to see that real money is not chasing yet-more-web-apps as their respective business plans. That seems to indicate a good trend in the technology sector.
Okay, enough broad perspectives – let's cut to some examples. Over the next few parts in this series, I'd like to present some sample code and summary analysis for applications. Let's start by taking a look at data from the jyte.com cred API.
So here is a sample chart, with my respective ethnicity. Our family is a mixed bag, part which looks like a cross-section of early North America, and part looks like a cross-section of Central Asia, sampled right along the Silk Route. These things happen.
Meanwhile, happy Chinese New Year!