Paco Nathan: hadoop, part 2: jyte cred graph

Following from Part 1 in the series, this article takes a look at an initial application in Hadoop, using a simplified PageRank algorithm with graph data from jyte.com to illustrate a gentle introduction to distributed computing. The really cool part is that this runs quickly on my MacBook laptop, the same Java code can run and scale without modification on Amazon AWS, on an IBM mainframe, and (with minor changes) in the clouds at Yahoo and Google. Source code is available as open source on Google Code at:

http://code.google.com/p/ceteri-mapred/

The social networking system jyte.com provides means for using OpenID as primary identification for participants – at least one is required for login. Participants then post "claims" on which others vote. They can also give "cred" to other participants, citing specific areas of credibility with tags. The contacts API provides interesting means for managing "social whitelisting".

Okay, that was a terribly dry, academic description, but the bottom line is that jyte is fun. I started using it shortly after its public launch in early 2007. You can find a lot more info on their wiki. Some of the jyte data gets published via their API page – essentially the "cred" graph. Others have been working with that data outside of jyte.com, notably Jim Snow's work on reputation metrics.

Looking at the "cred" graph, it follows a form suitable for link analysis such as the PageRank used by Google. In other words, the cred given by one jyte user to another resembles how web pages link to other web pages. Therefore a PageRank metric can be determined, as Wikipedia describes, "based upon quantity of inbound links as well as importance of the page providing the link." Note that this application uses a simplified PageRank, i.e., no damping term in the difference equation.

Taking the jyte cred graph, inverting its OpenID mapping index, then calculating a PageRank algorithm seemed like a reasonably-sized project. There are approximately 1K nodes in the graph (currently) which provides enough data to make the problem and its analysis interesting, but not so much to make it unwieldy. Sounds like a great first app for Hadoop.

The data as downloaded from jyte.com has a caveat: it has been GZIP'ed twice. Once you get it inflated and loaded into the HDFS distributed file system, as "input", the TSV text file line format looks like:

from_openid to_openid csv cred tags

Pass 1 in Hadoop is a job which during its "map" phase takes the downloaded data from a local file, and during the "reduce" phase it aggregates the one-to-many relation of OpenID URIs which give cred to other OpenID URIs. See in the Java source code the method configPass1() in the org.ceteri.JyteRank class, and also the inner classes MapFrom2To and RedFrom2To. The resulting "from2to" TSV text data line format looks like:

from_openid tsv list of to_openids

Pass 2 is a very simple job which initializes rank values for each OpenID URI in the graph – starting simply with a value of 1. We'll iterate to change that, later. See the method configPass2() and also the inner classes MapInit and RedInit. The resulting "prevrank" TSV text data line format looks like:

from_openid rank

Pass 3 performs a join of "from2to" data with "prevrank" data, then expands that into individual terms in the PageRank summation, emitting a partial calculation for each OpenID URI. See the method configPass3() and also the inner classes MapExpand and RedExpand. The resulting "elemrank" TSV text data line format looks like:

to_openid rank_term

Pass 4 aggregates the "elemrank" data to sum the terms for each OpenID URI in the graph, reducing them to a rank metric for each. See the method configPass4() and also the inner classes MapSum and RedSum. The resulting "thisrank" TSV text data line format looks like:

to_openid rank

The trick with PageRank is that it uses a difference equation to represent the equilibrium values in a network, iterating through sources and sinks. So the calculation generally needs to iterate before it converges to a reasonably good approximation of the solution at equilibrium. In other words, after completing Pass 4, the code swaps "thisrank" in place of "prevrank" and then iterates on Pass 3 and Pass 4. Rinse, later, repeat – until the difference between two successive iterations produces a relatively small error.

source: Wikipedia.org

Whereas Google allegedly runs their PageRank algorithm through thousands of iterations, running for several hours each day, I have a hunch that jyte.com does not run many iterations to determine their cred points. Stabbing blindly, I found that using 9 iterations seems to produce a close correlation to the actual values listed as jyte.com cred points. A linear regression can then be used to approximate the actual values:

My analysis here is quite rough, and the code is far from optimal. For example, I'm handling data in-between the map and reduce phases simply as TSV text. There could be much improvement by using subclassed InputFormat and OutputFormat definitions for my own data formats. Even so, the results look pretty good: not bad for one afternoon's work, eh?

I have big reservations about PageRank for web pages, since it assumes that an equilibrium state effectively models attention given to web pages. PageRank provides a mathematical model for a kind of "coin toss", where a user looking at a browser either chooses to navigate to some random URL (with probability 0.15) or they follow one of the links on the current page (with probability 0.85). In practice that implies how some important web pages will experience substantial latency before PageRank begins to "notice" them, since PageRank must wait for (a) other pages to link to the new page, and (b) their web crawlers to visit and analyze those changes. That process can take several days.

A more realistic model of human behavior might take into account that – contrary to Google PageRank assumptions – people do not navigate to some random URL because it is in their bookmarks or is their individual home page. That might have been the general case in 1998, but these days I get my fix of random URLs because of email, IM, RSS, etc., all pushing them out to me. In other words, there is an implied economy of attention (assuming I only look at one browser window at a time) which has drivers that are external to links in the World Wide Web. In lieu of using equilibrium models, the link analysis might consider some URLs as "pumps" and then take a far-from-equilibrium approach.

Certainly, a site such as CNN.com demonstrates that level of network effects, and news events function as attention drivers.

But I digress.

A simplified PageRank analysis based on an equilibrium model appears to work fine enough to approximate jyte.com cred points. It also illustrates how to program within the Hadoop framework. I have ulterior motives for getting jyte cred graph data into Hadoop – which we'll examine in the next part of this series.

Paco Nathan

2008-02-20

hadoop, part 2: jyte cred graph

No comments:

about

archive

perspective

usage