The social networking system
jyte.com provides means for using
OpenID as primary identification for participants – at least one is required for login. Participants then post "claims" on which others vote. They can also give "cred" to other participants, citing specific areas of credibility with tags. The contacts API provides interesting means for managing "
social whitelisting".
Okay, that was a terribly dry, academic description, but the bottom line is that
jyte is fun. I started using it shortly after its public launch in early 2007. You can find a lot more info on their
wiki. Some of the jyte data gets published via their
API page – essentially the "cred" graph. Others have been working with that data outside of jyte.com, notably
Jim Snow's work on
reputation metrics.
Looking at the "cred" graph, it follows a form suitable for
link analysis such as the
PageRank used by Google. In other words, the cred given by one jyte user to another resembles how web pages link to other web pages. Therefore a PageRank metric can be determined, as Wikipedia describes,
"based upon quantity of inbound links as well as importance of the page providing the link." Note that this application uses a simplified PageRank, i.e., no damping term in the difference equation.Taking the jyte cred graph, inverting its OpenID mapping index, then calculating a PageRank algorithm seemed like a reasonably-sized project. There are approximately 1K nodes in the graph (currently) which provides enough data to make the problem and its analysis interesting, but not so much to make it unwieldy. Sounds like a great first app for Hadoop.
The data as downloaded from jyte.com has a caveat: it has been GZIP'ed twice. Once you get it inflated and loaded into the
HDFS distributed file system, as "input", the TSV text file line format looks like:
from_openid to_openid csv cred tagsPass 1 in Hadoop is a job which during its "map" phase takes the downloaded data from a local file, and during the "reduce" phase it aggregates the one-to-many relation of OpenID URIs which give cred to other OpenID URIs. See in the Java source code the method
configPass1() in the
org.ceteri.JyteRank class, and also the inner classes
MapFrom2To and
RedFrom2To. The resulting "
from2to" TSV text data line format looks like:
from_openid tsv list of to_openidsPass 2 is a very simple job which initializes rank values for each OpenID URI in the graph – starting simply with a value of
1. We'll iterate to change that, later. See the method
configPass2() and also the inner classes
MapInit and
RedInit. The resulting "
prevrank" TSV text data line format looks like:
from_openid rankPass 3 performs a
join of "
from2to" data with "
prevrank" data, then expands that into individual terms in the PageRank summation, emitting a partial calculation for each OpenID URI. See the method
configPass3() and also the inner classes
MapExpand and
RedExpand. The resulting "
elemrank" TSV text data line format looks like:
to_openid rank_termPass 4 aggregates the "
elemrank" data to sum the terms for each OpenID URI in the graph, reducing them to a rank metric for each. See the method
configPass4() and also the inner classes
MapSum and
RedSum. The resulting "
thisrank" TSV text data line format looks like:
to_openid rankThe trick with PageRank is that it uses a
difference equation to represent the equilibrium values in a network, iterating through sources and sinks. So the calculation generally needs to iterate before it converges to a
reasonably good approximation of the solution at equilibrium. In other words, after completing
Pass 4, the code swaps "
thisrank" in place of "
prevrank" and then iterates on
Pass 3 and
Pass 4. Rinse, later, repeat – until the difference between two successive iterations produces a relatively small error.
source: Wikipedia.org Whereas Google allegedly runs their PageRank algorithm through thousands of iterations, running for several hours each day, I have a hunch that jyte.com does not run many iterations to determine their cred points. Stabbing blindly,
I found that using 9 iterations seems to produce a close correlation to the actual values listed as jyte.com cred points. A linear regression can then be used to approximate the actual values:
My analysis here is quite rough, and the code is far from optimal. For example, I'm handling data in-between the map and reduce phases simply as TSV text. There could be much improvement by using subclassed
InputFormat and
OutputFormat definitions for my own data formats. Even so, the results look pretty good: not bad for one afternoon's work, eh?
I have big reservations about PageRank for web pages, since it assumes that an equilibrium state effectively models attention given to web pages. PageRank provides a mathematical model for a kind of "coin toss", where a user looking at a browser either chooses to navigate to some random URL (with probability 0.15) or they follow one of the links on the current page (with probability 0.85). In practice that implies how some important web pages will experience substantial latency before PageRank begins to "notice" them, since PageRank must wait for (a) other pages to link to the new page, and (b) their web crawlers to visit and analyze those changes. That process can take several days.
A more realistic model of human behavior might take into account that – contrary to Google PageRank assumptions – people do not navigate to some random URL because it is in their bookmarks or is their individual home page. That might have been the general case in 1998, but these days I get my fix of random URLs because of email, IM, RSS, etc., all pushing them out to me. In other words, there is an implied economy of attention (assuming I only look at one browser window at a time) which has drivers that are external to links in the World Wide Web. In lieu of using equilibrium models, the link analysis might consider some URLs as "pumps" and then take a
far-from-equilibrium approach.
Certainly, a site such as
CNN.com demonstrates that level of
network effects, and news events function as attention drivers.
But I digress.
A simplified PageRank analysis based on an equilibrium model appears to work fine enough to approximate
jyte.com cred points. It also illustrates how to program within the Hadoop framework. I have ulterior motives for getting jyte cred graph data into Hadoop – which we'll examine in the next part of
this series.