Paco Nathan: hadoop

Showing posts with label hadoop. Show all posts

2012-06-22

hadoop summit 2012: emergence of the confidence economy

moore intro

Geoffrey Moore opened his keynote at Hadoop Summit 2012 and promptly dropped the line: “You will remember this moment years from now.”

http://www.slideshare.net/Hadoop_Summit/geoffrey-moore-keynote-hadoop-summit

After a disappointing set of “sales pitch” keynotes on the first day of the conference (thanks Yahoo! — but you knew that already) many people attending seemed to roll their eyes about yet another keynote talk this morning. Surprise!

I was grateful to hear Geoffrey Moore trash Advertising as an industry at risk. If I may paraphrase: permanently caught between bleeding edge and dinosaurs, yet irreparably dependent on a broken business model. [FWIW, the last three VCs on whom I’ve used that line looked back at me like I was some kind of alien slime-mold.]

By the middle of his talk, Moore put up a slide with a half-dozen bullet points. The slide listed some of the most disruptive technologies on which businesses — Main Street, in his terms — would come to rely in the early 21st century. Those include: collab filters, behavioral targeting, predictive analytics, fraud detection, time series, etc., etc. Outside of the intelligence community and the hedge funds, the significance of these technologies is not well understood yet. Word. Up. Bitches.

Moore’s “Final Thoughts” slide really hit home. He talked about data access patterns (system of record vs. log file usage vs. real-time analytics vs. etc.) and how those access patterns create feedback loops within an organization. Moore claimed this was core DNA for Google, Amazon, etc., which all major businesses must now embrace. Or else. [That's about 95% overlap with a slide I made for (insert recent past employer) during a 2011Q1 pivot. Two pivots later, I left without any particular next gig in mind — clearly needing to get involved with a different business team. Shortly before their CEO got, um, an "opportunity" to find work elsewhere. But I digress.]

an exercise

So here’s a fun exercise for the interested reader: Pull up a 10-year chart for the S&P 500. Add to that CBS. Right.. Add to that Barnes & Noble. Bokay.. Add to that Wal-Mart. Got few bumps, some upturns.. Nothing to write home about.

Now add Google. Now add Amazon. Now add Apple. One might argue that I’m cherry-picking examples; however, one must understand those three in particular to grasp the trajectory of how Data modifies Companies.

Think about it. Imagine rolling the clock back about 13 years, just a few years before that huge financial sea change got going. Think about perceptions at the time of Apple, Amazon, Google. Most of the mainstream buzz that I heard or read in 1999 was largely disparaging about those three. They didn’t make sense to the average joe, and that was a problem. I will contend that what made sense to a handful of computer science grad students, but not to the average joe, was considered a problem for Main Street. A multi-billion dollar existential problem for some, as it turned out.

At the time, it seemed like Apple would never get past the overwhelming popularity of Dell and Microsoft. Amazon didn’t have a way to justify its enormous P/E ratio — and was probably fluff in the long run. Google was considered interesting, but a little strange, with no clear path toward revenue.

Now think about what happened to the music industry, the mobile industry, the … well, I could go on, but Apple disrupted the pants off lots of established players. Entire industries were taken down by one company. Then consider what happened to retail. One word, a verb according to Geoffrey Moore: Amazon. Think about what happened to advertising. Googled, and not in a nice way either. Amazon and Google took off in 1997Q4 and 1998Q1 respectively, with Big Data projects which became enormous cash cows: Amazon’s recommender system (plus cloud infrastructure), and Google’s search+ads (plus cloud infrastructure). Arguably, those two are the reasons we were having a "Hadoop" conference. Apple perhaps seems less in category; however Apple leveraged mountains of consumer data (plus cloud infrastructure) to drive its smartphones, App Store, etc.

Imagine what kinds of conversations which must have been occurring in the board rooms of CBS, Motorola, Barnes & Noble, Wal-Mart, etc., etc. Gone, gone, gone. Three relative underdogs became giants, tipping almost everyone else’s apple carts. (pun intended) At least three firms understood the power of leveraging their data, they understood the urgency of real-time analytics, etc. Their competitors, mostly, did not. Just look at those stock charts.

According to Moore, that was the tip of the iceberg. Most of the Global 1000 is now on notice. Over the next decade we’ll see monumental failures. Winners and losers, as always, but the magnitude of the losers may be unexpected.

central point

Moore’s central point in the keynote — since this was a Hadoop conference — was that the Hadoop tech stack and business ecosystem is maybe a year ahead of the proverbial “crossing the chasm” moment. Ergo his lead line.

Notably, enormous cultural changes of the 1990s and early 2000s have percolated through personal expectations among those coming up in the ranks. That’s happened more notably and with more impact outside the US than within it. He pointed to the “digitization of culture”, where access has become nearly universal, where broadband created emotional dimensions (Facebook, Pinterest, etc.), where mobile makes the experience ubiquitous regardless of socio-economic position.

Meanwhile, the corporate culture of how to “get stuff done” within enterprise has not kept up. There’s no Facebook for enterprise, no YouTube for enterprise, etc. [Well, actually, there are — and they are each headquartered within a bike ride of my home near the Mountain View / Palo Alto border — but you haven’t heard about them. Yet.]

Meanwhile, Facebook-esque consumer Internet companies of the world are too caught up in their own weirdly distorted realities to solve the larger business problems. Business problems where the solutions will inevitably derive from the social networks’ innovations. Oops.

In Moore’s vaulted opinion, those conditions won’t hold much longer. There will be winners. There will be losers. Big ones.

Meanwhile, for people of my ilk, Moore smiled and predicted: “This should provide at least a decade of entertainment for everyone present.” Fundamental business reasons are simple: enormous change ahead but precious few who are trained and experienced to navigate it.

key take-aways

My first key take-away is based on the observation last year that Enterprise giants bumbled into Hadoop Summit 2011 in a huge and awkward way. Oddly, the logo is an elephant, #justsayin

In contrast, this year was really smooth, completely professional, far too expensive … but almost all about data infrastructure in a world where nobody want to utter the word “Oracle”.

Mind you that neither of the two main “enterprise” keynote speakers from last year still have their same jobs. #justsayin

Also, notable Hadoop practitioners were noticeably absent. In fact, most of the cast and crew of Strata seemed to be missing. A particularly popular social network has been burning the midnight oil to make Hadoop perform backflips — they like gave a couple talks and seemed to vanish.

Let me put this in other words: several hundred million dollars have been invested by VCs (and angels) to recreate an industry in the image of Redhat and Yahoo!

Wow, did anybody think that would be a particularly good idea? No, but it’s the herd mentality in practice. Even after the 5th beer I’d still recognize that strategy as not particularly wise. Feels like when you talk with an ex-convict, and they drop a line “Yeah, I made some poor choices long ago…”

My hunch is those data infrastructure plays are mostly tax write-offs (for the “early adopter” part of Geoffrey’s famous curve) at this point.

Moore underscored how real payouts come when key verticals catch fire — with serious domain expertise leveraged. LinkedIn perhaps got close, but now it almost feels like a spamming broadcast system for HR and BD departments. We’ll see “Big Data” killer apps which mean something to lots of people. Beyond the GOOG+AMZN+AAPL tip o’ the iceberg. They will come from people who have sophisticated backgrounds in Stats + ORSA + distributed systems + functional programming + DevOps, people who can also communicate well with actual business leaders. Not those employed by some halfwit B-school grad who’s posturing as the next Steve Jobs, when in reality he drinks bad beer at a lame, faux-hipster sports bar while watching cable televison. Or something. Dude, hop on your fixed-gear bike and standstill/peddle your sleeve tattoos out of here.

Translated: the proverbial ignite moment, that spark of innovation, is not going to come from the likes of a Cloudera or a Platfora or a (banal noun)-(o|e)ra… But it’s going to come, probably not many moons away. It will be in apps.

the sound of disruption

Thirty years ago, I went into a field called “math science”, i.e. how to build predictive analytics as software apps. Stanford — the Statistics department chairman, Bradley Efron, in particular — had put together an interdisciplinary degree which combined math, statistics, operations research, programming, engineering, etc. At the time, most of my peers in the program went on to become insurance actuaries. I went instead to do graduate work in machine learning and distributed systems.

For nearly two decades, most employers could care less about any quantitative background. They wanted C++ software engineers working all day on APIs from Sun or Microsoft or Oracle. Or they wanted managers. Then, in about 2000, came the sea change.

Right about the same time as ticker symbols for Apple and Google and Amazon were strolling up to their respective launchpads, some people began to look at my resume and ask a different line of questions.

I’ll always remember the first: a microchip vendor — one which makes electronics for several products you’ve purchased — was getting squeezed by Intel and their silicon compiler vendor. Critical features were being deprecated, specifically to put this second-tier player out of business. The company was on notice. They had to find a proverbial needle in a haystack: out of tens of thousands of circuit designs, they had to identify the 1% which would no longer be licensed — then redesign those. Quickly.

An internal team at the company had tried, but given up. Too much data for their techniques, it would’ve taken years to resolve. The company hired an electronics consulting firm in Austin, and engineers went to work, but gave up as well. Too much data, not enough signal. I got called in, as a “Whatever, just see if you can get anywhere” last-ditch effort. About 20 lines of Perl and one relatively simple equation later, I dumped my results into a scatterplot.

One of the lead circuit designers picked up my plot off the laser printer and began laughing. Loudly. The whole office heard him.

His manager grew annoyed: “What?! Why are you laughing?!”

Engineer: “He found it.”

When I turned in my invoice, the manager glared. “Look,” he said in a growl, “Just go somewhere for about three weeks. Bill us the whole time. Then come back and turn that in.”

My brows furrowed, this was a high-dollar rate for 2000.

“If you don’t pad that damn invoice…” he paused, “You’ll make both us and our customer look like complete fools. Piss a lot of people off.”

That’s the sound of Disruption.

More than a decade later, the summary graf of my resume reads like bullet points from Geoffrey Moore’s slide. Collab filters, anti-fraud classifiers, predictive analytics, etc. Even in the past few years, when HR people have read that resume, several looked up with a frown, said they thought that kind of work was better suited for business analysts — yadda, yadda, yadda, keep following the herd: you put the “botch” in “beotch”.

At a time when lots of business (start-ups as well as enterprise) are starving because they cannot hire Data Scientists, I’ve been busy building teams. Teams which delivered $MM results. I’ve hired about thirty people onto Data Science teams within the past few years — at a time when many start-ups would feel lucky to hire one. #justsayin

mal*wart

I read one of the most imbecilic essays recently from Forbes/Quora: “What Would Be The Global Impact If Wal-Mart Abruptly Shut Down?” Essentially, a hagiography stating that Wal-Mart is too big to fail, that the consequences on the US economy, the global economy, would be catastrophic. Translated: may require an enormous bailout, soon.

[In case you hadn’t guessed, I just threw up a little bit in my mouth.]

What. A. Fucking. Moron. The reality is that Wal-Mart hasn’t been doing so well over the past decade. Not if you peel back enough layers of PR. Not since they tried to bamboozle the LA city council. And failed. Moreover, folks at Amazon could really care less which Senators or SecState/former-first-lady the execs in Arkansas have in pocket. Bezos has positioned to take over 150% of Wal-Mart’s business the picosecond after Bentonville implodes. Sears and Target have reinvented themselves specifically for that very instant. So long, good riddance. Remember the point about the Global 1000 on notice? About the importance of business fundamentals?

sears, a.k.a. that web site which kinda looks like amazon

A third keynote talk that day was by the Sears CTO, Phillip Shelley. I had packed up my laptop and backpack, and was getting ready to walk out of the auditorium. After his first few sentences, I put my stuff back down and started taking notes.

Dr. Shelley mentioned how Sears started as a mail-order business a century ago, though more recently got completely kicked by another “catalog” called Amazon. Now they are leveraging Hadoop + R + Linux/Xen private cloud (srsly, this is from the Sears CTO?!?) to reinvent their business with 100x more detail on regional pricing models. Literally calculating personal pricing discounts for individuals, multiple times per day, specifically for mobile.

Sears: core algorithm moved from [6000 lines of COBOL on mainframe with 3.5 hr batch window] to [50 lines of Hadoop app on Linux with 8 min batch window], while reducing TCO for enterprise IT by two orders of magnitude. So much success, that they’ve spun it out as a new business line called MetaScale.

Brilliant strategery by Sears. Some seriously high powered Data Science talent walked out of that auditorium musing how they wished their VP Engineering was half as progressive as Sears. Srsly? Um, that’s called a PR coup. [Literally at the same moment as Wal*Mart had HR droids spamming the audience with whispers and rumors of lucrative salaries. Gak.]

emergence of the confidence economy

What’s the deal? It’s about confidence. Those giants in the Global 1000 which Geoffrey Moore says are on notice? They got that way by believing that business is largely about who barks the loudest, barks the longest, and cuts the most deals under the table. The proverbial alpha male in a wolf pack.

Wal-Mart would be a prime example, in my opinion. Their business is predicated on fundamentals which simply do not hold. Misplaced confidence. Thanks to people like Hillary Clinton, Wal-Mart has gained much influence on the House and Senate floors and the halls of the State Department and the UN assembly. In other words, so long as we manage to keep fuel costs artificially low, Wal-Mart’s market valuation will keep growing. So long as we believe that bullying vendors, conducting intelligence operations against the rest of your ecosystem, etc. — that these kinds of practices are ethical and sound in the long-run, then Wal-Mart will keep growing. Bullshit. Go look at that stock chart again. Wal-Mart is about tall white guys in dark suits, acting like complete pricks, destroying and plundering anything they can get their grimy paws on. Richard Gere in Pretty Woman, before he gets Julia Roberts. And not much more than that. On notice.

Moore is pointing out, in my opinion, that the issue at hand is about uncertainty. The point of establishing a corporate charter was always to externalize risk and perpetuate wealth for shareholders. That was true four centuries ago, when the first transnational was established, and has been true ever since. The modus of that mechanism is a process called sublation. The train wreck for sublation is uncertainty. In an environment where uncertainty holds sway, having real-time analytics from petabytes of customer data wins out over having a Senator in pocket. Any day of the week. The antidote for uncertainty is confidence. While there had been a regime of an “Attention Economy” extant for the past two decades or so, we’re now entering a new regime of the “Confidence Economy”.

Here’s the deal: people ~~like me~~ like those of us in Moore's lecture have been the “secret sauce” fueling the rise of Amazon, Google, Apple, etc. We use techniques which are mostly not well understood outside of Langley and the hedge funds. The tools of contemporary corporate assassins. Guys in suits who act like pricks in lieu of practicing business fundamentals — those guys are our targets. The modus is Disruption. If you have an MBA or a CxO title and not much else to back it up, I put food on my family’s table by being a sniper paid to hunt you. Lots of *great* food. And some of the best wines available. I shake the tension out of my hands, correct for wind and distance, draw a bead, take a deep breath, squeeze the trigger. Kill shot.

The challenges faced by Data Scientists are daunting. On one hand, most mathematicians lack enough solid engineering to create killer apps. Conversely, most engineers lack enough math to make any headway on the business data. Most business analysts lack enough of either the math or the engineering to be worth hiring. Data Scientists provide all three areas of expertise: the engineering and the math and the business insights to contend with mountainous torrents of data, and move the needle. On the other hand, Data Scientists must also speak truth to power. In any given business, there will be winners and losers. Executives, people accustomed to their own power, taken down. Meanwhile, we Data Scientists come prancing into a business, we do our magic, and consequently we point out which executives are bullshit and must be “executed”.

The reason why I’ve built Data teams at a time when others are starving is simple: confidence. Sure, I’ve logged three decades of machine learning, statistical modeling, data management, distributed computing, etc. When I talk with a grad student about their work, I can tell them in 25 words or less what they need to do on their first day at work to become regarded as an great asset to the team. They already know the techniques, but crave confidence. Into the trenches, fresh-out, having to speak truth to power. They’ll be placed into some faltering business unit, run some detailed analysis, and point out that the VP who’s been arguing loudly was completely wrong for the last N years and his/her ego cost the company several $MM. You can bet that those execs will return fire. However, a person like me is confident that we can get a kill shot. I show new folks how to draw a bead and squeeze the trigger. Been doing it for a long while, and will be doing for a long while more.

Snipers have an eerily pragmatic sense of confidence. And, by the way, that’s a peculiarly difficult job. Praise goes out to the men and women who serve their countries in uniform — when the cause is just. [FWIW, before tackling the challenges of data+science, I wore a military uniform and carried a rifle. Sniper training has become invaluable.]

my opinions

#1: Enterprise suffers because so many people in the corporate leadership ranks (or rather, amongst those clawing and scrambling to make their way into the corporate leadership ranks) consider themselves to be a different caste — if not a different species all together — from the rest of us who do not have a salaried position with a transnational. In a “Let them eat cake” world fraught with trillion-dollar bailouts, that’s not a particularly good way to future-proof. Moreover, this is why VCs are vital… to demolish that kind of hubris via constructive Disruption. #justsayin Word. Up. Bitches.

#2: Enterprise tooling, which is now mostly dependent on JVM-based apps, suffers because it has embraced “Convention over Configuration” … CoC has its place. I can imagine that it’s an excellent idea for heart surgeons to have a standard toolset, with scalpels in the exact same positions, etc. CoC is not a particularly good way to manage complexity and uncertainty, because it simply displaces major problems into the build system. Ultimately, it fails too much and impedes spin-up. Which, I believe, represents an enormous, ticking time bomb in Enterprise. Here’s a challenge: Show me a metric for the median period it takes in your business for a newly hired engineer to push code changes into production use which is adopted by at least 80% of your customer base. Now show me a metric for the media period it takes in your business between the point where a product manager identifies a needed feature and a newly hired engineer is ready for spin-up. From those, I’ll make a prediction based on that metric for how well your business will survive the “on notice” condition which Geoffrey Moore described. Better clues for navigating complexity and uncertainty can be found in the works of Ilya Prigogine or Stephen Wolfram. To wit, functional programming is more likely to address complexity, real complexity, and also more likely to attract top talent. CoC, not so much. Perhaps your enterprise business addresses Main Street instead of Early Adopters... Recall that Google and Amazon and Apple crossed the chasm by recruiting armies from grad students — at a time when most other people erred on the side of average joes. Remember the point about real-time analytics? It counts for training your people, then retraining, and retraining, constantly — to grapple with uncertainty. Kill shot. Global 1000.

#3: MapReduce will be unrecognizable within three years. Hadoop Summit will become something quite different after Hadoop bifurcates and gets sublated into Something Else. For example, it would be not difficult to use the Simple Workflow Service from Amazon AWS to implement MapReduce using the core part of Cascading… a different kind of MapReduce, which is not constrained by JVMs… which could scale much more gracefully and robustly… which could out-perform Google infrastructure and avoid attempting to re-create the industry in the image of Yahoo! At which point, one could deploy functional programming blocks at enterprise scale, without having to rely on the morass of enterprise build tools. Hmmm… may need to get a term sheet for that one.

Geoffrey Moore, we may have a few answers for your questions.

2010-08-05

sample src + data for getting started on hadoop

Monday, 19 July 2010... I had a wonderful opportunity to present at the Silicon Valley Cloud Computing meetup, on the topic "Getting Started on Hadoop".

The talk showed examples of Hadoop Streaming, based on Python scripts  running on the AWS Elastic MapReduce service.

We started with a brief history of MapReduce, including the concepts leading up to the framework as well as open source projects and services which have followed. Then we stepped through the ubiquitous “WordCount” example (a “Hello World” for MapReduce), showing how Python and Hadoop Streaming make it simple to iterate and debug from a command line using Unix/Linux pipes.

Source code is available on GitHub and the oddly enough, the slide deck got an editor's pick that week on SlideShare.

The focus of the talk was to show text mining of the infamous Enron Email Dataset, which Infochimps.com and CMU make available. In that context, the example code creates an inverted index of keywords found in the email dataset, begins to semantic lexicon of "neighbor" keyword relationships, plus some data visualization and social graph analysis using R and Gephi.

Along with my presentation, Matthew Schumpert from Datameer gave a demo of their product, doing some similar kinds of text analysis.

Lots of people showed up, enough that the kind folks at Fenwick & West LLP grew concerned about running out of seats :) The audience asked several excellent questions and we had a lot of discussion after the talk. Todd Hoff wrote an article summarizing the talk and discussions, along with some great perspectives on High Scalability.

Admittedly, the Enron aspects of the talk were intended as somewhat of a teaser; my examples focused more on method than on results. I'd talked with several people who'd never seen how to write Python scripts for Hadoop Streaming, how to run Hadoop jobs on Elastic MapReduce, how to calculate some basic text analytics or produce simple data visualizations. Even so, if you want to see investigate the Enron data yourself, then checkout the code, download the data, and run this on AWS. There were some fun surprises to be found among the analytics results, which may be good to publish as a follow-up talk.

Many thanks to SVCC and our organizer Sebastian Stadil, our venue host Fenwick & West LL, and all who participated.

2009-06-03

upcoming talks

I will be speaking at three events over the next two weeks. These talks will focus on how ShareThis leverages AWS and MapReduce, based on Cascading and AsterData as vendors, plus several open source projects including Bixo and Katta.

Tue 2009-06-09, 7:00pm-9:00pm

ScaleCamp
http://scalecamp.eventbrite.com/

Santa Clara Marriott
2700 Mission College Blvd
Santa Clara, CA 95054

Will discuss "How ShareThis mashes technologies in the cloud for Big Data analysis, leveraging AsterData nCluster, Cascading, Amazon Elastic MapReduce, and other AWS services in our system architecture."

This will be run in a Bar Camp format, along with DJ Patil, Ted Dunning, and other great speakers.

Wed 2009-06-10, 8:30am-8:30pm

Hadoop Summit 2009
http://developer.yahoo.com/events/hadoopsummit09/

Santa Clara Marriott
2700 Mission College Blvd
Santa Clara, CA 95054

Two talks: "Cascading for Data Insights at ShareThis: Syntax is for humans, API's are for software", as a developer talk, 2-3pm. Also, participating on the Amazon AWS panel discussion moderated by @jinman, about running Hadoop on EC2 and Elastic MapReduce, 6:00-6:30pm.

Last year's Hadoop Summit was a must-not-miss event, and I got lucky on timing and was able to tip-off friends at AWS to check it out... Amazon sponsored the lunch, and even though I did have to miss that event, AWS sent me awesome t-shirt schwag which I treasure to this day :)

Tue 2009-06-16, 1:00pm-7:00pm

Amazon AWS Start-Up Project
http://aws.amazon.com/startupproject/

Plug & Play Tech Center
440 N Wolfe Rd
Sunnyvale, CA‎ 94085

Panel discussion, along with Netflix, SmugMug, and other AWS customers, to "help business and technical decision makers from from small start-ups to large enterprises learn how to be successful with Amazon Web Service."

Really looking forward to this. I was on the panel for AWS Start-Up Tour in Santa Monica in 2007. You may recall there were some fires... like, nearly all of LA and SD hillsides were up in flames. Even so, 250+ people braved their way through the smoke and ash to reach that night club on Santa Monica Blvd, and we had a great panel discussion. This year in Sunnyvale should be even better.

---

While I've got your attention... I've been spending a fair amount of time in Seattle and on the phone to Seattle, particularly at one reconditioned former medical center, and especially regarding the fine folks at Elastic MapReduce. While in Seattle, I got to hang out with 30+ other system architects, many incredible folks. M David, of course I'm talking about you! :) At this stage of my career, I rarely seem to get time to sit back with peers, and swap stories and ideas. Time well spent.

BTW, we've been working with EMR now for a while, and recently started using EMR for in production. One of those cases where I could not post a blog entry then, but definitely wanted to tell everybody! Now that the news has been long public, I'm actually too busy using it commercially to blog ;)

See you on twitter @pacoid

2009-02-10

westward ho

Some friends of mine live in the Westport area of Kansas City, across the street from a marker which notes the Oregon Trail, the Santa Fe Trail, and the California Trail. For many people in North America during the 19th century, their resolve to migrate west was tested there.

View Larger Map

So it is for us, with two little girls ready to travel. Our oldest is eager to live in hotels again, because they promise TVs, room service, and jacuzzi pools. She was done with KC as soon as she got her fill of fresh snow. Our youngest has been missing the ocean, especially the seals. She's asked, "When do we go home?" just about ever since we arrived in KC.

A woman from Kansas a few days ago made a few rather hostile remarks about how Californians feel about seals -- right after my little girl got through declaring how much she loves seals and whales.

Unforgivably callous. Intolerably ignorant. Not entirely unexpected. In other words, par for the course. I'm sorry to have to feel that kind of resignation, but it is entirely realistic. I bit my tongue to avoid making an offhand retort about "fly over zone". After all, parts of the Midwest are highly recommended.

Two bright spots have made these past 10 months worthwhile...

One is my team in the Analytics department at Adknowledge -- a.k.a., "Team Mendota". Those people have been a joy and and honor to work among. In addition to several contributions to R, Hadoop, AWS, RightScale, and points inbetween, we've made big strides in methods for having statisticians and developers working side-by-side on "Big Data" projects -- with tangible results. Frankly, I would have left much soon (supra 3 yr old tugging daddy's heartstrings) if it had not been for how much I appreciate this team. Flo, Chris, Nathan, Chun, Bill, Shan, Vicky, Margi, David, Gaurav -- I wish you all very well. My regret is that I did not get to work with you long enough.

The other is our collective friends and families in the area, met through our kids -- mostly because of their school, Global Montessori Academy. My wife has served on the board there. The teachers and just everyone involved have been amazing. Not sure quite how that place emerged here inside of KC, and I'd be hard-pressed to believe that anything much like it exists between Chicago and Austin... but GMA and the people associated will be missed.

I found myself thinking, many many times while living here in KC, "How can I teach my daughters about the principles in which I believe, if I cannot live them here?" GMA actually does represent those principles. Highly recommended.

For everyone else, see you back on the Left Coast!

2008-09-27

cloud computing - methodology notes

A couple companies ago, one of my mentors -- Jack Olson, author of Data Quality -- taught us team leaders to follow a formula for sizing software development groups. Of course this is simply a guidance, but it makes sense:

9:3:1 for dev/test/doc

In other words, a 3:1 ratio of developers to testers, and then a 9:1 ratio of developers to technical writers. Also figure in how a group that size (13) needs a manager/architect and some project management.

On the data quality side, Jack sent me out to talk with VPs at major firms -- people responsible for handling terabytes of customer data -- and showed me just how much the world does *not* rely on relational approaches for most data management.

Jack also put me on a plane to Russia, and put me in contact with a couple of firms -- in SPb and Moscow, respectively. I've gained close friends in Russia, as a result, and profound respect for the value of including Russian programmers on my teams.

A few years and a couple of companies later, I'm still building and leading software development teams, I'm still working with developers from Russia, I'm still staring at terabytes (moving up to petabytes), and I'm still learning about data quality issues.

One thing has changed, however. During the years in-between, Google demonstrated the value of building fault-tolerant, parallel processing frameworks atop commodity hardware -- e.g, MapReduce. Moreover, Hadoop has provided MapReduce capabilities as open source. Meanwhile, Amazon and now some other "cloud" vendors have provided cost-effective means for running thousands of cores and managing petabytes -- without having to sell one's soul to either IBM or Oracle. By I digress.

I started working with Amazon EC2 in August, 2006 -- fairly early in the scheme of things. A friend of mine had taken a job working with Amazon to document something new and strange, back in 2004. He'd asked me terribly odd questions, ostensibly seeking advice about potential use cases. When it came time to sign up, I jumped at the chance.

Over the past 2+ years I've learned a few tricks and have an update for Jack's tried-and-true ratios. Perhaps these ideas are old hat -- or merely scratching the surface or even out of whack -- at places which have been working with very large data sets and MapReduce techniques for years now. If so, my apologies in advance for thinking aloud. This just seems to capture a few truisms about engineering Big Data.

First off, let's talk about requirements. One thing you'll find with Big Data is that statistics govern the constraints of just about any given situation. For example, I've read articles by Googlers who describe using MapReduce to find simple stats to describe a *large* sample, and then build their software to satisfy those constraints. That's an example of statistics providing requirements. Especially if your sample is based on, say, 99% market share :)

In my department's practice, we have literally one row of cubes for our Statisticians, and across the department's center table there is another row of cubes for our Developers. We build recommendation systems and other kinds of predictive models at scale; when it comes to defining or refining algorithms that handle terabytes, we find that analytics make programming at scale possible. We follow a practice of having statisticians pull samples, visualize the data, run analysis on samples, develop models, find model parameters, etc., and then test their models on larger samples. Once we have selected good models and the parameters for them, those are handed over to the developers -- and to the business stakeholders and decisions makers. Those models + parameters + plots become our requirements, and part of the basis for our acceptance tests.

Given the sheer lack of cost-effective tools for running statistics packages at scale (SAS is not an option, R can't handle it yet) then we must work on analytics through samples. That may be changing, soon. But I digress.

Developers working at scale are a different commodity. My current employer has a habit of hiring mostly LAMP developers, people who've demonstrated proficiency with PHP and SQL. That's great for web apps, but hand the same people terabytes and they'll become lost. Or make a disastrous mess. I give a short quiz during every developer interview to insure that our candidates understand how a DHT works, how to recognize a good situation for using MapReduce, and whether they've got any sense for working with open source projects. Are they even in the ballpark? Most candidates fail.

I certainly don't look for one-size-fits all, either: some people are great working on system programming, which is crucial for Big Data, and others are more suited for working on algorithms. The latter is problematic, because algorithms at scale do not tend to work according to how logic might dictate. Precious little at scale works according to how logic might dictate. In fact, if I get a sense that a developer or candidate is too "left brained" and reliant on "logic", then I know already they'll be among the first to fail horribly with MapReduce. One cannot *think* their way through failure cases within a thousand cores running on terabytes. Data quality issues alone should be sufficient, but there are deeper forces involved. Anywho, one must *measure* their way through troubleshooting at scale -- which is another reason why statistics and statisticians need to be close at hand.

The trouble is that most commodity hardware is built to perform well within given sets of tolerance. If you restart your laptop a hundred times, it will probably work fine; however if you launch 100 virtual servers, some percentage will fail. Even among the ones that appear to run fine, there will be disk failures, network failures, memory glitches, etc., some of which do not become apparent until they've occurred many times over. So when you pump terabytes through a cluster of servers for hours and hours, some parts of your data is going to get mangled.

Which brings up another problem: unit tests run at scale don't provide a whole lot of meaning. Sure, it's great to have unit tests to validate that your latest modification and check-ins pass tests with small samples of data. Great. Even so, minor changes in code, even code that passes its unit test, can cascade into massive problems at scale. Testing at scale is problematic, because sometimes you simply cannot tell what went wrong unless you go chasing down several hundred different log files, each of which is many Gb. Even if results get produced, with Big Data you may not know much about the performance of results until they've had time to go through A/B testing. Depending on your operation, that may take days to evaluate correctly.

One trick is to have statisticians acting in what would otherwise be a "QA" role -- poking and prodding the results of large MapReduce jobs.

One other trick is to find the people talking up the use of scrum, XP, etc., and relocate those people to another continent. I use iterative methodology -- from a jumble of different sources, recombined for every new organization. However, when someone wants to sell me a "agile" product where unit tests adjudicate the project management, I show that person the door. Don't let it hit you on the way out. In terms of working with cloud-based architecture and Big Data, something new needs to be invented for methodology, something beyond "agile". I'll try to document what we've found to work, but I don't have answers beyond that.

Some key points:

Each statistician can prompt enough work for about three developers; it's a two-way street, because developers are generally helping analysts resolve system issues, pull or clean up data that's problematic, etc.

Statisticians work better in pairs or groups, not so well individually. In fact, I round up all the quantitative analysts in the company once each week for a "Secret Statisticians Meeting". We run the meeting like a grad seminar. Management and developers are not allowed, unless they have at least a degree in mathematics -- and even then they must prove their credentials by presenting and submitting to a review.

In any case, I try to assign a minimum of two statisticians to any one project, to get overlap of backgrounds and shared insights.

We use R, it's great for plots and visualizing data. I find that hiring statisticians who have the chops to produce good graphics and dashboards -- that's crucial. Stakeholders don't grasp an equation for how to derive variance, but they do grasp seeing their business realities depicted in a control chart or scatterplot. That communication is invaluable, on all sides. Hire statisticians who know how leverage R to produce great graphics.

Back to the developers, some are vital as systems programmers -- for scripting, troubleshooting, measuring, tweaking, etc. We use Python mostly for that. You won't be able to manage cloud resources without that talent. Another issue is that your traditional roles for system administrators, DBAs, etc., are not going to help much when it comes to troubleshooting an algorithm running in parallel on a thousand cores. They will tend to make remarks like "Why don't you just reboot it?" As if. So you're going to have to rely on your own developers to run operations. Those are your system programmers.

Other developers work at the application layer. Probably not the same people, but you might get lucky. Don't count on it. Algorithm work requires that a programmer can speak statistics, and there aren't many people who cross both worlds *and* write production quality code for Hadoop.

Another problem is documentation. Finding a tech writer to document a team working on Big Data would entail finding an empath. Besides, the statisticians generate a lot of documents, and the developers (and their automated tools) are great at generating tons of text too. Precious little of it dovetails, however. Of course we use a wiki to capture our documentation -- it has versioning, collaborative features, etc. -- and we also use internal blogs for "staging" our documentation, writing drafts, logging notes, etc. Even with all those nifty collab tools, it still takes an editor to keep a bunch of different authors and their texts on track -- *not* a tech writer, but an actual editor. Find one. They tend to be inexpensive compared with developers, and hence more valuable than anybody but the recruiters. But I digress.

Back to those team ratios from Jack Olson... I have a hunch that the following works better in Big Data:

2:3:3:1 for stats/sys/app/edit

A team of 9 needs a team leader/manager -- which, as I understand, might be called a "TL/M" at GOOG, a "2PT" at AMZN, etc. I favor having that person be hands-on, and deeply involved with integration work. Integration gives one a birds-eye view of which individual contributors are ahead or behind. Granted, that'd be heresy at most large firms which consider themselves adept at software engineering. Even so I find it to be a trend among successful small firms.

One other point... Working with Big Data, and especially in the case of working with cloud computing... the biggest risk involved and the most complex part of the architecture involves data loading. In most enterprise operations, data loading is taken quite seriously. If you're running dozens or hundreds (or thousands?) of servers in a Hadoop cluster, then take a tip from the large shops and get serious about how you manage your data. Have a data architect -- quite possibly, the aforementioned team leader/manager.

YMMV.

2008-08-26

hadoop in the cloud - patterns for automation

A very good use for Hadoop is to manage large "batch" jobs, such as log file analysis. A very good use for elastic server resources -- such as Amazon EC2, GoGrid, AppNexus, Flexiscale, etc. -- is to handle large resource needs which are periodic but not necessarily 24/7 requirements, such as large batch jobs.

Running a framework like Hadoop on elastic server resources seems natural. Fire up a few dozen servers, crunch the data, bring down the (relatively smaller) results, and you're done!

What also seems natural is the need to have important batch jobs automated. However, the mechanics of launching a Hadoop cluster on remote resources, monitoring for job completing, downloading results, etc., are far from being automated. At least, not from what we've been able to find in our research.

Corporate security policies come into play. For example, after a Hadoop cluster running for several hours to obtain results, your JobTracker out on EC2 may not be able to initiate a download back to your corporate data center. Policy restrictions may require that to be initiated from the other side.

There may be several steps involved in running a large Hadoop job. Having a VPN running between the corporate data center and the remote resources would be ideal, and allow for simple file transfers. Great, if you can swing it.

Another problem is time: part of the cost-effectiveness of this approach is to run the elastic resources only as long as you need them -- in other words, while the MR jobs are running. That may make the security policies and server automation difficult to manage.

One approach would be to use message queues on either side: in the data center, or in the cloud. The scripts which launch Hadoop could then synchronize with processes on your premises via queues. A queue poller on the corporate data center side could initiate data transfers, for example.

Would be very interested to hear how others are approach this issue.

2008-07-08

a methodology for cloud computing architecture - part 5

Diagram, redux.

Errata: fixed issues regarding 3Tera and CohesiveFT, both of which deserved better positioning in the stack than my cloudy thoughts had heretofore imagined.

Again, many thanks to vendors who are contacting me, and kudos in general to discussions on the "cloud-computing" group.

One thing jumps out from this diagram now: items 6 and 12 are placeholders. What vendors or projects might emerge to cover them? For example, might some enterprising soul rewrite portions of the Scalr project so that it can handle APIs for other vendors?

2008-07-07

a methodology for cloud computing architecture - part 4

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

The diagram won't go away -

This time around, I'd like to take a few notes from both the Globus VWS and the GoGrid / Flexiscale discussions on the "cloud-computing" email list...

Suppose you just happened to have a bunch of hardware, and wanted to leverage that as a grid – in other words, as "disposable infrastructure" to use the phrasing from 3Tera, or perhaps "managed virtualization". From what I've read in the respective web sites, there are a few ways to approach that. 3Tera has AppLogic, and also there is the Eucalyptus project which exposes a layer looking like EC2.

At a bit higher level of abstraction, there are offerings from CohesiveFT and Enomalism which provide similar "grid" atop your metal, but also manage your images for your cloud-based apps on EC2. In other words, those allow for portability of cloud-based apps – whether on your own metal or on metal leased from AWS.

Slicing this stack diagonally in another direction, there are RightScale and Scalr which manage scaling your cloud-based app on EC2 – and monitor it. The former in particular provides excellent features for scripting and monitoring apps. Flexiscale provides similar features for apps run on their cloud (in UK – hope I'm getting those facts straight).

Traveling up the stack on the batch side, Globus provides open source software to run jobs on your own metal, or on EC2 – or on scientific grids, if you qualify.

Meanwhile, GigaSpaces has announced joint projects with both RightScale (for EC2) and CohesiveFT (for your own metal + EC2) to run spaces for SBA – in other words, virtualized middleware. That appears to take the integration of cloud portability and cloud management to another level.

The diagram still needs work along the Hadoop / HDFS portion of the stack – which was its original intention. Obviously there is no distributed file system when talking about Globus, Condor, etc.

Overall, it's quite nice to see this evolution among vendors and product offerings. Many thanks to the vendors who've been taking me on guided tours of their wares.

2008-07-05

a methodology for cloud computing architecture - part 3

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

One more iteration on that diagram.

This time we'll represent some structured approaches running atop a Hadoop cluster. Parallel R being one that addresses our department's needs the most directly...

FWIW, I'm more interested in what can be accomplished with batch jobs and non-relational database approaches -- on Linux-based commodity hardware. Probably an occupational hazard for working with very large data sets and fast turn-around time requirements, where fault-tolerance seems to be what cripples very large-scale relational approaches, in practice. In other words, if your solution to handling very large data requires an SQL query, I might not take the call. At least not for analytics.

Having said that, we still have a ton of transaction / web-based front-end work, which was written for LAMP and still fits relational better.

2008-06-30

a methodology for cloud computing architecture - part 2

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

Thanks to many people who gave their feedback – publicly and privately – about the diagram in my previous post. Apologies to RightScale and others for inverting their placement. I definitely want to recommend John Willis for an excellent overview of the vendors in this space.

Will keep attempting to articulate this until we get it right :)

I really wanted to avoid virtualization. However, it looks like that's a foregone conclusion.
I prefer to place the cloud layer atop the grid layer. Several good resources disagree on that; the two terms flip-flop.
Our "3x" sidebar about TCO and AWS was quite interesting. Would like to hear more from others about how they estimate that metric.

Another note: our firm is decidedly in the camp of needing more than one cloud. For a variety of reasons, some applications must live in our own data centers. Others are great for running in AWS. Perhaps others could run on additional grids – secondary services are compelling, even if not the most competitive yet on costs. In other words, probably not as replacements for AWS, but to augment it.

Mostly, I'm interested in software solutions which can place a "cloud manager" layer atop our own equipment – so that we could pull apps from one cloud to another depending on needs. Labeled "5" in the new diagram.

Comments are welcomed.

2008-06-29

a methodology for cloud computing architecture

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

I got asked to join to a system architecture review recently. The subject was a plan for data center redundancy. The organization had one data center, and included some relatively big iron:

a large storage system appliance, with high capex (its name starts with an "N", let you guess)
a large RDBMS with expensive licensing (its name starts with an "O", let you guess)
a large data warehouse appliance, with GINORMOUS capex, expensive maintenance fees, and far too much downtime (its name includes a "Z", let you guess)
plus several hundred Linux servers on commodity hardware running a popular virtualization product (its name includes an "M" and a "W", let you guess)

The plan was to duplicate those resources in another data center, then run both at 50% capacity while replicating data across them. In the event of an outage at one data center, the other would be ready to handle the full load.

Trouble is, most of those Ginormous capexii species of products – particularly the noisome Ginormous capexii opexii subspecies – were already heavily overloaded by the organization. That makes the "50% capacity" load balancing story hard to believe. Otherwise it would have been a good approach, circa 2005. These days there are better ways to approach the problem...

After reviewing the applications it became clear that the data management needed help – probably before any data centers got replicated. One immediate issue concerned unwinding dependencies on their big iron data management, notably the large RDBMS and DW appliance. Both were abused at the application level. One should use an RDBMS as, say, a backing store for web services that require transactions (not as a log aggregator), or for reporting. One should use a DW for storing operational data which isn't quite ready to archive (not for cluster analysis).

We followed an approach: peel off single applications out of that tangled mess first. Some could run on MySQL. Others were periodic batch jobs (read: large merge sorts) which did not require complex joins, and could be refactored as Hadoop jobs more efficiently. Perhaps the more transactional parts would stay in the RDBMS. By the end of the review, we'd tossed the DW appliance, cut RDBMS load substantially, and started to view the filer appliance as a secondary storage grid.

Another approach I suggested was to think of cloud computing as the "duplicating 50% capacity" argument taken to the next level. Extending our "peel one off" approach, determine which applications could run on a cloud – without much concern about "where" the cloud ran other than the fact that we could move required data to it. Certainly most of the Hadoop jobs fit that description. So did many of the MySQL-based transactions. Those clouds could migrate to different grids, depending on time, costs, or failover emergencies.

One major cost is data loading across grids. However, the organization gains by duplicating its most critical data onto different grids and has a business requirement to do so.

In terms of grids, the organization had been using AWS for some work, but had fallen into a common trap of thinking "Us (data center) versus Them (AWS)", arguing about TCO, vendor lock-in, etc. Sysadmins calculated a ratio of approximately 3x cost at AWS – if you don't take scalability into consideration. In other words, if a server needs to run more than 8 hours per day, it starts looking cheaper to run on your own premises than to run it on EC2.

I agree with the ratio; it's strikingly similar to the 3x markup you find buying a bottle of good wine at any decent restaurant. However, scalability is a crucial matter. Scaling up as a business grows (or down, as it contracts) is the vital lesson of Internet-based system architecture. Also, the capacity to scale up failover resources rapidly in the event of an emergency (data center lands under an F4 tornado) is much more compelling than TCO.

In the diagram shown above, I try to show that AWS is a vital resource, but not the only vendor involved. Far from that, AWS provides services for accessing really valuable metal; they tend to stay away from software, when it comes to cloud computing. Plenty of other players have begun to crowd that market already.

I also try to separate application requirements into batch and online categories. Some data crunching work can be scheduled as batch jobs. Other applications require a 24/7 presence but also need to scale up or down based on demand.

What the diagram doesn't show are points where message queues fit – such as Amazon's SQS or the open source AMPQ implementations. That would require more of a cross-section perspective, and, well, may come later if I get around to a 3D view. Frankly, I'd rather redo the illustration inside Transmutable, because that level of documentation for architectures belongs in real architectural tools such as 3D spaces. But I digress.

To me, there are three aspects of the landscape which have the most interesting emerging products and vendors currently:

data management as service-in-the-cloud, particularly the non-relational databases (read: BigTable envy)
job control as service-in-the-cloud, particularly Hadoop and Condor (read: MapReduce+GFS envy, but strikingly familiar to mainframe issues)
cloud service providing the cloud (where I'd be investing, if I could right now)

A Methodology

To abstract and summarize some of the methodology points we developed during this system architecture review:

Peel off the applications individually, to detangle the appliance mess (use case analysis).
Categorize applications as batch, online, heavy transactional, or reporting – where the former two indicate likely cloud apps.
Think of cloud computing as a way to load balance your application demands across different grids of available resources.
Slide the clouds across different grids depending on costs, scheduling needs, or failover capacity.
Take the hit to replicate critical data across different grids, to have it ready for a cutover within minutes; that's less expensive than buying insurance.
Run your own multiple data centers as internal grids, but have additional grid resources ready for handling elastic demands (which you already have, in quantity).
Reassure your DBAs and sysadmins that their roles are not diminished due to cloud computing, and instead become more interesting – with hopefully a few major headaches resolved.

Meanwhile, keep an eye on the vendors for non-relational data management. It's poised to become suddenly quite interesting.

My diagram is a first pass, drafted the night after our kids refused to sleep because of a thunderstorm... subject to many corrections at least a few revisions. The vendors shown on fit approximately, circa mid-2008. Obviously some offer overlapping services into other areas. Most vendors mentioned are linked in my public bookmark: http://del.icio.us/ceteri/grid

Comments are encouraged. Haven't seen this kind of "vendor landscape" diagram elsewhere yet. If you spot one in the wild elsewheres, please let us know.

2008-03-12

hadoop, part 3: graph-based nlp

I've been evaluating graph-based methods for language analytics. At my previous employer, we used statistical parsing and non-parametric methods as a basis for semantic analysis of texts. In other words, use a web crawler to grab some text content from a web page, run it through a natural language parser (after a wee bit o' clean-up), then use the resulting parse trees to extract meaning from the text. Typical stages required for parsing include segmentation, tokenizing, tagging, chunking, and possibly some post-processing after all that. Our use case was to identify noun phrases in the text, which is called NP chunking. Baseline metrics run at about 80% for success using that kind of approach.

More recent work on graph-based methods allows for improved results with considerably faster throughput rates. We did not apply this approach at HeadCase, but my tests subsequently have shown better than an order of magnitude performance gain, in terms of CPU time, when running side-by-side tests on the same processor and data set. This approach, based on link analysis within a graph, eliminates the need for chunking. Instead it iterates on a difference equation, which tends to converge rather quickly (depending on the size of the graph).

source: wikipedia.org

As an example, I took input text from an encyclopedia article about Java Island found online. It has 3 paragraphs, 16 sentences, 331 words, 2053 bytes. Using a popular, open source NLP library, a parser-based approach for identifying noun phrases takes an average of 8341 milliseconds on my MacBook laptop, while the same task using the same libraries with a graph-based approach takes an average of 257 milliseconds. That's a factor 32x speed-up.

To be fair, some researchers have used tagging followed by regular expressions to perform fairly good NP chunking. That works, quickly, but not so well. I needed better results.

The following table illustrates the success of the two approaches, compared with the top 27 key phrases selected by a human editor:

While the graph-based approach has a higher error rate, in terms of recognizing actual noun phrases, it results in comparable F-measure precision rates – with single digital relative error between the two approaches. That is accounted for by the fact that each step in a stochastic parser introduces some error and ambiguity because of variants, but a relatively simple use of dynamic programming can be applied to prune variants and reduce error overall. In contrast, a graph-based analysis tends to identify better high-value targets for key phrases – and more importantly, it tends to rank true positives higher than the noise, which is not something that a parser-based approach can provide. If there is post-processing involved for contextual analysis, the noun phrases produced by a graph-based approach appear superior – and substantially more performant.

Another point to consider, when analyzing large amounts of text, is that graph-based analysis is relatively simple and efficient to implement using a map-reduce framework such as Hadoop.

The trick is… how to represent a stream of morphemes in natural language as a graph.

2008-02-20

hadoop, part 2: jyte cred graph

Following from Part 1 in the series, this article takes a look at an initial application in Hadoop, using a simplified PageRank algorithm with graph data from jyte.com to illustrate a gentle introduction to distributed computing. The really cool part is that this runs quickly on my MacBook laptop, the same Java code can run and scale without modification on Amazon AWS, on an IBM mainframe, and (with minor changes) in the clouds at Yahoo and Google. Source code is available as open source on Google Code at:

http://code.google.com/p/ceteri-mapred/

The social networking system jyte.com provides means for using OpenID as primary identification for participants – at least one is required for login. Participants then post "claims" on which others vote. They can also give "cred" to other participants, citing specific areas of credibility with tags. The contacts API provides interesting means for managing "social whitelisting".

Okay, that was a terribly dry, academic description, but the bottom line is that jyte is fun. I started using it shortly after its public launch in early 2007. You can find a lot more info on their wiki. Some of the jyte data gets published via their API page – essentially the "cred" graph. Others have been working with that data outside of jyte.com, notably Jim Snow's work on reputation metrics.

Looking at the "cred" graph, it follows a form suitable for link analysis such as the PageRank used by Google. In other words, the cred given by one jyte user to another resembles how web pages link to other web pages. Therefore a PageRank metric can be determined, as Wikipedia describes, "based upon quantity of inbound links as well as importance of the page providing the link." Note that this application uses a simplified PageRank, i.e., no damping term in the difference equation.

Taking the jyte cred graph, inverting its OpenID mapping index, then calculating a PageRank algorithm seemed like a reasonably-sized project. There are approximately 1K nodes in the graph (currently) which provides enough data to make the problem and its analysis interesting, but not so much to make it unwieldy. Sounds like a great first app for Hadoop.

The data as downloaded from jyte.com has a caveat: it has been GZIP'ed twice. Once you get it inflated and loaded into the HDFS distributed file system, as "input", the TSV text file line format looks like:

from_openid to_openid csv cred tags

Pass 1 in Hadoop is a job which during its "map" phase takes the downloaded data from a local file, and during the "reduce" phase it aggregates the one-to-many relation of OpenID URIs which give cred to other OpenID URIs. See in the Java source code the method configPass1() in the org.ceteri.JyteRank class, and also the inner classes MapFrom2To and RedFrom2To. The resulting "from2to" TSV text data line format looks like:

from_openid tsv list of to_openids

Pass 2 is a very simple job which initializes rank values for each OpenID URI in the graph – starting simply with a value of 1. We'll iterate to change that, later. See the method configPass2() and also the inner classes MapInit and RedInit. The resulting "prevrank" TSV text data line format looks like:

from_openid rank

Pass 3 performs a join of "from2to" data with "prevrank" data, then expands that into individual terms in the PageRank summation, emitting a partial calculation for each OpenID URI. See the method configPass3() and also the inner classes MapExpand and RedExpand. The resulting "elemrank" TSV text data line format looks like:

to_openid rank_term

Pass 4 aggregates the "elemrank" data to sum the terms for each OpenID URI in the graph, reducing them to a rank metric for each. See the method configPass4() and also the inner classes MapSum and RedSum. The resulting "thisrank" TSV text data line format looks like:

to_openid rank

The trick with PageRank is that it uses a difference equation to represent the equilibrium values in a network, iterating through sources and sinks. So the calculation generally needs to iterate before it converges to a reasonably good approximation of the solution at equilibrium. In other words, after completing Pass 4, the code swaps "thisrank" in place of "prevrank" and then iterates on Pass 3 and Pass 4. Rinse, later, repeat – until the difference between two successive iterations produces a relatively small error.

source: Wikipedia.org

Whereas Google allegedly runs their PageRank algorithm through thousands of iterations, running for several hours each day, I have a hunch that jyte.com does not run many iterations to determine their cred points. Stabbing blindly, I found that using 9 iterations seems to produce a close correlation to the actual values listed as jyte.com cred points. A linear regression can then be used to approximate the actual values:

My analysis here is quite rough, and the code is far from optimal. For example, I'm handling data in-between the map and reduce phases simply as TSV text. There could be much improvement by using subclassed InputFormat and OutputFormat definitions for my own data formats. Even so, the results look pretty good: not bad for one afternoon's work, eh?

I have big reservations about PageRank for web pages, since it assumes that an equilibrium state effectively models attention given to web pages. PageRank provides a mathematical model for a kind of "coin toss", where a user looking at a browser either chooses to navigate to some random URL (with probability 0.15) or they follow one of the links on the current page (with probability 0.85). In practice that implies how some important web pages will experience substantial latency before PageRank begins to "notice" them, since PageRank must wait for (a) other pages to link to the new page, and (b) their web crawlers to visit and analyze those changes. That process can take several days.

A more realistic model of human behavior might take into account that – contrary to Google PageRank assumptions – people do not navigate to some random URL because it is in their bookmarks or is their individual home page. That might have been the general case in 1998, but these days I get my fix of random URLs because of email, IM, RSS, etc., all pushing them out to me. In other words, there is an implied economy of attention (assuming I only look at one browser window at a time) which has drivers that are external to links in the World Wide Web. In lieu of using equilibrium models, the link analysis might consider some URLs as "pumps" and then take a far-from-equilibrium approach.

Certainly, a site such as CNN.com demonstrates that level of network effects, and news events function as attention drivers.

But I digress.

A simplified PageRank analysis based on an equilibrium model appears to work fine enough to approximate jyte.com cred points. It also illustrates how to program within the Hadoop framework. I have ulterior motives for getting jyte cred graph data into Hadoop – which we'll examine in the next part of this series.