election visualizations

I use the following tools to watch the 2008 US presidential election take shape. These tools show great applications of geographic analytics, prediction markets, data visualization, etc.

My one wish would be to see an aggregate view of them as a cartogram.


ceteris paribus and the basted egg heuristic

Years ago, I had a job working as a short-order cook in a reasonably nice restaurant. One thing you learn quickly in that kind of work is what you can cook simply, reliably, while handling N other orders at the same time. Perhaps this falls into the general category of load-balancing or speculative execution, but some orders work better than others.

An example of a simple order is a "basted egg". One egg, some water and some oil in a small covered pan, burner set low, for a few minutes. I happen to like eating those for breakfast. Even though they almost never appear on a breakfast menu, it's one of the simplest things that a short-order cook can be asked to make -- aside from toast.

When I'm checking out a new little cafe for a quick breakfast, oftentimes I'll ask for a basted egg. That alone provides amazing insights to the structure and vitality of the restaurant... I've found that most wait staff and short-order cooks who are any good at preparing breakfast will jump at the chance to earn 3x food cost + their 20% tip by serving the simplest thing possible. (assuming that wait staff splits tips with cooks, which they should). Generally they do quite well, I enjoy breakfast, and pay more than 20% tip. Done.

However, using this metric, I've found an heuristic: restaurants on the brink of failure -- and there are many, almost always, given the nature of the business -- tend to fumble on something even as simple as a basted egg. I've found that almost every restaurant which told me "Sorry, no basted eggs here" has closed within the following 2 months. In fact, I've got a running calculation for a binomial confidence interval on that heuristic test converging near 1.0 as a best point estimate. Which is to say, pretty damn certain. Generally, I'd attribute that kind of failure to lack of insight or communication by the restaurant management, or simply that they'd hired untrained wait staff and kitchen staff. Generally the former condition is more opaque, and thus more likely as a risk to the restaurant.

Years ago -- and not nearly as many years ago as the cooking gigs, but close -- I started working in the technology industry. My first full-time job in Silicon Valley was in 1983... though I'd been mentored about the industry by my uncle (who was deep in it) going back to the late 1960s. In other words, I've been learning about what works and what doesn't in the tech industry for approximately 40 years, with the past 25 years of it hands-on. (Yikes, I'm that old now?!)

During that time, I've developed and tuned a robust heuristic for measuring the vitality of tech organizations -- not so different from my basted egg test in restaurants. Simply put, when I walk into a meeting, I ask myself three questions about everyone in the room. I won't disclose those questions here, but they are listed on my resume, which is public. I'll leave that research as an exercise for the interested reader. Anywhoo, every participant in the meeting gets scored on a scale of 0-3 points, whether they like it or not. Here's how I evaluate scores on an individual level:

3. great to work with, nurture a long-term relationship
2. short-term only, watch carefully for cracks in the foundation
1. limit dialogue to what is required
0. migrate in opposite direction!

Okay, I'll be the first to say that may sound trite, but hey it works. Even when you're encountering people who may not be the sharpest minds in the industry, if they score highly, consistently on that scale then ceteris paribus they'll likely do well in the industry.

I recall -- vividly -- one restaurant in central Austin, where I asked for a basted egg... the manager came out, and, without asking, placed his arm around my shoulders while explaning to me, "Son, there's no such thing as a basted egg." As one might imagine, I got up and left rather suddenly, and more adroitly put, there was no such thing as that idiot's employment 4 weeks later.

On the organizational level, based on the past few decades of experience watching tech companies struggle to be viable -- struggle on par with that found in the restaurant biz -- on one hand, if a tech firm has plenty of people who score 3's consistently, they've got my attention and quite likely could get my commitment too. On the other hand, if they've got people who score 1's or 0's consistently then, just like that ill-fated breakfast nook in Austin with the smarmy manager, get up and leave quickly!


cloud computing - methodology notes

A couple companies ago, one of my mentors -- Jack Olson, author of Data Quality -- taught us team leaders to follow a formula for sizing software development groups. Of course this is simply a guidance, but it makes sense:

9:3:1 for dev/test/doc

In other words, a 3:1 ratio of developers to testers, and then a 9:1 ratio of developers to technical writers. Also figure in how a group that size (13) needs a manager/architect and some project management.

On the data quality side, Jack sent me out to talk with VPs at major firms -- people responsible for handling terabytes of customer data -- and showed me just how much the world does *not* rely on relational approaches for most data management.

Jack also put me on a plane to Russia, and put me in contact with a couple of firms -- in SPb and Moscow, respectively. I've gained close friends in Russia, as a result, and profound respect for the value of including Russian programmers on my teams.

A few years and a couple of companies later, I'm still building and leading software development teams, I'm still working with developers from Russia, I'm still staring at terabytes (moving up to petabytes), and I'm still learning about data quality issues.

One thing has changed, however. During the years in-between, Google demonstrated the value of building fault-tolerant, parallel processing frameworks atop commodity hardware -- e.g, MapReduce. Moreover, Hadoop has provided MapReduce capabilities as open source. Meanwhile, Amazon and now some other "cloud" vendors have provided cost-effective means for running thousands of cores and managing petabytes -- without having to sell one's soul to either IBM or Oracle. By I digress.

I started working with Amazon EC2 in August, 2006 -- fairly early in the scheme of things. A friend of mine had taken a job working with Amazon to document something new and strange, back in 2004. He'd asked me terribly odd questions, ostensibly seeking advice about potential use cases. When it came time to sign up, I jumped at the chance.

Over the past 2+ years I've learned a few tricks and have an update for Jack's tried-and-true ratios. Perhaps these ideas are old hat -- or merely scratching the surface or even out of whack -- at places which have been working with very large data sets and MapReduce techniques for years now. If so, my apologies in advance for thinking aloud. This just seems to capture a few truisms about engineering Big Data.

First off, let's talk about requirements. One thing you'll find with Big Data is that statistics govern the constraints of just about any given situation. For example, I've read articles by Googlers who describe using MapReduce to find simple stats to describe a *large* sample, and then build their software to satisfy those constraints. That's an example of statistics providing requirements. Especially if your sample is based on, say, 99% market share :)

In my department's practice, we have literally one row of cubes for our Statisticians, and across the department's center table there is another row of cubes for our Developers. We build recommendation systems and other kinds of predictive models at scale; when it comes to defining or refining algorithms that handle terabytes, we find that analytics make programming at scale possible. We follow a practice of having statisticians pull samples, visualize the data, run analysis on samples, develop models, find model parameters, etc., and then test their models on larger samples. Once we have selected good models and the parameters for them, those are handed over to the developers -- and to the business stakeholders and decisions makers. Those models + parameters + plots become our requirements, and part of the basis for our acceptance tests.

Given the sheer lack of cost-effective tools for running statistics packages at scale (SAS is not an option, R can't handle it yet) then we must work on analytics through samples. That may be changing, soon. But I digress.

Developers working at scale are a different commodity. My current employer has a habit of hiring mostly LAMP developers, people who've demonstrated proficiency with PHP and SQL. That's great for web apps, but hand the same people terabytes and they'll become lost. Or make a disastrous mess. I give a short quiz during every developer interview to insure that our candidates understand how a DHT works, how to recognize a good situation for using MapReduce, and whether they've got any sense for working with open source projects. Are they even in the ballpark? Most candidates fail.

I certainly don't look for one-size-fits all, either: some people are great working on system programming, which is crucial for Big Data, and others are more suited for working on algorithms. The latter is problematic, because algorithms at scale do not tend to work according to how logic might dictate. Precious little at scale works according to how logic might dictate. In fact, if I get a sense that a developer or candidate is too "left brained" and reliant on "logic", then I know already they'll be among the first to fail horribly with MapReduce. One cannot *think* their way through failure cases within a thousand cores running on terabytes. Data quality issues alone should be sufficient, but there are deeper forces involved. Anywho, one must *measure* their way through troubleshooting at scale -- which is another reason why statistics and statisticians need to be close at hand.

The trouble is that most commodity hardware is built to perform well within given sets of tolerance. If you restart your laptop a hundred times, it will probably work fine; however if you launch 100 virtual servers, some percentage will fail. Even among the ones that appear to run fine, there will be disk failures, network failures, memory glitches, etc., some of which do not become apparent until they've occurred many times over. So when you pump terabytes through a cluster of servers for hours and hours, some parts of your data is going to get mangled.

Which brings up another problem: unit tests run at scale don't provide a whole lot of meaning. Sure, it's great to have unit tests to validate that your latest modification and check-ins pass tests with small samples of data. Great. Even so, minor changes in code, even code that passes its unit test, can cascade into massive problems at scale. Testing at scale is problematic, because sometimes you simply cannot tell what went wrong unless you go chasing down several hundred different log files, each of which is many Gb. Even if results get produced, with Big Data you may not know much about the performance of results until they've had time to go through A/B testing. Depending on your operation, that may take days to evaluate correctly.

One trick is to have statisticians acting in what would otherwise be a "QA" role -- poking and prodding the results of large MapReduce jobs.

One other trick is to find the people talking up the use of scrum, XP, etc., and relocate those people to another continent. I use iterative methodology -- from a jumble of different sources, recombined for every new organization. However, when someone wants to sell me a "agile" product where unit tests adjudicate the project management, I show that person the door. Don't let it hit you on the way out. In terms of working with cloud-based architecture and Big Data, something new needs to be invented for methodology, something beyond "agile". I'll try to document what we've found to work, but I don't have answers beyond that.

Some key points:

Each statistician can prompt enough work for about three developers; it's a two-way street, because developers are generally helping analysts resolve system issues, pull or clean up data that's problematic, etc.

Statisticians work better in pairs or groups, not so well individually. In fact, I round up all the quantitative analysts in the company once each week for a "Secret Statisticians Meeting". We run the meeting like a grad seminar. Management and developers are not allowed, unless they have at least a degree in mathematics -- and even then they must prove their credentials by presenting and submitting to a review.

In any case, I try to assign a minimum of two statisticians to any one project, to get overlap of backgrounds and shared insights.

We use R, it's great for plots and visualizing data. I find that hiring statisticians who have the chops to produce good graphics and dashboards -- that's crucial. Stakeholders don't grasp an equation for how to derive variance, but they do grasp seeing their business realities depicted in a control chart or scatterplot. That communication is invaluable, on all sides. Hire statisticians who know how leverage R to produce great graphics.

Back to the developers, some are vital as systems programmers -- for scripting, troubleshooting, measuring, tweaking, etc. We use Python mostly for that. You won't be able to manage cloud resources without that talent. Another issue is that your traditional roles for system administrators, DBAs, etc., are not going to help much when it comes to troubleshooting an algorithm running in parallel on a thousand cores. They will tend to make remarks like "Why don't you just reboot it?" As if. So you're going to have to rely on your own developers to run operations. Those are your system programmers.

Other developers work at the application layer. Probably not the same people, but you might get lucky. Don't count on it. Algorithm work requires that a programmer can speak statistics, and there aren't many people who cross both worlds *and* write production quality code for Hadoop.

Another problem is documentation. Finding a tech writer to document a team working on Big Data would entail finding an empath. Besides, the statisticians generate a lot of documents, and the developers (and their automated tools) are great at generating tons of text too. Precious little of it dovetails, however. Of course we use a wiki to capture our documentation -- it has versioning, collaborative features, etc. -- and we also use internal blogs for "staging" our documentation, writing drafts, logging notes, etc. Even with all those nifty collab tools, it still takes an editor to keep a bunch of different authors and their texts on track -- *not* a tech writer, but an actual editor. Find one. They tend to be inexpensive compared with developers, and hence more valuable than anybody but the recruiters. But I digress.

Back to those team ratios from Jack Olson... I have a hunch that the following works better in Big Data:
2:3:3:1 for stats/sys/app/edit

A team of 9 needs a team leader/manager -- which, as I understand, might be called a "TL/M" at GOOG, a "2PT" at AMZN, etc. I favor having that person be hands-on, and deeply involved with integration work. Integration gives one a birds-eye view of which individual contributors are ahead or behind. Granted, that'd be heresy at most large firms which consider themselves adept at software engineering. Even so I find it to be a trend among successful small firms.

One other point... Working with Big Data, and especially in the case of working with cloud computing... the biggest risk involved and the most complex part of the architecture involves data loading. In most enterprise operations, data loading is taken quite seriously. If you're running dozens or hundreds (or thousands?) of servers in a Hadoop cluster, then take a tip from the large shops and get serious about how you manage your data. Have a data architect -- quite possibly, the aforementioned team leader/manager.



hadoop in the cloud - patterns for automation

A very good use for Hadoop is to manage large "batch" jobs, such as log file analysis. A very good use for elastic server resources -- such as Amazon EC2, GoGrid, AppNexus, Flexiscale, etc. -- is to handle large resource needs which are periodic but not necessarily 24/7 requirements, such as large batch jobs.

Running a framework like Hadoop on elastic server resources seems natural. Fire up a few dozen servers, crunch the data, bring down the (relatively smaller) results, and you're done!

What also seems natural is the need to have important batch jobs automated. However, the mechanics of launching a Hadoop cluster on remote resources, monitoring for job completing, downloading results, etc., are far from being automated. At least, not from what we've been able to find in our research.

Corporate security policies come into play. For example, after a Hadoop cluster running for several hours to obtain results, your JobTracker out on EC2 may not be able to initiate a download back to your corporate data center. Policy restrictions may require that to be initiated from the other side.

There may be several steps involved in running a large Hadoop job. Having a VPN running between the corporate data center and the remote resources would be ideal, and allow for simple file transfers. Great, if you can swing it.

Another problem is time: part of the cost-effectiveness of this approach is to run the elastic resources only as long as you need them -- in other words, while the MR jobs are running. That may make the security policies and server automation difficult to manage.

One approach would be to use message queues on either side: in the data center, or in the cloud. The scripts which launch Hadoop could then synchronize with processes on your premises via queues. A queue poller on the corporate data center side could initiate data transfers, for example.

Would be very interested to hear how others are approach this issue.


shared memory: humdog

News came today that a long-time friend had died over the weekend: Carmen Hermosillo, aka "Humdog". I am very sad to hear. She is dearly missed.

Humdog generally kept Skype running on her computer all the time, which now shows "last seen on 2008-08-12 11:18 GMT-7". Someone must have turned off her computer just then.

Thinking back, it must have been about 1991 when I first "met" Carmen online, on The Well when I first joined as part of the editorial staff at bOING-bOING. Writers enjoy heady discussions, and Carmen had a knack for turning discussions on their head. She also had the most incredible ways of steering a group online.

More recently, I hired Carmen as a researcher for my team at HeadCase. Perfect for the job, she thrived in that role. There was one point, when the company was first getting started in 2006... Humdog tried to get each of the managers to read a particular article. Everyone replied, "Wow, great, but I'm really too busy now on a certain business planning document, blah blah, will catch up later, etc." As it turned out, that article broke news about virtual worlds which turned our first draft business plan inside out, but in a good way. Some of us finally listened to Carmen, in time to make necessary changes.

Oddly enough, Carmen and I never met in person until the final HeadCase team meetings in July 2008, held in San Luis Obispo. My hometown provided a beautiful, relaxed setting -- our first and last meeting in the physical world.

During most of the 1990s, I was privileged to lead a rather scruffy gang of media explorers called FringeWare. Humdog had a definitive role on our magazine's masthead and an exuberant voice in our online forums.

Among those who formed the core of FringeWare, Humdog becomes the first of us to go to our ancestors.


a methodology for cloud computing architecture - part 5

Diagram, redux.

Errata: fixed issues regarding 3Tera and CohesiveFT, both of which deserved better positioning in the stack than my cloudy thoughts had heretofore imagined.

Again, many thanks to vendors who are contacting me, and kudos in general to discussions on the "cloud-computing" group.

One thing jumps out from this diagram now: items 6 and 12 are placeholders. What vendors or projects might emerge to cover them? For example, might some enterprising soul rewrite portions of the Scalr project so that it can handle APIs for other vendors?


a methodology for cloud computing architecture - part 4

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

The diagram won't go away -

This time around, I'd like to take a few notes from both the Globus VWS and the GoGrid / Flexiscale discussions on the "cloud-computing" email list...

Suppose you just happened to have a bunch of hardware, and wanted to leverage that as a grid – in other words, as "disposable infrastructure" to use the phrasing from 3Tera, or perhaps "managed virtualization". From what I've read in the respective web sites, there are a few ways to approach that. 3Tera has AppLogic, and also there is the Eucalyptus project which exposes a layer looking like EC2.

At a bit higher level of abstraction, there are offerings from CohesiveFT and Enomalism which provide similar "grid" atop your metal, but also manage your images for your cloud-based apps on EC2. In other words, those allow for portability of cloud-based apps – whether on your own metal or on metal leased from AWS.

Slicing this stack diagonally in another direction, there are RightScale and Scalr which manage scaling your cloud-based app on EC2 – and monitor it. The former in particular provides excellent features for scripting and monitoring apps. Flexiscale provides similar features for apps run on their cloud (in UK – hope I'm getting those facts straight).

Traveling up the stack on the batch side, Globus provides open source software to run jobs on your own metal, or on EC2 – or on scientific grids, if you qualify.

Meanwhile, GigaSpaces has announced joint projects with both RightScale (for EC2) and CohesiveFT (for your own metal + EC2) to run spaces for SBA – in other words, virtualized middleware. That appears to take the integration of cloud portability and cloud management to another level.

The diagram still needs work along the Hadoop / HDFS portion of the stack – which was its original intention. Obviously there is no distributed file system when talking about Globus, Condor, etc.

Overall, it's quite nice to see this evolution among vendors and product offerings. Many thanks to the vendors who've been taking me on guided tours of their wares.


a methodology for cloud computing architecture - part 3

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

One more iteration on that diagram.

This time we'll represent some structured approaches running atop a Hadoop cluster. Parallel R being one that addresses our department's needs the most directly...

FWIW, I'm more interested in what can be accomplished with batch jobs and non-relational database approaches -- on Linux-based commodity hardware. Probably an occupational hazard for working with very large data sets and fast turn-around time requirements, where fault-tolerance seems to be what cripples very large-scale relational approaches, in practice. In other words, if your solution to handling very large data requires an SQL query, I might not take the call. At least not for analytics.

Having said that, we still have a ton of transaction / web-based front-end work, which was written for LAMP and still fits relational better.


a methodology for cloud computing architecture - part 2

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

Thanks to many people who gave their feedback – publicly and privately – about the diagram in my previous post. Apologies to RightScale and others for inverting their placement. I definitely want to recommend John Willis for an excellent overview of the vendors in this space.

Will keep attempting to articulate this until we get it right :)

  • I really wanted to avoid virtualization. However, it looks like that's a foregone conclusion.
  • I prefer to place the cloud layer atop the grid layer. Several good resources disagree on that; the two terms flip-flop.
  • Our "3x" sidebar about TCO and AWS was quite interesting. Would like to hear more from others about how they estimate that metric.
Another note: our firm is decidedly in the camp of needing more than one cloud. For a variety of reasons, some applications must live in our own data centers. Others are great for running in AWS. Perhaps others could run on additional grids – secondary services are compelling, even if not the most competitive yet on costs. In other words, probably not as replacements for AWS, but to augment it.

Mostly, I'm interested in software solutions which can place a "cloud manager" layer atop our own equipment – so that we could pull apps from one cloud to another depending on needs. Labeled "5" in the new diagram.

Comments are welcomed.


a methodology for cloud computing architecture

[Ed: the diagram shown below has been updated; see previous diagram and the discussion update.]

I got asked to join to a system architecture review recently. The subject was a plan for data center redundancy. The organization had one data center, and included some relatively big iron:

  • a large storage system appliance, with high capex (its name starts with an "N", let you guess)
  • a large RDBMS with expensive licensing (its name starts with an "O", let you guess)
  • a large data warehouse appliance, with GINORMOUS capex, expensive maintenance fees, and far too much downtime (its name includes a "Z", let you guess)
  • plus several hundred Linux servers on commodity hardware running a popular virtualization product (its name includes an "M" and a "W", let you guess)
The plan was to duplicate those resources in another data center, then run both at 50% capacity while replicating data across them. In the event of an outage at one data center, the other would be ready to handle the full load.

Trouble is, most of those Ginormous capexii species of products – particularly the noisome Ginormous capexii opexii subspecies – were already heavily overloaded by the organization. That makes the "50% capacity" load balancing story hard to believe. Otherwise it would have been a good approach, circa 2005. These days there are better ways to approach the problem...

After reviewing the applications it became clear that the data management needed help – probably before any data centers got replicated. One immediate issue concerned unwinding dependencies on their big iron data management, notably the large RDBMS and DW appliance. Both were abused at the application level. One should use an RDBMS as, say, a backing store for web services that require transactions (not as a log aggregator), or for reporting. One should use a DW for storing operational data which isn't quite ready to archive (not for cluster analysis).

We followed an approach: peel off single applications out of that tangled mess first. Some could run on MySQL. Others were periodic batch jobs (read: large merge sorts) which did not require complex joins, and could be refactored as Hadoop jobs more efficiently. Perhaps the more transactional parts would stay in the RDBMS. By the end of the review, we'd tossed the DW appliance, cut RDBMS load substantially, and started to view the filer appliance as a secondary storage grid.

Another approach I suggested was to think of cloud computing as the "duplicating 50% capacity" argument taken to the next level. Extending our "peel one off" approach, determine which applications could run on a cloud – without much concern about "where" the cloud ran other than the fact that we could move required data to it. Certainly most of the Hadoop jobs fit that description. So did many of the MySQL-based transactions. Those clouds could migrate to different grids, depending on time, costs, or failover emergencies.

One major cost is data loading across grids. However, the organization gains by duplicating its most critical data onto different grids and has a business requirement to do so.

In terms of grids, the organization had been using AWS for some work, but had fallen into a common trap of thinking "Us (data center) versus Them (AWS)", arguing about TCO, vendor lock-in, etc. Sysadmins calculated a ratio of approximately 3x cost at AWS – if you don't take scalability into consideration. In other words, if a server needs to run more than 8 hours per day, it starts looking cheaper to run on your own premises than to run it on EC2.

I agree with the ratio; it's strikingly similar to the 3x markup you find buying a bottle of good wine at any decent restaurant. However, scalability is a crucial matter. Scaling up as a business grows (or down, as it contracts) is the vital lesson of Internet-based system architecture. Also, the capacity to scale up failover resources rapidly in the event of an emergency (data center lands under an F4 tornado) is much more compelling than TCO.

In the diagram shown above, I try to show that AWS is a vital resource, but not the only vendor involved. Far from that, AWS provides services for accessing really valuable metal; they tend to stay away from software, when it comes to cloud computing. Plenty of other players have begun to crowd that market already.

I also try to separate application requirements into batch and online categories. Some data crunching work can be scheduled as batch jobs. Other applications require a 24/7 presence but also need to scale up or down based on demand.

What the diagram doesn't show are points where message queues fit – such as Amazon's SQS or the open source AMPQ implementations. That would require more of a cross-section perspective, and, well, may come later if I get around to a 3D view. Frankly, I'd rather redo the illustration inside Transmutable, because that level of documentation for architectures belongs in real architectural tools such as 3D spaces. But I digress.

To me, there are three aspects of the landscape which have the most interesting emerging products and vendors currently:
  • data management as service-in-the-cloud, particularly the non-relational databases (read: BigTable envy)
  • job control as service-in-the-cloud, particularly Hadoop and Condor (read: MapReduce+GFS envy, but strikingly familiar to mainframe issues)
  • cloud service providing the cloud (where I'd be investing, if I could right now)

A Methodology

To abstract and summarize some of the methodology points we developed during this system architecture review:
  1. Peel off the applications individually, to detangle the appliance mess (use case analysis).
  2. Categorize applications as batch, online, heavy transactional, or reporting – where the former two indicate likely cloud apps.
  3. Think of cloud computing as a way to load balance your application demands across different grids of available resources.
  4. Slide the clouds across different grids depending on costs, scheduling needs, or failover capacity.
  5. Take the hit to replicate critical data across different grids, to have it ready for a cutover within minutes; that's less expensive than buying insurance.
  6. Run your own multiple data centers as internal grids, but have additional grid resources ready for handling elastic demands (which you already have, in quantity).
  7. Reassure your DBAs and sysadmins that their roles are not diminished due to cloud computing, and instead become more interesting – with hopefully a few major headaches resolved.
Meanwhile, keep an eye on the vendors for non-relational data management. It's poised to become suddenly quite interesting.

My diagram is a first pass, drafted the night after our kids refused to sleep because of a thunderstorm... subject to many corrections at least a few revisions. The vendors shown on fit approximately, circa mid-2008. Obviously some offer overlapping services into other areas. Most vendors mentioned are linked in my public bookmark: http://del.icio.us/ceteri/grid

Comments are encouraged. Haven't seen this kind of "vendor landscape" diagram elsewhere yet. If you spot one in the wild elsewheres, please let us know.


hadoop, part 3: graph-based nlp

I've been evaluating graph-based methods for language analytics. At my previous employer, we used statistical parsing and non-parametric methods as a basis for semantic analysis of texts. In other words, use a web crawler to grab some text content from a web page, run it through a natural language parser (after a wee bit o' clean-up), then use the resulting parse trees to extract meaning from the text. Typical stages required for parsing include segmentation, tokenizing, tagging, chunking, and possibly some post-processing after all that. Our use case was to identify noun phrases in the text, which is called NP chunking. Baseline metrics run at about 80% for success using that kind of approach.

More recent work on graph-based methods allows for improved results with considerably faster throughput rates. We did not apply this approach at HeadCase, but my tests subsequently have shown better than an order of magnitude performance gain, in terms of CPU time, when running side-by-side tests on the same processor and data set. This approach, based on link analysis within a graph, eliminates the need for chunking. Instead it iterates on a difference equation, which tends to converge rather quickly (depending on the size of the graph).

source: wikipedia.org

As an example, I took input text from an encyclopedia article about Java Island found online. It has 3 paragraphs, 16 sentences, 331 words, 2053 bytes. Using a popular, open source NLP library, a parser-based approach for identifying noun phrases takes an average of 8341 milliseconds on my MacBook laptop, while the same task using the same libraries with a graph-based approach takes an average of 257 milliseconds. That's a factor 32x speed-up.

To be fair, some researchers have used tagging followed by regular expressions to perform fairly good NP chunking. That works, quickly, but not so well. I needed better results.

The following table illustrates the success of the two approaches, compared with the top 27 key phrases selected by a human editor:

While the graph-based approach has a higher error rate, in terms of recognizing actual noun phrases, it results in comparable F-measure precision rates – with single digital relative error between the two approaches. That is accounted for by the fact that each step in a stochastic parser introduces some error and ambiguity because of variants, but a relatively simple use of dynamic programming can be applied to prune variants and reduce error overall. In contrast, a graph-based analysis tends to identify better high-value targets for key phrases – and more importantly, it tends to rank true positives higher than the noise, which is not something that a parser-based approach can provide. If there is post-processing involved for contextual analysis, the noun phrases produced by a graph-based approach appear superior – and substantially more performant.

Another point to consider, when analyzing large amounts of text, is that graph-based analysis is relatively simple and efficient to implement using a map-reduce framework such as Hadoop.

The trick is… how to represent a stream of morphemes in natural language as a graph.


first day at the new job

As some of my friends have recently noticed (from profiles on LinkedIn, etc.) I've got a new employer.

Of course, many folks have asked me, "How does one get hired by S.P.E.C.T.R.E., anyway?" Well, it's a long, involved process – lemme tell you. I actually got hired years ago, and have been moonlighting there on the sly. Hey, it happens. Silicon Valley was built out of that kind of "resourcefulness".

Anywhoo, I figured it was time to commit, go full-time, and moreover go public with my affiliation. My grandpa always said, if you're gonna swim a river right, you gotta jump in with both feet. Or something.

The thing is, even though our prime objective at S.P.E.C.T.R.E. is world domination, and the prevalence of evil over good, etc., we really have a lot of latitude in our daily work life. I mean, the corporate campus is awesome! Who cares about being at the Googleplex? We get to work inside a secret volcano base. With atomic-powered peoplemovers. Access not only to a ginormous cloud computing facility, but to all the banking IT in the world. Ninjas instead of silly security guards and Wakenhuts. Hell, we've even got Fem Bots in the break room!

But I digress.

It's not all guts and glory. We still have to follow process (which is killer) and meet our milestones. Those can be grueling, especially when that giant dude with the bad metal teeth starts lurking around right before deadline. The benefits plan is amazing though – free travel around the world, staying at some of the most fantastics resorts. Overall, serving evil a pretty good deal.


hadoop, part 2: jyte cred graph

Following from Part 1 in the series, this article takes a look at an initial application in Hadoop, using a simplified PageRank algorithm with graph data from jyte.com to illustrate a gentle introduction to distributed computing. The really cool part is that this runs quickly on my MacBook laptop, the same Java code can run and scale without modification on Amazon AWS, on an IBM mainframe, and (with minor changes) in the clouds at Yahoo and Google. Source code is available as open source on Google Code at:


The social networking system jyte.com provides means for using OpenID as primary identification for participants – at least one is required for login. Participants then post "claims" on which others vote. They can also give "cred" to other participants, citing specific areas of credibility with tags. The contacts API provides interesting means for managing "social whitelisting".

Okay, that was a terribly dry, academic description, but the bottom line is that jyte is fun. I started using it shortly after its public launch in early 2007. You can find a lot more info on their wiki. Some of the jyte data gets published via their API page – essentially the "cred" graph. Others have been working with that data outside of jyte.com, notably Jim Snow's work on reputation metrics.

Looking at the "cred" graph, it follows a form suitable for link analysis such as the PageRank used by Google. In other words, the cred given by one jyte user to another resembles how web pages link to other web pages. Therefore a PageRank metric can be determined, as Wikipedia describes, "based upon quantity of inbound links as well as importance of the page providing the link." Note that this application uses a simplified PageRank, i.e., no damping term in the difference equation.

Taking the jyte cred graph, inverting its OpenID mapping index, then calculating a PageRank algorithm seemed like a reasonably-sized project. There are approximately 1K nodes in the graph (currently) which provides enough data to make the problem and its analysis interesting, but not so much to make it unwieldy. Sounds like a great first app for Hadoop.

The data as downloaded from jyte.com has a caveat: it has been GZIP'ed twice. Once you get it inflated and loaded into the HDFS distributed file system, as "input", the TSV text file line format looks like:

from_openid to_openid csv cred tags

Pass 1 in Hadoop is a job which during its "map" phase takes the downloaded data from a local file, and during the "reduce" phase it aggregates the one-to-many relation of OpenID URIs which give cred to other OpenID URIs. See in the Java source code the method configPass1() in the org.ceteri.JyteRank class, and also the inner classes MapFrom2To and RedFrom2To. The resulting "from2to" TSV text data line format looks like:

from_openid tsv list of to_openids

Pass 2 is a very simple job which initializes rank values for each OpenID URI in the graph – starting simply with a value of 1. We'll iterate to change that, later. See the method configPass2() and also the inner classes MapInit and RedInit. The resulting "prevrank" TSV text data line format looks like:

from_openid rank

Pass 3 performs a join of "from2to" data with "prevrank" data, then expands that into individual terms in the PageRank summation, emitting a partial calculation for each OpenID URI. See the method configPass3() and also the inner classes MapExpand and RedExpand. The resulting "elemrank" TSV text data line format looks like:

to_openid rank_term

Pass 4 aggregates the "elemrank" data to sum the terms for each OpenID URI in the graph, reducing them to a rank metric for each. See the method configPass4() and also the inner classes MapSum and RedSum. The resulting "thisrank" TSV text data line format looks like:

to_openid rank

The trick with PageRank is that it uses a difference equation to represent the equilibrium values in a network, iterating through sources and sinks. So the calculation generally needs to iterate before it converges to a reasonably good approximation of the solution at equilibrium. In other words, after completing Pass 4, the code swaps "thisrank" in place of "prevrank" and then iterates on Pass 3 and Pass 4. Rinse, later, repeat – until the difference between two successive iterations produces a relatively small error.

source: Wikipedia.org

Whereas Google allegedly runs their PageRank algorithm through thousands of iterations, running for several hours each day, I have a hunch that jyte.com does not run many iterations to determine their cred points. Stabbing blindly, I found that using 9 iterations seems to produce a close correlation to the actual values listed as jyte.com cred points. A linear regression can then be used to approximate the actual values:

My analysis here is quite rough, and the code is far from optimal. For example, I'm handling data in-between the map and reduce phases simply as TSV text. There could be much improvement by using subclassed InputFormat and OutputFormat definitions for my own data formats. Even so, the results look pretty good: not bad for one afternoon's work, eh?

I have big reservations about PageRank for web pages, since it assumes that an equilibrium state effectively models attention given to web pages. PageRank provides a mathematical model for a kind of "coin toss", where a user looking at a browser either chooses to navigate to some random URL (with probability 0.15) or they follow one of the links on the current page (with probability 0.85). In practice that implies how some important web pages will experience substantial latency before PageRank begins to "notice" them, since PageRank must wait for (a) other pages to link to the new page, and (b) their web crawlers to visit and analyze those changes. That process can take several days.

A more realistic model of human behavior might take into account that – contrary to Google PageRank assumptions – people do not navigate to some random URL because it is in their bookmarks or is their individual home page. That might have been the general case in 1998, but these days I get my fix of random URLs because of email, IM, RSS, etc., all pushing them out to me. In other words, there is an implied economy of attention (assuming I only look at one browser window at a time) which has drivers that are external to links in the World Wide Web. In lieu of using equilibrium models, the link analysis might consider some URLs as "pumps" and then take a far-from-equilibrium approach.

Certainly, a site such as CNN.com demonstrates that level of network effects, and news events function as attention drivers.

But I digress.

A simplified PageRank analysis based on an equilibrium model appears to work fine enough to approximate jyte.com cred points. It also illustrates how to program within the Hadoop framework. I have ulterior motives for getting jyte cred graph data into Hadoop – which we'll examine in the next part of this series.

hadoop, part 1: approach

Recently I started putting together a collection of code examples based on the Hadoop open source project at Apache. These examples are available as open source on Google Code at:

This project is intended to explore applications of MapReduce programming techniques for handling large data sets within a distributed computing framework. It also presents more Hadoop examples (currently based on version 0.15.3) to help other developers overcome its learning curve. You'll need to download and install Hadoop first.

source: Apache.org

Yahoo! developers announced today about production use of Hadoop supporting their search engine, which includes more than 10,000 cores. Some have estimated that as approaching 10% of the size of Google's server farm running MapReduce. Given that Hadoop is still relatively "early code" as an open source project – in other words waaay prior to its "release 1.0", and that's impressive!

Google provides an excellent set of lectures online introducing these topics, entitled "Cluster Computing and MapReduce". I highly recommend it. While you're studying up, be sure to read the Dynamo paper by Amazon.com CTO
source: Google.com

Those emerging priorities are no so much based on a need to crunch the data faster, but rather to be able to scale to handle very large data reliably and cost-effectively – even if, statistically, the data quality is questionable, the third-party libraries are buggy, and some of the processors fail. Generally speaking, for large scale problems that approach does end up crunching the data faster overall. For example, if you happen to be Last.fm or Rackspace.com then even relatively mundane business needs like log file analysis become very large data problems ... and Hadoop is providing their solution.

One interesting point is that this style of coding and architecture may not quite fit the first-foot-forward techniques that a "pre-Millenial" crop of software developers learned and tend to practice. Bread and butter concepts such as relational databases, O/R mapping for persistence layers, model-view-controller, threading – each have their purposes, but not longer provide the default patterns. Instead, in-between the lines of whitepapers about Dynamo, MapReduce, Hadoop, etc., we catch glimpses of architecture of IBM mainframe programming, specifically hierarchical databases, ETL for data streaming, zSeries hardware with many cores, Shark, MQSeries, etc. The underlying technologies for those date back to the 1960s, and IBM has been refining them ever since, considering that they provide the IT foundations for finance, transportation, manufacturing, etc. Moreover there are programmers who have quite literally logged four decades perfecting highly sophisticated techniques for handling large data sets with these system components.

Working mostly in banking, two start-ups ago, I learned much about this kind of technology – at times, brainstorming alongside senior IT architects at customer sites like Wachovia. I have a hunch that just about anyone who's spent quality time snuggling close with IBM's microcoded hardware sort algorithm would recognize the heavy-lifting performed by the merge sort in the middle of MapReduce ... in a heartbeat.

What is even more interesting is how the core components rolled out so far in Amazon AWS bear strong resemblance to those same mainframe programming techniques I mentioned above. Specifically, XML resembles a hierarchical data model (IMS), Hadoop resembles ETL, EC2 resembles virtualization on zSeries, S3 resembles Shark, SMS resembles MQSeries, etc. I could go on, but it's not necessary. What is necessary is to recognize that these mainframe technologies run the bulk of commercial computing and data storage – once you add up all the processing performed by banks, insurance, accounting, airlines, shipping, tax agencies, pharma, etc. Their corollary technologies represented within Dynamo, MapReduce, and Hadoop are being presented by large firms earning large revenues based on the Internet: AMZN, GOOG, YHOO, etc. In other words, it is refreshing to see that real money is not chasing yet-more-web-apps as their respective business plans. That seems to indicate a good trend in the technology sector.

Okay, enough broad perspectives – let's cut to some examples. Over the next few parts in this series, I'd like to present some sample code and summary analysis for applications. Let's start by taking a look at data from the jyte.com cred API.


a sample chart

Been wanting to try out the recently released Google Chart API. It looks great for implementing things on web pages like Tufte's sparklines.

So here is a sample chart, with my respective ethnicity. Our family is a mixed bag, part which looks like a cross-section of early North America, and part looks like a cross-section of Central Asia, sampled right along the Silk Route. These things happen.

Anywhoo, it took all of about 2 minutes to build a first chart and have it labeled and colored the way that I wanted. All that plus the data is URL-encoded. Nice work, GOOG.

Meanwhile, happy Chinese New Year!