A couple companies ago, one of my mentors -- Jack Olson, author of Data Quality -- taught us team leaders to follow a formula for sizing software development groups. Of course this is simply a guidance, but it makes sense:
9:3:1 for dev/test/doc
In other words, a 3:1 ratio of developers to testers, and then a 9:1 ratio of developers to technical writers. Also figure in how a group that size (13) needs a manager/architect and some project management.
On the data quality side, Jack sent me out to talk with VPs at major firms -- people responsible for handling terabytes of customer data -- and showed me just how much the world does *not* rely on relational approaches for most data management.
Jack also put me on a plane to Russia, and put me in contact with a couple of firms -- in SPb and Moscow, respectively. I've gained close friends in Russia, as a result, and profound respect for the value of including Russian programmers on my teams.
A few years and a couple of companies later, I'm still building and leading software development teams, I'm still working with developers from Russia, I'm still staring at terabytes (moving up to petabytes), and I'm still learning about data quality issues.
One thing has changed, however. During the years in-between, Google demonstrated the value of building fault-tolerant, parallel processing frameworks atop commodity hardware -- e.g, MapReduce. Moreover, Hadoop has provided MapReduce capabilities as open source. Meanwhile, Amazon and now some other "cloud" vendors have provided cost-effective means for running thousands of cores and managing petabytes -- without having to sell one's soul to either IBM or Oracle. By I digress.
I started working with Amazon EC2 in August, 2006 -- fairly early in the scheme of things. A friend of mine had taken a job working with Amazon to document something new and strange, back in 2004. He'd asked me terribly odd questions, ostensibly seeking advice about potential use cases. When it came time to sign up, I jumped at the chance.
Over the past 2+ years I've learned a few tricks and have an update for Jack's tried-and-true ratios. Perhaps these ideas are old hat -- or merely scratching the surface or even out of whack -- at places which have been working with very large data sets and MapReduce techniques for years now. If so, my apologies in advance for thinking aloud. This just seems to capture a few truisms about engineering Big Data.
First off, let's talk about requirements. One thing you'll find with Big Data is that statistics govern the constraints of just about any given situation. For example, I've read articles by Googlers who describe using MapReduce to find simple stats to describe a *large* sample, and then build their software to satisfy those constraints. That's an example of statistics providing requirements. Especially if your sample is based on, say, 99% market share :)
In my department's practice, we have literally one row of cubes for our Statisticians, and across the department's center table there is another row of cubes for our Developers. We build recommendation systems and other kinds of predictive models at scale; when it comes to defining or refining algorithms that handle terabytes, we find that analytics make programming at scale possible. We follow a practice of having statisticians pull samples, visualize the data, run analysis on samples, develop models, find model parameters, etc., and then test their models on larger samples. Once we have selected good models and the parameters for them, those are handed over to the developers -- and to the business stakeholders and decisions makers. Those models + parameters + plots become our requirements, and part of the basis for our acceptance tests.
Given the sheer lack of cost-effective tools for running statistics packages at scale (SAS is not an option, R can't handle it yet) then we must work on analytics through samples. That may be changing, soon. But I digress.
Developers working at scale are a different commodity. My current employer has a habit of hiring mostly LAMP developers, people who've demonstrated proficiency with PHP and SQL. That's great for web apps, but hand the same people terabytes and they'll become lost. Or make a disastrous mess. I give a short quiz during every developer interview to insure that our candidates understand how a DHT works, how to recognize a good situation for using MapReduce, and whether they've got any sense for working with open source projects. Are they even in the ballpark? Most candidates fail.
I certainly don't look for one-size-fits all, either: some people are great working on system programming, which is crucial for Big Data, and others are more suited for working on algorithms. The latter is problematic, because algorithms at scale do not tend to work according to how logic might dictate. Precious little at scale works according to how logic might dictate. In fact, if I get a sense that a developer or candidate is too "left brained" and reliant on "logic", then I know already they'll be among the first to fail horribly with MapReduce. One cannot *think* their way through failure cases within a thousand cores running on terabytes. Data quality issues alone should be sufficient, but there are deeper forces involved. Anywho, one must *measure* their way through troubleshooting at scale -- which is another reason why statistics and statisticians need to be close at hand.
The trouble is that most commodity hardware is built to perform well within given sets of tolerance. If you restart your laptop a hundred times, it will probably work fine; however if you launch 100 virtual servers, some percentage will fail. Even among the ones that appear to run fine, there will be disk failures, network failures, memory glitches, etc., some of which do not become apparent until they've occurred many times over. So when you pump terabytes through a cluster of servers for hours and hours, some parts of your data is going to get mangled.
Which brings up another problem: unit tests run at scale don't provide a whole lot of meaning. Sure, it's great to have unit tests to validate that your latest modification and check-ins pass tests with small samples of data. Great. Even so, minor changes in code, even code that passes its unit test, can cascade into massive problems at scale. Testing at scale is problematic, because sometimes you simply cannot tell what went wrong unless you go chasing down several hundred different log files, each of which is many Gb. Even if results get produced, with Big Data you may not know much about the performance of results until they've had time to go through A/B testing. Depending on your operation, that may take days to evaluate correctly.
One trick is to have statisticians acting in what would otherwise be a "QA" role -- poking and prodding the results of large MapReduce jobs.
One other trick is to find the people talking up the use of scrum, XP, etc., and relocate those people to another continent. I use iterative methodology -- from a jumble of different sources, recombined for every new organization. However, when someone wants to sell me a "agile" product where unit tests adjudicate the project management, I show that person the door. Don't let it hit you on the way out. In terms of working with cloud-based architecture and Big Data, something new needs to be invented for methodology, something beyond "agile". I'll try to document what we've found to work, but I don't have answers beyond that.
Some key points:
Each statistician can prompt enough work for about three developers; it's a two-way street, because developers are generally helping analysts resolve system issues, pull or clean up data that's problematic, etc.
Statisticians work better in pairs or groups, not so well individually. In fact, I round up all the quantitative analysts in the company once each week for a "Secret Statisticians Meeting". We run the meeting like a grad seminar. Management and developers are not allowed, unless they have at least a degree in mathematics -- and even then they must prove their credentials by presenting and submitting to a review.
In any case, I try to assign a minimum of two statisticians to any one project, to get overlap of backgrounds and shared insights.
We use R, it's great for plots and visualizing data. I find that hiring statisticians who have the chops to produce good graphics and dashboards -- that's crucial. Stakeholders don't grasp an equation for how to derive variance, but they do grasp seeing their business realities depicted in a control chart or scatterplot. That communication is invaluable, on all sides. Hire statisticians who know how leverage R to produce great graphics.
Back to the developers, some are vital as systems programmers -- for scripting, troubleshooting, measuring, tweaking, etc. We use Python mostly for that. You won't be able to manage cloud resources without that talent. Another issue is that your traditional roles for system administrators, DBAs, etc., are not going to help much when it comes to troubleshooting an algorithm running in parallel on a thousand cores. They will tend to make remarks like "Why don't you just reboot it?" As if. So you're going to have to rely on your own developers to run operations. Those are your system programmers.
Other developers work at the application layer. Probably not the same people, but you might get lucky. Don't count on it. Algorithm work requires that a programmer can speak statistics, and there aren't many people who cross both worlds *and* write production quality code for Hadoop.
Another problem is documentation. Finding a tech writer to document a team working on Big Data would entail finding an empath. Besides, the statisticians generate a lot of documents, and the developers (and their automated tools) are great at generating tons of text too. Precious little of it dovetails, however. Of course we use a wiki to capture our documentation -- it has versioning, collaborative features, etc. -- and we also use internal blogs for "staging" our documentation, writing drafts, logging notes, etc. Even with all those nifty collab tools, it still takes an editor to keep a bunch of different authors and their texts on track -- *not* a tech writer, but an actual editor. Find one. They tend to be inexpensive compared with developers, and hence more valuable than anybody but the recruiters. But I digress.
Back to those team ratios from Jack Olson... I have a hunch that the following works better in Big Data:
2:3:3:1 for stats/sys/app/edit
A team of 9 needs a team leader/manager -- which, as I understand, might be called a "TL/M" at GOOG, a "2PT" at AMZN, etc. I favor having that person be hands-on, and deeply involved with integration work. Integration gives one a birds-eye view of which individual contributors are ahead or behind. Granted, that'd be heresy at most large firms which consider themselves adept at software engineering. Even so I find it to be a trend among successful small firms.
One other point... Working with Big Data, and especially in the case of working with cloud computing... the biggest risk involved and the most complex part of the architecture involves data loading. In most enterprise operations, data loading is taken quite seriously. If you're running dozens or hundreds (or thousands?) of servers in a Hadoop cluster, then take a tip from the large shops and get serious about how you manage your data. Have a data architect -- quite possibly, the aforementioned team leader/manager.