hadoop in the cloud - patterns for automation

A very good use for Hadoop is to manage large "batch" jobs, such as log file analysis. A very good use for elastic server resources -- such as Amazon EC2, GoGrid, AppNexus, Flexiscale, etc. -- is to handle large resource needs which are periodic but not necessarily 24/7 requirements, such as large batch jobs.

Running a framework like Hadoop on elastic server resources seems natural. Fire up a few dozen servers, crunch the data, bring down the (relatively smaller) results, and you're done!

What also seems natural is the need to have important batch jobs automated. However, the mechanics of launching a Hadoop cluster on remote resources, monitoring for job completing, downloading results, etc., are far from being automated. At least, not from what we've been able to find in our research.

Corporate security policies come into play. For example, after a Hadoop cluster running for several hours to obtain results, your JobTracker out on EC2 may not be able to initiate a download back to your corporate data center. Policy restrictions may require that to be initiated from the other side.

There may be several steps involved in running a large Hadoop job. Having a VPN running between the corporate data center and the remote resources would be ideal, and allow for simple file transfers. Great, if you can swing it.

Another problem is time: part of the cost-effectiveness of this approach is to run the elastic resources only as long as you need them -- in other words, while the MR jobs are running. That may make the security policies and server automation difficult to manage.

One approach would be to use message queues on either side: in the data center, or in the cloud. The scripts which launch Hadoop could then synchronize with processes on your premises via queues. A queue poller on the corporate data center side could initiate data transfers, for example.

Would be very interested to hear how others are approach this issue.


shared memory: humdog

News came today that a long-time friend had died over the weekend: Carmen Hermosillo, aka "Humdog". I am very sad to hear. She is dearly missed.

Humdog generally kept Skype running on her computer all the time, which now shows "last seen on 2008-08-12 11:18 GMT-7". Someone must have turned off her computer just then.

Thinking back, it must have been about 1991 when I first "met" Carmen online, on The Well when I first joined as part of the editorial staff at bOING-bOING. Writers enjoy heady discussions, and Carmen had a knack for turning discussions on their head. She also had the most incredible ways of steering a group online.

More recently, I hired Carmen as a researcher for my team at HeadCase. Perfect for the job, she thrived in that role. There was one point, when the company was first getting started in 2006... Humdog tried to get each of the managers to read a particular article. Everyone replied, "Wow, great, but I'm really too busy now on a certain business planning document, blah blah, will catch up later, etc." As it turned out, that article broke news about virtual worlds which turned our first draft business plan inside out, but in a good way. Some of us finally listened to Carmen, in time to make necessary changes.

Oddly enough, Carmen and I never met in person until the final HeadCase team meetings in July 2008, held in San Luis Obispo. My hometown provided a beautiful, relaxed setting -- our first and last meeting in the physical world.

During most of the 1990s, I was privileged to lead a rather scruffy gang of media explorers called FringeWare. Humdog had a definitive role on our magazine's masthead and an exuberant voice in our online forums.

Among those who formed the core of FringeWare, Humdog becomes the first of us to go to our ancestors.