Newsletter Updates for May 2013

Lots of talks, lots conferences, lots of articles. Here are the latest updates about scheduled events, along with pointers toward some of the best content that I've been studying lately.

Sign up for PXN Newsletter: "Data Workflows" http://liber118.com/pxn/ ]

The O'Reilly book is due in print June 22 – just in time for Hadoop Summit. An online "Rough Cuts" version should be available before then. Many people have asked if there will be a Kindle version? Yes. Yes, there will be! Many many thanks to our technical reviewers for all the excellent feedback and suggestions. Also, speaking of EPUBs, the Liber 118 novel has been available in Kindle version for a nice long while, and please check it out :)

Events are listed below for a week in Austin, then CityCamp and Hadoop Summit here in SV, followed by OSCON in PDX. Should be a fun summer! These talks have ample doses of Big Data frameworks, Data Science use cases, Machine Learning at scale, Open Data apps, etc. For a sample, check out a recent article about the Pattern open source project in the Software Developers Journal. For another sample, check out this recent Airbnb Tech Talk. Functional programming as a way to tackle Big Data projects has resonated well lately in these talks – at Stanford and CMU as well. Also, the chef at Airbnb graciously celebrated the event with a special dinner menu featuring "Paco's Tacos":

I like to include a few "extras" in these newsletters. A few in particular have been on my mind a lot lately. The first is an excellent by John Wilkes at the 2011 Google Americas Faculty Summit, discussing the Borg and Omega projects which Google uses to manage clustered resources in their data centers. Wilkes presents the notion of "surety" as a first-class resource alongside CPU, RAM, I/O, etc., introducing a subtle but powerful change to our accepted notion of Von Neumann architecture. This has several interesting implications for those of use who build large-scale distributed apps. We'll be talking more about that in Austin. Another recommendation is the excellent paper by Jimmy Lin, "MapReduce is Good Enough?", based on his analysis of machine learning apps at scale during a sabbatical at Twitter. On the one hand, there is ample criticism that Hadoop is not quite suitable for many important kinds of algorithms. On the other hand, as Professor Lin points out, much of our code implementing algorithms has been inherited from 3+ decades of expressing logic based on FORTRAN loops. Seriously. I'm not one to believe that Hadoop won't be replaced (reasonably soon) but we really need to replace algorithm libraries with better updated code. As the paper shows, that can lead to more effective implementations for streaming anyway. In terms of putting related insights into practice, check out the talks on SlideShare by David Gleich, especially about "tall and skinny" QR matrix factorization, as well as excellent explanations of the math behind Google, etc., for multi-arm bandit and other machine learning in practice. Chris Severs at LinkedIn has an excellent implementation for Gleich's TSQR. Last but not least, and putting most of the above into practice: Pete Skomoroch gave an excellent talk "Skills, Reputation, and Search" about data products at LinkedIn. You could search far and wide to try to find a better discussion of how to take a problem of raw, unstructured Big Data from the stage of blank whiteboard to world class app in a matter of months.

One final note: been preparing a workshop about all of the above – typically as a full-day course, very hands-on. We'll try the first in Austin, then take it elsewhere. If you have a city or venue to suggest, please let me know @pacoid


Upcoming events…

Big Data SF Bay Area presents:
Mon, May 20, 2013  6:30 PM - 9:00 PM (PDT)
Lilly Mac's
187 S Murphy Ave, Sunnyvale, CA 94086 

Hands-on Introduction to Data Science -- a full-day workshop with PXN
Wed, May 29, 2013  8:30 AM - 5:30 PM (CDT)
AT&T Conference Center
1900 University Ave, Austin, TX 78705

City of Palo Alto:
a talk by Paco Nathan and Diego May 
Sat, Jun 1, 2013 11:00 AM - 7:00 PM (PDT)
downtown Palo Alto, CA

Hadoop Summit:
Wed, Jun 26, 5:05 PM - 5:55 PM (PDT)
San Jose Convention Center
150 W San Carlos, San Jose, CA 95110

O'Reilly Media OSCON:
Thu, Jul 25, 2013  5:00 PM (PDT)
Oregon Convention Center
777 NE Martin Luther King, Jr. Blvd., Portland, OR 972322