Newsletter Updates for July 2013
Lots of talks, lots conferences, lots of writing. Here are my latest updates about scheduled events, along with pointers to some of the best content that I've been studying lately.
Got to speak at Hadoop Summit last month, about the Pattern project. Great audience, lots of discussion about deploying predictive models at scale on Apache Hadoop clusters. BTW, one of the most compelling talks at Hadoop Summit this year was by Kevin Coogan, founder of AmalgaMood in DC. Kevin discussed their technology that leverages social signals and Open Data in predictive analytics when financial markets are not responding to what might otherwise be called analysts' consensus. Chaos, in other words. The Q&A for that talk in particular was enlightening and compelling: what a great way to scurry VCs up to the audience microphone.
We had several other excellent events recently: Seattle, Santa Clara, Los Angeles -- many thanks to hosts Surf Incubator, White Pages, and Factual. Plus, we had some private brown bags at LinkedIn, MapR, and other firms. My takeaway: great to meet many amazing people, a wealth of talent, and overall so much dedication to learning about this field. To that point, there's lots of opportunity in Data Science roles and along with that a big need for people who are adept at working across disciplines. People need enough programming background to leverage distributed systems, which enable the compelling use cases. People also need enough quantitative background to leverage the math required for high ROI apps at scale. I find that many people attending the workshops have expertise in one field, and want to augment with the other field -- which is ideal for learning from each other. Upcoming events: Portland, Austin, Chicago. Nike is helping to sponsor the Portland meet up, at Widmers -- so we won't have far to go for beers afterwards. And during. See the calendar at http://liber118.com/pxn/ for more details. BTW, if you want a 20% discount for OSCON, please use OS13FOS for a discount code. If you have a city or venue to suggest for upcoming workshops and talks, please let me know @pacoid
Toward a general thesis for Cluster Computing...
In other news, the Mesos open source project has graduated into a top-level Apache project. I recently took a position as Chief Scientist at a new company related to that project, called Mesosphere, in San Francisco.
There's a general thesis emerging, namely that we run large-scale apps based on cluster computing. Because the data has become too big to fit on one computer anymore. Or, for that matter, the apps have become too complex to be handled by one computer, one person, one model. We require multi-disciplinary teams, leveraging cluster computing. Three areas of technology innovations get applied: Big Data, Data Science, and Cloud Computing. At a high level, applications generally leverage an abstraction layer, such as Cascading. At a low level, the more advanced organizations are leveraging cluster schedulers such as Mesos -- and, arguably, YARN coming along too. For an excellent overview, see the Wired article by Cade Metz, Return of the Borg: How Twitter Rebuilt Google's Secret Weapon.
I foresee a general trend of smarter clusters, leading into higher ROI on Big Data apps. On the one hand, multi-tenancy in clusters helps balance the utilization curves and cut costs. On the other hand, reducing the "wire tax" of moving data products from batch clusters to web app clusters will help enable new areas of algorithm development, by reducing critical latencies. Companies such as Twitter and Airbnb have both built their tech stacks using these components, Cascading and Mesos. I spot a trend.
My intent is to show sample apps that leverage both layers. Also, one advantage of Mesos is that it manages resources for many different kinds of frameworks and apps: Apache Hadoop, Spark, MPI, Memcached, Nginx, Redis, Ruby on Rails, Python, etc. That's perfect for Big Data use cases that blend multiple frameworks. Stay tuned.
Drilling down into the math...
My lectures tend to emphasize a division between the rigor and formalisms of statistical theory versus the relatively ad-hoc praxis of what we categorize as machine learning. At top schools, grad students in one of those fields receive high salaries and VC funding straight out of school, while grad students in the other... not so much. That's a shame, because mission-critical apps at scale rely on both disciplines. Machine learning allows you to make billion-dollar mistakes, while statistics help you avoid billion-dollar mistakes. Take a look at any good search engine team and you'll see how both disciplines become necessary in practice. Together.
Another point is that machine learning approaches are a subcategory within optimization theory. As I'm researching industry use cases for Data Science, Big Data, Cloud Computing, etc., it becomes clear that more emphasis on optimization is crucial for long-term industry evolution. Hanging around seminars at Stanford's Systems Optimization Lab, the math innovations presented and their applications have huge implications for industry. As a case in point, John Deere probably won't be building a Facebook competitor any time soon; however, they must tackle hard problems at enormous scale in optimization. Mathematicians are responding to that demand. Given that 40% of the world's population works directly in agriculture, plus the urgency of global climate change, etc., I tend to find Deere's domain more compelling than yet-another social network, ad network, social game, etc.
Enough soap box. Instead I'd like to recommend an excellent resource in this area, notably Rob Zinkov's blog series Convex Optimized at http://zinkov.com/
My current homework, thanks to Rob, focuses on Alternating Direction Method of Multipliers (ADMM). This builds on the previous theme of exploiting sparsity and matrix factorization atop Hadoop. It also addresses the need for more emphasis on general approaches in optimization theory, and less on the nuances of machine learning algorithms. Much study will be required before my sample apps begin to emerge. Meanwhile here's to broadcast and gather as a more interesting pair of verbs than map and reduce.
Some well-known Big Data vendors are struggling to market Apache Hadoop as an OS. It's not really. Hadoop may be a distributed file system plus some distributed computing, but calling it an operating system would be a great way to fail a CS midterm. More to the point, MapReduce as an abstraction is multiple layers removed from the needs of actual workloads, plus it's already 11+ years old. A better focus for these vendors might be to engage Professor Boyd, et al., to build distributed computing frameworks for commodity hardware based on ADMM principles, which support a wide range of commercial ML problems directly.
Doubtful that will happen. For example, that might require (gasp) supporting MPI features! Some engineer would need to prioritize studying math (gasp) in lieu of lobbying for 150 new commits on an Apache project! People could eventually recognize commercially interesting problems in Enterprise IT which are (gasp) not readily expressed as SQL! No, that won't happen. The Global 1000 is far too busy tooling up on Hadoop. Instead, I'd bet money on Anaconda, Spark, Titan, GraphLab, etc., leap-frogging the Hadoop-centric segment of the industry -- once the world beyond Silicon Valley wakes up to the realities of math emerging circa 2010. YMMV.
BTW, I must apologize, but it's become impossible to keep pace with email. While traveling, I find that Twitter works best for what must get said quickly. Semipublicly. I'll check Twitter often -- but email infrequently.
Many thanks,
Paco