Been quite an interesting past month or so: DC, Austin, SF, Ann Arbor, Atlanta, Seattle… with hopefully much learned from those travels, plus many excellent events and introductions.
Meanwhile, I learned much from this gem, Therbligs for data science: A nuts and bolts framework for accelerating data work, by Abe Gong. Looking forward to seeing more about Therbligs from Abe. Definitely tune in to Welcome to Intelligence Matters, a new series by O’Reilly exploring current issues in AI, with Beau Cronin as lead correspondent. Another recommended gem is Genomics Crash Course for Data Engineers by Allen Day – that's at the intersection of Genomics and Big Data, for which I have seen an uptick recently.
Just Enough Math
Allen and I have been working to complete our new O’Reilly book, Just Enough Math. The video is in post-production now, and the book is half through second drafts – we are closing in! Some of that material will be previewed in the upcoming workshop Machine Learning for Managers:
O’Reilly will host a free one-hour webcast, Computational Thinking, Just Enough Math on Wed, Jun 4, 10:00am–11:00am (Pacific). Please join me there. The webcast will help publicize a tutorial based on Just Enough Math at OSCON in Portland on Sun, 20 Jul, 9:00am-noon. As a special offer, use the code PACOID to get a 20% discount on OSCON registration. Our tutorial will preview a very new thing at O’Reilly: converting book+video content into interactive tutorials using Docker + IPython Notebook + Vagrant + Git for a cloud-based next-generation content platform.
Speaking of Docker, one of the more interesting start-ups that I have run across recently is Resin, using Docker and Git to containerize+push apps on IoT devices running embedded Linux. Brilliant work.
|UCB Initiation Ritual: cousins circa 1968, near Atascadero|
In other news, I am thrilled to announce a partnership with Databricks, where I’ve been working to help develop an instructional program that introduces Apache Spark. As you can see in the photo above, the ceremonial ritual for teaming up with UC Berkeley is a bit arduous, but well worth it. Yes, you heard correctly … a Stanford alum saying “Go Bears!”
Our first course in the series is Databricks Hands-on Intro to Apache Spark, an introduction for developers working in Python, Java, and Scala. We have several of these workshops scheduled:
Spark is approaching the 1.0 release at Apache, with new support for SQL. Overall, one of the best presentations that I’ve seen recently about it was Spark at Twitter by Sriram Krishnan, Engineering Manager for Data Platform at Twitter.
The agenda was posted recently for Spark Summit 2014, in SF on 30 Jun - 1 Jul. As another special offer, use the code Paco2014 to get a 15% discount on Spark Summit registration. Highly recommended, and I hope to see you there.
Speaking of BDAS and the Berkeley Stack… there have been lots of developments in the Apache Mesos world. One of the best talks ever about Mesos was Improving Resource Efficiency with Apache Mesos by Christina Delimitrou, a case study about Quasar usage at Twitter. Also check out Mesos Elastically Scalable Operations, Simplified by Niklas Nielsen and Adam Bordelon, presented recently at ApacheCon 2014.
The other big news is that #MesosCon, the first Mesos conference, will be held in Chicago on Aug 21. Definitely see you there! Companies interested in sponsoring the conference – please inquire.
I’ve create a new workshop called Cluster Compute App Integrations about building end-to-end apps for Big Data. The workshop leverages Mesos based on the https://elastic.mesosphere.io/ service in the cloud, along with Spark, KNIME, etc. Hint: this involves teams competing, and it is turning out to be quite a popular course. We have upcoming dates lined up:
Agriculture + Data
Did you know that agriculture provides a livelihood for 40% of the world’s population? Or that agriculture consumes 70% of the world’s freshwater in aggregate? That figure is expected to reach 89% by 2050. Or have you heard that Havana grows 75% of its own food based on urban agriculture?
Last month I wrote an O’Reilly Strata article, Ag+Data, about those topics and more. The article introduces a whitepaper, Agriculture + Data: Outlook 2Q14, that we recently at The Data Guild to explore these issues in greater depth. Many thanks to Bill Worzel, Brad Martin, and others who helped on that!
Recently I gave a keynote talk at the Genetic Programming in Theory and Practice conference, which hosted each year at U Michigan by The Center for the Study of Complex Systems. They are the experts in GP; I was merely there to add a few perspectives about machine learning and Big Data. What a wonderful conference. Got to speak at length with Lee Spector at UMass Amherst and Hampshire College. Lee and his grad students have been working with a Clojure-based language called Push, in which evolutionary programs are expressed.
What kinds of optimization problems respond to evolutionary pressure? Definitely not the kinds that one typically finds solved by machine learning. That is where GP approaches come in. In general, there was a lot of discussion about symbolic regression as a general rubric, also some exceptionally interesting work on use of Pareto optimal fronts for model archives (which I’ll be added to my ML bag o’ tricks). In particular, great work from Theresa Kotanchek and Mark Kotanchek at Evolved Analytics. Their software effectively leverages Pareto optimality to select exemplars when models diverge, which I find to be a fascinating alternative to what other disciplines might attempt to resolve through sample. Brilliant work.
Also got to talk with Bill Tozier, author of Answer Factories: The Engineering of Useful Surprises, and viewed some astounding work in HeuristicLab, an interactive framework from HEAL. Think: evolutionary IDE. Another excellent tip was to check out Modeling global temperature changes with genetic programming by Karolina Stanislawska, Krzysztof Krawiec, Zbigniew Kundzewicz.
✽ ✽ ✽
Didn’t get to mention yet about Atlanta, but I really appreciated meeting many wonderful folks there. You’ll be hearing more about upcoming Atlanta plans soon! Also, there are workshops and meetup talks planned now for: NYC, SV/SF, Austin, Chicago. Next up after my current week in Seattle comes Hadoop Summit, on 3–5 Jun in San Jose. Hope to see you there!