2014-03-02

Newsletter Updates for March 2014

Strata SC 2014 was a busy time indeed. I’m grateful to have had the opportunity to introduce speakers for several excellent presentations – in addition to presenting about Apache Mesos and meeting with many interesting people who were attending the conf. The keynotes this time were diverse, including brilliant and inspiring words from Geoffrey Moore, Rodney Mullen, and the capstone talk The Future Isn’t What It Used To Be by James Burke – what people were calling “The missing episode of Connections.”

Among the sessions at Strata, my favorite talk by far was Spreadsheets: The dark matter of Big Data by Felienne Herman, professor at Delft and founder of Infotron. If we are going to address the matter of data in large quantities and especially key learnings about data in business, spreadsheets are the place to start and Felienne is the brilliant leader of our exploration! Functional programming, graph queries, metadata modeling over time, etc., go check out her work.

Also at the top of my picks from Strata: Probabilistic Programming: Why, What, How, When by Beau Cronin, a recovering computational neuroscientist and big data skeptic debonaire. To paraphrase: business data is heterogeneous and structured… the data for every domain is heterogeneous. I’ve seen the best minds of my generation destroyed by madness, dragging themselves through quagmires of large LCD screens filled with Intellij debuggers and database command lines, yearning to fit real-world data into their preferred “deterministic” tools. March on, do not tarry, to go study this work. You may glimpse why Salesforce acquired a small AI start-up out of MIT to become its skunkworks.

My third top pick from Strata is clearly Algebra for Analytics by the ineffable talent driving Twitter’s insights at scale, Oscar Boykin. Money quote from this gem? Because #Monoids: “lack of associativity increases latency exponentially”. That talk was worth the price of admission alone, if you get the implications. If not, well, there are still plenty of jobs reqs open for J2EE, somewhere.

While we are discussing the subject, I urge you to make time to view Add ALL the Things: Abstract Algebra Meets Analytics by Avi Bryant, co-author of Scalding. If I may attempt to place this into context, your team could pour all their resources into developing precious source code, schema, unit tests, etc., but in practice your data probably will not fit what they have anticipated – especially when you encounter the Balrog of low-latency use cases. Partitioning, missing values, max range overflows, etc., did you really expect us (and the BoD) to believe that your developers can anticipate the complexities of data at scale? Fine, but why bother with all this abstract algebra mishmash, you ask? To wit:
  • grouping does not matter (associative)
  • ordering does not matter (commutative)
  • zeros get ignored (identity)
One more choice quote from Oscar: “Hash, don’t sample.” Seriously, after reviewing those presentations listed above, if a truly huge bright light bulb does not suddenly click ON about data+business and why at 2014 we really must stop reiterating the COBOL experience in Java (let alone making intellectually-encumbered hagiographies about “Hadoop as an Operating System”) … then, well, it’s time to step away from the screen+keyboard.

"Dune Builder" from Beyond The Human Eye

On to other illuminations

Had a wonderful time in Seattle in late January. Many thanks to all who attended the workshops, meetup talks, drink-ups, office hours, etc. In particular, we had the first public meetup @Twitter Seattle – thanks to many efforts by Jake Mannix. I gave a new talk, Data Workflows for Machine Learning, that began as an update to the Enterprise Data Workflows with Cascading book. I wanted to expand the analysis of abstraction layers out through many different open source frameworks. The results in the talk develop a “scorecard” to compare/contrast features among the different frameworks in the study. Standing. Room. Only. We spent hours afterwards discussing these topics and more. Judging from SlideShare velocity, this has turned out to be my most popular talk in the past three years. Apparently, people want to talk more about data workflows. Who knew?

On the subject of Apache Mesos, check out the recent paper Quasar: Resource-Efficient and QoS-Aware Cluster Management by Christina Delimitrou and Christos Kozyrakis @Stanford. I enjoyed an opportunity have a few beers and convo with Christos a few months ago to discuss this work. The space is evolving rapidly! Also, Ben, Flo, and I really enjoyed presenting the Mesos tutorial at Strata, great audience feedback from that – and we look forward to doing more of those tutorials!

On the subject of Titan, check out the release of Dendrite – an integration of Titan, Faunus, GraphLab, Jung, AngularJS, and more. Money quote from Lab 41: “It turns out that much of the world, both physical and virtual, can be represented as a graph. ”

On the subject of Cascading, Concurrent recently released Driven, a monitoring/troubleshooting webapp for Cascading – think: New Relic for Big Data apps. Great to see this new capability reaching the datacenters, and I am grateful to have been involved in the product’s inception. I can tell you that once your team deploys a mission-critical app on a large Hadoop cluster, well the fun has merely begun! Hint: you need better tooling to troubleshoot edge cases at scale – or your team will be camped out (like our teams were, multiple times) under their desks for the next few weeks, while SVPs keep calling for progress updates.

✽ ✽ ✽

More about machine learning in particular…

I enjoyed the opportunity to catch the talk From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews by Julian McAuley, one of Jure Leskovec’s post-docs @Stanford. You may have already seen Julian’s fascinating work about machine learning and beer preferences in videos or articles. I highly recommend a good read through Julian’s preso (linked above) for the state of the art in time-based recommendation systems.

As a follow-up to my earlier link about quantum algorithms and Minecraft, here’s another interesting video about Google + NASA + D-Wave working on Quantum AI projects.

Data Science Weekly recently ran an interview Machine Learning => Energy Efficiency with Kari Hensien, Sr Director Product Development at Optimum Energy, and my colleague Cameron Turner, Data Scientist at The Data Guild. Money quote, in answer to a question of what was the first data set you remember working with? “Honestly, I was a girl scout and the cookie season was upon us. I found myself trying to figure out how many houses I would need to stop at in order to sell enough boxes to get the Rubik’s Cube.”

Also, check out a new release – a free report/mini-book Practical Machine Learning: Innovations in Recommendation by Ted Dunning and Ellen Friedman. Pragmatic, timely, and approachable for all levels of expertise.

✽ ✽ ✽

Lately with the drought in California, my thoughts have been turning more and more toward Agriculture and how data insights at scale and IoT issues apply for very real world problems. Namely, the livelihood for 40% of the world’s population, in what amounts to $15T of annual GDP globally. It's hard to imagine a topic of research that matters more. An interesting perspective is presented in the article How NASA, Cisco, and a tricked-out planetary skin could make the world a safer place. From my research, the connectivity for IoT sensors is a major failing point currently – along with privacy/security, interoperability among analytics platforms downstream, etc. I would be very interested to hear your experiences related to Ag data at scale.

I’ll leave you with this video about Gravity Glue which I found amazingly inspirational.