First, got a webcast on O’Reilly later this week: Getting Started Running Apache Spark on Apache Mesos. That will be Friday January 24, at 10am PT.
Next, we have several workshops coming up: Bay Area in late Feb, DC in mid-Mar, then Boston in late Apr. The new Intro to Machine Learning workshop got great reviews, especially with the new “Just Enough Math” segment. The focus for this workshop is on understanding the business use cases for Machine Learning: beginning with an overview of the history of ML, how many different fields contributed and explaining the wide variety of terminology… then looking at team process, evaluating models in production, etc. Plus, a general rubric for how to approach building ML-based apps. Oh, we take a good look at comparing algorithms, too :) Most of the code examples are brief, based on R and Python.
There’s another workshop getting added, called Cluster Compute Integrations. We’ll look at a variety of distributed frameworks, how to build apps which integrate them: Spark, Storm, Hadoop, Cassandra, Titan, etc. More specifics about that in a couple weeks – but the emphasis is hands-on experience running these frameworks on AWS clusters.
Data Day Texas 2014 was held in Austin, Texas on Jan 11th. It was great to get to catch up with many friends in Austin. We had twice as many people attending as the year before, and a great set of speakers. Alex Moundalexis @Cloudera published a great set of photos, and Eventifier covered the conf too. Here’s a review of some of the talks, along with a collection of slide decks from the conference.
Clearly, Josh Wills @Cloudera stole the show. Check out the slides from his From the Lab to the Factory talk, and for more details check his recent preso video on InfoQ.
Another popular talk, Introduction to KNIME Data Mining Software by Michael Berthold about KNIME was reported as worth the price of admission.
Mining Social Web APIs with IPython Notebook by Matthew Russell @Digital Reasoning, was an excellent tutorial. Check out his books on O’Reilly Media. Matthew also teamed up with Steve Kramer @Paragon Science for Got Chaos? Extracting Business Intelligence from Email with Natural Language Processing and Dynamic Graph Analysis Check out slides 42-ff for the math regarding Finite-time Lyapunov exponents (FTLEs) for nonlinear time-series analysis – in this case, analyzing the Enron email data set. Recommended.
The Road to Summingbird: Stream Processing at (Every) Scale by Sam Ritchie @Paddleboard was another of my top picks. Sam was one of the authors of Summingbird at Twitter – recently relocated to Boulder. They’re lucky, the Big Data expertise in the Front Range just jumped exponentially.
Algorithmic Music Recommendations at Spotify by Chris Johnson @Spotify featured some of the more hard-core data science of the conference. Fun to see Spotify users/fans learning the details of how it's recommendations work under the hood.
If The Singularity Arrives, Will It Be By Design Or Evolution? by Bill Worzel @Evolution Enterprises presented Genetic Programming based on combinators. For the functional programming fans out there, check this out.
Developing Real-Time Data Pipelines with Apache Kafka by Joe Stein @AllThingsHadoop – another popular talk. Check out related material Launching Kafka with Apache Mesos, and in general check out Joe’s All Things Hadoop blog+podcast!
I did not get to catch all of the talks, but I've tried to collect as many of the slide decks as have been posted… Polyglot Persistence at Parse by Charity Majors @Facebook, Measure All The Things! by Gary Dusbabek @Rackspace, Message Architectures in Distributed Systems by Eric Lubow @SimpleReach, A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database by Srivatsan Ramanujam @Pivotal Labs.
Meanwhile, I tried to tie together the many themes of DDTx 2014, and the themes of Big Data circa 2014 in general, with an umbrella keynote entitled The Big Picture.
That's the update for now. See you at Strata, if not sooner in Seattle!