Newsletter Updates for September 2014
Highly recommended, Oct 2: an O’Reilly Media webcast Spark 1.1 and Beyond by Patrick Wendell and Ben Lorica. Two people who have much to share about where Apache Spark is heading.
My favorite conference in a long while was the Spark Tutorial hosted by Prof. Reza Zadeh @ Stanford ICME – home of world-leading innovation for machine learning at scale. The tutorial featured lectures on Spark Streaming, MLlib, GraphX, etc., from lead committers. Great to be working at Stanford again (if only for a few days this summer) and wonderful to meet many people who participated. Here’s an excellent set of notes. For Stanford affiliates, Prof. Zadeh has an upcoming course CME 323: Distributed Algorithms and Optimization with related content explored in much more detail.
We will hold another Spark Tutorial at UMD in College Park, Maryland on Oct 20–22, hosted by Prof. Jimmy Lin. That event sold out quickly, as did the one at Stanford – so we’ll do more! More about that in a bit.
The Quad @ Stanford University |
Another great conference this summer was the inaugural MesosCon 2014 in Chicago last month. Twitter kindly recorded all the sessions. In particular, Ben Hindman’s keynote hints toward cross-datacenter features on the horizon. My talk was about Spark on Mesos, and a related blog post shows a few simple steps to launch a Spark cluster on Mesosphere’s free-tier service atop Google Cloud Platform.
Mesosphere partnered with Google’s Omega team for a killer demo involving Kubernetes and Mesos, showing cluster failover/migration across datacenters in CA and NY. Sounds simple, but the implications are vast. The other killer demo, from eBay, featured YARN on Mesos – with ultimately no code mods required, just an additional JAR file plus some config settings. Check out related slides and video. Ginormous implications for that one, thanks eBay!
Sparky-the-Bear sez: ignite your data
Big news for me this summer was joining Databricks as Director of Community Evangelism. New business cards. Lotsa new tshirts. I’m thrilled to become part of this renowned team, delighted to be out in the field amidst the exponential growth of Spark production use cases.
KDnuggets ran a story recently about our Spark news… and there’s a lot. To quote the Gartner report Hype Cycle for Advanced Analytics and Data Science 2014: “Databricks is providing certification, training and evangelism that mirror the early Hadoop model.” Of course AMPLab + Databricks have been running Spark training sessions for years. I’ve joined to lead this program, and our team is busy delivering:
- Spark Camp @ O’Reilly Strata and Spark Summit events
- professional workshops worldwide
- Spark Tutorial events held at major universities
- a new Spark developer certification by Databricks + O’Reilly Media
- free online resources for learning Spark
- plus a few other items that we won’t reveal just yet…
Databricks and O’Reilly Media partnered to launch Developer Certification for Apache Spark http://oreilly.com/go/sparkcert – a brand spanking new program that leverages the amazing Spark experts @ Databricks + the incomparable editorial team @ O’Reilly Media:
val results = sc.parallelize(world_class).map(x => exp(log(x) * 2))
results.sum()
So my second O’Reilly book turned out to be a video + Docker image, while the third became a cert exam :) This formal exam takes < 90 minutes: expect multiple-choice questions based on small blocks of code in Python, Java, Scala. Questions test for a range of developer knowledge across Spark Core plus Spark SQL, Streaming, MLlib, GraphX, and typical use cases. We’re establishing the industry standard for measuring and validating technical expertise in Spark.
How to prep for this exam? Don’t worry, it doesn’t require extensive Scala knowledge; however, some familiarity with Scala code examples shown in the Spark docs would help lots. Mostly, we’re testing to see if you understand the Spark execution model, RDDs, how to leverage functional programming to get the most out of your cluster, i.e., avoid common bottlenecks, refute some of the, ahem, FUD that’s been circulating about MapReduce vs. Spark. You are probably good to go if you:
- are comfortable coding the advanced exercises in Spark Camp
- read the Apache Spark user email list regularly and could field 80% of the newbie questions
- have mastered the material released so far in Learning Spark
- took at least two of our Spark professional workshops
Alternatively, we’re looking for volunteers. The certificate exam will preview on Oct 16 at Strata NY and we need volunteers to evaluate the exam. You’ll get deep discounts on the Spark developer certificate. Plus, it’s an excellent way to score ginormous brownie points with both Databricks and O’Reilly Media, along with conf coupons, outstanding nerd cred, etc. Become an essential part of the Spark developer community building the next-generation of Big Data apps. Let me know. I’ve heard that T. O’Reilly and I. Stoica have authorized us to buy NY gourmet pizza + top-shelf beers for all volunteers (at least let’s start the rumor).
Meanwhile, stay up to date with the latest advances and training in Spark, and help prep for the certification exam. Workshop materials are authored by Databricks, and we’ve trained and certified these instructors. Upcoming training for Spark will be held in SF, DC, London, Paris, Barcelona, Stockholm, and Dublin:
- Oct 10, London via Big Data Partnership
- Oct 15, NYC via Spark Camp @ Strata NY (SOLD OUT)
- Oct 22, Paris via Scala.IO
- Oct 27, SF via Big Data TechCon
- Nov 14ish, Stockholm @ SICS [TBD]
- Nov 19, Barcelona via O’Reilly Strata EU
- Dec 5, London via Big Data Partnership
I look forward to the EU trip, but I regret not arriving in time for Scala.IO – amazing talks lined up this year. Also looking forward to Big Data TechCon, and in particular I recommended The Hitchhiker’s Guide to Machine Learning with Python and @ApacheSpark by Krishna Sankar.
BTW, keep your eyes peeled for more material (courses, talks, videos, webcasts, etc.) about architectural design patterns that leverage Spark together with other popular frameworks, such as Cassandra and Kafka. Our team has been working closely with DataStax and others to bring you solutions that go far, far Beyond Hadoop. For those who weren’t watching closely: an emerging tech stack that integrates Spark, Cassandra, Kafka, ElasticSearch, etc., recently pulled in a 1/4 billion in VC financing.
Just Enough Math
The Just Enough Math material is progressing well… Similar to OSCON, we’ll have a tutorial at Strata NY on Wed, Oct 15 1:30pm, expecting +100 people this time. There’s also a public Docker image now, plus more work with O’Reilly on this project. We needed more Mesos + Docker foo to make progress on that infrastructure.
Hopefully, we’ll have an upcoming series of lectures too!
3D Printer Room @ Singularity University |
The return of the fellowships
It was an honor to present at Singularity University this summer, along with a workshop at Insight Data Engineering Fellows Program. Looking forward to visiting Zipfian Academy soon too.
We have bunches and gobs o’ regional confs and meetups scheduled:
- Oct 1: Chicago Big Data Everywhere
- Oct 2: Boulder/Denver Spark Meetup @ Datalogix
- Oct 7: SF, The Spark SQL Optimizer and External Data Sources API @ Deloitte
- Oct 8: Seattle, Deep Dive into Spark, Tachyon, and Mesos Internals @ Expedia
- Oct 15: NYC Spark Committer Night @ Bloomberg
- Oct 21: Maryland: Apache Spark in Four Parts @ Raytheon
- Nov 14-ish, Stockholm Big Data Meetup [TBD]
- Nov 17: Madrid, Big Data Spain @ Kinépolis
- Nov 18: LA, Getting Started with Spark and Scala @Directv
Also mark your calendars for:
- Austin, Jan 10: Data Day Texas
- San Jose, Feb 18–20: Strata CA
- NYC, Mar 18–19: Spark Summit East
- SF, Jun 15–17: Spark Summit 2015
Ag+Data
Continuing on the prior theme of Ag+Data, James Hamilton (Amazon) wrote an intriguing blog post recently, Data Center Cooling Done Differently about a new kind of collocation: datacenters and desalinization. Desalinization at scale seems inevitable here in California – perhaps taking a cue from successes in Australia, etc. FWIW, I prepared a VC pitch for a related venture in 2008, but pulled back after initial feedback. Remember: always go with your gut!
I thoroughly enjoyed this gem about “Organic Ready” non-GMO seeds… Here's to gametophytic incompatibility in large doses. Also check Water’s Edge for an interesting special report on rising sea levels. Big Data comes in handy for contending with these crises related to global warming issues. Three items to check out from low Earth orbit: The Satellite, Spaceknow, OmniEarth. Just in case we fry the biosphere before we can get a semi-permanent backup archived on Luna or Mars… one dreads the thought, but artificial photosynthesis is becoming more of a reality. I say “dread” because that idea recalls a vision of Trantor or perhaps Silent Running.
While we’re talking about remote sensing, I should also mention a follow-up study on the data point about GE 12 exabytes/day from turbine sensors on commercial flights: 2000x faster detection of rare critical failure modes. Here's to those early successes turning into a trendline for IoT.
Misc.
A few pointers to notable work by friends and family: Film Theory and Chatbots by Robby Garner; Don Webb: Writing the Science Fiction Novel @ UCLA Extension; Eisoptrophobia by Akira Rabelais; AlaVoidDistribution by William Barker.
Then I’ll leave you with something haunting and epic: NASA Space Sounds.
That's the update for now. See you in NY, DC, EU on the event horizon!