Newsletter Updates for September 2014

Highly recommended, Oct 2: an O’Reilly Media webcast Spark 1.1 and Beyond by Patrick Wendell and Ben Lorica. Two people who have much to share about where Apache Spark is heading.

My favorite conference in a long while was the Spark Tutorial hosted by Prof. Reza Zadeh @ Stanford ICME – home of world-leading innovation for machine learning at scale. The tutorial featured lectures on Spark Streaming, MLlib, GraphX, etc., from lead committers. Great to be working at Stanford again (if only for a few days this summer) and wonderful to meet many people who participated. Here’s an excellent set of notes. For Stanford affiliates, Prof. Zadeh has an upcoming course CME 323: Distributed Algorithms and Optimization with related content explored in much more detail.

We will hold another Spark Tutorial at UMD in College Park, Maryland on Oct 20–22, hosted by Prof. Jimmy Lin. That event sold out quickly, as did the one at Stanford – so we’ll do more! More about that in a bit.

The Quad @ Stanford University
Another great conference this summer was the inaugural MesosCon 2014 in Chicago last month. Twitter kindly recorded all the sessions. In particular, Ben Hindman’s keynote hints toward cross-datacenter features on the horizon. My talk was about Spark on Mesos, and a related blog post shows a few simple steps to launch a Spark cluster on Mesosphere’s free-tier service atop Google Cloud Platform.

Mesosphere partnered with Google’s Omega team for a killer demo involving Kubernetes and Mesos, showing cluster failover/migration across datacenters in CA and NY. Sounds simple, but the implications are vast. The other killer demo, from eBay, featured YARN on Mesos – with ultimately no code mods required, just an additional JAR file plus some config settings. Check out related slides and video. Ginormous implications for that one, thanks eBay!

Sparky-the-Bear sez: ignite your data

Big news for me this summer was joining Databricks as Director of Community Evangelism. New business cards. Lotsa new tshirts. I’m thrilled to become part of this renowned team, delighted to be out in the field amidst the exponential growth of Spark production use cases.

KDnuggets ran a story recently about our Spark news… and there’s a lot. To quote the Gartner report Hype Cycle for Advanced Analytics and Data Science 2014: “Databricks is providing certification, training and evangelism that mirror the early Hadoop model.” Of course AMPLab + Databricks have been running Spark training sessions for years. I’ve joined to lead this program, and our team is busy delivering:
Databricks and O’Reilly Media partnered to launch Developer Certification for Apache Spark http://oreilly.com/go/sparkcert – a brand spanking new program that leverages the amazing Spark experts @ Databricks + the incomparable editorial team @ O’Reilly Media:

val results = sc.parallelize(world_class).map(x => exp(log(x) * 2))

So my second O’Reilly book turned out to be a video + Docker image, while the third became a cert exam :) This formal exam takes < 90 minutes: expect multiple-choice questions based on small blocks of code in Python, Java, Scala. Questions test for a range of developer knowledge across Spark Core plus Spark SQL, Streaming, MLlib, GraphX, and typical use cases. We’re establishing the industry standard for measuring and validating technical expertise in Spark.

How to prep for this exam? Don’t worry, it doesn’t require extensive Scala knowledge; however, some familiarity with Scala code examples shown in the Spark docs would help lots. Mostly, we’re testing to see if you understand the Spark execution model, RDDs, how to leverage functional programming to get the most out of your cluster, i.e., avoid common bottlenecks, refute some of the, ahem, FUD that’s been circulating about MapReduce vs. Spark. You are probably good to go if you:
Alternatively, we’re looking for volunteers. The certificate exam will preview on Oct 16 at Strata NY and we need volunteers to evaluate the exam. You’ll get deep discounts on the Spark developer certificate. Plus, it’s an excellent way to score ginormous brownie points with both Databricks and O’Reilly Media, along with conf coupons, outstanding nerd cred, etc. Become an essential part of the Spark developer community building the next-generation of Big Data apps. Let me know. I’ve heard that T. O’Reilly and I. Stoica have authorized us to buy NY gourmet pizza + top-shelf beers for all volunteers (at least let’s start the rumor).

Meanwhile, stay up to date with the latest advances and training in Spark, and help prep for the certification exam. Workshop materials are authored by Databricks, and we’ve trained and certified these instructors. Upcoming training for Spark will be held in SF, DC, London, Paris, Barcelona, Stockholm, and Dublin:
I look forward to the EU trip, but I regret not arriving in time for Scala.IO – amazing talks lined up this year. Also looking forward to Big Data TechCon, and in particular I recommended The Hitchhiker’s Guide to Machine Learning with Python and @ApacheSpark by Krishna Sankar.

BTW, keep your eyes peeled for more material (courses, talks, videos, webcasts, etc.) about architectural design patterns that leverage Spark together with other popular frameworks, such as Cassandra and Kafka. Our team has been working closely with DataStax and others to bring you solutions that go far, far Beyond Hadoop. For those who weren’t watching closely: an emerging tech stack that integrates Spark, Cassandra, Kafka, ElasticSearch, etc., recently pulled in a 1/4 billion in VC financing.

Just Enough Math

The Just Enough Math material is progressing well… Similar to OSCON, we’ll have a tutorial at Strata NY on Wed, Oct 15 1:30pm, expecting +100 people this time. There’s also a public Docker image now, plus more work with O’Reilly on this project. We needed more Mesos + Docker foo to make progress on that infrastructure.

Hopefully, we’ll have an upcoming series of lectures too!

3D Printer Room @ Singularity University

The return of the fellowships

It was an honor to present at Singularity University this summer, along with a workshop at Insight Data Engineering Fellows Program. Looking forward to visiting Zipfian Academy soon too.
We have bunches and gobs o’ regional confs and meetups scheduled:
Also mark your calendars for:


Continuing on the prior theme of Ag+Data, James Hamilton (Amazon) wrote an intriguing blog post recently, Data Center Cooling Done Differently about a new kind of collocation: datacenters and desalinization. Desalinization at scale seems inevitable here in California – perhaps taking a cue from successes in Australia, etc. FWIW, I prepared a VC pitch for a related venture in 2008, but pulled back after initial feedback. Remember: always go with your gut!

I thoroughly enjoyed this gem about “Organic Ready” non-GMO seeds… Here's to gametophytic incompatibility in large doses. Also check Water’s Edge for an interesting special report on rising sea levels. Big Data comes in handy for contending with these crises related to global warming issues. Three items to check out from low Earth orbit: The SatelliteSpaceknowOmniEarth. Just in case we fry the biosphere before we can get a semi-permanent backup archived on Luna or Mars… one dreads the thought, but artificial photosynthesis is becoming more of a reality. I say “dread” because that idea recalls a vision of Trantor or perhaps Silent Running.

While we’re talking about remote sensing, I should also mention a follow-up study on the data point about GE 12 exabytes/day from turbine sensors on commercial flights: 2000x faster detection of rare critical failure modes. Here's to those early successes turning into a trendline for IoT.


A few pointers to notable work by friends and family: Film Theory and Chatbots by Robby Garner; Don Webb: Writing the Science Fiction Novel @ UCLA Extension; Eisoptrophobia by Akira Rabelais; AlaVoidDistribution by William Barker. 

Then I’ll leave you with something haunting and epic: NASA Space Sounds.

That's the update for now. See you in NY, DC, EU on the event horizon!


Data & Analytics Fellowship - O'Reilly Strata conf

Amplify Partners Data & Analytics Fellowship — designed for engineers, analysts, students, and anyone else passionate about data science, analytics, data-driven apps, and data infrastructure. The fellowship includes full conference registration, airfare, and hotel accommodation to attend the Strata NY conference, Oct 15-17 in NYC.

Fellows will be invited to join Amplify Partners along with a select group at a private dinner during the event, as well as for selected gatherings and Amplify Partners events ongoing throughout the year.

Applications are due Sep 30


Spark atop Mesos on Google Cloud Platform

When we run Databricks training for Apache Spark, we generally emphasize how to launch a Spark shell on a laptop and work from that basis. It’s a great way to get started using Spark. It’s also the preferred approach for developing apps, for many people who use Spark in production.

To wit, once you have an application running correctly on your laptop with a relatively small data set, then move your app to a cluster and work with data at scale. Many organizations provide Hadoop clusters on which it is quite simple to launch a Spark app.

If you don’t have a cluster available already, another good approach is to leverage cloud computing. One quick and easy way to launch a Spark cluster in the cloud is to run atop Apache Mesos on the Google Compute Engine cloud service. This is both simple for a beginner to get started, and robust at scale for production use. #no #hadoop #needed

The following five steps show how to launch, use, and monitor Spark running on a Mesos cluster on the Google Cloud Platform. Well, more like seven steps – if you include a brief wait time while the VMs launch, plus your little happy dance while celebrating at the end.

Step 1: Set up your GCP account

Set up an account on Google Cloud Platform by going to https://console.developers.google.com/project and creating a project. Let’s use a project called spark-lauch-00 for this example. Once that is created, be sure to click on the Billing link and arrange your payment details.

Step 2: Launch a Mesosphere cluster

Next, check out the free-tier service Mesosphere for Google Cloud Platform by going to https://google.mesosphere.io/ and launching a cluster. That requires a login from your Google account. Then click the +New Cluster button to get started. You will be prompted to choose a configuration for your Mesosphere cluster. For this example click on the Select Development button to run a four-node cluster.

Then you need to provide both a public SSH key and a GCP project identifier to launch a cluster with this service. If you need help on the former, Mesosphere provides a tutorial for how to generate SSH keys. Copy and paste your SSH public key into the input field and click the Create button.

Google Cloud Console

Next, go to the Google Cloud Console in your browser and click on the Projects link. Find the GCP project identifier in the second column of the table that lists your projects. In this example, we’ll use spark-launch-00 for that identifier. Copy and paste that string into the input field and click the Next button.

Now it’s time to launch your cluster. Click the shiny purple Launch Cluster button. NB: do not click the shiny red History Eraser button.

The price tag for this development configuration will run you a walloping total of approximately US$0.56 per hour. Mesosphere charges absolutely nada on top of the cost of the VMs. Depending on how long you run the example below, it should cost much less than the price of a respectable chai latte. You’re welcome.


It will take a few minutes for those VMs to launch and get configured. You can use this precious time to meditate for some serious non-thinking, or catch up on YouTube videos. Or something.

Within a mere matter of minutes, you should receive a delightful email message from Mesosphere, indicating that your new cluster is ready to roll. Or, if you’re impatient, or OCD, or something, then just keep refreshing either the GCP console or the Mesosphere cluster console. Or refresh both, if you must. In any case, you should see the VMs updating.

Step 3: The Master and The Margarita

Check your Mesosphere cluster console in the browser, and scroll down to the Topology section. There should be one VM listed under the Master section, with both internal and external IP addresses shown for it. Copy the internal IP address for the Mesos master, and make a note about its external IP address.

Mesosphere Cluster Console

Next, you need to login through SSH to the Mesos master. You could use the OpenVPN configuration through the Mesosphere console – which is great for production use, but a bit more learning curve for those who are just getting started. It’s much simpler to login through the GCP console:
  1. click on the spark-launch-00 project link
  2. click on the Compute section
  3. click on the Compute Engine subsection
  4. click on the VMs instances subsection
Then find your Mesos master, based on its external IP address. In this example, the external IP address was for the master. You’ll need to change that to whatever your master’s external IP address happens to be… Anywho, click on the SSH button for the master to launch a terminal window in your browser.

Once the SSH login completes, you must change to the jclouds user:
sudo bash
su - jclouds
Next, set an environment variable to point to your Mesos master. In this example, the internal IP address was for the master:
Now let’s download a binary distribution for Apache Spark. This example uses the latest 1.0.2 production release:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz
tar xzvf spark-1.0.2-bin-hadoop2.tgz
Great. We need to configure just a few variables…
cd spark-1.0.2-bin-hadoop2/conf

cp spark-env.sh.template spark-env.sh
echo "export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so" >> spark-env.sh
echo "export SPARK_EXECUTOR_URI=http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz" >> spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
echo "spark.mesos.coarse=true" >> spark-defaults.conf

cd ..
Bokay, ready to roll. Launch a Spark shell that points to the Mesos master:
./bin/spark-shell --master mesos://$MESOS_MASTER:5050
Once the Spark shell launches, you should see a scala> prompt. ¡Bueno!

Step 4: Run a simple Spark app

Next, let’s run a simple Spark app to verify your cluster operation. Copy and paste the following two-liner at the scala> prompt:
val data = 1 to 10000
That code will do three things:
  1. parallelize ten thousand numbers across your cluster as an RDD
  2. sum them together
  3. print the result on your console
The result should be a ginormous number beginning with 5 followed by zeros and another 5 in its midst. Specifically, for the OCD math geeks in the audience, the value needs to be the same as (10000*10001)/2, at least in the northern hemisphere.

Step 5: Welcome to the Panopticon

For extra goodness, check out the Spark and Mesos consoles while your Mesos cluster is running Spark apps. First, open some TCP ports by adding a firewall rule… Go back to the GCP console window in your browser and locate the Mesos master, then click on any link in the Network column. Next, under the Firewall rules section, click on the Create new link:
  1. give it a name, e.g., spark-console
  2. copy and paste the following into the Protocols and Ports input field: tcp:4040,5050
  3. click on the shiny blue Create button
That new firewall rule will take a few moments to propagate across your cluster, but you should see its notification updating on that web page. Once you see that the rule is in place, browse to the Spark and Mesos consoles using the Mesos master’s external IP address. In this example, the external IP address was for the master.

For the Spark console, open in your browser. For the Mesos console, open in your browser. Of course, you’ll need to substitute the external IP address for your cluster.

Then click through these consoles to see how the cluster resources are getting used. The Spark docs give more details about monitoring.

Finally, after a good gawk through the monitoring consoles, you’ll probably want to shutdown the cluster. Go back to the Mesosphere cluster window and click on the shiny red Destroy cluster button… the jolly, candy-like button.


Congrads, you have just run a Spark app on a Mesos cluster, based on the Mesosphere free-tier service, which you launched on Google Cloud Platform. That’s quite an accomplishment! Do you feel all DevOps-ish suddenly?

It’s time to celebrate, ergo the Margarita.