Newsletter Updates for December 2014

Chicago, Boulder, NYC, DC, SF, Stanford, London, Stockholm, Madrid, Barcelona, Amsterdam, Dulles, Baltimore, LA. The range of speaking events and business travel over the past quarter almost bewilders, but I’m grateful to get to meet many interesting people and learn about new projects. 

Also feeling grateful to enjoy some quiet time at home with family over the holidays, and I wish very happy holidays to you and yours.

Conference Summaries

Strata NY set a new record with about 450 people attending Spark Camp. There was a spare room, plus an hour break in the fray, so we held an impromptu “Ask Us Anything” about Spark – that has turned into a new kind of open source ritual at Strata confs, especially for handling the more advanced audience questions. Also, Bloomberg kindly hosted a large Spark Committer Night meetup event, their largest to-date.

Manhattan, from NY Water Taxi at Port Imperial
Throughout many conferences and meetup events over the past few months, one demo in particular stood out. David Jonker and Rob Harper from Oculus Info in Toronto gave a talk about Aperture Tiles at Strata NY. Last talk of the show, and quite arguably the best. This open source framework, partly built atop Spark, provides interactive data exploration with continuous zooming on large scale datasets. Highly recommended.

The week after Strata conf in NYC, some of our team found our way slightly south to the University of Maryland, where we got to teach alongside the renowned Jimmy Lin. The week included a Spark Tutorial on campus, plus the initial meeting of the Apache Spark Maryland meetup. Much fun, and we look forward to returning to UMD again soon.

Arriving back to the Bay Area just in time, I caught the launch of the new GalvanizeU program in downtown SF. One challenge that particular evening was getting scheduled to speak head-to-head with the final game in the World Series. That keynote, Data Science in Future Tense, examined some of the near-past and near-future of the field – hopefully indicating some non-intuitive directions. 

GalvanizeU is located next to the Transbay Center, just a few blocks away from the new Databricks office. They provide a hands-on graduate program in Data Science, in an urban setting and working closely with industry partners. Galvanize started in Boulder and is also expanding soon into Seattle. We’re thrilled about our new neighbors.

Home just long enough to take the kiddos trick-or-treating and attend GCPLive, then on to Europe… During a brief visit in the UK, I got to present about the latest in Spark Streaming at the London Spark Meetup: Tiny Batches in the wine (a callback to Don Ho, for those who were born more recently – ideal for getting your luau on). Then on to Stockholm with gracious hosting by Spotify, Ericsson, and SICS.

Good times @ Big Data Spain, Madrid

Madrid came next, for the annual Big Data Spain conf. Noticing a joke painted on the side of a jet at the airport, I had a hunch immediately that Madrid would be lots of fun. I was not disappointed. Our hosts at Paradigma Tecnólogico and Stratio presented an amazing conference, one of my favorites in a long, long time. I was fortunate to give a keynote talk, alongside many other excellent talks, such as from friends at Cloudera and Google BigQuery. I highly recommend Big Data Spain. More about Stratio in a bit…

The beach at El Poblenou, Barcelona
Taking a train from Madrid to Barcelona, admittedly I was missing the former, but Barcelona is a wonderful place. Imagine yourself in Santa Barbara, except that the city is 50 times larger, thousands of years older, and packed full o’ amazing culture. Strata EU was located at a conference center right next to the beach. We held the first official Spark developer certificate exam, plus a large Spark Camp event (25% of the conference attended), a meetup at UPC, and a second iteration of our “Ask Us Anything” about Spark.

Locavore feasting in Catalunya
Business travel Spark-style does not allow much downtime. Effectively one day off during two full weeks in EU. Fortunately that just happened to be during a weekend in Barcelona, the day after Strata concluded. I rented an Airbnb condo near the beach in El Poblenou, then wandered busy Rambla markets, through the crowd surrounding a busker string trio, gathering items to make a small feast. Only in Catalunya.

Amstel River in Amsterdam
A quick stop in Amsterdam, with a very fun talk hosted at eBay with hours of Q&A, then back home. Long enough for a family Thanksgiving feast, then off to DC, Baltimore, and LA. Excellent events and good friends met along the way, particularly the Los Angeles Apache Spark meetup hosted by Rubicon Project. Much appreciated.


The curiously named Likelihood T. Prior noted on TwitterSpark spark spark spark, spark spark spark spark. #Strataconf synopsis complete. Some went as far as to begin calling “Strata + Hadoop World” by a new name, “Strata + Spark World”. I like the sound of that.

To help keep track of this rocket ride, I’ve begun curating an ongoing list http://goo.gl/2YqJZK of the talks, workshops, etc., related to Spark worldwide. Please let me know if you have events to add.
Speaking of events, recently we began to increase the cadence for Bay Area Spark meetup events. These talks get live-streamed, with the archives published on the Apache Spark channel on YouTube. Databricks also recently announced Spark Packages a community index of packages. The site had to be moved shortly after its launch, due to overwhelming popularity. Good stuff on both the video channel and package repo.

So much news about Spark has happened in the past few months. I’d like to summarize with a few gems collected along the way…
Not least of these items, the Databricks team broke YHOO’s previous world record for the Daytona GraySort contest. That tied for the 100 PB sort on AWS, using 1/10 the number of servers and running 3x faster than YHOO Hadoop clusters. #justsayin


Part of my job involves the curriculum for Spark instruction. Our big news recently is that edX and the University of California will be offering two new MOOCs about Spark, sponsored by Databricks.

The first is Introduction to Big Data with Apache Spark by Prof. Anthony Joseph at UC Berkeley. This comprehensive introduction to Spark, as well as Big Data, is based entirely on Python programming and aimed at developing Data Science skills. This course begins on 2015–02–23.

The second is Scalable Machine Learning by Prof. Ameet Talwalkar at UCLA. This hands-on course focuses on distributed machine learning at scale, based on examples using open data, also in Python. This course begins on 2015–04–14.

Note that some taking Spark MOOCs will have the option to use Databricks Cloud free student accounts. Similarly, we will be integrating use of DBC free accounts into our other Spark training events.


Several years ago, I was fortunate to work for a CEO who understood how to leverage a distributed workplace. I studied the management practices involved, and in particular have grown to appreciate ROWE greatly. These practices seem all too rare among early-stage tech start-ups Silicon Valley. However, a few tech firms (DataStax and Typesafe come to mind) have embraced distributed workplace models. Frankly, correlations between effective approaches to gender equality and practices such as ROWE should be on every VC’s radar.

With respect to workplace practices – effective or otherwise – two recent articles caught my attention:
Great words of wisdom about two of the worst anti-patterns for successful tech organizations. The most telling part is the “canary in a coal mine” effect: to watch and see who becomes the most offended by these points. Egregious (sometimes outright hostile) use of email, chat, meetings, etc., and the fallacy of “crunch mode” stand as two of my top determinants for evaluating a company. Right alongside we provide free snacks and meals vs. we offer reasonable health care plans – which somehow turn out to be at odds in far too many start-ups.

BTW, really looking forward to catching Chad speak at GOTO Chicago next May.

Just Enough Math

The Just Enough Math material continues to evolve… Allen and I gave a tutorial at Strata NY, working closely with O’Reilly Media to export content to IPython Notebook within a Docker container for participants to run in the cloud. Rackspace provided the hosting, which in turn was an alpha test for their Nature magazine IPython interactive demo. Welcome to the future of publishing.

Andrew Odewahn and I entered a version of this for the Boston instance of Docker Global Hack Day #2 – frankly, Andrew did like 99.9999% of the work on that one :) Meanwhile, speaking of the future of publishing, JEM provides an example in the new Publishing Workflows for Jupyter by Andrew OdewahnKyle KelleyRune Madsen.

Beyond publishing, we do have some math to suggest… Two papers caught my attention recently:
Oh, and riffing off the “Quantum Algorithms on the Moon” meme from JEM, note that NASA, Google and USRA establish Quantum Computing Research Collaboration such that 20% of computing time will be provided to the university community. In case you have some large data set that’s just screaming to get crunched on a D-Wave. Like you do.


Other big news was the Google Cloud Platform Live conference in SF on Nov 1. The message from #GCPLive was largely about containers… in short, the notion of The datacenter IS the computer going mainstream. To paraphrase one comment during the conf: “Customers get locked into host-based patterns, so they struggle with intertwined systems.” Well said. Definitely looking forward to the new GKE service based on Kubernetes.

Other big news was awaiting in London. Namely, the team behind Weave. Recall that the JEM tutorial had been an alpha test for the IPython + Docker + Rackspace + Nature magazine thing? We learned a truism the hard way, with minutes to go before the event started: Docker does little to resolve crucial issues outside of the containers. Enter Weave, handling difficult matters outside the container, such as networking and crypto. Check their blog for tasty insights, e.g., Automated provisioning of multi-cloud weave network with Terraform. Highly recommended.

Speaking of Docker, I really enjoyed this talk by Adrian Cockcroft @DockerCon: State of the Art in Microservices. Especially slides #8–19, product development process.

Speaking of Microservices, here’s a good overview: The Strengths and Weaknesses of Microservices by Abel Avram on InfoQ.


Continuing on the Ag+Data front, check out the excellent article GeoTrellis Adapts to Climate Change and Spark about how Climate Change analytics drove Spark adoption at Azavea. They integrated Spark and Accumulo to support fast computation of climate impact metrics for DoE, which should be included in the 0.10 release of GeoTrellis.

NYT ran an interactive analysis/visualization, Flooding Risk From Climate Change, Country by Country, which perhaps helps explain Silicon Valley rumors about Google building ferry ports at corporate campuses along SF Bay.

I’m a big fan of Danielle Nierenberg @FoodTank in Chicago. A recent article, How Vegetables Can Save the World, is brief, accessible, and quite to the point. More of that on FoodTank.
Meanwhile, considering the many challenges ahead in Ag worldwide, I’m curious whether some programmable matter could become useful on farms to leverage data? Sort of an asymptote for IoT.

Upcoming Events

Many interesting conferences and other events are planned for the months ahead. Please do check the http://goo.gl/2YqJZK listings. In particular, mark your calendars for:
O'Reilly studio in Sebastopol, for new "Intro Spark" video


I’ll leave you with something fun and something epic.

First, the fun – though it’s quite epic in a way: LumiGeekWe make Arduino shields for LEDs, audio-reactive drivers, and custom solutions for architectural and artistic endeavors. Check their installation at the new Galvanize Cafe in SF, and look about carefully for a subtle case of anamorphosis.

Second, the epic – if you haven’t seen it yet, it’s well worth four gorgeous minutes of video: Wanderers by Erik Wernquist, narrated by Carl Sagan. Money quote @1:45: “Herman Melville in Moby Dick spoke for wanderers in all epochs and meridians…”

That's the update for now. See you in Austin, San Jose, and NYC on the event horizon!


Newsletter Updates for September 2014

Highly recommended, Oct 2: an O’Reilly Media webcast Spark 1.1 and Beyond by Patrick Wendell and Ben Lorica. Two people who have much to share about where Apache Spark is heading.

My favorite conference in a long while was the Spark Tutorial hosted by Prof. Reza Zadeh @ Stanford ICME – home of world-leading innovation for machine learning at scale. The tutorial featured lectures on Spark Streaming, MLlib, GraphX, etc., from lead committers. Great to be working at Stanford again (if only for a few days this summer) and wonderful to meet many people who participated. Here’s an excellent set of notes. For Stanford affiliates, Prof. Zadeh has an upcoming course CME 323: Distributed Algorithms and Optimization with related content explored in much more detail.

We will hold another Spark Tutorial at UMD in College Park, Maryland on Oct 20–22, hosted by Prof. Jimmy Lin. That event sold out quickly, as did the one at Stanford – so we’ll do more! More about that in a bit.

The Quad @ Stanford University
Another great conference this summer was the inaugural MesosCon 2014 in Chicago last month. Twitter kindly recorded all the sessions. In particular, Ben Hindman’s keynote hints toward cross-datacenter features on the horizon. My talk was about Spark on Mesos, and a related blog post shows a few simple steps to launch a Spark cluster on Mesosphere’s free-tier service atop Google Cloud Platform.

Mesosphere partnered with Google’s Omega team for a killer demo involving Kubernetes and Mesos, showing cluster failover/migration across datacenters in CA and NY. Sounds simple, but the implications are vast. The other killer demo, from eBay, featured YARN on Mesos – with ultimately no code mods required, just an additional JAR file plus some config settings. Check out related slides and video. Ginormous implications for that one, thanks eBay!

Sparky-the-Bear sez: ignite your data

Big news for me this summer was joining Databricks as Director of Community Evangelism. New business cards. Lotsa new tshirts. I’m thrilled to become part of this renowned team, delighted to be out in the field amidst the exponential growth of Spark production use cases.

KDnuggets ran a story recently about our Spark news… and there’s a lot. To quote the Gartner report Hype Cycle for Advanced Analytics and Data Science 2014: “Databricks is providing certification, training and evangelism that mirror the early Hadoop model.” Of course AMPLab + Databricks have been running Spark training sessions for years. I’ve joined to lead this program, and our team is busy delivering:
Databricks and O’Reilly Media partnered to launch Developer Certification for Apache Spark http://oreilly.com/go/sparkcert – a brand spanking new program that leverages the amazing Spark experts @ Databricks + the incomparable editorial team @ O’Reilly Media:

val results = sc.parallelize(world_class).map(x => exp(log(x) * 2))

So my second O’Reilly book turned out to be a video + Docker image, while the third became a cert exam :) This formal exam takes < 90 minutes: expect multiple-choice questions based on small blocks of code in Python, Java, Scala. Questions test for a range of developer knowledge across Spark Core plus Spark SQL, Streaming, MLlib, GraphX, and typical use cases. We’re establishing the industry standard for measuring and validating technical expertise in Spark.

How to prep for this exam? Don’t worry, it doesn’t require extensive Scala knowledge; however, some familiarity with Scala code examples shown in the Spark docs would help lots. Mostly, we’re testing to see if you understand the Spark execution model, RDDs, how to leverage functional programming to get the most out of your cluster, i.e., avoid common bottlenecks, refute some of the, ahem, FUD that’s been circulating about MapReduce vs. Spark. You are probably good to go if you:
Alternatively, we’re looking for volunteers. The certificate exam will preview on Oct 16 at Strata NY and we need volunteers to evaluate the exam. You’ll get deep discounts on the Spark developer certificate. Plus, it’s an excellent way to score ginormous brownie points with both Databricks and O’Reilly Media, along with conf coupons, outstanding nerd cred, etc. Become an essential part of the Spark developer community building the next-generation of Big Data apps. Let me know. I’ve heard that T. O’Reilly and I. Stoica have authorized us to buy NY gourmet pizza + top-shelf beers for all volunteers (at least let’s start the rumor).

Meanwhile, stay up to date with the latest advances and training in Spark, and help prep for the certification exam. Workshop materials are authored by Databricks, and we’ve trained and certified these instructors. Upcoming training for Spark will be held in SF, DC, London, Paris, Barcelona, Stockholm, and Dublin:
I look forward to the EU trip, but I regret not arriving in time for Scala.IO – amazing talks lined up this year. Also looking forward to Big Data TechCon, and in particular I recommended The Hitchhiker’s Guide to Machine Learning with Python and @ApacheSpark by Krishna Sankar.

BTW, keep your eyes peeled for more material (courses, talks, videos, webcasts, etc.) about architectural design patterns that leverage Spark together with other popular frameworks, such as Cassandra and Kafka. Our team has been working closely with DataStax and others to bring you solutions that go far, far Beyond Hadoop. For those who weren’t watching closely: an emerging tech stack that integrates Spark, Cassandra, Kafka, ElasticSearch, etc., recently pulled in a 1/4 billion in VC financing.

Just Enough Math

The Just Enough Math material is progressing well… Similar to OSCON, we’ll have a tutorial at Strata NY on Wed, Oct 15 1:30pm, expecting +100 people this time. There’s also a public Docker image now, plus more work with O’Reilly on this project. We needed more Mesos + Docker foo to make progress on that infrastructure.

Hopefully, we’ll have an upcoming series of lectures too!

3D Printer Room @ Singularity University

The return of the fellowships

It was an honor to present at Singularity University this summer, along with a workshop at Insight Data Engineering Fellows Program. Looking forward to visiting Zipfian Academy soon too.
We have bunches and gobs o’ regional confs and meetups scheduled:
Also mark your calendars for:


Continuing on the prior theme of Ag+Data, James Hamilton (Amazon) wrote an intriguing blog post recently, Data Center Cooling Done Differently about a new kind of collocation: datacenters and desalinization. Desalinization at scale seems inevitable here in California – perhaps taking a cue from successes in Australia, etc. FWIW, I prepared a VC pitch for a related venture in 2008, but pulled back after initial feedback. Remember: always go with your gut!

I thoroughly enjoyed this gem about “Organic Ready” non-GMO seeds… Here's to gametophytic incompatibility in large doses. Also check Water’s Edge for an interesting special report on rising sea levels. Big Data comes in handy for contending with these crises related to global warming issues. Three items to check out from low Earth orbit: The SatelliteSpaceknowOmniEarth. Just in case we fry the biosphere before we can get a semi-permanent backup archived on Luna or Mars… one dreads the thought, but artificial photosynthesis is becoming more of a reality. I say “dread” because that idea recalls a vision of Trantor or perhaps Silent Running.

While we’re talking about remote sensing, I should also mention a follow-up study on the data point about GE 12 exabytes/day from turbine sensors on commercial flights: 2000x faster detection of rare critical failure modes. Here's to those early successes turning into a trendline for IoT.


A few pointers to notable work by friends and family: Film Theory and Chatbots by Robby Garner; Don Webb: Writing the Science Fiction Novel @ UCLA Extension; Eisoptrophobia by Akira Rabelais; AlaVoidDistribution by William Barker. 

Then I’ll leave you with something haunting and epic: NASA Space Sounds.

That's the update for now. See you in NY, DC, EU on the event horizon!


Data & Analytics Fellowship - O'Reilly Strata conf

Amplify Partners Data & Analytics Fellowship — designed for engineers, analysts, students, and anyone else passionate about data science, analytics, data-driven apps, and data infrastructure. The fellowship includes full conference registration, airfare, and hotel accommodation to attend the Strata NY conference, Oct 15-17 in NYC.

Fellows will be invited to join Amplify Partners along with a select group at a private dinner during the event, as well as for selected gatherings and Amplify Partners events ongoing throughout the year.

Applications are due Sep 30


Spark atop Mesos on Google Cloud Platform

When we run Databricks training for Apache Spark, we generally emphasize how to launch a Spark shell on a laptop and work from that basis. It’s a great way to get started using Spark. It’s also the preferred approach for developing apps, for many people who use Spark in production.

To wit, once you have an application running correctly on your laptop with a relatively small data set, then move your app to a cluster and work with data at scale. Many organizations provide Hadoop clusters on which it is quite simple to launch a Spark app.

If you don’t have a cluster available already, another good approach is to leverage cloud computing. One quick and easy way to launch a Spark cluster in the cloud is to run atop Apache Mesos on the Google Compute Engine cloud service. This is both simple for a beginner to get started, and robust at scale for production use. #no #hadoop #needed

The following five steps show how to launch, use, and monitor Spark running on a Mesos cluster on the Google Cloud Platform. Well, more like seven steps – if you include a brief wait time while the VMs launch, plus your little happy dance while celebrating at the end.

Step 1: Set up your GCP account

Set up an account on Google Cloud Platform by going to https://console.developers.google.com/project and creating a project. Let’s use a project called spark-lauch-00 for this example. Once that is created, be sure to click on the Billing link and arrange your payment details.

Step 2: Launch a Mesosphere cluster

Next, check out the free-tier service Mesosphere for Google Cloud Platform by going to https://google.mesosphere.io/ and launching a cluster. That requires a login from your Google account. Then click the +New Cluster button to get started. You will be prompted to choose a configuration for your Mesosphere cluster. For this example click on the Select Development button to run a four-node cluster.

Then you need to provide both a public SSH key and a GCP project identifier to launch a cluster with this service. If you need help on the former, Mesosphere provides a tutorial for how to generate SSH keys. Copy and paste your SSH public key into the input field and click the Create button.

Google Cloud Console

Next, go to the Google Cloud Console in your browser and click on the Projects link. Find the GCP project identifier in the second column of the table that lists your projects. In this example, we’ll use spark-launch-00 for that identifier. Copy and paste that string into the input field and click the Next button.

Now it’s time to launch your cluster. Click the shiny purple Launch Cluster button. NB: do not click the shiny red History Eraser button.

The price tag for this development configuration will run you a walloping total of approximately US$0.56 per hour. Mesosphere charges absolutely nada on top of the cost of the VMs. Depending on how long you run the example below, it should cost much less than the price of a respectable chai latte. You’re welcome.


It will take a few minutes for those VMs to launch and get configured. You can use this precious time to meditate for some serious non-thinking, or catch up on YouTube videos. Or something.

Within a mere matter of minutes, you should receive a delightful email message from Mesosphere, indicating that your new cluster is ready to roll. Or, if you’re impatient, or OCD, or something, then just keep refreshing either the GCP console or the Mesosphere cluster console. Or refresh both, if you must. In any case, you should see the VMs updating.

Step 3: The Master and The Margarita

Check your Mesosphere cluster console in the browser, and scroll down to the Topology section. There should be one VM listed under the Master section, with both internal and external IP addresses shown for it. Copy the internal IP address for the Mesos master, and make a note about its external IP address.

Mesosphere Cluster Console

Next, you need to login through SSH to the Mesos master. You could use the OpenVPN configuration through the Mesosphere console – which is great for production use, but a bit more learning curve for those who are just getting started. It’s much simpler to login through the GCP console:
  1. click on the spark-launch-00 project link
  2. click on the Compute section
  3. click on the Compute Engine subsection
  4. click on the VMs instances subsection
Then find your Mesos master, based on its external IP address. In this example, the external IP address was for the master. You’ll need to change that to whatever your master’s external IP address happens to be… Anywho, click on the SSH button for the master to launch a terminal window in your browser.

Once the SSH login completes, you must change to the jclouds user:
sudo bash
su - jclouds
Next, set an environment variable to point to your Mesos master. In this example, the internal IP address was for the master:
Now let’s download a binary distribution for Apache Spark. This example uses the latest 1.0.2 production release:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz
tar xzvf spark-1.0.2-bin-hadoop2.tgz
Great. We need to configure just a few variables…
cd spark-1.0.2-bin-hadoop2/conf

cp spark-env.sh.template spark-env.sh
echo "export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so" >> spark-env.sh
echo "export SPARK_EXECUTOR_URI=http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz" >> spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
echo "spark.mesos.coarse=true" >> spark-defaults.conf

cd ..
Bokay, ready to roll. Launch a Spark shell that points to the Mesos master:
./bin/spark-shell --master mesos://$MESOS_MASTER:5050
Once the Spark shell launches, you should see a scala> prompt. ¡Bueno!

Step 4: Run a simple Spark app

Next, let’s run a simple Spark app to verify your cluster operation. Copy and paste the following two-liner at the scala> prompt:
val data = 1 to 10000
That code will do three things:
  1. parallelize ten thousand numbers across your cluster as an RDD
  2. sum them together
  3. print the result on your console
The result should be a ginormous number beginning with 5 followed by zeros and another 5 in its midst. Specifically, for the OCD math geeks in the audience, the value needs to be the same as (10000*10001)/2, at least in the northern hemisphere.

Step 5: Welcome to the Panopticon

For extra goodness, check out the Spark and Mesos consoles while your Mesos cluster is running Spark apps. First, open some TCP ports by adding a firewall rule… Go back to the GCP console window in your browser and locate the Mesos master, then click on any link in the Network column. Next, under the Firewall rules section, click on the Create new link:
  1. give it a name, e.g., spark-console
  2. copy and paste the following into the Protocols and Ports input field: tcp:4040,5050
  3. click on the shiny blue Create button
That new firewall rule will take a few moments to propagate across your cluster, but you should see its notification updating on that web page. Once you see that the rule is in place, browse to the Spark and Mesos consoles using the Mesos master’s external IP address. In this example, the external IP address was for the master.

For the Spark console, open in your browser. For the Mesos console, open in your browser. Of course, you’ll need to substitute the external IP address for your cluster.

Then click through these consoles to see how the cluster resources are getting used. The Spark docs give more details about monitoring.

Finally, after a good gawk through the monitoring consoles, you’ll probably want to shutdown the cluster. Go back to the Mesosphere cluster window and click on the shiny red Destroy cluster button… the jolly, candy-like button.


Congrads, you have just run a Spark app on a Mesos cluster, based on the Mesosphere free-tier service, which you launched on Google Cloud Platform. That’s quite an accomplishment! Do you feel all DevOps-ish suddenly?

It’s time to celebrate, ergo the Margarita.


NdGT pseudoscience

I'm thinking that Neil deGrasse Tyson, an astrophysicist, making broad claims about genomics, soil science, agronomics, etc., with obvious political outcomes, is sufficiently similar to William Shockley, a physicist who invented the transistor, making broad claims about population genetics, with obvious political outcomes. Both represent failures for science.

The notion that an observation over "tens of thousands of years" is scientifically valid for sweeping pronouncements about complex biological processes (e.g., our digestive tracts, topsoil ecosystems, beehive population dynamics, etc.) that have evolved over many millions of years -- that represents pseudoscience on the part of NdGT. Leading agronomists (who are not employed by Monsanto) such as at The Land Institute present vastly different opinions on the subject.

Moreover, many of the arguments against GMOs are based on the political process and business outcomes, not the science per se. For example, why should a transnational corporation spend many millions of dollars to prevent a state government from enacting reasonable laws -- or from even allowing voters to voice opinion? That's just over the labeling... Clearly, the GMO issues *aren't* so much about the scientific aspects as they are about the commercial aspects.

Last time that I checked, NdGT was not qualified to act as an attorney. Nor should he be giving legal advice to voters. Which is what the subtext indicates.

Huge points off for NdGT in my book. Meanwhile, that guy gets funded by somebody. What are the political linkages and business agendas for his funders?


Newsletter Updates for July 2014

Two aspects about leveraging machine learning are largely under-represented in the lit, especially when it comes to production use cases: feature engineering and the comparative evaluation of multiple modeling approaches. To that point, check out “Streamlining feature engineering: Researchers and startups are building tools that enable feature discovery” by Ben Lorica. The article mentions Spark Beyond, which “finds deep patterns in your data.” I was lucky to get a demo of Spark Beyond earlier this year and talk with the principals – and highly recommend taking a good look at their wares. Between the ongoing advances in deep learning and symbolic regression, a direction seems to be emerging … that perhaps one of the more difficult parts of machine learning workflows, namely the feature engineering aspects, could become more automated.

For another great article, check out Including Men in the Conversation About Women by Scarlett Sieber. Among my biggest peeves about Silicon Valley are the “brogrammer” lopsided demographics, and the gender bias which is quite real and nearly epidemic. Our data science teams have generally been quite mixed, why can’t engineering teams in general leave the 19th century behind, let alone stop being so hostile? Not naming names, but two of the SV firms in which I’ve worked in the past five years are both well known and well poised for harassment lawsuits. Taking a stand against that nonsense as an engineering manager is a great way to catch hell, which I’ve gladly engaged before. Another related pet peeve is where one of the same firms was actively pressuring their engineering interns to quit university degree programs. As a behavior for an engineering manager, I find that highly unethical. Some of those who are engaged in these practices know quite well who I’m talking about.

Spark Summit

The big, BIG news last month was … (wait for it) … Spark Summit. All of the speaker videos have been posted – those are probably the single-best resource for learning about Apache Spark. Of course, the big surprise at the conf was the announcement of Databricks Cloud. If you missed the conf, you can watch Ali Ghodsi’s spectacular demo which kicks in at about the 14:40 time marker.

Spark Summit keynote practice, T-15 hours
One surprise learning from the conf was that one product line from SAP generates more annual revenue than all of the other Big Data vendors (HW, Cloudera, etc.) combined. Other pleasant surprises included: Flambo, a Clojure DSL for Spark; and Thunder, for large-scale neural data analysis, which shows some excellent integration of PySpark, SciPy, scikit-learn, etc.

Our training sessions at Spark Summit set some kind of new records. In particular, check out the advanced material for great lectures there. Those who attended the conf received a free ebook preview for the upcoming Learning Spark: Lightning-Fast Big Data Analytics by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia; O’Reilly Media (2014).

Also, I got to host the Research track of session talks at Spark Summit, which was a real treat. We had a special #geo break-out session following the Geotrellis talk by Rob Emanuele. We will hopefully be expanding that focus in future confs. There were so many other great talks that it’s hard to pick favorites. Even so, I’ll be studying up about two in particular: Quadratic Programing Solver for Non-negative Matrix Factorization with Spark by Debasish Das, Santanu Das; and Distributed Reinforcement Learning for Electricity Market Bidding with Spark by Vijay Srinivas Agneeswaran, Vishnuteja Nanduri. The latter seems almost ideal for integration with recent work on genetic programming.

Stay tuned for the next Spark Summit, which will be held on NYC in early 2015.


I’ve just returned from OSCON 2014. What an excellent conference! Check out the content recently posted online: keynotesphotosspeaker slides.

Of course, this event was carefully timed to overlap with the Oregon Brewer’s Festival. Top two picks: Double Latte. by Sierra Nevada Brewing Co.; and Lorenzini Blood Orange Double IPA by Maui Brewing Company. Many thanks to Erin Rasmussen for suggesting about OBF!

Blood Orange IPA
Back at OSCON… one of my favorite Ignite talks was What Science Fiction Can Teach Us About Building Communities by Dawn Foster. Another favorite, speaking of #geo, was a preso/proposal for Open Aerial Map by Kate Chapman.

During the conf, Andy Orem did a video interview where we discussed perspectives and current projects: Ag+Data, Industrial Internet, sketch algorithms, Apache Spark, etc. Andy was the very first editor I worked with at O’Reilly Media, ten years ago. He’s a much better interviewer than I am an interviewee, so I enjoyed learning much through our work together. Also fun to work again with the amazing video team.
"With great power comes some data, plus wrinkled shirts"
The tutorial for Just Enough Math had 50+ people attending, and we got to evaluate an intermediate stage of a new tutorial software platform. For that, I needed to get a bunch of USB drives from Amazon, but the order/delivery #failed. At the last minute our 10 y.o. daughter and I made an emergency run to Fry’s Electronics (she was eager to observe ground zero for nerdliness) … but the only 4Gb flash drives that they had left in stock were Marvel Universe comix characters. Arriving back home, our 9 y.o. daughter was aghast that adults would be receiving comix figures in a lecture :)

The Data Workflows for Machine Learning talk received lots of great responses – as did earlier versions during meetups in Seattle and SF. It become of the “top-shared” slide decks featured on the SlideShare home page. Perhaps that needs to be turned into a mini-book?

new book kiosk
As my last-o’-the-day book signing was winding down, after almost everyone had left the convention center for “nearby locations of beer taps”, a friend mentioned “Hey, look there’s another pile of books – these look different.” So a few lucky latecomers got signed copies of the galley drafts for our new book Just Enough Math, which probably still won’t be released for months – this rev is quite rough :) Oddly enough, the first person to read it looked up and said, “Where are the other O’Reilly books about math?” Indeed.

Sketchy Things

Speaking of Just Enough Math, we’ve put up a companion site for the video+book+tutorial at http://justenoughmath.com/ to provide additional resources and related links:
  • set up a Python programming envon your laptop
  • code+data files for examples in the video+book
  • “gists” that show expected results for the examples
  • links to external resources that get referenced
  • recommended books and videos for further study
  • monthly newsletter sign-up
The tutorial at OSCON previewed a new chapter recently added about sketch algorithms, following from notes at an excellent Foo Camp session led by Avi Bryant. I will be focussing on Spark Streaming use cases for Strata EU in Barcelona this fall, particularly where approximation techniques (think: examples of monoids in action) can leverage both Spark and Cassandra. If you have examples to share of Spark Streaming production use cases in general, I’m eager to build case studies to publish in Radar. Meanwhile, for a great resource about sketch algorithms, check out the archives of the AK Data Science Summit – Streaming and Sketching from last summer.

Card-Carrying Green

A friend recently brought up the topic of navigating questions about extinction and climate change for preschoolers… I’m getting those too; however, in my experience the questions become much better formulated after an additional 5–6 years or so. As a parent, as a human, it kills me to see all the ginormous FUD spewing from the political lobbies for the coal industry, fracking, Monsanto, GM, etc. How about giving ample air time and consideration for some points from the other side?

First off, I’ve mentioned it before but it bears repeating: The Land Institute is a phenomenally excellent resource for understanding some of the insanity and pure tragedy of contemporary agricultural practices, particularly when it comes to monocultures, annuals, hybrids, let alone unnecessary tillage. To paraphrase Wes Jackson, “The plow share has destroyed more options for future generations than the sword.” On a related note, I’ll also point to an excellent article by Michael Pollan, as a forward to Grass, Soil, Hope: A Journey through Carbon Country by Courtney White. Moreover, check out The Solutions Project. That latter site has more substance than perhaps its web-design polish indicates: it’s about the work by Mark Jacobson, et al., on how to power the planet via renewables now while mitigating hurricane damages, etc. One would think that the reinsurance revenues alone would justify a significant investment. In any case, these three links point to the fact that any emerging “dialog of despair” about global warming, etc., is purely FUD. Much can and will be done.

Phylo, the trading card game
I’m particularly grateful to be associated with O’Reilly Media, which provided OSCON attendees with a nice treat in their schwag bags: Phylo, a trading card game. Its gameplay emphasizes endangered species, climate change, food chains, and other environmental pressures. “Phylo is a project that began as a reaction to the following nugget of information: Kids know more about Pokemon creatures than they do about real creatures. We think there’s something wrong with that. Apparently, so do many others.”

In a related development, check out Nerds Without Borders: “We are looking for all sorts of people to help: Engineers, Scientists, Writers, Artists, Dreamers, Activists, Organizers, Fundraisers, Financiers, etc…” Starting with use of IoT sensors and cell phone networks to protect sea turtle hatchlings. Good stuff.

Looking Ahead

Another fun follow-up from Foo Camp and OSCON: getting to talk with Scott Jenson about his work on The Physical Web at Google. Check out his preso, Why Mobile Apps Must Die. The big idea is a kind of “micro-DNS” for low-cost digital tagging of physical items that can be accessed by mobile devices. No app installs required.

In other news, Trafodion was recently released as open source by HP. The name is based on the Welsh word for “transaction”. If you recall about Tandem Computers and NonStop, this product line has a long history of tech innovations – for highly reliable, highly optimized real-time SQL at scale. My uncle retired from Tandem, and lately I’ve spent time with the Trafodion team and am quite impressed. This release brings an interesting new level of Enterprise robustness to real-time transactions+analysis atop Linux+Hadoop. One to watch.

Another to watch closely is The Distributed Developer Stack Field Guide by Andrew Odewahn, Courtney Nash, Mike Loukides, et al. This is a GitHub-based book from O’Reilly. If you see any points in there that need editing, embellishing, etc., then two words: pull request, for the win.

In terms of upcoming events, registration is now open for Data Day Texas 2015, and I’m really looking forward to that. Will be teaching Spark at Scala by the Bay in SF on Aug 8–9, speaking at #MesosCon in Chicago on Aug 21, followed by another Spark course in Chicago on Aug 25.


I’ll close with a look back to a 1990 Documentary about Cyberpunk. That provides a good summary of what we up to in the early 1990s with Mondo 2000, bOING-bOING, FringeWare, WiReD, The WELL, Turkey City, etc. Tim’s monologue around 15:30-ff is hilarious – both because of his ever-optimistic “There will be mass democracy in the streets” miss, and how much it contrasts with just about every other major point coming true within 25 years. Warning: gratuitous F242 clips, throughout. Time marker 27:11 shows what I was doing as a vendor at many, many raves… Meanwhile, check out a recent bOING-bOING article Alien Autopsy: William Barker on Schwa, two decades later for some of the more astute counterpoint about what was really going on, then and now.

That's the update for now. See you in Chicago with San Diego on the event horizon!


Newsletter Updates for June 2014

Been quite an interesting month: NYC, SJ/SF, bookended by Hadoop Summit and Spark Summit, with Foo Camp in the midst… much learned, and many excellent introductions.

If you haven’t seen it, this is a gem: Seeing Spaces by Bret Victor, as an evolution of the “Maker Spaces” concept. Another top recommend is A Short History of and Introduction to Deep Learning by John Kaufhold. Money quote: “Learn, don’t engineer feature representations.” Check this review by Mary Galvin at Data Community DC.

For another great source of inspired writings, follow the Matthew Hunt posts on LinkedIn . In this episode of delightfully unexpected connections, Matthew leads us on a path among Pink Floyd, moon cheese, gnome-like cretins, and unlikely heroes for a tale of two Burkes.

   Just Enough Math

The video for Just Enough Math has been on sale for the past month. O’Reilly has a preview video on YouTube, if you’d like to check out a sample. Meanwhile…

I need your help: this Just Enough Math project would greatly benefit from your reviews. Even if you don’t purchase the full video, check the preview and the free sections. We’re eager to hear your feedback, and especially your reviews!

Here’s the thing: on the one hand, if you’re the kind of person who enjoys reading math papers as a fond pastime, this material is probably not for you. There are plenty of other videos in the world, and so many brain teasers, so little time. On the other hand, if you find that math papers tend to be almost entirely devoid of context (which, frankly, many are) and you took math through Algebra 2, and you enjoy seeing some examples, learning some history, etc., then you’ll probably benefit from this video.

There are quite a number of great resources at O’Reilly and other publishers for those who want a deep-dive in any particular area of advanced math applied for Big Data … and the point of the Just Enough Math project is to serve almost like a “hyperlink document” (e.g., old school web pages circa early 1990s) for those other books, videos, websites, etc., along with providing history and case study examples as context.

We’ll be presenting a tutorial based on Just Enough Math at OSCON. Plus, there’s a super-secret discount code for 20% off registration: PACOID

In the Bay Area, we’ve recently launched a Just Enough Math: Machine Learning for Execs and Entrepreneurs meetup. Looking forward to more events through that. Submitted as evidence, check out “How Not to Be Wrong”: What the literary world can learn from math by Laura Miller in Salon.


Some interesting insights about Apache Mesos surfaced in the recent 2014 community survey. And at this point, the list of firms adopting Mesos no longer fits in my browser window. To find out more, check out MesosCon scheduled in August in Chicago. I’m looking forward to talks from John Wilkes and several other experts, and meanwhile will present about Apache Spark running on Mesos. In related news, recently I gave a talk at the Mesos NYC Meetup sponsored by the kind folks at Shutterstock. If you’re in the area check out an intro Mesos talk on 7/17 by Joe Stein at Bloomberg.

✽ ✽ ✽

On a recent camping adventure in Sebastopol, I was grateful to learn about lots of new technologies. One of the more interesting finds was Unbounded Robotics, and I enjoyed a chat with Melonee Wise, CEO. These actually are the droids you’re looking for. Meanwhile, O’Reilly Media is looking for editors, especially in the Data practice area. Got Edit? Join the team!

morning walk in Ceres Community Garden, w/ O'Reilly Media in bkgd

In terms of other interesting technologies… I’ve been hearing memes rumbling about “Big Data is a myth” or “Where are the IoT apps?” Here, that’s where. The part of Nokia that didn’t sell off to Microsoft is handling some of the most interesting fusion of data exhaust that I’ve seen. Case in point, check out Jams, game theory, and equations: the science of traffic for a view of really big data analyzed in real-time. If you’ve attempted to drive anywhere in, say, DC or Austin or Silicon Valley anytime recently during commute times … this is a problem. Money quote: “Then we start to look at the car’s sensors. We start to know the weather before the weather authorities do, because we can see which cars have their windscreen wipers or their headlights switched on.” Orders of magnitude larger than your favorite social network or ad exchange.

Meanwhile, my favorite IoT app so far is clearly this: sharks tweet as they approach the shore of Western Australia. Would be great to see more technology applications like that!

   Minecraft camp

Speaking of Foo and other camps, I’ve got two kiddos currently in iD Tech Camps --learning Minecraft and Scratch, respectively. These courses tour around the US and are highly recommended. We could learn much from their teaching approach, to benefit professional workshops for adults as well.

To follow-up on the Minecraft + Quantum theme from previous posts, here’s a good video of Seth Lloyd explaining Quantum Machine Learning. Why does this seem to call back to the Real Genius movie?

   Ag + Data

O’Reilly Strata recently carried a story about how Farm data could be worth billions, related to the Ag + Data post on O’Reilly Radar. Much is happening in Ag data and other consumers of remote sensing products – particularly with respect to recent changes in satellite regulations. However, my favorite recent Ag story is about the Purdue Improved Cowpea Storage (PICS) bag. Brilliant work.

Overall, much of the interesting Ag+Data tech seems to be coming from (or through) Chile… and a new phrase has emerged: Chilecon Valley.

   Friends in the News

Congratulations go out to Robby Garner, competing with the JFRED Chat Server in Turing2014: 60th anniversary year of Alan Turing’s untimely death. Many years ago, Robby and I worked on a primordial version of JFRED. That played “customer service agent” for the FringeWare online bookstore. Circa 1998 we ran the bots on BBC “Tomorrows World” for a live televised Turing Test, which is some of the  most fun I've ever had in network engineering. More recently, Hubot-based chatbots are being deployed for devops and other engineering teams, such as the Shep chatbot used by engineering at O’Reilly.

Also, check out the new Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives by Vijay Srinivas Agneeswaran. This is a deep-dive into design patterns and frameworks for large-scale analytics beyond Hadoop.

Got to meet lots of people interested in using Spark at the recent Hadoop Summit in San Jose. One of the Community Choice Awards at the conference went to “Demo: Building a Unified Data Pipeline in Apache Spark” by Aaron Davidson from Databricks. Eager to see the slides published for that. Also at Hadoop Summit, Xiangrui Meng gave an excellent talk about the MLlib – the tech roadmap and integrations, and especially emphasizing about how to leverage sparsity in your data.

Meanwhile, friends at Zementis have recently released PMML support for Python, with a project called Py2PMML. In particular, there’s integration for scikit-learn. I wonder how long before PySpark + MLlib joins that list?

   Joaquim on the Moon

As many of you know, given enough beers I become fond of talking about dropping large complex arrays of sophisticated equipment into the polar dark craters on the Moon. In recent convo over drinks with people who calculate the costs of such an operation, for a living we surfaced a interesting price tag for that kind of venture: approximately $15B. In terms of how much the US spends on the Department of Defense, that’s about 8 days’ worth. Think about it. Who wants a term sheet? Meanwhile, the subject got me thinking of Kubrick films, particularly the 2001: Space Odyssey set production, an engineering feat in itself.

That's the update for now. See you in PDX and ATX, with Chicago and San Diego on the event horizon!