Newsletter Updates for December 2014
Chicago, Boulder, NYC, DC, SF, Stanford, London, Stockholm, Madrid, Barcelona, Amsterdam, Dulles, Baltimore, LA. The range of speaking events and business travel over the past quarter almost bewilders, but I’m grateful to get to meet many interesting people and learn about new projects.
Also feeling grateful to enjoy some quiet time at home with family over the holidays, and I wish very happy holidays to you and yours.
Conference Summaries
Strata NY set a new record with about 450 people attending Spark Camp. There was a spare room, plus an hour break in the fray, so we held an impromptu “Ask Us Anything” about Spark – that has turned into a new kind of open source ritual at Strata confs, especially for handling the more advanced audience questions. Also, Bloomberg kindly hosted a large Spark Committer Night meetup event, their largest to-date.
Manhattan, from NY Water Taxi at Port Imperial |
Throughout many conferences and meetup events over the past few months, one demo in particular stood out. David Jonker and Rob Harper from Oculus Info in Toronto gave a talk about Aperture Tiles at Strata NY. Last talk of the show, and quite arguably the best. This open source framework, partly built atop Spark, provides interactive data exploration with continuous zooming on large scale datasets. Highly recommended.
The week after Strata conf in NYC, some of our team found our way slightly south to the University of Maryland, where we got to teach alongside the renowned Jimmy Lin. The week included a Spark Tutorial on campus, plus the initial meeting of the Apache Spark Maryland meetup. Much fun, and we look forward to returning to UMD again soon.
Arriving back to the Bay Area just in time, I caught the launch of the new GalvanizeU program in downtown SF. One challenge that particular evening was getting scheduled to speak head-to-head with the final game in the World Series. That keynote, Data Science in Future Tense, examined some of the near-past and near-future of the field – hopefully indicating some non-intuitive directions.
GalvanizeU is located next to the Transbay Center, just a few blocks away from the new Databricks office. They provide a hands-on graduate program in Data Science, in an urban setting and working closely with industry partners. Galvanize started in Boulder and is also expanding soon into Seattle. We’re thrilled about our new neighbors.
Home just long enough to take the kiddos trick-or-treating and attend GCPLive, then on to Europe… During a brief visit in the UK, I got to present about the latest in Spark Streaming at the London Spark Meetup: Tiny Batches in the wine (a callback to Don Ho, for those who were born more recently – ideal for getting your luau on). Then on to Stockholm with gracious hosting by Spotify, Ericsson, and SICS.
Madrid came next, for the annual Big Data Spain conf. Noticing a joke painted on the side of a jet at the airport, I had a hunch immediately that Madrid would be lots of fun. I was not disappointed. Our hosts at Paradigma Tecnólogico and Stratio presented an amazing conference, one of my favorites in a long, long time. I was fortunate to give a keynote talk, alongside many other excellent talks, such as from friends at Cloudera and Google BigQuery. I highly recommend Big Data Spain. More about Stratio in a bit…
The beach at El Poblenou, Barcelona |
Taking a train from Madrid to Barcelona, admittedly I was missing the former, but Barcelona is a wonderful place. Imagine yourself in Santa Barbara, except that the city is 50 times larger, thousands of years older, and packed full o’ amazing culture. Strata EU was located at a conference center right next to the beach. We held the first official Spark developer certificate exam, plus a large Spark Camp event (25% of the conference attended), a meetup at UPC, and a second iteration of our “Ask Us Anything” about Spark.
Locavore feasting in Catalunya |
Business travel Spark-style does not allow much downtime. Effectively one day off during two full weeks in EU. Fortunately that just happened to be during a weekend in Barcelona, the day after Strata concluded. I rented an Airbnb condo near the beach in El Poblenou, then wandered busy Rambla markets, through the crowd surrounding a busker string trio, gathering items to make a small feast. Only in Catalunya.
Amstel River in Amsterdam |
A quick stop in Amsterdam, with a very fun talk hosted at eBay with hours of Q&A, then back home. Long enough for a family Thanksgiving feast, then off to DC, Baltimore, and LA. Excellent events and good friends met along the way, particularly the Los Angeles Apache Spark meetup hosted by Rubicon Project. Much appreciated.
Spark
The curiously named Likelihood T. Prior noted on Twitter: Spark spark spark spark, spark spark spark spark. #Strataconf synopsis complete. Some went as far as to begin calling “Strata + Hadoop World” by a new name, “Strata + Spark World”. I like the sound of that.
To help keep track of this rocket ride, I’ve begun curating an ongoing list http://goo.gl/2YqJZK of the talks, workshops, etc., related to Spark worldwide. Please let me know if you have events to add.
Speaking of events, recently we began to increase the cadence for Bay Area Spark meetup events. These talks get live-streamed, with the archives published on the Apache Spark channel on YouTube. Databricks also recently announced Spark Packages a community index of packages. The site had to be moved shortly after its launch, due to overwhelming popularity. Good stuff on both the video channel and package repo.
So much news about Spark has happened in the past few months. I’d like to summarize with a few gems collected along the way…
- Stratio has distinguished as arguably one of the most sophisticated use cases for Spark Streaming, with a CEP engine layered atop plus substantial other integrations
- Virdata published the highly recommended Tuning Spark Streaming for Throughput, along with a shared blog post about their Spark use cases
- Large use cases for Spark Streaming became public, notably Netflix and Pearson
- DataStax published an excellent tutorial, Interactive Advanced Analytics with DSE and Spark MLlib based on use of the Spark Cassandra Connector
- Elasticsearch provides Spark SQL integration since 2.1 – check their developer guide for a great tutorial
- Michael Noll published a detailed tutorial about Integrating Kafka and Spark Streaming: Code Examples and State of the Game
Not least of these items, the Databricks team broke YHOO’s previous world record for the Daytona GraySort contest. That tied for the 100 PB sort on AWS, using 1/10 the number of servers and running 3x faster than YHOO Hadoop clusters. #justsayin
MOOCs
Part of my job involves the curriculum for Spark instruction. Our big news recently is that edX and the University of California will be offering two new MOOCs about Spark, sponsored by Databricks.
The first is Introduction to Big Data with Apache Spark by Prof. Anthony Joseph at UC Berkeley. This comprehensive introduction to Spark, as well as Big Data, is based entirely on Python programming and aimed at developing Data Science skills. This course begins on 2015–02–23.
The second is Scalable Machine Learning by Prof. Ameet Talwalkar at UCLA. This hands-on course focuses on distributed machine learning at scale, based on examples using open data, also in Python. This course begins on 2015–04–14.
Note that some taking Spark MOOCs will have the option to use Databricks Cloud free student accounts. Similarly, we will be integrating use of DBC free accounts into our other Spark training events.
Workplace
Several years ago, I was fortunate to work for a CEO who understood how to leverage a distributed workplace. I studied the management practices involved, and in particular have grown to appreciate ROWE greatly. These practices seem all too rare among early-stage tech start-ups Silicon Valley. However, a few tech firms (DataStax and Typesafe come to mind) have embraced distributed workplace models. Frankly, correlations between effective approaches to gender equality and practices such as ROWE should be on every VC’s radar.
With respect to workplace practices – effective or otherwise – two recent articles caught my attention:
- Stop Wasting Everyone’s Time: Meetings and Emails Kill Hours, but You Can Identify the Worst Offenders on WSJ
- Killing the Crunch Mode Antipattern by Chad Fowler
Great words of wisdom about two of the worst anti-patterns for successful tech organizations. The most telling part is the “canary in a coal mine” effect: to watch and see who becomes the most offended by these points. Egregious (sometimes outright hostile) use of email, chat, meetings, etc., and the fallacy of “crunch mode” stand as two of my top determinants for evaluating a company. Right alongside we provide free snacks and meals vs. we offer reasonable health care plans – which somehow turn out to be at odds in far too many start-ups.
BTW, really looking forward to catching Chad speak at GOTO Chicago next May.
Just Enough Math
The Just Enough Math material continues to evolve… Allen and I gave a tutorial at Strata NY, working closely with O’Reilly Media to export content to IPython Notebook within a Docker container for participants to run in the cloud. Rackspace provided the hosting, which in turn was an alpha test for their Nature magazine IPython interactive demo. Welcome to the future of publishing.
Andrew Odewahn and I entered a version of this for the Boston instance of Docker Global Hack Day #2 – frankly, Andrew did like 99.9999% of the work on that one :) Meanwhile, speaking of the future of publishing, JEM provides an example in the new Publishing Workflows for Jupyter by Andrew Odewahn, Kyle Kelley, Rune Madsen.
Beyond publishing, we do have some math to suggest… Two papers caught my attention recently:
- Getting the Most Out of Ensemble Selection by Rich Caruana, Art Munson, Alexandru Niculescu-Mizil @Cornell – turns out that RMS is a pretty good proxy metric…
- The Thermodynamics of High Frequency Markets by Kevin Thomas Webster @Princeton – lots of unexpected in that gem
Oh, and riffing off the “Quantum Algorithms on the Moon” meme from JEM, note that NASA, Google and USRA establish Quantum Computing Research Collaboration such that 20% of computing time will be provided to the university community. In case you have some large data set that’s just screaming to get crunched on a D-Wave. Like you do.
Mesos
BenH wrote in O’Reilly Radar recently, Why the data center needs an operating system: It’s time for applications — not servers — to rule the data center.
Other big news was the Google Cloud Platform Live conference in SF on Nov 1. The message from #GCPLive was largely about containers… in short, the notion of The datacenter IS the computer going mainstream. To paraphrase one comment during the conf: “Customers get locked into host-based patterns, so they struggle with intertwined systems.” Well said. Definitely looking forward to the new GKE service based on Kubernetes.
Other big news was awaiting in London. Namely, the team behind Weave. Recall that the JEM tutorial had been an alpha test for the IPython + Docker + Rackspace + Nature magazine thing? We learned a truism the hard way, with minutes to go before the event started: Docker does little to resolve crucial issues outside of the containers. Enter Weave, handling difficult matters outside the container, such as networking and crypto. Check their blog for tasty insights, e.g., Automated provisioning of multi-cloud weave network with Terraform. Highly recommended.
Speaking of Docker, I really enjoyed this talk by Adrian Cockcroft @DockerCon: State of the Art in Microservices. Especially slides #8–19, product development process.
Speaking of Microservices, here’s a good overview: The Strengths and Weaknesses of Microservices by Abel Avram on InfoQ.
Ag+Data
Continuing on the Ag+Data front, check out the excellent article GeoTrellis Adapts to Climate Change and Spark about how Climate Change analytics drove Spark adoption at Azavea. They integrated Spark and Accumulo to support fast computation of climate impact metrics for DoE, which should be included in the 0.10 release of GeoTrellis.
NYT ran an interactive analysis/visualization, Flooding Risk From Climate Change, Country by Country, which perhaps helps explain Silicon Valley rumors about Google building ferry ports at corporate campuses along SF Bay.
I’m a big fan of Danielle Nierenberg @FoodTank in Chicago. A recent article, How Vegetables Can Save the World, is brief, accessible, and quite to the point. More of that on FoodTank.
Meanwhile, considering the many challenges ahead in Ag worldwide, I’m curious whether some programmable matter could become useful on farms to leverage data? Sort of an asymptote for IoT.
Upcoming Events
Many interesting conferences and other events are planned for the months ahead. Please do check the http://goo.gl/2YqJZK listings. In particular, mark your calendars for:
- Austin, Jan 10: Data Day Texas
- San Jose, Feb 18–20: Strata CA
- NYC, Mar 18–19: Spark Summit East
- SF, Jun 15–17: Spark Summit 2015
O'Reilly studio in Sebastopol, for new "Intro Spark" video |
Misc.
I’ll leave you with something fun and something epic.
First, the fun – though it’s quite epic in a way: LumiGeek. We make Arduino shields for LEDs, audio-reactive drivers, and custom solutions for architectural and artistic endeavors. Check their installation at the new Galvanize Cafe in SF, and look about carefully for a subtle case of anamorphosis.
Second, the epic – if you haven’t seen it yet, it’s well worth four gorgeous minutes of video: Wanderers by Erik Wernquist, narrated by Carl Sagan. Money quote @1:45: “Herman Melville in Moby Dick spoke for wanderers in all epochs and meridians…”
That's the update for now. See you in Austin, San Jose, and NYC on the event horizon!