2013-08-26

Newsletter Updates for August 2013

Lots of talks, lots conferences, lots of writing. Here are the latest updates about scheduled events, along with pointers to some of the best content that I've been studying lately.

YARN has a Docker-sized hole. Brace yourselves, in this post some friends at [ list the Hadoop vendors, other than MapR ] may get upset. Good friends all, I wish them the best. However, some things must be said. More about that in a moment. (And if I don't surface before Nowruz, please notify my family.)

It's been a busy month: Portland, Austin, Chicago. While we often go out for beers following a talk, Aaron Betik @Nike had the brilliant insight to hold our PDX meetup inside a brewery. Three birds, one stone. Many thanks to Aaron, Nike, Thomas Lockney @JanRain, Todd Johnson @Urban Airship for putting together that event. And to everyone at O'Reilly Media for such a great OSCON!

Girish Kalathagiri and I wrote a paper about ensemble modeling for the first-ever PMML Workshop at KDD 2013 in Chicago. Fun to be on the ground floor, and I look forward to this expanding into a full conference. Many thanks to Bob Grossman, Walt Wells, et al., for their efforts arranging the workshop.

Mike Zeller from Zementis explained about a big spike in industry usage over the past year. PMML has become a pervasive standard for describing analytics workflows. I was particularly impressed with solutions from Augustus for Python, and with the recent updates to Knime – both presented at our workshop. In the broader scope of Enterprise data workflows, I've been learning much about the Python stack from Continuum Analytics – Anaconda, Wakari, etc. – plus excellent work on IPython Notebook, Pandas, Scikit-Learn, and related projects. Augustus (from Open Data Group) fits well within that context to provide PMML for model creation and model scoring. Especially given Continuum's support for compiling and optimizing apps and attention to low-latency use cases. Similarly, Knime has a brilliant commercial integration in the context of ParAccel by Actian. I'm particularly impressed with Actian's use of Knime and Eclipse to provide a UI for Enterprise data workflows: great UX plus operationalizing. I got to spend some time recently with the teams at Continuum and Actian. Highly recommended.

So, the takeaway is, if you're working with an analytics stack that does not incorporate PMML support as a core feature, run. It's time to switch.

–––

In Austin, I got to meet with directors of the new Business Analytics program at the UT/Austin McCombs School of Business. Brilliant, a mere glance at their syllabus and schedule proved this program to promote the kind of aggressively fast-paced intellect which I crave. Which the industry craves for data scientists, too – in quantity. I'm looking forward to visiting their program again soon.

In Chicago, I got to meet with several of the current fellows in the Data Science for Social Good. Most substantive discussions I've had since Los Angeles about Open Data and its applications in practice. There's a new conference emerging in the midst of those discussions. Looking forward to final projects from this summer's DSSG fellowship!


(cluster computing, à la friends @Hazelcast)

Meanwhile, we have "Intro to Data Science" workshops and talks coming up in Denver, SF, NYC. We'll book-end each trip with office hours and on-site demos for Mesos.

Speaking of the workshops, a few great new titles related to that material, which I recommend:

Analyzing the Analyzers, by Harlan Harris, Sean Murphy, Marck Vaisman; O'Reilly (2013): including an excellent analysis of the skills, experience, and viewpoints of data practitioners in industry.

Mondrian in Action, by William D. Back, Nicholas Goodman, Julian Hyde; Manning (2013): beyond the Pentaho analytics, this team has some of the most comprehensive insights available about Enterprise SQL usage.

Storm Real-Time Processing Cookbook, by Quinton Anderson; Packt (2013): including clear, concise sample apps that integrate the kinds of frameworks we use in production (Storm, Kafka, Cascalog, etc.) and not mere code snippets shown in isolation.

–––

One of the main themes in my workshops and lectures is that Apache Hadoop is almost never used in isolation. However, that perspective has been taken to task by the architects of Hadoop:

"Hadoop is the kernel of a distributed operating system, and all the other components around the kernel are now arriving on the stage."

"DistributedShell is the new WordCount."

Check out Arun's slide deck there. Seriously, Hadoop as a service endpoint? Wow, that's like Enterprise Java Beans redux. A kinder, gentler version of EJB. Or something. The notion of having to write 40 lines of Java to execute a Bash command line – now that's impressive! Just wow. I'd pay good money to be in the room when Cloudera and Hortonworks sales teams attempt to pitch this flavor of JVM nonsense to IT execs at Schwab. Their reality bubble leaves me wondering: have Doug or Arun even looked at a modern kernel, i.e., even bothered to notice Linux kernel commits, since their Java coding days back in 2006?

Because I'm quite certain that the people making Docker have. I'm quite certain that the people making OpenVZ have, too. Moreover, my friends who work in Ops for their profession are generally well aware of Docker and OpenVZ. However, YARN? Not so much…


Let's consider a historical trajectory:
  • COBOL, circa 1959… DoD accountants trying to tell Ops how to do their job
  • EJB, circa 1999… IBM/Sun/Oracle attorneys trying to tell Ops how to do their job
  • OpenStack, circa 2010… NASA app developers trying to tell Ops how to do their job
Notice a trend? And now, for the latest contender:
  • YARN, circa 2013… Yahoo! data engineers trying to tell Ops how to do their job
In my experiences with real companies – companies with substantial amounts of revenue, that is, not the Silicon Valley definitions – anyone outside of Ops trying to tell then how to do their job better have a CxO in their title. Preferably with a vowel in-between. In other words, keep the dilettantes the #&!% out of the Ops pit, to keep your employer from going out of business.

Meanwhile, what's the commercial reality of Big Data today? Well let's compare and contrast a couple big players in the space:

Actian is profitable, with a base of 10K customers, and north of $150MM annualized revenue. BTW, I'm a big fan of Amazon AWS Redshift, how about you? Runs great in production at scale. Love the economics of that.

Hortonworks recently took a B round, with $70MM in funding to-date. The company currently sells training and support; not clear how long it will take to become profitable. Oh, and their CTO recently left the building.

Hmmm… May need to check with some of my b-school friends about how to compare those fundamentals. Meanwhile, here are a few notes from Google about their experiences and benchmarks working with Linux kernels and very large-scale distributed computing.

–––

My first full-time job as a engineer was at a start-up in Sunnyvale in the summers of 1983-84. Our team ported Unix to a 32-bit minicomputer, and I wrote the sort package. If you've shopped at Ace Hardware or Pep Boys, your transactions probably went through our code. I'm grateful for that experience and learned a bit about operating systems by helping implement part of a popular embedded commercial distro.

At the time, many companies were totally absorbed in COBOL – still a good living back then. Those shops seemed oblivious to the changes underway: Unix minicomputers on the high end, workstations in the mid-range, PCs on the low end (e.g., Apple II running VisiCalc). These would soon wipe out COBOL programming like a forest fire raging through a stand of dry pines.

About that time I attended a seminar by one of my CS profs and his colleagues, describing a start-up based on work at Stanford University Network, aka SUN, which had commercialized a new network-enabled minicomputer. They considered that work transformational. Read: disruptive. They were right. That same year, Steve Jobs gave a standing-room-only lecture on campus about what he'd learned from Xerox PARC, aka the Lisa just prior to the Macintosh. He considered that work transformational. He was right. Meanwhile, a guy named Larry Ellison was busy hiring the bulk of our CS grads for his company, Oracle. You know the rest of the story. We could see big changes ahead circa 1983-84.

Over time, I've noticed how disruptive changes in computer technology tend to happen faster at the hardware and OS level, while the popular programming languages struggle to keep pace. Legacy frameworks encounter even more difficulty. I recall lectures at Stanford CS teaching pretty much the opposite: that it's simpler to evolve new technology at the language and application layer. While that notion may hold power in academia, its narrative conceit is that computer programming languages become tied to culture in industry, and culture has inertia.

That leaves us where we are today with Java vis-a-vis Big Data. For better or for worse, YHOO circa 2006 made a big bet on Java as the principal language for Big Data frameworks. YHOO circa 2006 didn't last, but the frameworks that emerged from it persist. These days the Big Data vendors are thoroughly occupied selling Apache Hadoop to the Global 1000 as a glorious path into a shiny, data-imbued future. In other words, recreating the Global 1000 in the image of YHOO circa 2006. Some excellent work came out of Yahoo! from the mid-/late- 2000s, and I greatly admire my friends who were there and made that happen. Even so, Hadoop is based on work at GOOG circa 2002 – now a few generations behind. Living (literally) with GOOG in my backyard, my neighbors who work there smirk whenever the word "Hadoop" gets mentioned.

When I see talented people who have their heads stuck inside IntelliJ, who cannot think outside of a Java API, it seems sad. It reminds me of those poor souls circa 1983 pouring over COBOL punch cards and teletype output. Sure, there's excellent software written in Java – java.util.concurrent comes to mind immediately.

I'm obviously quite a fan of JVM-based functional programming languages such as Clojure and Scala. However, when the "thought leaders" go around talking about Hadoop as an operating system, re-defining HA/low-latency service endpoints to be based on Hadoop and Java – it's COBOL all over again. Hold on tight.


Linux is an operating system. Unix is an operating system which had sophisticated features even sooner – arguably, as of the Linux 3.x kernels the playing field has perhaps become more leveled? Windows is also an operating system, albeit geared toward different usage. When people try to pitch "Hadoop as an operating system", they are trying to sell you snake oil.

The only thing that nonsense will buy the IT industry is even more of a guaranteed revenue stream for people building zero-day exploits in Beijing. Let's just suppose, hypothetically, that you run part of IT at Morningstar or Schwab. Imagine somebody trying to pitch you on JVM for low-latency services and cluster management. Are you going to bet your EVP's bonus on snake oil? Didn't think so.

The lesson is that enormous changes are afoot in terms of multi-core processors, large memory spaces, etc. These changes have huge impact on algorithm work for handling data at scale – without going bankrupt. Hadoop emerged in a day when spinny disks were king, multi-core was rare, and large memory spaces were expensive: that world is gone. Meanwhile, the modern kernels have kept pace with those industry changes. 

So, what is my point? Trying to resolve OS issues in the application layer is almost always a recipe for disaster. Caveat emptor.

I give Hadoop three years before it gets displaced. The lesson of Spark, in my analysis, is that rewriting Hadoop to be 100x better isn't hugely difficult, given the available building blocks for data center computing, based on the modern kernel. Meanwhile, the prognosis for Hadoop? Three years. On the outside.

Many thanks,

Paco