SV Synopsis: Fundamentalism in Technology

I am grateful for perspectives gained because our family lives in Silicon Valley. Many options here to work at novel ventures, and on fascinating projects… Opportunities to drop by Stanford or Berkeley for some remarkable guest lecture by a visiting expert… The wonders of an almost perpetual Maker Faire as one walks through the neighborhood on any given evening… Tech camps that our daughters can attend locally, as they wish… And, generally speaking, the lack of any real need to engage in ridiculous commutes

As an open source evangelist and as an investor, I've felt grateful to learn from a veritable parade of interesting projects. However, I am troubled by the incidence of a particular problem. Far too often one runs headlong into what I could characterize as a close approximation of cocaine-fueled misogynistic narcissism. The condition is subtle, but systemic here. Even recently, I have witnessed this up close – along with the regrettably pervasive and predictable non-reactions to it. Increasingly, zero tolerance appears to be the only effective response. Or perhaps the tech industry percolates out elsewhere, far from SF and its inertia?

Without mentioning names, two well-known billionaire-club investors in Silicon Valley personify this character sketch. Evidence of panspermia ad absurdum festers in the "cultures" that they promote. Personal jihads seemingly to self-perpetuate their fundamentalist ideals.

A nagging question lingers… Why work alongside an ilk of people with whom I would never encourage my daughters to mingle? Granted, I believe quite strongly in the need to talk with just about everyone, to keep dialogue open, to reject the notion of "enemy". Even so, there are absolutes. Practical realities of livelihood aside, as a parent what kind of examples do my professional actions and affiliations set?

In addition, a question that investors ask over and over when considering whether to fund a new company is "Will the team scale?" Any measure of the toxins described above almost guarantees that the answer will in practice be "No."

That represents a dirty little secret. There is an amazing level of demand for tech talent. It's not exactly because these companies are raging commercial successes; most early-stage ventures by definition are not. It's because few people who are capable of making good judgements are willing to compromise their futures to work for ineffective caricatures. Many start-ups encounter difficulties in scaling their team. Or – more likely over time – they encounter high attrition rates.

While I have in the past focused for several years on the same project, lately I don't stay long in most early-stage firms, generally moving on after an organization demonstrates its nature. To paraphrase Lady Grantham from Downton Abbey, there is a point at which malice ceases to be amusing. On the one hand, that's a terrible way to leverage stock option packages. On the other hand, arguably I have pursued a portfolio career strategy. That approach has helped me build an amazing network. Long-term benefits of my network have far surpassed the potential upside of my aggregate stock options. Therein dwells an important lesson about Silicon Valley.


Newsletter Updates for February 2015

Not so much travel recently – Austin was my only trip this quarter so far. We’ve been heads-down reworking instructional materials to highlight what you can do with cloud-based notebooks. To learn more about that, check out the new Databricks newsletter.

Snow near Cold Springs, California

Meanwhile, my family gets to enjoy some time this weekend in a cabin near Yosemite, during an increasingly rare event here: lots of snow! Recommend: we always try to drop by our favorite mile-high restaurant, Mia’s, for excellent Italian cooking in the mountains and even homemade limoncello.


Of course, one of the other big reasons for keeping close to home lately was our biggest event of the year, Strata + Hadoop World in San Jose. Here’s a link for the published speaker slides and videos, along with an excellent summary of the Hardcore Data Science day by Ben Lorica.

About 325 people attended our Spark Camp tutorial. Oddly enough, that’s the same ratio of total conference attendees that we had at Spark Camp in NYC last fall. I also got to host the new Spark in Action track. One eye-opener in our track was the Tencent talk, where LianHui Wang presented about their experiences running an 8000 node Spark cluster in production. So much for FUD claims that Spark doesn’t scale ;) When asked how Tencent can build substantially larger clusters than what YHOO has reported, LianHui replied wryly, “They do not speak Chinese.”

StackOverflow analysis of Spark by Donnie Berkholz @RedMonk

One of the other Strata talks that I really wanted to catch: Tensor Methods for Large-scale Unsupervised Learning: Applications to Topic and Community Modeling by Animashree Anandkumar @UC Irvine. For more details, check out her video

In particular note the experimental results at the 42:46 mark, along with slides for a related talk. There is even more background in the recent papers: Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank–1 Updates; and Tensor decompositions for learning latent variable models.

The gist of this effort is about using graph moments, assuming priors which then help make tensor decomposition tractable. This material will flex your advanced math agility as it flies through linear algebra, graph theory, statistics, and optimization for some startling implications. While the immediate research is about latent variables for community detection (think: Facebook) these techniques have implications on a much broader range of industry optimization problems. Note that the outcomes are in contrast to work by Jure Leskovec, et al., @Stanford. Another excellent Spark-related talk at Strata that referenced work with tensors was Hadoop as a Platform for Genomics by Allen Day @MapR .

Looking Ahead

Why tensors? Recall from 18 months ago, “I give Hadoop three years before it gets displaced.” At the time that prediction drew some flack. Now that we’re halfway to the predicted time, note that during the past three Strata + Hadoop World conferences there have been numerous remarks to rename it Strata + Spark World. However, the general insight drives a bit deeper…

My question here is, “What is the business case for developing custom apps atop a Hadoop platform?” When I examine industry use cases for Big Data frameworks, there are a few general categories:
  1. ETL
  2. data warehouse replacements
  3. data exploration and reporting
  4. analytics in depth, leading toward streaming
The first category is relatively well-understood, leading toward general purpose solutions. On the start-up side of the spectrum there are great solutions emerging such as ETLeapAlation, and arguably examples such as Epic in medical data exchange. On the established side of Enterprise IT, incumbents such as Informatica have been aggressively partnering and expanding the scope of their integration. That begs the question of whether firms would continue to build rather than buy?

The second and third categories are the devil-you-know, as continuations of DW and BI respectively. SiliconAngle had a good article recently along these lines, The cheat sheet to following Big Data’s money trail by Suzanne Kattau.

My hunch is that in terms of the second category, Cloudera, Hortonworks, etc., will be forced to pivot toward vertical applications sooner than later to sustain their growth, and will likely buy up smaller analytics vendors along the way. That puts them on a collision course with incumbents Oracle, IBM, Teradata, SAS, etc., where both ends of the spectrum race toward resembling each other. In other words, the DW king is dead, long live the DW king. Expect either some contractions or M&A activity as a result. Not much news there.

The third category, effectively a BI displacement, gets a bit more interesting. I gave a keynote talk at Data Day Texas in Austin in January, A New Year in Data Science: ML Unpaused. The gist is that two aspects of the BI displacement – effectively, the dev-centric software engineering (aka “data engineering”) approach and the statistics detour of the past two centuries – are losing steam and lacked sufficient depth to begin with. Machine learning in the 1980s meant something much broader than what gets represented by the current crop of analytics vendors; check out my preso for more details. To cut to the chase, also check an excellent talk The Thorn in the Side of Big Data: too few artists by Christopher Ré @Stanford. See a related article I’ll Be Back: The Return of Artificial Intelligence by Jack Clark @BloombergBusiness.

Stanford Y2E2 at sunset

I have a hunch that cloud-based notebooks will eat the lunch of oh-so-many dev-centric approaches and second-generation BI tools. That strips away from the intrinsic value of Hortonworks, Cloudera, etc. Meanwhile it pushes value toward those firms which are closest to domain experts, with key examples such as EnliticIdibonOculus InfoSpaceknow, etc.

The fourth category has a large market in industry in general. In my opinion, going forward its upside will be realized less so among the “data-centric” usual suspects of ad techfin teche-commercesocial networkssecurity… rather more so within the more traditional sectors of energy, transportation, manufacturing, agriculture, etc. Sensor data is a major driver, whether we are talking about embedded sensors or layers of remote sensing or for that matter the volumes of data in genomics work. These use cases tend toward streaming. Fine-grained resource management in clusters is core to this: not so much due to the data rates as it is due to needs for elastic computing capacity and service architectures – in other words, latency and robustness become key. Streaming applications have lots of moving parts and represent a hard problem in computer science in general. On the one hand, the organizational costs of using a YARN cluster to address those kinds of needs proves to be rather upside down, while on the other hand we see a rise in Mesos deployments, e.g., VirdataAtigeoStratio, etc.

My hunch is that the emerging stack for sophisticated analytics and optimization needs will look significantly less like Cloudera or Hortonworks, and more like a integration of...

Typesafe is another vendor that is clearly addressing this demand. However, that speaks to the infrastructure not the science, and this is where the focus on tensors comes back into the picture…

Within the 2–3 year horizon, I expect to see reasonably good open source projects for cost-effective and scalable methods for low-rank tensor factorization. It’s likely this will involve some probabilistic techniques and lead toward online algorithms, i.e., for streaming. So far there haven’t been good off-the-shelf solutions for tensor factorization. However, a general case approach that could scale-out on commodity hardware would be a significant game-changer, with the potential to sublate a wide range of contemporary work in algorithms.

Within a similar timeline, I expect to see relatively dramatic improvements in networking technology, i.e., within the datacenter. Taken together those two events would signal the availability of relatively more general purpose solutions in contrast to the many one-offs in analytics that are currently bread-and-butter for Hadoop app developers. It could also erode the valuation for the many machine learning library vendors. Consequently, I’m watching this area closely as the sea change evolves. 

My prediction about Hadoop was on target, so let’s see how this new prediction unfolds.


We’ve had the Apache Spark developer certificate available online for several weeks now. Congrads to the recipient of certificate number 1.1.0 - 0001François Garrilot @Typesafe. While I cannot release exact numbers, the success rate for people taking the exam is in the mid 90’s percent. It pays to have hands-on experience developing Spark apps, and this talk provides some great test prep examples. We’ll work toward certifications that are more specialized toward systems engineering and data science.

First Spark certificate goes to François Garrilot!

Recently, Reynold Xin presented about the new DataFrames support in Spark, bringing parity with similar abstractions in Python and R. This capability will be introduced but disabled by default in Spark 1.3, but will become center-stage in later releases. In terms of workflows, it represents a higher-level abstraction than RDDs; however, there are still RDDs underneath and many applications will continue to focus at that layer. Meanwhile, Matei’s thesis has been translated into Chinese. Hopefully that represents the beginning of trend.

Also check out the events worldwide listings and archived talks on the YouTube channel for Apache Spark.


So much effort these days seems to be spent on achieving #Inbox40 … I have a hunch that use of email for business must be rethought. Soon. And perhaps abandoned? I am not convinced that productivity tools such YammerAsanaSlack, etc., provide any long-term solutions, since they still tend to focus people too much on screens and keyboards.

Pescadero Beach, office for an afternoon on the way from our company retreat

FWIW, among my daughters’ peer group, they are way more Internet-savvy than #millenials and have already dumped email as #deadmedia … They use InstagramMinecraft, and Skype as collaboration tools – each of which is at least partly owned by MSFT, for those who are keeping track. However, they concede that they’d likely use Twitter for business if they needed it. Consequently, I greatly appreciate when people use my public timeline on Twitter to communicate. At this point, I delete most private messages aside from Gmail: Twitter DMs, LinkedIn mail, etc., and Gmail messages are N-deep before they will get read.

Just Enough Math

Apparently the Foobartendr drink-by-drone-delivery service in Just Enough Math wasn’t so cray-cray after all ;) Recently the Washington Post reported about a restaurant delivering drinks via drones indoors.

Another interesting bit of tech news is in Quantum Information Processing: Are We There Yet? by Daniel Lidar @USC: niobium processors, Chimera graphs, and much more fun. To wit, this video discusses how to solve Ising Hamiltonians with quantum annealing, i.e., for complex graph problems. Gosh, wonder if that could be handy for tensor factorization? Check around the 36:48 mark, where Prof. Lidar discusses how ground state success probability distributions for DWave are inconsistent with thermal annealer (classical / unimodal) results, but consistent with simulated quantum annealer (bimodal). As far as I can follow the discussion, this rules out classical models, but is not definitive proof yet. Also, how well will it scale?

Upcoming Events

Many interesting conferences and other events are planned for the months ahead… Please check the http://goo.gl/2YqJZK listings. In particular, mark your calendars for:
Meanwhile we’re busy preparing for Spark Summit East next month in NYC on Mar 18–19. Please join us, and to help with that here’s a 20% discount code SSPACO20 for registration.

Also, make plans for MesosCon 2015, Aug 20–21 in Seattle.


Just under the wire: for what it’s worth, I barely squeaked into the Top 30 People in Big Data and Analytics and also recently joined the academic advisory board for the GalvanizeU graduate program in data science. Grateful for both of those.

Whenever I go to write a newsletter, I’m concerned that there won’t be enough content collected yet. Invariably, there are too many links to share. Here are some that caught my attention recently…

The Africa soil map shows the changing nature of soil across the continent. as “an essential reference to a non-renewable resource that is fundamental for life on this planet.” A vital lesson to all, for there are no jobs on a dead planet. Establishing a bar here, I wish we had comparable analysis for North America.

Perhaps one of the more jaw-dropping research results recently: photonic radiative cooling by Shanhui Fan, et al., @Stanford. More than simply an enormous increase in the capability for buildings to reflect sunlight efficiently, this provides a way to beam internal heat out into space without warming the atmosphere: “What we’ve done is to create a way that should allow us to use the coldness of the universe as a heat sink during the day.”

Another interesting development is the US Digital Service: “The United States Digital Service is transforming how the federal government works for the American people. And we need you.” That emerges along with DJ Patil becoming US Chief Data Scientist.

Following that, I’ll leave you with something fun and something epic. First, a limerick detector, based on the GitHub repo Nantucket. Second, words of wisdom from Vint Cerf: Forgotten Century.

That's the update for now. See you in NYC, Boulder, São Paulo, Boston, London, A Coruña, and Chicago on the event horizon!


Newsletter Updates for December 2014

Chicago, Boulder, NYC, DC, SF, Stanford, London, Stockholm, Madrid, Barcelona, Amsterdam, Dulles, Baltimore, LA. The range of speaking events and business travel over the past quarter almost bewilders, but I’m grateful to get to meet many interesting people and learn about new projects. 

Also feeling grateful to enjoy some quiet time at home with family over the holidays, and I wish very happy holidays to you and yours.

Conference Summaries

Strata NY set a new record with about 450 people attending Spark Camp. There was a spare room, plus an hour break in the fray, so we held an impromptu “Ask Us Anything” about Spark – that has turned into a new kind of open source ritual at Strata confs, especially for handling the more advanced audience questions. Also, Bloomberg kindly hosted a large Spark Committer Night meetup event, their largest to-date.

Manhattan, from NY Water Taxi at Port Imperial
Throughout many conferences and meetup events over the past few months, one demo in particular stood out. David Jonker and Rob Harper from Oculus Info in Toronto gave a talk about Aperture Tiles at Strata NY. Last talk of the show, and quite arguably the best. This open source framework, partly built atop Spark, provides interactive data exploration with continuous zooming on large scale datasets. Highly recommended.

The week after Strata conf in NYC, some of our team found our way slightly south to the University of Maryland, where we got to teach alongside the renowned Jimmy Lin. The week included a Spark Tutorial on campus, plus the initial meeting of the Apache Spark Maryland meetup. Much fun, and we look forward to returning to UMD again soon.

Arriving back to the Bay Area just in time, I caught the launch of the new GalvanizeU program in downtown SF. One challenge that particular evening was getting scheduled to speak head-to-head with the final game in the World Series. That keynote, Data Science in Future Tense, examined some of the near-past and near-future of the field – hopefully indicating some non-intuitive directions. 

GalvanizeU is located next to the Transbay Center, just a few blocks away from the new Databricks office. They provide a hands-on graduate program in Data Science, in an urban setting and working closely with industry partners. Galvanize started in Boulder and is also expanding soon into Seattle. We’re thrilled about our new neighbors.

Home just long enough to take the kiddos trick-or-treating and attend GCPLive, then on to Europe… During a brief visit in the UK, I got to present about the latest in Spark Streaming at the London Spark Meetup: Tiny Batches in the wine (a callback to Don Ho, for those who were born more recently – ideal for getting your luau on). Then on to Stockholm with gracious hosting by Spotify, Ericsson, and SICS.

Good times @ Big Data Spain, Madrid

Madrid came next, for the annual Big Data Spain conf. Noticing a joke painted on the side of a jet at the airport, I had a hunch immediately that Madrid would be lots of fun. I was not disappointed. Our hosts at Paradigma Tecnólogico and Stratio presented an amazing conference, one of my favorites in a long, long time. I was fortunate to give a keynote talk, alongside many other excellent talks, such as from friends at Cloudera and Google BigQuery. I highly recommend Big Data Spain. More about Stratio in a bit…

The beach at El Poblenou, Barcelona
Taking a train from Madrid to Barcelona, admittedly I was missing the former, but Barcelona is a wonderful place. Imagine yourself in Santa Barbara, except that the city is 50 times larger, thousands of years older, and packed full o’ amazing culture. Strata EU was located at a conference center right next to the beach. We held the first official Spark developer certificate exam, plus a large Spark Camp event (25% of the conference attended), a meetup at UPC, and a second iteration of our “Ask Us Anything” about Spark.

Locavore feasting in Catalunya
Business travel Spark-style does not allow much downtime. Effectively one day off during two full weeks in EU. Fortunately that just happened to be during a weekend in Barcelona, the day after Strata concluded. I rented an Airbnb condo near the beach in El Poblenou, then wandered busy Rambla markets, through the crowd surrounding a busker string trio, gathering items to make a small feast. Only in Catalunya.

Amstel River in Amsterdam
A quick stop in Amsterdam, with a very fun talk hosted at eBay with hours of Q&A, then back home. Long enough for a family Thanksgiving feast, then off to DC, Baltimore, and LA. Excellent events and good friends met along the way, particularly the Los Angeles Apache Spark meetup hosted by Rubicon Project. Much appreciated.


The curiously named Likelihood T. Prior noted on TwitterSpark spark spark spark, spark spark spark spark. #Strataconf synopsis complete. Some went as far as to begin calling “Strata + Hadoop World” by a new name, “Strata + Spark World”. I like the sound of that.

To help keep track of this rocket ride, I’ve begun curating an ongoing list http://goo.gl/2YqJZK of the talks, workshops, etc., related to Spark worldwide. Please let me know if you have events to add.
Speaking of events, recently we began to increase the cadence for Bay Area Spark meetup events. These talks get live-streamed, with the archives published on the Apache Spark channel on YouTube. Databricks also recently announced Spark Packages a community index of packages. The site had to be moved shortly after its launch, due to overwhelming popularity. Good stuff on both the video channel and package repo.

So much news about Spark has happened in the past few months. I’d like to summarize with a few gems collected along the way…
Not least of these items, the Databricks team broke YHOO’s previous world record for the Daytona GraySort contest. That tied for the 100 PB sort on AWS, using 1/10 the number of servers and running 3x faster than YHOO Hadoop clusters. #justsayin


Part of my job involves the curriculum for Spark instruction. Our big news recently is that edX and the University of California will be offering two new MOOCs about Spark, sponsored by Databricks.

The first is Introduction to Big Data with Apache Spark by Prof. Anthony Joseph at UC Berkeley. This comprehensive introduction to Spark, as well as Big Data, is based entirely on Python programming and aimed at developing Data Science skills. This course begins on 2015–02–23.

The second is Scalable Machine Learning by Prof. Ameet Talwalkar at UCLA. This hands-on course focuses on distributed machine learning at scale, based on examples using open data, also in Python. This course begins on 2015–04–14.

Note that some taking Spark MOOCs will have the option to use Databricks Cloud free student accounts. Similarly, we will be integrating use of DBC free accounts into our other Spark training events.


Several years ago, I was fortunate to work for a CEO who understood how to leverage a distributed workplace. I studied the management practices involved, and in particular have grown to appreciate ROWE greatly. These practices seem all too rare among early-stage tech start-ups Silicon Valley. However, a few tech firms (DataStax and Typesafe come to mind) have embraced distributed workplace models. Frankly, correlations between effective approaches to gender equality and practices such as ROWE should be on every VC’s radar.

With respect to workplace practices – effective or otherwise – two recent articles caught my attention:
Great words of wisdom about two of the worst anti-patterns for successful tech organizations. The most telling part is the “canary in a coal mine” effect: to watch and see who becomes the most offended by these points. Egregious (sometimes outright hostile) use of email, chat, meetings, etc., and the fallacy of “crunch mode” stand as two of my top determinants for evaluating a company. Right alongside we provide free snacks and meals vs. we offer reasonable health care plans – which somehow turn out to be at odds in far too many start-ups.

BTW, really looking forward to catching Chad speak at GOTO Chicago next May.

Just Enough Math

The Just Enough Math material continues to evolve… Allen and I gave a tutorial at Strata NY, working closely with O’Reilly Media to export content to IPython Notebook within a Docker container for participants to run in the cloud. Rackspace provided the hosting, which in turn was an alpha test for their Nature magazine IPython interactive demo. Welcome to the future of publishing.

Andrew Odewahn and I entered a version of this for the Boston instance of Docker Global Hack Day #2 – frankly, Andrew did like 99.9999% of the work on that one :) Meanwhile, speaking of the future of publishing, JEM provides an example in the new Publishing Workflows for Jupyter by Andrew OdewahnKyle KelleyRune Madsen.

Beyond publishing, we do have some math to suggest… Two papers caught my attention recently:
Oh, and riffing off the “Quantum Algorithms on the Moon” meme from JEM, note that NASA, Google and USRA establish Quantum Computing Research Collaboration such that 20% of computing time will be provided to the university community. In case you have some large data set that’s just screaming to get crunched on a D-Wave. Like you do.


Other big news was the Google Cloud Platform Live conference in SF on Nov 1. The message from #GCPLive was largely about containers… in short, the notion of The datacenter IS the computer going mainstream. To paraphrase one comment during the conf: “Customers get locked into host-based patterns, so they struggle with intertwined systems.” Well said. Definitely looking forward to the new GKE service based on Kubernetes.

Other big news was awaiting in London. Namely, the team behind Weave. Recall that the JEM tutorial had been an alpha test for the IPython + Docker + Rackspace + Nature magazine thing? We learned a truism the hard way, with minutes to go before the event started: Docker does little to resolve crucial issues outside of the containers. Enter Weave, handling difficult matters outside the container, such as networking and crypto. Check their blog for tasty insights, e.g., Automated provisioning of multi-cloud weave network with Terraform. Highly recommended.

Speaking of Docker, I really enjoyed this talk by Adrian Cockcroft @DockerCon: State of the Art in Microservices. Especially slides #8–19, product development process.

Speaking of Microservices, here’s a good overview: The Strengths and Weaknesses of Microservices by Abel Avram on InfoQ.


Continuing on the Ag+Data front, check out the excellent article GeoTrellis Adapts to Climate Change and Spark about how Climate Change analytics drove Spark adoption at Azavea. They integrated Spark and Accumulo to support fast computation of climate impact metrics for DoE, which should be included in the 0.10 release of GeoTrellis.

NYT ran an interactive analysis/visualization, Flooding Risk From Climate Change, Country by Country, which perhaps helps explain Silicon Valley rumors about Google building ferry ports at corporate campuses along SF Bay.

I’m a big fan of Danielle Nierenberg @FoodTank in Chicago. A recent article, How Vegetables Can Save the World, is brief, accessible, and quite to the point. More of that on FoodTank.
Meanwhile, considering the many challenges ahead in Ag worldwide, I’m curious whether some programmable matter could become useful on farms to leverage data? Sort of an asymptote for IoT.

Upcoming Events

Many interesting conferences and other events are planned for the months ahead. Please do check the http://goo.gl/2YqJZK listings. In particular, mark your calendars for:
O'Reilly studio in Sebastopol, for new "Intro Spark" video


I’ll leave you with something fun and something epic.

First, the fun – though it’s quite epic in a way: LumiGeekWe make Arduino shields for LEDs, audio-reactive drivers, and custom solutions for architectural and artistic endeavors. Check their installation at the new Galvanize Cafe in SF, and look about carefully for a subtle case of anamorphosis.

Second, the epic – if you haven’t seen it yet, it’s well worth four gorgeous minutes of video: Wanderers by Erik Wernquist, narrated by Carl Sagan. Money quote @1:45: “Herman Melville in Moby Dick spoke for wanderers in all epochs and meridians…”

That's the update for now. See you in Austin, San Jose, and NYC on the event horizon!