Not so much travel recently – Austin was my only trip this quarter so far. We’ve been heads-down reworking instructional materials to highlight what you can do with cloud-based notebooks. To learn more about that, check out the new Databricks newsletter.
|
Snow near Cold Springs, California |
Meanwhile, my family gets to enjoy some time this weekend in a cabin near Yosemite, during an increasingly rare event here: lots of snow! Recommend: we always try to drop by our favorite mile-high restaurant,
Mia’s, for excellent Italian cooking in the mountains and even homemade limoncello.
Strata
About 325 people attended our Spark Camp tutorial. Oddly enough, that’s the same ratio of total conference attendees that we had at Spark Camp in NYC last fall. I also got to host the new Spark in Action track. One eye-opener in our track was the Tencent talk, where LianHui Wang presented about their experiences running an 8000 node Spark cluster in production. So much for FUD claims that Spark doesn’t scale ;) When asked how Tencent can build substantially larger clusters than what YHOO has reported, LianHui replied wryly, “They do not speak Chinese.”
The gist of this effort is about using graph moments, assuming priors which then help make tensor decomposition tractable. This material will flex your advanced math agility as it flies through linear algebra, graph theory, statistics, and optimization for some startling implications. While the immediate research is about latent variables for community detection (think: Facebook) these techniques have implications on a much broader range of industry optimization problems. Note that the outcomes are in contrast to work by Jure Leskovec, et al., @Stanford. Another excellent Spark-related talk at Strata that referenced work with tensors was Hadoop as a Platform for Genomics by Allen Day @MapR .
Looking Ahead
My question here is, “What is the business case for developing custom apps atop a Hadoop platform?” When I examine industry use cases for Big Data frameworks, there are a few general categories:
- ETL
- data warehouse replacements
- data exploration and reporting
- analytics in depth, leading toward streaming
The first category is relatively well-understood, leading toward general purpose solutions. On the start-up side of the spectrum there are great solutions emerging such as ETLeap, Alation, and arguably examples such as Epic in medical data exchange. On the established side of Enterprise IT, incumbents such as Informatica have been aggressively partnering and expanding the scope of their integration. That begs the question of whether firms would continue to build rather than buy?
My hunch is that in terms of the second category, Cloudera, Hortonworks, etc., will be forced to pivot toward vertical applications sooner than later to sustain their growth, and will likely buy up smaller analytics vendors along the way. That puts them on a collision course with incumbents Oracle, IBM, Teradata, SAS, etc., where both ends of the spectrum race toward resembling each other. In other words, the DW king is dead, long live the DW king. Expect either some contractions or M&A activity as a result. Not much news there.
The third category, effectively a BI displacement, gets a bit more interesting. I gave a keynote talk at Data Day Texas in Austin in January, A New Year in Data Science: ML Unpaused. The gist is that two aspects of the BI displacement – effectively, the dev-centric software engineering (aka “data engineering”) approach and the statistics detour of the past two centuries – are losing steam and lacked sufficient depth to begin with. Machine learning in the 1980s meant something much broader than what gets represented by the current crop of analytics vendors; check out my preso for more details. To cut to the chase, also check an excellent talk The Thorn in the Side of Big Data: too few artists by Christopher Ré @Stanford. See a related article I’ll Be Back: The Return of Artificial Intelligence by Jack Clark @BloombergBusiness.
|
Stanford Y2E2 at sunset |
I have a hunch that
cloud-based notebooks will eat the lunch of oh-so-many dev-centric approaches and second-generation BI tools. That strips away from the intrinsic value of Hortonworks, Cloudera, etc. Meanwhile it pushes value toward those firms which are closest to domain experts, with key examples such as
Enlitic,
Idibon,
Oculus Info,
Spaceknow, etc.
The fourth category has a large market in industry in general. In my opinion, going forward its upside will be realized less so among the “data-centric” usual suspects of ad tech, fin tech, e-commerce, social networks, security… rather more so within the more traditional sectors of energy, transportation, manufacturing, agriculture, etc. Sensor data is a major driver, whether we are talking about embedded sensors or layers of remote sensing or for that matter the volumes of data in genomics work. These use cases tend toward streaming. Fine-grained resource management in clusters is core to this: not so much due to the data rates as it is due to needs for elastic computing capacity and service architectures – in other words, latency and robustness become key. Streaming applications have
lots of moving parts and represent a
hard problem in computer science in general. On the one hand, the organizational costs of using a YARN cluster to address those kinds of needs proves to be rather upside down, while on the other hand we see a rise in Mesos deployments, e.g.,
Virdata,
Atigeo,
Stratio, etc.
My hunch is that the emerging stack for sophisticated analytics and optimization needs will look significantly less like Cloudera or Hortonworks, and more like a integration of...
Typesafe is another vendor that is
clearly addressing this demand. However, that speaks to the infrastructure not the science, and this is where the focus on
tensors comes back into the picture…
Within the 2–3 year horizon, I expect to see reasonably good open source projects for cost-effective and scalable methods for low-rank tensor factorization. It’s likely this will involve some probabilistic techniques and lead toward online algorithms, i.e., for streaming. So far there haven’t been good off-the-shelf solutions for tensor factorization. However, a general case approach that could scale-out on commodity hardware would be a significant game-changer, with the potential to sublate a wide range of contemporary work in algorithms.
Within a similar timeline, I expect to see relatively dramatic improvements in networking technology, i.e., within the datacenter. Taken together those two events would signal the availability of relatively more general purpose solutions in contrast to the many one-offs in analytics that are currently bread-and-butter for Hadoop app developers. It could also erode the valuation for the many machine learning library vendors. Consequently, I’m watching this area closely as the sea change evolves.
My prediction about Hadoop was on target, so let’s see how this new prediction unfolds.
Spark
We’ve had the Apache Spark developer certificate available online for several weeks now. Congrads to the recipient of certificate number 1.1.0 - 0001, François Garrilot @Typesafe. While I cannot release exact numbers, the success rate for people taking the exam is in the mid 90’s percent. It pays to have hands-on experience developing Spark apps, and this talk provides some great test prep examples. We’ll work toward certifications that are more specialized toward systems engineering and data science.
Recently,
Reynold Xin presented about the new
DataFrames support in Spark, bringing parity with similar abstractions in
Python and
R. This capability will be introduced but disabled by default in Spark 1.3, but will become center-stage in later releases. In terms of workflows, it represents a higher-level abstraction than RDDs; however, there are still RDDs underneath and many applications will continue to focus at that layer. Meanwhile, Matei’s thesis has been
translated into Chinese. Hopefully that represents the beginning of trend.
Workplace
So much effort these days seems to be spent on achieving #Inbox40 … I have a hunch that use of email for business must be rethought. Soon. And perhaps abandoned? I am not convinced that productivity tools such Yammer, Asana, Slack, etc., provide any long-term solutions, since they still tend to focus people too much on screens and keyboards.
FWIW, among my daughters’ peer group, they are way more Internet-savvy than #millenials and have already dumped email as #deadmedia … They use Instagram, Minecraft, and Skype as collaboration tools – each of which is at least partly owned by MSFT, for those who are keeping track. However, they concede that they’d likely use Twitter for business if they needed it. Consequently, I greatly appreciate when people use my public timeline on Twitter to communicate. At this point, I delete most private messages aside from Gmail: Twitter DMs, LinkedIn mail, etc., and Gmail messages are N-deep before they will get read.
Just Enough Math
Another interesting bit of tech news is in Quantum Information Processing: Are We There Yet? by Daniel Lidar @USC: niobium processors, Chimera graphs, and much more fun. To wit, this video discusses how to solve Ising Hamiltonians with quantum annealing, i.e., for complex graph problems. Gosh, wonder if that could be handy for tensor factorization? Check around the 36:48 mark, where Prof. Lidar discusses how ground state success probability distributions for DWave are inconsistent with thermal annealer (classical / unimodal) results, but consistent with simulated quantum annealer (bimodal). As far as I can follow the discussion, this rules out classical models, but is not definitive proof yet. Also, how well will it scale?
Upcoming Events
Many interesting conferences and other events are planned for the months ahead… Please check the http://goo.gl/2YqJZK listings. In particular, mark your calendars for:
- QCon São Paulo, Mar 23–28, São Paulo
- CU Boulder + BBBT, Apr 23–24, Boulder
- Big Data TechCon, Apr 26–28, Boston
- Next.ML, Apr 27, Boston
- Strata EU, May 5–7, London
- GOTO Chicago, May 11–14, Chicago
Meanwhile we’re busy preparing for Spark Summit East next month in NYC on Mar 18–19. Please join us, and to help with that here’s a 20% discount code SSPACO20 for registration.
Misc.
Whenever I go to write a newsletter, I’m concerned that there won’t be enough content collected yet. Invariably, there are too many links to share. Here are some that caught my attention recently…
The Africa soil map shows the changing nature of soil across the continent. as “an essential reference to a non-renewable resource that is fundamental for life on this planet.” A vital lesson to all, for there are no jobs on a dead planet. Establishing a bar here, I wish we had comparable analysis for North America.
Perhaps one of the more jaw-dropping research results recently: photonic radiative cooling by Shanhui Fan, et al., @Stanford. More than simply an enormous increase in the capability for buildings to reflect sunlight efficiently, this provides a way to beam internal heat out into space without warming the atmosphere: “What we’ve done is to create a way that should allow us to use the coldness of the universe as a heat sink during the day.”
That's the update for now. See you in NYC, Boulder, São Paulo, Boston, London, A Coruña, and Chicago on the event horizon!