Four Quick Links for July

For those who live in the Pacific Northwest – or may be heading in that direction soon – I'll be doing a few talks in Portland and Seattle next week:

Portland, Tue Jul 21, 13:30–17:00

The Apache Spark tutorial at OSCON presents a hands-on introduction to Spark, with deep-dives into important components, SparkR and Data Sources API. The half-day tutorial considers examples from several Huawei case studies – production Spark deployments at scale for Telco use cases. I’ll be teaching this along with Haichuan Wang, Jacky Li, and Vimal Das Kammath V from Huawei.

Also: my newly updated video, Introduction to Apache Spark, will be featured as the Video of the Week during OSCON. This features new updates for DataFrames.

Portland, Thu Jul 23, 10:40–11:20

Microservices, containers, and machine learning provides a deep-dive into a project called Exsto that we’re using to explore the structure and dynamics of open source developer communities. It incorporates natural language processing, graph algorithms, etc., and leverages DataFrames and GraphX in Spark. We’ll explore the Apache Spark developer community as a case study.

Seattle, Fri Jul 24, 18:30–21:30

Eleven Almost-Truisms About Data will be a keynote at a launch party for the new GalvanizeU program in Seattle. Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. Let’s consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries – especially for those who are just now beginning to study about the technologies, the processes, and the people involved.

Seattle, Sun Jul 26, 14:15–14:55

PyData Seattle: NLP and text analytics at scale with PySpark and notebooks at PyData Seattle will go into more detail about the PySpark components of the Exsto pipeline. I’m also super excited about the keynote by Lorena Barba. We’re leveraging Project Jupyter for O’Reilly Learning and I’m really looking forward to talking with lots of people who are working on Jupyter for Education.

See you in Portland and Seattle!


Newsletter Updates for May 2015

If you haven’t already been listening in to the O’Reilly Data Show Podcast hosted by Ben Lorica, then by all means do not walk, run to check it out! Episode linked above is with Anima Anandkumar @UC Irvine, recommended previously here, re: tensor analysis. Similarly, recent collaboration among David Gleich @Purdue, Austin Benson @Stanford, and Lek-Heng Lim @ UChicago, using tensor analysis to resolve hard problems in higher dimensional Markov chains, resulted in “Spacey Random Walks”. Punchline on slide #17. I detect a trend…


Three months and so much travel since my previous post: to paraphrase Ricardo Alberto Fernando Ricardo y de Acha, I’ve got some ’splaining to do :)

Highlights include:

Really enjoyed the track chair gig for Data Science. My two favorite talks, both highly recommended:
That was a busy conf, indeed! You can tell since no time was carved off for sacred pilgrimages to Wassail NYC cider bar, Mast Brothers cold brew chocolate, or Blue Hill Farm. Instead my flight left the following day for…

My first time to Brazil. I’m hugely impressed by the developer community there in SP. Most inspirational quote from the conf: “A team is like a symphony, not a factory.” by Randy Shoup.
Great sessions on Spark, Docker, and much much more. Fortunately, we did have time to sample the local cuisine… e.g., Italian food made with tropical ingredients, or my favorite meal of the week, Pirarucu steamed in banana leaves, with a local favorite recipe for pumpkin soup. Then back to the US, before one could even say “alfajores” – assisted on a course at Stanford, then back to NYC…

Pirarucu – Brasil a gosto, São Paulo

Organized by Jeremy Freeman and crew. Truly excellent: top neuroscience researchers in the world gather for a hackathon (i.e., coding together). Perhaps a few savvy Finance people lean in too, eagerly drooling over results that apply for their high-dimensional, time-series, non-linear correlations work as well… in what VCs insist on calling an “ecosystem”. Or something. See these excellent notes. My favorites:
  • Michael Dewar : streamtools real-time analytics from NY Times
  • Olga Botvinnik : flotilla - Py package for iterative machine learning analyses
  • Eiman Azim on motor circuit function … with actual electronic circuits to emulate what Columbia discovered through ablation studies about cerebellum connections
  • Brendan Lake teaching computers to scribble characters like humans, if you want some really interesting use cases for Deep Learning
We worked together the following day, between tutorials, to build an online platform for submitting algorithms to run against standard neuroscience data sets. This hackathon literally was research. If you want to understand more about Big Data being used expertly in life sciences, attend CodeNeuro!

Lower East Side, Manhattan – from New Museum roof

Many thanks to Marilyn WaldmanClaudia Imhoff, and crew for a fantastic Spark tutorial at CU Leeds Business School! Followed by excellent convo via webcast with several hundred of the top BI analysts in the world. Then on to Boston for…

Matei and I were multiplexing between these two conferences so much, UberX in oscillation overthruster mode, that we didn’t even see each other. Even so, lots of great Spark talks in Boston that week! Meanwhile, kind hotel staff redirected me toward The Barking Crab for dinner, and a good friend introduced L.A. Burdick. Also, there was lots of excellent hard cider in the area. Along with all that Big Data conf talk stuff. Then a red-eye flight took off for Europe…

Boston Seaport, by water taxi

Many thanks to all the work by Amparo Alonso-BetanzosDavid Martínez-Rego, and colleagues for organizing the Spark tutorial at A Coruña. I’d never visited Galicia before, a place with lots of rain and people with red hair playing bagpipes amidst rolling green hills (no en UK – pero en España) … a place where there are software companies next to world championship surfing competitions and albariño vineyards (no en Santa Cruz, California – pero en España) … a place where they speak a language close to Portuguese – not so unfamiliar, right after São Paulo! Excellent other talks, along with a Spark tutorial by Juantomás García. What incredible people, Computer Science excellence at the university, and oh such good sea food. My keynote was broadcast on Spanish television – that’s a first! Use of runways in this corner of Spain is quite abbreviated, so we catapulted next for…

Frente a la Torre de Hercules

See a good summary of the conf online. Spark Camp this time had 12% of the conference attending, whereas before we’d been trending steady at just over 8%. Many thanks to all who participated. Great to meet so many people enthusiastic about using Apache Spark! Also got to host the Hadoop and Beyond track. Which, oddly enough, was mostly about Spark. My two favorite talks:
Plenty of hard ciders sampled while in London. Whenever visiting near West London, I make a point to drop by Princess Victoria. Also, got to see some friends at UCL and the Barclays incubator program. Then back to the US via customs in my former part-time home city, Vancouver…


Dean Wampler, et al., have been hosting great Spark events in the Windy City. Always a treat to visit. We had a good turnout for the Spark tutorial at GOTO, one of my favorite software conferences. Walking alongside the Tribune Tower after my tutorial, I noticed that its walls contain rocks from other famous buildings all over the world. With labels, like an inverted museum. Check it out when you visit. There are also rumors of a cider bar being built, ahem, soon. Then a flight back to the Bay Area…

Serendipity. Got invited by a dear friend, Donna Kidwell @Webstudent to join this conf about the future of education, a collaboration among Future Learning LabH-STAREdCast, etc. Many thanks to Oddgeir Tveiten and crew.

Most discussions focused on experiences with MOOCs – from highly successful examples, e.g., Intro to Robotics by Peter Corke @QUT, or Trust Academy @Salesforce by Masha Sedova. However, the overall themes transcended MOOCs, asking the question of what comes “After Gutenberg”, how peer evaluation is transforming education at scale, and … Peter Norvig’s point that when so much learning material is available (e.g., via Google) the problem becomes a matter of how do you get people to want to interact with it? Generally, the social context of learning becomes key.

Along similar lines, Michael Shanks stressed that – in contrast to traditional academe that tends to decontextualize learning – more contemporary advances are focusing on how to contextualize, locally. That’s part of the essence of interdisciplinary work, e.g., Data Science. Also, delighted to meet Keith Devlin (our new neighbor) with amazing work in math education using Minecraft, etc. (sound familiar?) And was very fortunate to meet teaching superhero David Conover, who uses game design to teach topics like IoT in an at-risk high school in Austin. Brilliant.

Of course, we’ll be inviting the whole lot to propose talks for Strata! Speaking of advances in learning platforms, check out the recent beta site and related article Embracing Jupyter Notebooks at O’Reilly by Andrew Odewahn. Par example, Data visualization with Seaborn … that is integrating use of IPython/Jupyter, Docker, Thebe, etc. Brilliant++.

Data Science

Another recommend: a new series of ML-related interviews by David Beyer, beginning with my friend and colleague Reza Zadeh @Stanford – on the evolution of ML, deep learning, Stanford ICME, and Apache Spark.

If you haven’t seen the news, Nature banned use of p-values … #finally  Note that Fisher did not intend for p-values to be (ab)used that way. So I consider these tests to be truly excellent, for identifying intellectual limits. Related: The Nine Circles of Scientific Hell.

On the subject of pseudoscience – appalling to see recent and ongoing unscientific gaffes by people who should know better, e.g., Neil deGrasse Tyson about GMOs. In great contrast, antidotally, I’d point toward this gem – and please read it at least twice: Die, selfish gene, die by David Dobbs. So glad to see Dawkins getting served … #finally

Neil, Richard: just because you’re published, doesn’t mean you’ve become superior thinkers. Please keep your day jobs, respectively #thankyouverymuch

Speaking of ’splaining to do… another gem is Visualization Explanations @Setosa. Extra points if you grok the callback in their name.

If some of these items are what you’ve been talking about recently, you may just be a Data Science instructor … or interested in becoming one? Check out Become a Data Science instructor @ Galvanize (Seattle, SF, Boulder)


También se recomienda: una excelente introducción en español a la Scala y Apache Spark por Isra Gaytan. The Latin World has been busting a move lately on Spark, #justsayin

A Coruña – from Playa de Oza

Adatao published an excellent article about how they anticipated the inflection point for Spark adoption. Check out the cost curves. Brilliant.

Big news from AMPLab: Keystone.ML is released as open source, to make the process of constructing complicated machine learning pipelines easier. Good stuff from the eponymous Evan Sparks, et al.

I’ve got some graph analytics talks coming up… and was excited to see a streaming/incremental SSSP impl for Spark.

Also, speaking of the Apache Spark Developer Certificate, we’ve got another new neighbor: ORM + DataStax partner on C* certCollect all three!

Meanwhile, the big BIG news is Spark Summit 2015 coming up next month in SF. Use the discount code SparkSummitPC25 for 25% off registration. Not retroactive, but nice try :) Followed by Spark Summit EU in Amsterdam, this autumn. Spark it up!


Solid Conference is coming up again soon! Highly recommended. As an appetizer, check out this excellent article by Cameron Turner @The Data GuildCaltrain Quantified - An Exploration in IoT which we could hear in our previous backyard every morning starting at about zero-dark-thirty. Now that former backyard has become the new GoogleX building, and SciFi tech experiments compete with the trains for attention.

For IoT in practice, I’m totally stoked to see: Surfers on acid… What an excellent application. And, culturally not far off that mark, here’s an interesting take on marine plastic: Net+Positiva.

Ag + Data

I’ve really been enjoying Biocoder News in quarterly installments, some of my favorite new articles in the world. Period.

On that note, I’m thoroughly ecstatic to announce that I’m moving to O’Reilly Media full-time. Even so, I'll stay involved with Spark and Databricks, assisting on Spark Summit, etc. We’re moving the family to a tiny farm, an old apple orchard that really needs some tending. Perfect as a research station for Ag+Data.

The Tiny Farm – redwoods 30m tall, planted 65 yrs ago by previous owner

In highly related news, check out How to Grow a Forest Really, Really Fast, about fantastic work by Shubhendu Sharma. I’m eager to try this out.

One of the top intellects of the early 21st century, Paul Stamets, had some excellent coverage: He Holds The Patent That Could DESTROY Monsanto And Change The World!  See also: BioMason  and Ecovative Fungi, FTW – and mycorrhiza in particular, as Mohamed Hijri explains quite succinctly.

Ag-related tech approaches in SV have become largely derailed by asinine priorities dictated by Monsanto – more about taking over hedge funds on commodity trading globally, than about feeding anyone. Perhaps the best analysis that I’ve read recently – and certainly one of the best books that I’ve ready recently – is the highly recommended The Third Plate by Dan Barber. I learned about that via Gastropod – where Cynthia Graber and Nicola Twilley consider food through the lens of science and history. Brilliant.

There’s been a terrible drought / water crisis in Brazil – largely exacerbated by transnational corporate interests. This was weighing on my mind as our flight landed in São Paulo. I got to speak with friends there who are working on Ag+Data analysis, very good to see.

As predicted: Finance is driving California water into the dust… take a moment to consider the jump in almond production versus the temporally co-located jump in variance for snow pack levels. That’s the tip of the iceberg for the near-term shape of major political battles brewing in California. To wit, some of our local mafia have become known under the more apt monicker of Oligarch Valley. While Fox News, et al., promotes the Israeli approach of desalinization at scale, many people who can actually think for themselves question the impact of that approach, and recognize what an utter environmental disaster it could produce. This is not an area of judgment where one gets to call #oops as an excuse, regardless of which side the local mafia may be taking.

Industry Insights

“Software eats the world” is a catchphrase used by A16z. While I slightly agree with the title from this Datanami article, How Machine Learning Is Eating the Software World, its conclusions are pretty much the opposite of what we’ve observed with Apache Spark use cases in the field. Don't get me wrong – Reynold is a good friend, and IMO one of the most talented people working in distributed systems today. However, I have a hunch that the reporter munged the line.

Two key reasons why organizations adopt cloud-based notebooks are (1) to reduce their need for DevOps people to run clusters; and (2) to reduce the need for programmers to assist business people with queries for insights Big Data. Done and done. In other words, domain experts trump all in Data Science applications, while application developers (in relatively large supply, but relatively expensive) and expert systems engineers (in relatively short supply, extremely expensive) both become less of an existential bottleneck for new ventures. I’ll let you do the math on that one.

Some of the themes that I’ve been researching and illustrating over recent years include: Functional Programming for Big DataApproximation AlgorithmsTensor Factorization, etc. Recognize that each of these point toward less emphasis on developers leveraging APIs, and meanwhile more emphasis on domain experts leveraging simple-to-use frameworks. That’s the bottom line of Apache Spark. Meanwhile, I have no doubt that A16z will continue to rake in loads of money – some of their partners are well-connected billionaires – just perhaps not as a consequence of their thesis. That ship is already sailing. Off, perhaps, toward the oh-not-so Great Pacific Garbage Patch.


Speaking of VCs in SV… TechCrunch analysis recently found that female founders nearly doubled in 5 years. Par example, check out the recent Women in Data: Their Work and Achievements.

Meanwhile, I thoroughly enjoyed this gem by Karin Rubin: How women are conquering SP500… My feelings about the overall ethics of algorithmic trading are arguably mixed. However, if it’s going to happen, why not guide it based on diversity, since that demonstrates a #winning strategy?

Fun Stuff, friends in the news…

Check out Lumo Interactive Projector by Meghan Athavale and crew. It’s an interactive floor projector, transforming a floor into games that kiddos can design themselves.

Also, this bit Our Coming Robot Overlords about David, Amanda, and Zeno Hanson – friends back in Texas.

And, what William Barker called one of his most honest interviews, ever.

Upcoming Events

Will just leave you with…

This article. Wonderful, on so many levels.


SV Synopsis: Fundamentalism in Technology

I am grateful for perspectives gained because our family lives in Silicon Valley. Many options here to work at novel ventures, and on fascinating projects… Opportunities to drop by Stanford or Berkeley for some remarkable guest lecture by a visiting expert… The wonders of an almost perpetual Maker Faire as one walks through the neighborhood on any given evening… Tech camps that our daughters can attend locally, as they wish… And, generally speaking, the lack of any real need to engage in ridiculous commutes

As an open source evangelist and as an investor, I've felt grateful to learn from a veritable parade of interesting projects. However, I am troubled by the incidence of a particular problem. Far too often one runs headlong into what I could characterize as a close approximation of cocaine-fueled misogynistic narcissism. The condition is subtle, but systemic here. Even recently, I have witnessed this up close – along with the regrettably pervasive and predictable non-reactions to it. Increasingly, zero tolerance appears to be the only effective response. Or perhaps the tech industry percolates out elsewhere, far from SF and its inertia?

Without mentioning names, two well-known billionaire-club investors in Silicon Valley personify this character sketch. Evidence of panspermia ad absurdum festers in the "cultures" that they promote. Personal jihads seemingly to self-perpetuate their fundamentalist ideals.

A nagging question lingers… Why work alongside an ilk of people with whom I would never encourage my daughters to mingle? Granted, I believe quite strongly in the need to talk with just about everyone, to keep dialogue open, to reject the notion of "enemy". Even so, there are absolutes. Practical realities of livelihood aside, as a parent what kind of examples do my professional actions and affiliations set?

In addition, a question that investors ask over and over when considering whether to fund a new company is "Will the team scale?" Any measure of the toxins described above almost guarantees that the answer will in practice be "No."

That represents a dirty little secret. There is an amazing level of demand for tech talent. It's not exactly because these companies are raging commercial successes; most early-stage ventures by definition are not. It's because few people who are capable of making good judgements are willing to compromise their futures to work for ineffective caricatures. Many start-ups encounter difficulties in scaling their team. Or – more likely over time – they encounter high attrition rates.

While I have in the past focused for several years on the same project, lately I don't stay long in most early-stage firms, generally moving on after an organization demonstrates its nature. To paraphrase Lady Grantham from Downton Abbey, there is a point at which malice ceases to be amusing. On the one hand, that's a terrible way to leverage stock option packages. On the other hand, arguably I have pursued a portfolio career strategy. That approach has helped me build an amazing network. Long-term benefits of my network have far surpassed the potential upside of my aggregate stock options. Therein dwells an important lesson about Silicon Valley.


Newsletter Updates for February 2015

Not so much travel recently – Austin was my only trip this quarter so far. We’ve been heads-down reworking instructional materials to highlight what you can do with cloud-based notebooks. To learn more about that, check out the new Databricks newsletter.

Snow near Cold Springs, California

Meanwhile, my family gets to enjoy some time this weekend in a cabin near Yosemite, during an increasingly rare event here: lots of snow! Recommend: we always try to drop by our favorite mile-high restaurant, Mia’s, for excellent Italian cooking in the mountains and even homemade limoncello.


Of course, one of the other big reasons for keeping close to home lately was our biggest event of the year, Strata + Hadoop World in San Jose. Here’s a link for the published speaker slides and videos, along with an excellent summary of the Hardcore Data Science day by Ben Lorica.

About 325 people attended our Spark Camp tutorial. Oddly enough, that’s the same ratio of total conference attendees that we had at Spark Camp in NYC last fall. I also got to host the new Spark in Action track. One eye-opener in our track was the Tencent talk, where LianHui Wang presented about their experiences running an 8000 node Spark cluster in production. So much for FUD claims that Spark doesn’t scale ;) When asked how Tencent can build substantially larger clusters than what YHOO has reported, LianHui replied wryly, “They do not speak Chinese.”

StackOverflow analysis of Spark by Donnie Berkholz @RedMonk

One of the other Strata talks that I really wanted to catch: Tensor Methods for Large-scale Unsupervised Learning: Applications to Topic and Community Modeling by Animashree Anandkumar @UC Irvine. For more details, check out her video

In particular note the experimental results at the 42:46 mark, along with slides for a related talk. There is even more background in the recent papers: Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank–1 Updates; and Tensor decompositions for learning latent variable models.

The gist of this effort is about using graph moments, assuming priors which then help make tensor decomposition tractable. This material will flex your advanced math agility as it flies through linear algebra, graph theory, statistics, and optimization for some startling implications. While the immediate research is about latent variables for community detection (think: Facebook) these techniques have implications on a much broader range of industry optimization problems. Note that the outcomes are in contrast to work by Jure Leskovec, et al., @Stanford. Another excellent Spark-related talk at Strata that referenced work with tensors was Hadoop as a Platform for Genomics by Allen Day @MapR .

Looking Ahead

Why tensors? Recall from 18 months ago, “I give Hadoop three years before it gets displaced.” At the time that prediction drew some flack. Now that we’re halfway to the predicted time, note that during the past three Strata + Hadoop World conferences there have been numerous remarks to rename it Strata + Spark World. However, the general insight drives a bit deeper…

My question here is, “What is the business case for developing custom apps atop a Hadoop platform?” When I examine industry use cases for Big Data frameworks, there are a few general categories:
  1. ETL
  2. data warehouse replacements
  3. data exploration and reporting
  4. analytics in depth, leading toward streaming
The first category is relatively well-understood, leading toward general purpose solutions. On the start-up side of the spectrum there are great solutions emerging such as ETLeapAlation, and arguably examples such as Epic in medical data exchange. On the established side of Enterprise IT, incumbents such as Informatica have been aggressively partnering and expanding the scope of their integration. That begs the question of whether firms would continue to build rather than buy?

The second and third categories are the devil-you-know, as continuations of DW and BI respectively. SiliconAngle had a good article recently along these lines, The cheat sheet to following Big Data’s money trail by Suzanne Kattau.

My hunch is that in terms of the second category, Cloudera, Hortonworks, etc., will be forced to pivot toward vertical applications sooner than later to sustain their growth, and will likely buy up smaller analytics vendors along the way. That puts them on a collision course with incumbents Oracle, IBM, Teradata, SAS, etc., where both ends of the spectrum race toward resembling each other. In other words, the DW king is dead, long live the DW king. Expect either some contractions or M&A activity as a result. Not much news there.

The third category, effectively a BI displacement, gets a bit more interesting. I gave a keynote talk at Data Day Texas in Austin in January, A New Year in Data Science: ML Unpaused. The gist is that two aspects of the BI displacement – effectively, the dev-centric software engineering (aka “data engineering”) approach and the statistics detour of the past two centuries – are losing steam and lacked sufficient depth to begin with. Machine learning in the 1980s meant something much broader than what gets represented by the current crop of analytics vendors; check out my preso for more details. To cut to the chase, also check an excellent talk The Thorn in the Side of Big Data: too few artists by Christopher Ré @Stanford. See a related article I’ll Be Back: The Return of Artificial Intelligence by Jack Clark @BloombergBusiness.

Stanford Y2E2 at sunset

I have a hunch that cloud-based notebooks will eat the lunch of oh-so-many dev-centric approaches and second-generation BI tools. That strips away from the intrinsic value of Hortonworks, Cloudera, etc. Meanwhile it pushes value toward those firms which are closest to domain experts, with key examples such as EnliticIdibonOculus InfoSpaceknow, etc.

The fourth category has a large market in industry in general. In my opinion, going forward its upside will be realized less so among the “data-centric” usual suspects of ad techfin teche-commercesocial networkssecurity… rather more so within the more traditional sectors of energy, transportation, manufacturing, agriculture, etc. Sensor data is a major driver, whether we are talking about embedded sensors or layers of remote sensing or for that matter the volumes of data in genomics work. These use cases tend toward streaming. Fine-grained resource management in clusters is core to this: not so much due to the data rates as it is due to needs for elastic computing capacity and service architectures – in other words, latency and robustness become key. Streaming applications have lots of moving parts and represent a hard problem in computer science in general. On the one hand, the organizational costs of using a YARN cluster to address those kinds of needs proves to be rather upside down, while on the other hand we see a rise in Mesos deployments, e.g., VirdataAtigeoStratio, etc.

My hunch is that the emerging stack for sophisticated analytics and optimization needs will look significantly less like Cloudera or Hortonworks, and more like a integration of...

Typesafe is another vendor that is clearly addressing this demand. However, that speaks to the infrastructure not the science, and this is where the focus on tensors comes back into the picture…

Within the 2–3 year horizon, I expect to see reasonably good open source projects for cost-effective and scalable methods for low-rank tensor factorization. It’s likely this will involve some probabilistic techniques and lead toward online algorithms, i.e., for streaming. So far there haven’t been good off-the-shelf solutions for tensor factorization. However, a general case approach that could scale-out on commodity hardware would be a significant game-changer, with the potential to sublate a wide range of contemporary work in algorithms.

Within a similar timeline, I expect to see relatively dramatic improvements in networking technology, i.e., within the datacenter. Taken together those two events would signal the availability of relatively more general purpose solutions in contrast to the many one-offs in analytics that are currently bread-and-butter for Hadoop app developers. It could also erode the valuation for the many machine learning library vendors. Consequently, I’m watching this area closely as the sea change evolves. 

My prediction about Hadoop was on target, so let’s see how this new prediction unfolds.


We’ve had the Apache Spark developer certificate available online for several weeks now. Congrads to the recipient of certificate number 1.1.0 - 0001François Garrilot @Typesafe. While I cannot release exact numbers, the success rate for people taking the exam is in the mid 90’s percent. It pays to have hands-on experience developing Spark apps, and this talk provides some great test prep examples. We’ll work toward certifications that are more specialized toward systems engineering and data science.

First Spark certificate goes to François Garrilot!

Recently, Reynold Xin presented about the new DataFrames support in Spark, bringing parity with similar abstractions in Python and R. This capability will be introduced but disabled by default in Spark 1.3, but will become center-stage in later releases. In terms of workflows, it represents a higher-level abstraction than RDDs; however, there are still RDDs underneath and many applications will continue to focus at that layer. Meanwhile, Matei’s thesis has been translated into Chinese. Hopefully that represents the beginning of trend.

Also check out the events worldwide listings and archived talks on the YouTube channel for Apache Spark.


So much effort these days seems to be spent on achieving #Inbox40 … I have a hunch that use of email for business must be rethought. Soon. And perhaps abandoned? I am not convinced that productivity tools such YammerAsanaSlack, etc., provide any long-term solutions, since they still tend to focus people too much on screens and keyboards.

Pescadero Beach, office for an afternoon on the way from our company retreat

FWIW, among my daughters’ peer group, they are way more Internet-savvy than #millenials and have already dumped email as #deadmedia … They use InstagramMinecraft, and Skype as collaboration tools – each of which is at least partly owned by MSFT, for those who are keeping track. However, they concede that they’d likely use Twitter for business if they needed it. Consequently, I greatly appreciate when people use my public timeline on Twitter to communicate. At this point, I delete most private messages aside from Gmail: Twitter DMs, LinkedIn mail, etc., and Gmail messages are N-deep before they will get read.

Just Enough Math

Apparently the Foobartendr drink-by-drone-delivery service in Just Enough Math wasn’t so cray-cray after all ;) Recently the Washington Post reported about a restaurant delivering drinks via drones indoors.

Another interesting bit of tech news is in Quantum Information Processing: Are We There Yet? by Daniel Lidar @USC: niobium processors, Chimera graphs, and much more fun. To wit, this video discusses how to solve Ising Hamiltonians with quantum annealing, i.e., for complex graph problems. Gosh, wonder if that could be handy for tensor factorization? Check around the 36:48 mark, where Prof. Lidar discusses how ground state success probability distributions for DWave are inconsistent with thermal annealer (classical / unimodal) results, but consistent with simulated quantum annealer (bimodal). As far as I can follow the discussion, this rules out classical models, but is not definitive proof yet. Also, how well will it scale?

Upcoming Events

Many interesting conferences and other events are planned for the months ahead… Please check the http://goo.gl/2YqJZK listings. In particular, mark your calendars for:
Meanwhile we’re busy preparing for Spark Summit East next month in NYC on Mar 18–19. Please join us, and to help with that here’s a 20% discount code SSPACO20 for registration.

Also, make plans for MesosCon 2015, Aug 20–21 in Seattle.


Just under the wire: for what it’s worth, I barely squeaked into the Top 30 People in Big Data and Analytics and also recently joined the academic advisory board for the GalvanizeU graduate program in data science. Grateful for both of those.

Whenever I go to write a newsletter, I’m concerned that there won’t be enough content collected yet. Invariably, there are too many links to share. Here are some that caught my attention recently…

The Africa soil map shows the changing nature of soil across the continent. as “an essential reference to a non-renewable resource that is fundamental for life on this planet.” A vital lesson to all, for there are no jobs on a dead planet. Establishing a bar here, I wish we had comparable analysis for North America.

Perhaps one of the more jaw-dropping research results recently: photonic radiative cooling by Shanhui Fan, et al., @Stanford. More than simply an enormous increase in the capability for buildings to reflect sunlight efficiently, this provides a way to beam internal heat out into space without warming the atmosphere: “What we’ve done is to create a way that should allow us to use the coldness of the universe as a heat sink during the day.”

Another interesting development is the US Digital Service: “The United States Digital Service is transforming how the federal government works for the American people. And we need you.” That emerges along with DJ Patil becoming US Chief Data Scientist.

Following that, I’ll leave you with something fun and something epic. First, a limerick detector, based on the GitHub repo Nantucket. Second, words of wisdom from Vint Cerf: Forgotten Century.

That's the update for now. See you in NYC, Boulder, São Paulo, Boston, London, A Coruña, and Chicago on the event horizon!