Newsletter Updates for July 2014

Two aspects about leveraging machine learning are largely under-represented in the lit, especially when it comes to production use cases: feature engineering and the comparative evaluation of multiple modeling approaches. To that point, check out “Streamlining feature engineering: Researchers and startups are building tools that enable feature discovery” by Ben Lorica. The article mentions Spark Beyond, which “finds deep patterns in your data.” I was lucky to get a demo of Spark Beyond earlier this year and talk with the principals – and highly recommend taking a good look at their wares. Between the ongoing advances in deep learning and symbolic regression, a direction seems to be emerging … that perhaps one of the more difficult parts of machine learning workflows, namely the feature engineering aspects, could become more automated.

For another great article, check out Including Men in the Conversation About Women by Scarlett Sieber. Among my biggest peeves about Silicon Valley are the “brogrammer” lopsided demographics, and the gender bias which is quite real and nearly epidemic. Our data science teams have generally been quite mixed, why can’t engineering teams in general leave the 19th century behind, let alone stop being so hostile? Not naming names, but two of the SV firms in which I’ve worked in the past five years are both well known and well poised for harassment lawsuits. Taking a stand against that nonsense as an engineering manager is a great way to catch hell, which I’ve gladly engaged before. Another related pet peeve is where one of the same firms was actively pressuring their engineering interns to quit university degree programs. As a behavior for an engineering manager, I find that highly unethical. Some of those who are engaged in these practices know quite well who I’m talking about.

Spark Summit

The big, BIG news last month was … (wait for it) … Spark Summit. All of the speaker videos have been posted – those are probably the single-best resource for learning about Apache Spark. Of course, the big surprise at the conf was the announcement of Databricks Cloud. If you missed the conf, you can watch Ali Ghodsi’s spectacular demo which kicks in at about the 14:40 time marker.

Spark Summit keynote practice, T-15 hours
One surprise learning from the conf was that one product line from SAP generates more annual revenue than all of the other Big Data vendors (HW, Cloudera, etc.) combined. Other pleasant surprises included: Flambo, a Clojure DSL for Spark; and Thunder, for large-scale neural data analysis, which shows some excellent integration of PySpark, SciPy, scikit-learn, etc.

Our training sessions at Spark Summit set some kind of new records. In particular, check out the advanced material for great lectures there. Those who attended the conf received a free ebook preview for the upcoming Learning Spark: Lightning-Fast Big Data Analytics by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia; O’Reilly Media (2014).

Also, I got to host the Research track of session talks at Spark Summit, which was a real treat. We had a special #geo break-out session following the Geotrellis talk by Rob Emanuele. We will hopefully be expanding that focus in future confs. There were so many other great talks that it’s hard to pick favorites. Even so, I’ll be studying up about two in particular: Quadratic Programing Solver for Non-negative Matrix Factorization with Spark by Debasish Das, Santanu Das; and Distributed Reinforcement Learning for Electricity Market Bidding with Spark by Vijay Srinivas Agneeswaran, Vishnuteja Nanduri. The latter seems almost ideal for integration with recent work on genetic programming.

Stay tuned for the next Spark Summit, which will be held on NYC in early 2015.


I’ve just returned from OSCON 2014. What an excellent conference! Check out the content recently posted online: keynotesphotosspeaker slides.

Of course, this event was carefully timed to overlap with the Oregon Brewer’s Festival. Top two picks: Double Latte. by Sierra Nevada Brewing Co.; and Lorenzini Blood Orange Double IPA by Maui Brewing Company. Many thanks to Erin Rasmussen for suggesting about OBF!

Blood Orange IPA
Back at OSCON… one of my favorite Ignite talks was What Science Fiction Can Teach Us About Building Communities by Dawn Foster. Another favorite, speaking of #geo, was a preso/proposal for Open Aerial Map by Kate Chapman.

During the conf, Andy Orem did a video interview where we discussed perspectives and current projects: Ag+Data, Industrial Internet, sketch algorithms, Apache Spark, etc. Andy was the very first editor I worked with at O’Reilly Media, ten years ago. He’s a much better interviewer than I am an interviewee, so I enjoyed learning much through our work together. Also fun to work again with the amazing video team.
"With great power comes some data, plus wrinkled shirts"
The tutorial for Just Enough Math had 50+ people attending, and we got to evaluate an intermediate stage of a new tutorial software platform. For that, I needed to get a bunch of USB drives from Amazon, but the order/delivery #failed. At the last minute our 10 y.o. daughter and I made an emergency run to Fry’s Electronics (she was eager to observe ground zero for nerdliness) … but the only 4Gb flash drives that they had left in stock were Marvel Universe comix characters. Arriving back home, our 9 y.o. daughter was aghast that adults would be receiving comix figures in a lecture :)

The Data Workflows for Machine Learning talk received lots of great responses – as did earlier versions during meetups in Seattle and SF. It become of the “top-shared” slide decks featured on the SlideShare home page. Perhaps that needs to be turned into a mini-book?

new book kiosk
As my last-o’-the-day book signing was winding down, after almost everyone had left the convention center for “nearby locations of beer taps”, a friend mentioned “Hey, look there’s another pile of books – these look different.” So a few lucky latecomers got signed copies of the galley drafts for our new book Just Enough Math, which probably still won’t be released for months – this rev is quite rough :) Oddly enough, the first person to read it looked up and said, “Where are the other O’Reilly books about math?” Indeed.

Sketchy Things

Speaking of Just Enough Math, we’ve put up a companion site for the video+book+tutorial at http://justenoughmath.com/ to provide additional resources and related links:
  • set up a Python programming envon your laptop
  • code+data files for examples in the video+book
  • “gists” that show expected results for the examples
  • links to external resources that get referenced
  • recommended books and videos for further study
  • monthly newsletter sign-up
The tutorial at OSCON previewed a new chapter recently added about sketch algorithms, following from notes at an excellent Foo Camp session led by Avi Bryant. I will be focussing on Spark Streaming use cases for Strata EU in Barcelona this fall, particularly where approximation techniques (think: examples of monoids in action) can leverage both Spark and Cassandra. If you have examples to share of Spark Streaming production use cases in general, I’m eager to build case studies to publish in Radar. Meanwhile, for a great resource about sketch algorithms, check out the archives of the AK Data Science Summit – Streaming and Sketching from last summer.

Card-Carrying Green

A friend recently brought up the topic of navigating questions about extinction and climate change for preschoolers… I’m getting those too; however, in my experience the questions become much better formulated after an additional 5–6 years or so. As a parent, as a human, it kills me to see all the ginormous FUD spewing from the political lobbies for the coal industry, fracking, Monsanto, GM, etc. How about giving ample air time and consideration for some points from the other side?

First off, I’ve mentioned it before but it bears repeating: The Land Institute is a phenomenally excellent resource for understanding some of the insanity and pure tragedy of contemporary agricultural practices, particularly when it comes to monocultures, annuals, hybrids, let alone unnecessary tillage. To paraphrase Wes Jackson, “The plow share has destroyed more options for future generations than the sword.” On a related note, I’ll also point to an excellent article by Michael Pollan, as a forward to Grass, Soil, Hope: A Journey through Carbon Country by Courtney White. Moreover, check out The Solutions Project. That latter site has more substance than perhaps its web-design polish indicates: it’s about the work by Mark Jacobson, et al., on how to power the planet via renewables now while mitigating hurricane damages, etc. One would think that the reinsurance revenues alone would justify a significant investment. In any case, these three links point to the fact that any emerging “dialog of despair” about global warming, etc., is purely FUD. Much can and will be done.

Phylo, the trading card game
I’m particularly grateful to be associated with O’Reilly Media, which provided OSCON attendees with a nice treat in their schwag bags: Phylo, a trading card game. Its gameplay emphasizes endangered species, climate change, food chains, and other environmental pressures. “Phylo is a project that began as a reaction to the following nugget of information: Kids know more about Pokemon creatures than they do about real creatures. We think there’s something wrong with that. Apparently, so do many others.”

In a related development, check out Nerds Without Borders: “We are looking for all sorts of people to help: Engineers, Scientists, Writers, Artists, Dreamers, Activists, Organizers, Fundraisers, Financiers, etc…” Starting with use of IoT sensors and cell phone networks to protect sea turtle hatchlings. Good stuff.

Looking Ahead

Another fun follow-up from Foo Camp and OSCON: getting to talk with Scott Jenson about his work on The Physical Web at Google. Check out his preso, Why Mobile Apps Must Die. The big idea is a kind of “micro-DNS” for low-cost digital tagging of physical items that can be accessed by mobile devices. No app installs required.

In other news, Trafodion was recently released as open source by HP. The name is based on the Welsh word for “transaction”. If you recall about Tandem Computers and NonStop, this product line has a long history of tech innovations – for highly reliable, highly optimized real-time SQL at scale. My uncle retired from Tandem, and lately I’ve spent time with the Trafodion team and am quite impressed. This release brings an interesting new level of Enterprise robustness to real-time transactions+analysis atop Linux+Hadoop. One to watch.

Another to watch closely is The Distributed Developer Stack Field Guide by Andrew Odewahn, Courtney Nash, Mike Loukides, et al. This is a GitHub-based book from O’Reilly. If you see any points in there that need editing, embellishing, etc., then two words: pull request, for the win.

In terms of upcoming events, registration is now open for Data Day Texas 2015, and I’m really looking forward to that. Will be teaching Spark at Scala by the Bay in SF on Aug 8–9, speaking at #MesosCon in Chicago on Aug 21, followed by another Spark course in Chicago on Aug 25.


I’ll close with a look back to a 1990 Documentary about Cyberpunk. That provides a good summary of what we up to in the early 1990s with Mondo 2000, bOING-bOING, FringeWare, WiReD, The WELL, Turkey City, etc. Tim’s monologue around 15:30-ff is hilarious – both because of his ever-optimistic “There will be mass democracy in the streets” miss, and how much it contrasts with just about every other major point coming true within 25 years. Warning: gratuitous F242 clips, throughout. Time marker 27:11 shows what I was doing as a vendor at many, many raves… Meanwhile, check out a recent bOING-bOING article Alien Autopsy: William Barker on Schwa, two decades later for some of the more astute counterpoint about what was really going on, then and now.

That's the update for now. See you in Chicago with San Diego on the event horizon!


Newsletter Updates for June 2014

Been quite an interesting month: NYC, SJ/SF, bookended by Hadoop Summit and Spark Summit, with Foo Camp in the midst… much learned, and many excellent introductions.

If you haven’t seen it, this is a gem: Seeing Spaces by Bret Victor, as an evolution of the “Maker Spaces” concept. Another top recommend is A Short History of and Introduction to Deep Learning by John Kaufhold. Money quote: “Learn, don’t engineer feature representations.” Check this review by Mary Galvin at Data Community DC.

For another great source of inspired writings, follow the Matthew Hunt posts on LinkedIn . In this episode of delightfully unexpected connections, Matthew leads us on a path among Pink Floyd, moon cheese, gnome-like cretins, and unlikely heroes for a tale of two Burkes.

   Just Enough Math

The video for Just Enough Math has been on sale for the past month. O’Reilly has a preview video on YouTube, if you’d like to check out a sample. Meanwhile…

I need your help: this Just Enough Math project would greatly benefit from your reviews. Even if you don’t purchase the full video, check the preview and the free sections. We’re eager to hear your feedback, and especially your reviews!

Here’s the thing: on the one hand, if you’re the kind of person who enjoys reading math papers as a fond pastime, this material is probably not for you. There are plenty of other videos in the world, and so many brain teasers, so little time. On the other hand, if you find that math papers tend to be almost entirely devoid of context (which, frankly, many are) and you took math through Algebra 2, and you enjoy seeing some examples, learning some history, etc., then you’ll probably benefit from this video.

There are quite a number of great resources at O’Reilly and other publishers for those who want a deep-dive in any particular area of advanced math applied for Big Data … and the point of the Just Enough Math project is to serve almost like a “hyperlink document” (e.g., old school web pages circa early 1990s) for those other books, videos, websites, etc., along with providing history and case study examples as context.

We’ll be presenting a tutorial based on Just Enough Math at OSCON. Plus, there’s a super-secret discount code for 20% off registration: PACOID

In the Bay Area, we’ve recently launched a Just Enough Math: Machine Learning for Execs and Entrepreneurs meetup. Looking forward to more events through that. Submitted as evidence, check out “How Not to Be Wrong”: What the literary world can learn from math by Laura Miller in Salon.


Some interesting insights about Apache Mesos surfaced in the recent 2014 community survey. And at this point, the list of firms adopting Mesos no longer fits in my browser window. To find out more, check out MesosCon scheduled in August in Chicago. I’m looking forward to talks from John Wilkes and several other experts, and meanwhile will present about Apache Spark running on Mesos. In related news, recently I gave a talk at the Mesos NYC Meetup sponsored by the kind folks at Shutterstock. If you’re in the area check out an intro Mesos talk on 7/17 by Joe Stein at Bloomberg.

✽ ✽ ✽

On a recent camping adventure in Sebastopol, I was grateful to learn about lots of new technologies. One of the more interesting finds was Unbounded Robotics, and I enjoyed a chat with Melonee Wise, CEO. These actually are the droids you’re looking for. Meanwhile, O’Reilly Media is looking for editors, especially in the Data practice area. Got Edit? Join the team!

morning walk in Ceres Community Garden, w/ O'Reilly Media in bkgd

In terms of other interesting technologies… I’ve been hearing memes rumbling about “Big Data is a myth” or “Where are the IoT apps?” Here, that’s where. The part of Nokia that didn’t sell off to Microsoft is handling some of the most interesting fusion of data exhaust that I’ve seen. Case in point, check out Jams, game theory, and equations: the science of traffic for a view of really big data analyzed in real-time. If you’ve attempted to drive anywhere in, say, DC or Austin or Silicon Valley anytime recently during commute times … this is a problem. Money quote: “Then we start to look at the car’s sensors. We start to know the weather before the weather authorities do, because we can see which cars have their windscreen wipers or their headlights switched on.” Orders of magnitude larger than your favorite social network or ad exchange.

Meanwhile, my favorite IoT app so far is clearly this: sharks tweet as they approach the shore of Western Australia. Would be great to see more technology applications like that!

   Minecraft camp

Speaking of Foo and other camps, I’ve got two kiddos currently in iD Tech Camps --learning Minecraft and Scratch, respectively. These courses tour around the US and are highly recommended. We could learn much from their teaching approach, to benefit professional workshops for adults as well.

To follow-up on the Minecraft + Quantum theme from previous posts, here’s a good video of Seth Lloyd explaining Quantum Machine Learning. Why does this seem to call back to the Real Genius movie?

   Ag + Data

O’Reilly Strata recently carried a story about how Farm data could be worth billions, related to the Ag + Data post on O’Reilly Radar. Much is happening in Ag data and other consumers of remote sensing products – particularly with respect to recent changes in satellite regulations. However, my favorite recent Ag story is about the Purdue Improved Cowpea Storage (PICS) bag. Brilliant work.

Overall, much of the interesting Ag+Data tech seems to be coming from (or through) Chile… and a new phrase has emerged: Chilecon Valley.

   Friends in the News

Congratulations go out to Robby Garner, competing with the JFRED Chat Server in Turing2014: 60th anniversary year of Alan Turing’s untimely death. Many years ago, Robby and I worked on a primordial version of JFRED. That played “customer service agent” for the FringeWare online bookstore. Circa 1998 we ran the bots on BBC “Tomorrows World” for a live televised Turing Test, which is some of the  most fun I've ever had in network engineering. More recently, Hubot-based chatbots are being deployed for devops and other engineering teams, such as the Shep chatbot used by engineering at O’Reilly.

Also, check out the new Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives by Vijay Srinivas Agneeswaran. This is a deep-dive into design patterns and frameworks for large-scale analytics beyond Hadoop.

Got to meet lots of people interested in using Spark at the recent Hadoop Summit in San Jose. One of the Community Choice Awards at the conference went to “Demo: Building a Unified Data Pipeline in Apache Spark” by Aaron Davidson from Databricks. Eager to see the slides published for that. Also at Hadoop Summit, Xiangrui Meng gave an excellent talk about the MLlib – the tech roadmap and integrations, and especially emphasizing about how to leverage sparsity in your data.

Meanwhile, friends at Zementis have recently released PMML support for Python, with a project called Py2PMML. In particular, there’s integration for scikit-learn. I wonder how long before PySpark + MLlib joins that list?

   Joaquim on the Moon

As many of you know, given enough beers I become fond of talking about dropping large complex arrays of sophisticated equipment into the polar dark craters on the Moon. In recent convo over drinks with people who calculate the costs of such an operation, for a living we surfaced a interesting price tag for that kind of venture: approximately $15B. In terms of how much the US spends on the Department of Defense, that’s about 8 days’ worth. Think about it. Who wants a term sheet? Meanwhile, the subject got me thinking of Kubrick films, particularly the 2001: Space Odyssey set production, an engineering feat in itself.

That's the update for now. See you in PDX and ATX, with Chicago and San Diego on the event horizon!


Newsletter Updates for May 2014

Been quite an interesting past month or so: DC, Austin, SF, Ann Arbor, Atlanta, Seattle… with hopefully much learned from those travels, plus many excellent events and introductions.

Meanwhile, I learned much from this gem, Therbligs for data science: A nuts and bolts framework for accelerating data work, by Abe Gong. Looking forward to seeing more about Therbligs from Abe. Definitely tune in to Welcome to Intelligence Matters, a new series by O’Reilly exploring current issues in AI, with Beau Cronin as lead correspondent. Another recommended gem is Genomics Crash Course for Data Engineers by Allen Day – that's at the intersection of Genomics and Big Data, for which I have seen an uptick recently.

Just Enough Math

Allen and I have been working to complete our new O’Reilly book, Just Enough Math. The video is in post-production now, and the book is half through second drafts – we are closing in! Some of that material will be previewed in the upcoming workshop Machine Learning for Managers:
O’Reilly will host a free one-hour webcast, Computational Thinking, Just Enough Math on Wed, Jun 4, 10:00am–11:00am (Pacific). Please join me there. The webcast will help publicize a tutorial based on Just Enough Math at OSCON in Portland on Sun, 20 Jul, 9:00am-noon. As a special offer, use the code PACOID to get a 20% discount on OSCON registration. Our tutorial will preview a very new thing at O’Reilly: converting book+video content into interactive tutorials using Docker + IPython Notebook + Vagrant + Git for a cloud-based next-generation content platform.

Speaking of Docker, one of the more interesting start-ups that I have run across recently is Resin, using Docker and Git to containerize+push apps on IoT devices running embedded Linux. Brilliant work.

UCB Initiation Ritual: cousins circa 1968, near Atascadero

In other news, I am thrilled to announce a partnership with Databricks, where I’ve been working to help develop an instructional program that introduces Apache Spark. As you can see in the photo above, the ceremonial ritual for teaming up with UC Berkeley is a bit arduous, but well worth it. Yes, you heard correctly … a Stanford alum saying “Go Bears!”

Our first course in the series is Databricks Hands-on Intro to Apache Spark, an introduction for developers working in Python, Java, and Scala. We have several of these workshops scheduled:
Spark is approaching the 1.0 release at Apache, with new support for SQL. Overall, one of the best presentations that I’ve seen recently about it was Spark at Twitter by Sriram Krishnan, Engineering Manager for Data Platform at Twitter.

The agenda was posted recently for Spark Summit 2014, in SF on 30 Jun - 1 Jul. As another special offer, use the code Paco2014 to get a 15% discount on Spark Summit registration. Highly recommended, and I hope to see you there.

Mesos Updates

Speaking of BDAS and the Berkeley Stack… there have been lots of developments in the Apache Mesos world. One of the best talks ever about Mesos was Improving Resource Efficiency with Apache Mesos by Christina Delimitrou, a case study about Quasar usage at Twitter. Also check out Mesos Elastically Scalable Operations, Simplified by Niklas Nielsen and Adam Bordelon, presented recently at ApacheCon 2014.

The other big news is that #MesosCon, the first Mesos conference, will be held in Chicago on Aug 21. Definitely see you there! Companies interested in sponsoring the conference – please inquire.
I’ve create a new workshop called Cluster Compute App Integrations about building end-to-end apps for Big Data. The workshop leverages Mesos based on the https://elastic.mesosphere.io/ service in the cloud, along with Spark, KNIME, etc. Hint: this involves teams competing, and it is turning out to be quite a popular course. We have upcoming dates lined up:

Agriculture + Data

Did you know that agriculture provides a livelihood for 40% of the world’s population? Or that agriculture consumes 70% of the world’s freshwater in aggregate? That figure is expected to reach 89% by 2050. Or have you heard that Havana grows 75% of its own food based on urban agriculture?
Last month I wrote an O’Reilly Strata article, Ag+Data, about those topics and more. The article introduces a whitepaper, Agriculture + Data: Outlook 2Q14, that we recently at The Data Guild to explore these issues in greater depth. Many thanks to Bill Worzel, Brad Martin, and others who helped on that!

Evolutionary Algorithms

Recently I gave a keynote talk at the Genetic Programming in Theory and Practice conference, which hosted each year at U Michigan by The Center for the Study of Complex Systems. They are the experts in GP; I was merely there to add a few perspectives about machine learning and Big Data. What a wonderful conference. Got to speak at length with Lee Spector at UMass Amherst and Hampshire College. Lee and his grad students have been working with a Clojure-based language called Push, in which evolutionary programs are expressed.

What kinds of optimization problems respond to evolutionary pressure? Definitely not the kinds that one typically finds solved by machine learning. That is where GP approaches come in. In general, there was a lot of discussion about symbolic regression as a general rubric, also some exceptionally interesting work on use of Pareto optimal fronts for model archives (which I’ll be added to my ML bag o’ tricks). In particular, great work from Theresa Kotanchek and Mark Kotanchek at Evolved Analytics. Their software effectively leverages Pareto optimality to select exemplars when models diverge, which I find to be a fascinating alternative to what other disciplines might attempt to resolve through sample. Brilliant work.

Also got to talk with Bill Tozier, author of Answer Factories: The Engineering of Useful Surprises, and viewed some astounding work in HeuristicLab, an interactive framework from HEAL. Think: evolutionary IDE. Another excellent tip was to check out Modeling global temperature changes with genetic programming by Karolina Stanislawska, Krzysztof Krawiec, Zbigniew Kundzewicz.

✽ ✽ ✽

Didn’t get to mention yet about Atlanta, but I really appreciated meeting many wonderful folks there. You’ll be hearing more about upcoming Atlanta plans soon! Also, there are workshops and meetup talks planned now for: NYC, SV/SF, Austin, Chicago. Next up after my current week in Seattle comes Hadoop Summit, on 3–5 Jun in San Jose. Hope to see you there!

-alaVoid Distribution

Misc. Inspiration

In closing… Those who have known me for, well, for the past 20-odd years or so will be familiar with the following: a 21st century artist named William Barker, formerly acclaimed of Schwa Corporation has a new endeavor called -AlaVoid Distribution. Definitely check out his new shop on Etsy.