Two aspects about leveraging machine learning are largely under-represented in the lit, especially when it comes to production use cases: feature engineering and the comparative evaluation of multiple modeling approaches. To that point, check out “Streamlining feature engineering: Researchers and startups are building tools that enable feature discovery” by Ben Lorica. The article mentions Spark Beyond, which “finds deep patterns in your data.” I was lucky to get a demo of Spark Beyond earlier this year and talk with the principals – and highly recommend taking a good look at their wares. Between the ongoing advances in deep learning and symbolic regression, a direction seems to be emerging … that perhaps one of the more difficult parts of machine learning workflows, namely the feature engineering aspects, could become more automated.
For another great article, check out Including Men in the Conversation About Women by Scarlett Sieber. Among my biggest peeves about Silicon Valley are the “brogrammer” lopsided demographics, and the gender bias which is quite real and nearly epidemic. Our data science teams have generally been quite mixed, why can’t engineering teams in general leave the 19th century behind, let alone stop being so hostile? Not naming names, but two of the SV firms in which I’ve worked in the past five years are both well known and well poised for harassment lawsuits. Taking a stand against that nonsense as an engineering manager is a great way to catch hell, which I’ve gladly engaged before. Another related pet peeve is where one of the same firms was actively pressuring their engineering interns to quit university degree programs. As a behavior for an engineering manager, I find that highly unethical. Some of those who are engaged in these practices know quite well who I’m talking about.
The big, BIG news last month was … (wait for it) … Spark Summit. All of the speaker videos have been posted – those are probably the single-best resource for learning about Apache Spark. Of course, the big surprise at the conf was the announcement of Databricks Cloud. If you missed the conf, you can watch Ali Ghodsi’s spectacular demo which kicks in at about the 14:40 time marker.
|Spark Summit keynote practice, T-15 hours|
One surprise learning from the conf was that one product line from SAP generates more annual revenue than all of the other Big Data vendors (HW, Cloudera, etc.) combined. Other pleasant surprises included: Flambo, a Clojure DSL for Spark; and Thunder, for large-scale neural data analysis, which shows some excellent integration of PySpark, SciPy, scikit-learn, etc.
Our training sessions at Spark Summit set some kind of new records. In particular, check out the advanced material for great lectures there. Those who attended the conf received a free ebook preview for the upcoming Learning Spark: Lightning-Fast Big Data Analytics by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia; O’Reilly Media (2014).
Also, I got to host the Research track of session talks at Spark Summit, which was a real treat. We had a special #geo break-out session following the Geotrellis talk by Rob Emanuele. We will hopefully be expanding that focus in future confs. There were so many other great talks that it’s hard to pick favorites. Even so, I’ll be studying up about two in particular: Quadratic Programing Solver for Non-negative Matrix Factorization with Spark by Debasish Das, Santanu Das; and Distributed Reinforcement Learning for Electricity Market Bidding with Spark by Vijay Srinivas Agneeswaran, Vishnuteja Nanduri. The latter seems almost ideal for integration with recent work on genetic programming.
Stay tuned for the next Spark Summit, which will be held on NYC in early 2015.
I’ve just returned from OSCON 2014. What an excellent conference! Check out the content recently posted online: keynotes, photos, speaker slides.
Of course, this event was carefully timed to overlap with the Oregon Brewer’s Festival. Top two picks: Double Latte. by Sierra Nevada Brewing Co.; and Lorenzini Blood Orange Double IPA by Maui Brewing Company. Many thanks to Erin Rasmussen for suggesting about OBF!
|Blood Orange IPA|
Back at OSCON… one of my favorite Ignite talks was What Science Fiction Can Teach Us About Building Communities by Dawn Foster. Another favorite, speaking of #geo, was a preso/proposal for Open Aerial Map by Kate Chapman.
During the conf, Andy Orem did a video interview where we discussed perspectives and current projects: Ag+Data, Industrial Internet, sketch algorithms, Apache Spark, etc. Andy was the very first editor I worked with at O’Reilly Media, ten years ago. He’s a much better interviewer than I am an interviewee, so I enjoyed learning much through our work together. Also fun to work again with the amazing video team.
|"With great power comes some data, plus wrinkled shirts"|
The tutorial for Just Enough Math had 50+ people attending, and we got to evaluate an intermediate stage of a new tutorial software platform. For that, I needed to get a bunch of USB drives from Amazon, but the order/delivery #failed. At the last minute our 10 y.o. daughter and I made an emergency run to Fry’s Electronics (she was eager to observe ground zero for nerdliness) … but the only 4Gb flash drives that they had left in stock were Marvel Universe comix characters. Arriving back home, our 9 y.o. daughter was aghast that adults would be receiving comix figures in a lecture :)
The Data Workflows for Machine Learning talk received lots of great responses – as did earlier versions during meetups in Seattle and SF. It become of the “top-shared” slide decks featured on the SlideShare home page. Perhaps that needs to be turned into a mini-book?
|new book kiosk|
As my last-o’-the-day book signing was winding down, after almost everyone had left the convention center for “nearby locations of beer taps”, a friend mentioned “Hey, look there’s another pile of books – these look different.” So a few lucky latecomers got signed copies of the galley drafts for our new book Just Enough Math, which probably still won’t be released for months – this rev is quite rough :) Oddly enough, the first person to read it looked up and said, “Where are the other O’Reilly books about math?” Indeed.
Speaking of Just Enough Math, we’ve put up a companion site for the video+book+tutorial at http://justenoughmath.com/ to provide additional resources and related links:
- set up a Python programming envon your laptop
- code+data files for examples in the video+book
- “gists” that show expected results for the examples
- links to external resources that get referenced
- recommended books and videos for further study
- monthly newsletter sign-up
The tutorial at OSCON previewed a new chapter recently added about sketch algorithms, following from notes at an excellent Foo Camp session led by Avi Bryant. I will be focussing on Spark Streaming use cases for Strata EU in Barcelona this fall, particularly where approximation techniques (think: examples of monoids in action) can leverage both Spark and Cassandra. If you have examples to share of Spark Streaming production use cases in general, I’m eager to build case studies to publish in Radar. Meanwhile, for a great resource about sketch algorithms, check out the archives of the AK Data Science Summit – Streaming and Sketching from last summer.
A friend recently brought up the topic of navigating questions about extinction and climate change for preschoolers… I’m getting those too; however, in my experience the questions become much better formulated after an additional 5–6 years or so. As a parent, as a human, it kills me to see all the ginormous FUD spewing from the political lobbies for the coal industry, fracking, Monsanto, GM, etc. How about giving ample air time and consideration for some points from the other side?
First off, I’ve mentioned it before but it bears repeating: The Land Institute is a phenomenally excellent resource for understanding some of the insanity and pure tragedy of contemporary agricultural practices, particularly when it comes to monocultures, annuals, hybrids, let alone unnecessary tillage. To paraphrase Wes Jackson, “The plow share has destroyed more options for future generations than the sword.” On a related note, I’ll also point to an excellent article by Michael Pollan, as a forward to Grass, Soil, Hope: A Journey through Carbon Country by Courtney White. Moreover, check out The Solutions Project. That latter site has more substance than perhaps its web-design polish indicates: it’s about the work by Mark Jacobson, et al., on how to power the planet via renewables now while mitigating hurricane damages, etc. One would think that the reinsurance revenues alone would justify a significant investment. In any case, these three links point to the fact that any emerging “dialog of despair” about global warming, etc., is purely FUD. Much can and will be done.
|Phylo, the trading card game|
I’m particularly grateful to be associated with O’Reilly Media, which provided OSCON attendees with a nice treat in their schwag bags: Phylo, a trading card game. Its gameplay emphasizes endangered species, climate change, food chains, and other environmental pressures. “Phylo is a project that began as a reaction to the following nugget of information: Kids know more about Pokemon creatures than they do about real creatures. We think there’s something wrong with that. Apparently, so do many others.”
In a related development, check out Nerds Without Borders: “We are looking for all sorts of people to help: Engineers, Scientists, Writers, Artists, Dreamers, Activists, Organizers, Fundraisers, Financiers, etc…” Starting with use of IoT sensors and cell phone networks to protect sea turtle hatchlings. Good stuff.
Another fun follow-up from Foo Camp and OSCON: getting to talk with Scott Jenson about his work on The Physical Web at Google. Check out his preso, Why Mobile Apps Must Die. The big idea is a kind of “micro-DNS” for low-cost digital tagging of physical items that can be accessed by mobile devices. No app installs required.
In other news, Trafodion was recently released as open source by HP. The name is based on the Welsh word for “transaction”. If you recall about Tandem Computers and NonStop, this product line has a long history of tech innovations – for highly reliable, highly optimized real-time SQL at scale. My uncle retired from Tandem, and lately I’ve spent time with the Trafodion team and am quite impressed. This release brings an interesting new level of Enterprise robustness to real-time transactions+analysis atop Linux+Hadoop. One to watch.
Another to watch closely is The Distributed Developer Stack Field Guide by Andrew Odewahn, Courtney Nash, Mike Loukides, et al. This is a GitHub-based book from O’Reilly. If you see any points in there that need editing, embellishing, etc., then two words: pull request, for the win.
In terms of upcoming events, registration is now open for Data Day Texas 2015, and I’m really looking forward to that. Will be teaching Spark at Scala by the Bay in SF on Aug 8–9, speaking at #MesosCon in Chicago on Aug 21, followed by another Spark course in Chicago on Aug 25.
I’ll close with a look back to a 1990 Documentary about Cyberpunk. That provides a good summary of what we up to in the early 1990s with Mondo 2000, bOING-bOING, FringeWare, WiReD, The WELL, Turkey City, etc. Tim’s monologue around 15:30-ff is hilarious – both because of his ever-optimistic “There will be mass democracy in the streets” miss, and how much it contrasts with just about every other major point coming true within 25 years. Warning: gratuitous F242 clips, throughout. Time marker 27:11 shows what I was doing as a vendor at many, many raves… Meanwhile, check out a recent bOING-bOING article Alien Autopsy: William Barker on Schwa, two decades later for some of the more astute counterpoint about what was really going on, then and now.
That's the update for now. See you in Chicago with San Diego on the event horizon!