Newsletter Updates for June 2013

Lots of talks, lots conferences, lots of writing. Here are my latest updates about scheduled events, along with pointers to some of the best content that I've been studying lately.

First, the Austin trip was awesome. Many thanks to all who attended the events -- and especially to those who participated in the workshop for excellent feedback. We've learned much from your suggestions, to use in the upcoming workshops.

Hadoop Summit is next week. I'm looking forward to catching up with friends who will be in town or (for those already in town) stepping out of the cubicles :) Come check out our talk about the Pattern project for PMML in Cascading. Other events are listed below for meetups and workshops in SeattleSanta ClaraLos Angeles, followed by OSCON in PDX. Please tell your friends in those cities. I also have an "official" home page now with a newsletter sign-up, event calendar, links, etc., at: http://liber118.com/pxn/

An update on the O'Reilly book about Cascading: it will now be in print  July 22 -- just in time for OSCON. Speaking of which, we should really plan for a meetup, drinkup, birds-of-a-feather, or something in PDX.

In other news, Cascalog also has a new home page at http://cascalog.org/  Paul, Soren, Sam, Bruce, et al., nice work! This is a much needed resource for the developer community.

Now for a few "extras" in the newsletter... I got to attend a Stanford talk earlier this month by Tim Davis, from University of Florida. For the second time in one week, I heard the phrase: "There's no such thing as RAM anymore." That will become a theme for the architecture of algorithms on distributed systems. Following up on the sparse matrix techniques mentioned in my previous newsletter, this lecture was about Sparse Cholesky update/downdate, LU factorization, QR factorization, software architectures based on GPUs for HPC parallel processing, etc. In other words, even if his name doesn't quite ring a bell, you've probably used his software daily: core libraries for linear algebra optimization in R, Matlab, Mathematica, plus some of the core algorithms for Google Street View, 3D Earth, as well as many of the Verilog vendors. Professor Davis curates an extensive collection of sparse matrices, which I highly recommend. Not only are these visualization beautiful, but the examples represent important edge cases for sparse matrix factorization, used to evaluate new work on algorithms. Also note the museum exhibit coming up in October.

Tim Davis / University of Florida: Sparse Matrix Collection

Speaking of sparse matrix factorization and "No such thing as RAM," I thoroughly enjoyed at day at National Instruments in Austin, giving a talk about Big Data trends, and also learning about machine learning techniques at microsecond speeds on FPGAs. There is a form of convergence afoot in the industry, between the technology pyrotechnics of NI and other firms working on sensor arrays for the "Internet of Things", and what we've been doing with large-scale cluster computing. Check out what NI has to say about the practice of Big Analog Data™ -- I have yet to see people not drop their jaws reading those stats.  Oddly enough, some of the most important techniques for machine learning algorithms at microsecond speeds have familiar cousins at petabyte scale, so I have a hunch there are many opportunities ahead based on this area of convergence.

Speaking of "Internet of Things" and real-world data, one of the most astounding projects that I've encountered in a long while is Protei. This is a truly innovative data platform: a multi-hull drone sail boat, built much like an eel, which changes how we clean up marine oil spills, recover "islands" of plastic waste, collect vital data from the radioactive waters off the coast from Fukushima, etc.  For another amazing innovation, Paragon Science received well-deserved press in the article, "Doctors and Social Oncology: The MDs most mentioned by their peers (breast cancer edition)".  I got to speak with Dr. Steve Kramer in Austin, and I'm quite impressed by capabilities of this technology for complex graph analysis and visualization.

In other news, I was grateful to attend the recent collaboration among Facebook, Twitter, and LinkedIn for the #Analytics@WebScale conference. We saw Facebook's first public announcement of Presto, a new approach to handling ad-hoc queries at very large scale which is now displacing Hive. Congrads to Martin, David, and teammates on the Presto project -- looking forward to this work becoming released as open source later this year! And in a new twist on the term "cloud computing", Facebook data is getting so large, so complex that they've even begun to experience weather conditions within their data centers :)

Finally, I learned a lot form the GOTO Chicago conference last month. Nathan Marz gave a talk on "Runaway complexity in Big Data systems... and a plan to stop it", with a video released.  Dean Wampler (from the new firm Concurrent Thought) and Amanda Laucher discuss functional programming an another video from the conference. Good stuff!

If you have a city or venue to suggest for upcoming workshops and talks, please let me know @pacoid




new "Official" home page

Here's a quick note to state that my new "Official" home page has been updated and now lives at http://liber118.com/pxn/

Check there for news about upcoming meetups, talks, and our world tour of "Intro to Data Science" hands-on workshops.  Plus, more links to interesting happenings related to large-scale data.