2014-04-07

Connected Devices Fellowship - O'Reilly Solid conf

I'm advising Amplify Partners and they've launched a Connected Devices Fellowship that includes conference registration, airfare, and accommodations to attend the new O’Reilly Solid conference on May 21-22 in SF.

The fellowship is designed for engineers, students, researchers, et al., who are passionate about infrastructure for IoT and connected devices:
http://www.amplifypartners.com/fellowships/amplify-partners-connected-devices-fellowship/

This is an amazing new conference coming up, and an excellent opportunity. If you know anyone who'd be interested, please pass it along! Deadline is coming up quickly, applications are due April 11.

2014-04-02

Newsletter Updates for April 2014

If you have not seen Data Science Folk Knowledge by Krishna Sankar, that is packed full o’ gems about Machine Learning.

Following up on more follow-ups from Strata SC 2014, I’d like to point to an excellent article: 5 Steps to Thinking Like a Designer in Machine Learning by Kevin Dalias:
We’ve all heard the saying that a data scientist is a cross between a statistician, domain expert, and machine learning hacker, but in today’s landscape, that falls short. A good data scientist needs to be all of the above and also a great designer.

Great points made there. Thanks to a teaching fellowship as a grad student many years ago, I got to stick around a couple extra terms and take a Design Communications program. That program evolved substantially, and of course there’s now Stanford d-school too. I'm grateful for those experiences and (taking a cue from Kevin Dalias) perhaps some formal exposure to design helped shape my later career path. Most certainly we emphasize design thinking at The Data Guild among our core values.

Speaking of Stanford, a recent report entitled Researcher reveals how “Computer Geeks” replaced “Computer Girls” pegged a ginormous criticism I have had about Silicon Valley:
This stereotype of the lone male computer whiz is self-perpetuating, and it keeps the computer field overwhelming male. Not only do hiring managers tend to favor male applicants, but women are less likely to pursue careers a field where feel they won’t fit in … as late as the 1960s many people perceived computer programming as a natural career choice for savvy young women.

As a parent of two girls who have dived into Minecraft eagerly along with their friends, I’m hopeful for a more balanced future. Perhaps some encouraging messaging is a good step toward that. Wired recently reported In a First, Women Outnumber Men in Berkeley Computer Science Course. Some refer to the Smithsonian report as counterpoint. Nonetheless, the male bias in tech is unquestionable... and that lone wolf thing was always a really terrible idea.

O'Reilly Video Studio, Sebastopol – meerkat crossing

Just Enough Math

Got to spend much of last week in Sebastopol working with the remarkably talented folks at O’Reilly VideoAllen Day and I have been busy writing our new book, Just Enough Math – and are now developing a video and much more to go along with that. Elevator pitch: advanced math for business people, to understand how to leverage OSS frameworks for Big Data. We pick up from a prereqs of: High School Algebra 2some Python, plus the experience in business to recognize why you need to leverage data. Let's explore how.

For example, you've probably heard much about graph query engines and perhaps read about graph use cases … how much graph theory did you get exposed to in school? Given that so many people stop at calculus, graph theory is perhaps rare among b-school topics. Would you feel comfortable working from a use case – in the sense of an HBR-styled business framework – leveraging large-scale graphs to build a high-ROI app? We think the answer is “Yes.”

Each morsel of advanced math gets introduced through a clear business use case, some historical context, lots of illustrations, and small snippets of Python code that you can cut&paste. In addition to integrating text + video + code, we are leveraging an instructional rubric called Computational Thinking. Stay tuned! Meanwhile, we’re scheduled for a Just Enough Math tutorial at OSCON in Portland the week of July 20th.

Compelling Projects

While I’m out teaching workshops around the country, I enjoy many opportunities to meet amazing people and hear about their projects. I’d like to highlight in particular about Mike West in Austin, and his recent post People Analytics Junto (public community):
We are a loosely connected group pressing forward, for the benefit of humanity, on the following topics: 1.) Exploring innovative ways data can be used to solve people related problems or make better people related decisions in organizations. 2.) Seeking understanding of organizations, and people in organizations, through Behavioral Science - Sociology, Labor Economics, Social Psychology, Psychology & Operations Research… 3.) Applying Big Data, Machine Learning, Artificial Intelligence, and related disciplines to Human Resources.

I am fascinated by use of Big Data and machine learning for HR. Having worked closely with HR in several organizations, having hired lots of people into Data Science and Engineering roles… I struggle to point to instances when we really leveraged data much – other than calculating salary targets for new hires or attrition rates.

The point is not to automate HR. Rather, the point is that most organizations spend most of the revenue on people (which makes sense) so why not invest in data insights there?

Do You Need or Want to become a Data Scientist?

KDnuggets recently ran Part 3 of 3 in my email Q&A interview ... aimed at candid career advice to people wanting to move into Data Scientist roles. Arguably a bit over the top, and in reaction to being exasperated by "Read this and begin calling yourself a Data Scientist" puff pieces. Many thanks to Anmol, Gregory, et al., @KDnuggets. This came after part 1 and part 2 about Apache Mesos following Strata. I'd like to publish some snippets here – not about what I said, but about what people began discussing.

A comment by Data Science London:
++1 “Product Mgmt. in SV is almost antithetically opposed to effective use of data” … Data Products nuke mgmt layers
Another comment by Data Science Retreat:
“Actual work in Data Science entails having to speak truth to power (not fun, but the essence of the role)”
Several criticisms were much appreciated, and hopefully that helped spur more dialog... Daniel Tunkelang:
@paix120 @BecomingDataSci I agree that @pacoid is overcompensating a bit to counter the proliferation of be-a-data-scientist-quick programs.
Data Science Renee:
.@dtunkelang @paix120 @pacoid hm could be. Just seems to take an unnecessarily discouraging tone.
Followed soon after by great perspectives which are recommended reading.

Plus some wise words stated much more succinctly that I could... Gregory Primosch:
a #datascientist is not a magical unicorn http://goo.gl/ytHzT4  
Andrew Musselman:
@GPrimosch @pacoid I started saying instead of hiring unicorns you should hire horses and narwhals
All of that discourse was illuminating to see. Well said, much better than I did. And, as Charlie Greenbacker pointed out:
It’s also a great rebuttal to all the articles claiming “data science” will soon be automated. #WishfulThinking
Indeed. Not to be overly flippant, but my hunch is that Data Scientist roles will become fully automated at about the same time as HR professionals and BoD meetings.

Meanwhile, IMHO some of the wisest words on this subject come from Nick Kolegraff, Dir Data Science @Rackspace: Do you need a data scientist?

Open Source Updates

There’s an excellent Mesos community update summary on the Apache Mesos site, along with Cassandra on Mesos integration, and a new task load simulator framework for cluster performance analysis. I also recommend an excellent preso from Claudiu Barbura @Atigeo about tying together Mesos, Spark, Cassandra, etc. There are building blocks for datacenter computing.

I’ve been working with Apache Spark lots more lately, particularly diving into PySpark. Check out the recommended Spark SQL: Manipulating Structured Data Using Spark … those are Spark workflows, not Hive fronted by Spark. Very nice work. Also, make plans for Spark Summit 2014.

Other recommended conferences with recent announcements:
Backyard in bloom – site of a new Google campus

Upcoming Events

Lots of plans to be out on the road during April/May this year. I hope to get to talk with you there! Here’s a summary of upcoming meetups and workshops, including new material. We will have drinkups plus office hours in most of these cities – probably adding more meetup talks too:
Misc. Inspiration

Speaking of Big Data apps, here’s a good one: simulations show that in the context of hurricanes Sandy, Isaac, and Katrina, wind farms disrupt the outer rotation winds so much that the storms do not even have enough energy to destroy the turbines – let alone their damage after landfall. That would effectively reduce category 5 storms to category 2 on a Saffir–Simpson scale. Similarly, storm surge decreased substantially in simulations – up to 79% for Katrina.

Consider that estimates for protective seawalls run in the $10-40B range, per city installation… seems like reinsurance companies could start underwriting turbine farms to cut their costs massively, not to mention generating electricity. I've heard Prof. Mark Jacobson present, and his work in general is highly recommended.

I'll leave you with one of the more interesting kinds of archival data that I’ve seen in a long while… Years by Bartholomäus Traubeck, a record player that plays tree rings.

That's the update for now. See you in DC, Austin, Ann Arbor, and Atlanta, with Seattle and NYC on the event horizon!

2014-03-02

Newsletter Updates for March 2014

Strata SC 2014 was a busy time indeed. I’m grateful to have had the opportunity to introduce speakers for several excellent presentations – in addition to presenting about Apache Mesos and meeting with many interesting people who were attending the conf. The keynotes this time were diverse, including brilliant and inspiring words from Geoffrey Moore, Rodney Mullen, and the capstone talk The Future Isn’t What It Used To Be by James Burke – what people were calling “The missing episode of Connections.”

Among the sessions at Strata, my favorite talk by far was Spreadsheets: The dark matter of Big Data by Felienne Herman, professor at Delft and founder of Infotron. If we are going to address the matter of data in large quantities and especially key learnings about data in business, spreadsheets are the place to start and Felienne is the brilliant leader of our exploration! Functional programming, graph queries, metadata modeling over time, etc., go check out her work.

Also at the top of my picks from Strata: Probabilistic Programming: Why, What, How, When by Beau Cronin, a recovering computational neuroscientist and big data skeptic debonaire. To paraphrase: business data is heterogeneous and structured… the data for every domain is heterogeneous. I’ve seen the best minds of my generation destroyed by madness, dragging themselves through quagmires of large LCD screens filled with Intellij debuggers and database command lines, yearning to fit real-world data into their preferred “deterministic” tools. March on, do not tarry, to go study this work. You may glimpse why Salesforce acquired a small AI start-up out of MIT to become its skunkworks.

My third top pick from Strata is clearly Algebra for Analytics by the ineffable talent driving Twitter’s insights at scale, Oscar Boykin. Money quote from this gem? Because #Monoids: “lack of associativity increases latency exponentially”. That talk was worth the price of admission alone, if you get the implications. If not, well, there are still plenty of jobs reqs open for J2EE, somewhere.

While we are discussing the subject, I urge you to make time to view Add ALL the Things: Abstract Algebra Meets Analytics by Avi Bryant, co-author of Scalding. If I may attempt to place this into context, your team could pour all their resources into developing precious source code, schema, unit tests, etc., but in practice your data probably will not fit what they have anticipated – especially when you encounter the Balrog of low-latency use cases. Partitioning, missing values, max range overflows, etc., did you really expect us (and the BoD) to believe that your developers can anticipate the complexities of data at scale? Fine, but why bother with all this abstract algebra mishmash, you ask? To wit:
  • grouping does not matter (associative)
  • ordering does not matter (commutative)
  • zeros get ignored (identity)
One more choice quote from Oscar: “Hash, don’t sample.” Seriously, after reviewing those presentations listed above, if a truly huge bright light bulb does not suddenly click ON about data+business and why at 2014 we really must stop reiterating the COBOL experience in Java (let alone making intellectually-encumbered hagiographies about “Hadoop as an Operating System”) … then, well, it’s time to step away from the screen+keyboard.

"Dune Builder" from Beyond The Human Eye

On to other illuminations

Had a wonderful time in Seattle in late January. Many thanks to all who attended the workshops, meetup talks, drink-ups, office hours, etc. In particular, we had the first public meetup @Twitter Seattle – thanks to many efforts by Jake Mannix. I gave a new talk, Data Workflows for Machine Learning, that began as an update to the Enterprise Data Workflows with Cascading book. I wanted to expand the analysis of abstraction layers out through many different open source frameworks. The results in the talk develop a “scorecard” to compare/contrast features among the different frameworks in the study. Standing. Room. Only. We spent hours afterwards discussing these topics and more. Judging from SlideShare velocity, this has turned out to be my most popular talk in the past three years. Apparently, people want to talk more about data workflows. Who knew?

On the subject of Apache Mesos, check out the recent paper Quasar: Resource-Efficient and QoS-Aware Cluster Management by Christina Delimitrou and Christos Kozyrakis @Stanford. I enjoyed an opportunity have a few beers and convo with Christos a few months ago to discuss this work. The space is evolving rapidly! Also, Ben, Flo, and I really enjoyed presenting the Mesos tutorial at Strata, great audience feedback from that – and we look forward to doing more of those tutorials!

On the subject of Titan, check out the release of Dendrite – an integration of Titan, Faunus, GraphLab, Jung, AngularJS, and more. Money quote from Lab 41: “It turns out that much of the world, both physical and virtual, can be represented as a graph. ”

On the subject of Cascading, Concurrent recently released Driven, a monitoring/troubleshooting webapp for Cascading – think: New Relic for Big Data apps. Great to see this new capability reaching the datacenters, and I am grateful to have been involved in the product’s inception. I can tell you that once your team deploys a mission-critical app on a large Hadoop cluster, well the fun has merely begun! Hint: you need better tooling to troubleshoot edge cases at scale – or your team will be camped out (like our teams were, multiple times) under their desks for the next few weeks, while SVPs keep calling for progress updates.

✽ ✽ ✽

More about machine learning in particular…

I enjoyed the opportunity to catch the talk From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews by Julian McAuley, one of Jure Leskovec’s post-docs @Stanford. You may have already seen Julian’s fascinating work about machine learning and beer preferences in videos or articles. I highly recommend a good read through Julian’s preso (linked above) for the state of the art in time-based recommendation systems.

As a follow-up to my earlier link about quantum algorithms and Minecraft, here’s another interesting video about Google + NASA + D-Wave working on Quantum AI projects.

Data Science Weekly recently ran an interview Machine Learning => Energy Efficiency with Kari Hensien, Sr Director Product Development at Optimum Energy, and my colleague Cameron Turner, Data Scientist at The Data Guild. Money quote, in answer to a question of what was the first data set you remember working with? “Honestly, I was a girl scout and the cookie season was upon us. I found myself trying to figure out how many houses I would need to stop at in order to sell enough boxes to get the Rubik’s Cube.”

Also, check out a new release – a free report/mini-book Practical Machine Learning: Innovations in Recommendation by Ted Dunning and Ellen Friedman. Pragmatic, timely, and approachable for all levels of expertise.

✽ ✽ ✽

Lately with the drought in California, my thoughts have been turning more and more toward Agriculture and how data insights at scale and IoT issues apply for very real world problems. Namely, the livelihood for 40% of the world’s population, in what amounts to $15T of annual GDP globally. It's hard to imagine a topic of research that matters more. An interesting perspective is presented in the article How NASA, Cisco, and a tricked-out planetary skin could make the world a safer place. From my research, the connectivity for IoT sensors is a major failing point currently – along with privacy/security, interoperability among analytics platforms downstream, etc. I would be very interested to hear your experiences related to Ag data at scale.

I’ll leave you with this video about Gravity Glue which I found amazingly inspirational.