Paco Nathan: 2014-04

If you have not seen Data Science Folk Knowledge by Krishna Sankar, that is packed full o’ gems about Machine Learning.

Following up on more follow-ups from Strata SC 2014, I’d like to point to an excellent article: 5 Steps to Thinking Like a Designer in Machine Learning by Kevin Dalias:

We’ve all heard the saying that a data scientist is a cross between a statistician, domain expert, and machine learning hacker, but in today’s landscape, that falls short. A good data scientist needs to be all of the above and also a great designer.

Great points made there. Thanks to a teaching fellowship as a grad student many years ago, I got to stick around a couple extra terms and take a Design Communications program. That program evolved substantially, and of course there’s now Stanford d-school too. I'm grateful for those experiences and (taking a cue from Kevin Dalias) perhaps some formal exposure to design helped shape my later career path. Most certainly we emphasize design thinking at The Data Guild among our core values.

Speaking of Stanford, a recent report entitled Researcher reveals how “Computer Geeks” replaced “Computer Girls” pegged a ginormous criticism I have had about Silicon Valley:

This stereotype of the lone male computer whiz is self-perpetuating, and it keeps the computer field overwhelming male. Not only do hiring managers tend to favor male applicants, but women are less likely to pursue careers a field where feel they won’t fit in … as late as the 1960s many people perceived computer programming as a natural career choice for savvy young women.

As a parent of two girls who have dived into Minecraft eagerly along with their friends, I’m hopeful for a more balanced future. Perhaps some encouraging messaging is a good step toward that. Wired recently reported In a First, Women Outnumber Men in Berkeley Computer Science Course. Some refer to the Smithsonian report as counterpoint. Nonetheless, the male bias in tech is unquestionable... and that lone wolf thing was always a really terrible idea.

O'Reilly Video Studio, Sebastopol – meerkat crossing

Just Enough Math

Got to spend much of last week in Sebastopol working with the remarkably talented folks at O’Reilly Video. Allen Day and I have been busy writing our new book, Just Enough Math – and are now developing a video and much more to go along with that. Elevator pitch: advanced math for business people, to understand how to leverage OSS frameworks for Big Data. We pick up from a prereqs of: High School Algebra 2, some Python, plus the experience in business to recognize why you need to leverage data. Let's explore how.

For example, you've probably heard much about graph query engines and perhaps read about graph use cases … how much graph theory did you get exposed to in school? Given that so many people stop at calculus, graph theory is perhaps rare among b-school topics. Would you feel comfortable working from a use case – in the sense of an HBR-styled business framework – leveraging large-scale graphs to build a high-ROI app? We think the answer is “Yes.”

Each morsel of advanced math gets introduced through a clear business use case, some historical context, lots of illustrations, and small snippets of Python code that you can cut&paste. In addition to integrating text + video + code, we are leveraging an instructional rubric called Computational Thinking. Stay tuned! Meanwhile, we’re scheduled for a Just Enough Math tutorial at OSCON in Portland the week of July 20th.

Compelling Projects

While I’m out teaching workshops around the country, I enjoy many opportunities to meet amazing people and hear about their projects. I’d like to highlight in particular about Mike West in Austin, and his recent post People Analytics Junto (public community):

We are a loosely connected group pressing forward, for the benefit of humanity, on the following topics: 1.) Exploring innovative ways data can be used to solve people related problems or make better people related decisions in organizations. 2.) Seeking understanding of organizations, and people in organizations, through Behavioral Science - Sociology, Labor Economics, Social Psychology, Psychology & Operations Research… 3.) Applying Big Data, Machine Learning, Artificial Intelligence, and related disciplines to Human Resources.

I am fascinated by use of Big Data and machine learning for HR. Having worked closely with HR in several organizations, having hired lots of people into Data Science and Engineering roles… I struggle to point to instances when we really leveraged data much – other than calculating salary targets for new hires or attrition rates.

The point is not to automate HR. Rather, the point is that most organizations spend most of the revenue on people (which makes sense) so why not invest in data insights there?

Do You Need or Want to become a Data Scientist?

KDnuggets recently ran Part 3 of 3 in my email Q&A interview ... aimed at candid career advice to people wanting to move into Data Scientist roles. Arguably a bit over the top, and in reaction to being exasperated by "Read this and begin calling yourself a Data Scientist" puff pieces. Many thanks to Anmol, Gregory, et al., @KDnuggets. This came after part 1 and part 2 about Apache Mesos following Strata. I'd like to publish some snippets here – not about what I said, but about what people began discussing.

A comment by Data Science London:

++1 “Product Mgmt. in SV is almost antithetically opposed to effective use of data” … Data Products nuke mgmt layers

Another comment by Data Science Retreat:

“Actual work in Data Science entails having to speak truth to power (not fun, but the essence of the role)”

Several criticisms were much appreciated, and hopefully that helped spur more dialog... Daniel Tunkelang:

@paix120 @BecomingDataSci I agree that @pacoid is overcompensating a bit to counter the proliferation of be-a-data-scientist-quick programs.

Data Science Renee:

.@dtunkelang @paix120 @pacoid hm could be. Just seems to take an unnecessarily discouraging tone.

Followed soon after by great perspectives which are recommended reading.

Plus some wise words stated much more succinctly that I could... Gregory Primosch:

a #datascientist is not a magical unicorn http://goo.gl/ytHzT4

Andrew Musselman:

@GPrimosch @pacoid I started saying instead of hiring unicorns you should hire horses and narwhals

All of that discourse was illuminating to see. Well said, much better than I did. And, as Charlie Greenbacker pointed out:

It’s also a great rebuttal to all the articles claiming “data science” will soon be automated. #WishfulThinking

Indeed. Not to be overly flippant, but my hunch is that Data Scientist roles will become fully automated at about the same time as HR professionals and BoD meetings.

Meanwhile, IMHO some of the wisest words on this subject come from Nick Kolegraff, Dir Data Science @Rackspace: Do you need a data scientist?

Open Source Updates

There’s an excellent Mesos community update summary on the Apache Mesos site, along with Cassandra on Mesos integration, and a new task load simulator framework for cluster performance analysis. I also recommend an excellent preso from Claudiu Barbura @Atigeo about tying together Mesos, Spark, Cassandra, etc. There are building blocks for datacenter computing.

I’ve been working with Apache Spark lots more lately, particularly diving into PySpark. Check out the recommended Spark SQL: Manipulating Structured Data Using Spark … those are Spark workflows, not Hive fronted by Spark. Very nice work. Also, make plans for Spark Summit 2014.

Other recommended conferences with recent announcements:

O’Reilly Solid, SF in May
MMDS 2014, Stanford in June
O’Reilly Strata, NYC in October

Backyard in bloom – site of a new Google campus

Upcoming Events

Lots of plans to be out on the road during April/May this year. I hope to get to talk with you there! Here’s a summary of upcoming meetups and workshops, including new material. We will have drinkups plus office hours in most of these cities – probably adding more meetup talks too:

San Francisco

Data Workflows for Machine Learning
Wed, Apr 9, 6:30pm–8:30pm (Pacific)
Climate Corp, 201 3rd St #1100, San Francisco, CA 94103

Washington, DC

Hands-on Intro to Data Science
Mon, Apr 14, 8:30am–4:30pm (Eastern)
MicroTek, 1101 Vermont Ave NW #700, Washington, DC 20005
Hands-on Intro to Machine Learning
Tue, Apr 15, 8:30am–4:30pm (Eastern)
MicroTek, 1101 Vermont Ave NW #700, Washington, DC 20005
Deep Dive on Apache Mesos
Tue, Apr 15, 6:30pm (Eastern)
AddThis, 1595 Spring Hill Rd #300, Vienna, VA 22182

Austin

Hands-on Intro to Machine Learning
Thu, Apr 24, 8:30am–4:30pm (Central)
AT&T Conf Center, 1900 University Ave, Austin, TX 78705
Cluster Compute App Integrations
Fri, Apr 25, 8:30am–4:30pm (Central)
AT&T Conf Center, 1900 University Ave, Austin, TX 78705

Atlanta

Big Data Week ATL (keynote talk)
Sat, May 10, noon–4:00pm (Eastern)
GA Tech Research Institute Conf Center, 250 14th St NW, Atlanta, GA 30361
Hands-on Intro to Data Science
Mon, May 12, 8:30am–4:30pm (Eastern)
MicroTek, Northpark Building 400 #194, 1000 Abernathy Rd, Atlanta, GA 30328
Hands-on Intro to Machine Learning
Tue, May 13, 8:30am–4:30pm (Eastern)
MicroTek, Northpark Building 400 #194, 1000 Abernathy Rd, Atlanta, GA 30328

Misc. Inspiration

Speaking of Big Data apps, here’s a good one: simulations show that in the context of hurricanes Sandy, Isaac, and Katrina, wind farms disrupt the outer rotation winds so much that the storms do not even have enough energy to destroy the turbines – let alone their damage after landfall. That would effectively reduce category 5 storms to category 2 on a Saffir–Simpson scale. Similarly, storm surge decreased substantially in simulations – up to 79% for Katrina.

Consider that estimates for protective seawalls run in the $10-40B range, per city installation… seems like reinsurance companies could start underwriting turbine farms to cut their costs massively, not to mention generating electricity. I've heard Prof. Mark Jacobson present, and his work in general is highly recommended.

I'll leave you with one of the more interesting kinds of archival data that I’ve seen in a long while… Years by Bartholomäus Traubeck, a record player that plays tree rings.

That's the update for now. See you in DC, Austin, Ann Arbor, and Atlanta, with Seattle and NYC on the event horizon!

Paco Nathan

2014-04-06

Connected Devices Fellowship - O'Reilly Solid conf

2014-04-02

Newsletter Updates for April 2014

about

archive

perspective

usage