www.flickr.com

2011-02-06

big data meets big quandary

This past week was the initial Strata conference here in Silicon Valley. While I did get invites to preliminary planning, I did not get to attend the conf itself, due to focus on other important responsibilities last week. On one hand, I missed the opportunity to pay $1095 to hear Greenplum sales reps babble, or witness O'Reilly seek to define a field which many of us have worked hard to define already :) On the other hand, I missed some truly choice talks by friends at LinkedIn and other firms – highly recommended you give those a good look.

Fortunately, I did get to see Tim O'Reilly on a panel about "What It Means To Be Open In A Data-Data World" at the Creative Commons salon hosted recently at LinkedIn. Great talks there about data, too. However, they missed an essential point: the power in data generally derives from the openness of its analysis, since few people will have time or capability to examine raw data in detail.

For example, there's been much crowing and clucking recently about WikiLeaks. While I'd call the impact on the (mostly borked) global media significant, I have not had time to read the thousands of documents released. I did have time to dive into some of their thinking behind WikiLeaks, which was good. However, for analysis of various released material I tend to look to trusted sources such as Al Jazeera English or Stratfor. Probably another good illustration is to look at how FiveThirtyEight has transformed the process of data analysis used in political campaigns. In a sea of data, delegation and trust become facts of life. Account for it.

It is the openness of analysis which provides a tipping point for the impact of "open data".

The flip side is that an essential quandary about openness in a data-driven world is that analysis – good analysis, that which has significant impact – often includes some element of "speaking truth to power". Perhaps not quite with the connotation which Quakers originally had for that statement, but close enough. As a data scientist, one often gets placed into a position of learning key aspects about a business long before its company officers do, and quite long before its board does.

Many analysts have gone through the motions of producing a much-anticipated study, only to have the leadership of the organization shred it and ask for no further discussion on the topic. When, for example, an executive has been vocal about estimating metrics for a business, and the actual figures don't align with stated estimates, many execs will squelch the actuals. That's another fact of life, which I've seen happen over the past several years.

In contrast, I've been working for the past several months at a firm which relies on a continuous deployment process, where openness and transparency are core values. My instructions as a data scientist were to forward any interesting findings from data analysis to the whole company. That's a significantly different approach from "speaking truth to power". Moreover, this core value placed on transparency is one of the fundamental requirements for continuous deployment. Take a look through some of the published approaches at IMVU, Etsy, Flickr, Wealthfront, etc.

One important point which comes to mind is that continuous deployment in many ways represents a successor to "agile" methodologies. I've interviewed 100+ engineering candidates over the past few years, most of whom mentioned "agile" to some extent. So far, I'm counting at least four different pronunciations for the term, which implies that some had merely read about it, not actually used the practices on a team context. Probably half of those candidates wouldn't recognize an Agile Manifesto if it bite them on the toe.

However, I cannot once recall a candidate using the terms "openness" or "transparency" as a requirement for their "agileness". To me that represents yet another data point in a trend signifying that continuous deployment represents a substantial departure; it forces structural changes in the company culture.

Furthermore, the principles of continuous deployment depend on good data-driven process. That's a point which, so far, O'Reilly has seemed to miss in their quest for authority about all things data :) For what it's worth, there's another aspect of data analytics, over and beyond the kind of "Lean Startup" principles articulated about continuous deployment … I've been working for nearly three years to articulate the "QR" methodology used on 3 of my past 4 analytics teams, and now look forward to defining that further in the context of my new analytics team.

Let's talk more about this. Here's a proposal for an ArchCamp unconference on the topic. Please vote, to help make that forum happen!

0 comments: