www.flickr.com

2011-02-06

big data meets big quandary

This past week was the initial Strata conference here in Silicon Valley. While I did get invites to preliminary planning, I did not get to attend the conf itself, due to focus on other important responsibilities last week. On one hand, I missed the opportunity to pay $1095 to hear Greenplum sales reps babble, or witness O'Reilly seek to define a field which many of us have worked hard to define already :) On the other hand, I missed some truly choice talks by friends at LinkedIn and other firms – highly recommended you give those a good look.

Fortunately, I did get to see Tim O'Reilly on a panel about "What It Means To Be Open In A Data-Data World" at the Creative Commons salon hosted recently at LinkedIn. Great talks there about data, too. However, they missed an essential point: the power in data generally derives from the openness of its analysis, since few people will have time or capability to examine raw data in detail.

For example, there's been much crowing and clucking recently about WikiLeaks. While I'd call the impact on the (mostly borked) global media significant, I have not had time to read the thousands of documents released. I did have time to dive into some of their thinking behind WikiLeaks, which was good. However, for analysis of various released material I tend to look to trusted sources such as Al Jazeera English or Stratfor. Probably another good illustration is to look at how FiveThirtyEight has transformed the process of data analysis used in political campaigns. In a sea of data, delegation and trust become facts of life. Account for it.

It is the openness of analysis which provides a tipping point for the impact of "open data".

The flip side is that an essential quandary about openness in a data-driven world is that analysis – good analysis, that which has significant impact – often includes some element of "speaking truth to power". Perhaps not quite with the connotation which Quakers originally had for that statement, but close enough. As a data scientist, one often gets placed into a position of learning key aspects about a business long before its company officers do, and quite long before its board does.

Many analysts have gone through the motions of producing a much-anticipated study, only to have the leadership of the organization shred it and ask for no further discussion on the topic. When, for example, an executive has been vocal about estimating metrics for a business, and the actual figures don't align with stated estimates, many execs will squelch the actuals. That's another fact of life, which I've seen happen over the past several years.

In contrast, I've been working for the past several months at a firm which relies on a continuous deployment process, where openness and transparency are core values. My instructions as a data scientist were to forward any interesting findings from data analysis to the whole company. That's a significantly different approach from "speaking truth to power". Moreover, this core value placed on transparency is one of the fundamental requirements for continuous deployment. Take a look through some of the published approaches at IMVU, Etsy, Flickr, Wealthfront, etc.

One important point which comes to mind is that continuous deployment in many ways represents a successor to "agile" methodologies. I've interviewed 100+ engineering candidates over the past few years, most of whom mentioned "agile" to some extent. So far, I'm counting at least four different pronunciations for the term, which implies that some had merely read about it, not actually used the practices on a team context. Probably half of those candidates wouldn't recognize an Agile Manifesto if it bite them on the toe.

However, I cannot once recall a candidate using the terms "openness" or "transparency" as a requirement for their "agileness". To me that represents yet another data point in a trend signifying that continuous deployment represents a substantial departure; it forces structural changes in the company culture.

Furthermore, the principles of continuous deployment depend on good data-driven process. That's a point which, so far, O'Reilly has seemed to miss in their quest for authority about all things data :) For what it's worth, there's another aspect of data analytics, over and beyond the kind of "Lean Startup" principles articulated about continuous deployment … I've been working for nearly three years to articulate the "QR" methodology used on 3 of my past 4 analytics teams, and now look forward to defining that further in the context of my new analytics team.

Let's talk more about this. Here's a proposal for an ArchCamp unconference on the topic. Please vote, to help make that forum happen!

2010-08-05

sample src + data for getting started on hadoop

Monday, 19 July 2010... I had a wonderful opportunity to present at the Silicon Valley Cloud Computing meetup, on the topic "Getting Started on Hadoop".

The talk showed examples of Hadoop Streaming, based on Python scripts 
running on the AWS Elastic MapReduce service.

We started with a brief history of MapReduce, including the concepts leading up to the framework as well as open source projects and services which have followed. Then we stepped through the ubiquitous “WordCount” example (a “Hello World” for MapReduce), showing how Python and Hadoop Streaming make it simple to iterate and debug from a command line using Unix/Linux pipes.

Source code is available on GitHub and the oddly enough, the slide deck got an editor's pick that week on SlideShare.


The focus of the talk was to show text mining of the infamous Enron Email Dataset, which Infochimps.com and CMU make available. In that context, the example code creates an inverted index of keywords found in the email dataset, begins to semantic lexicon of "neighbor" keyword relationships, plus some data visualization and social graph analysis using R and Gephi.


Along with my presentation, Matthew Schumpert from Datameer gave a demo of their product, doing some similar kinds of text analysis.

Lots of people showed up, enough that the kind folks at Fenwick & West LLP grew concerned about running out of seats :) The audience asked several excellent questions and we had a lot of discussion after the talk. Todd Hoff wrote an article summarizing the talk and discussions, along with some great perspectives on High Scalability.

Admittedly, the Enron aspects of the talk were intended as somewhat of a teaser; my examples focused more on method than on results. I'd talked with several people who'd never seen how to write Python scripts for Hadoop Streaming, how to run Hadoop jobs on Elastic MapReduce, how to calculate some basic text analytics or produce simple data visualizations. Even so, if you want to see investigate the Enron data yourself, then checkout the code, download the data, and run this on AWS. There were some fun surprises to be found among the analytics results, which may be good to publish as a follow-up talk.

Many thanks to SVCC and our organizer Sebastian Stadil, our venue host Fenwick & West LL, and all who participated.

2010-07-15

tech start-up self-eval

There have been some excellent discussions on Quora recently about how to structure start-ups, how to create the right kind of workplace cultures, best practices, anti-patterns, etc. All the better since Quora draws such an experienced, articulate audience, and makes highly effective use of their [topics, questions, follows, answers, comments, votes] format in a social network context. About that data? Drool. By I digress...

Anywhoo, I've been out talking with clients and other firms lately, plus comparing notes at meetups, conferences, workshops... having interviewed a small army of engineer and data scientist candidates for the past three employers... having seen the inside of several VC portfolios over the past five years, often with responsibilities to assist on due dili (triple that number, if you count being on the receiving end of due dili)... building teams for a range of early-stage through SMB...

Got a birdseye view of Silicon Valley: seeing what seems to work, what doesn't seem to work. All of which are terribly colored by my own bias – but hey, that bias is based on history in the industry. I joined as a FTE at a Silicon Valley tech start-up for the first time in 1983. Triad Systems, for what it's worth, followed by a gig at IBM – great contrasts. Call me opinionated, most certainly – that's one of the reasons why clients tend to call in the first place.

Here's my checklist of questions which I run through when evaluating a tech start-up. I'll keep this list updated and published in GDocs. Note that these question pairs are not arranged in any particular order; the spreadsheet will randomize order periodically, to help reduce bias in answers.

As a self-evaluation test, read each question pair, then tally the score for your start-up venture by giving one point for each answer in the left column (labeled "pattern").



For those who score 20+, congrads! Let's meet for beer. I'm buying the first round.

If your firm's score is less than 20, consider seeking help. There are equivalents for twelve-step programs for companies with substance abuse problems. A road toward recovery exists; you can get on that path.

If your firm's score is less than 10, don't bother trying to be competitive in Silicon Valley. However, you might consider a full relocation to Kansas. Or something.

Arguably, some of these question pairs represent a forced "either-or" scenario. Deal with it. This exercise is intended to force you to grapple with some difficult questions and controversial issues. Coffee is not for complainers.