Lots of talks, lots conferences, lots of articles. Here are the latest updates about scheduled events, along with pointers toward some of the best content that I've been studying lately.
[ Sign up for PXN Newsletter: "Data Workflows" http://bit.ly/pxnnews ]
The O'Reilly book is due in print June 22 – just in time for Hadoop Summit. An online "Rough Cuts" version should be available before then. Many people have asked if there will be a Kindle version? Yes. Yes, there will be! Many many thanks to our technical reviewers for all the excellent feedback and suggestions. Also, speaking of EPUBs, the Liber 118 novel has been available in Kindle version for a nice long while, and please check it out :)
Events are listed below for a week in Austin, then CityCamp and Hadoop Summit here in SV, followed by OSCON in PDX. Should be a fun summer! These talks have ample doses of Big Data frameworks, Data Science use cases, Machine Learning at scale, Open Data apps, etc. For a sample, check out a recent article about the Pattern open source project in the Software Developers Journal. For another sample, check out this recent Airbnb Tech Talk. Functional programming as a way to tackle Big Data projects has resonated well lately in these talks – at Stanford and CMU as well. Also, the chef at Airbnb graciously celebrated the event with a special dinner menu featuring "Paco's Tacos":
2011 Google Americas Faculty Summit, discussing the Borg and Omega projects which Google uses to manage clustered resources in their data centers. Wilkes presents the notion of "surety" as a first-class resource alongside CPU, RAM, I/O, etc., introducing a subtle but powerful change to our accepted notion of Von Neumann architecture. This has several interesting implications for those of use who build large-scale distributed apps. We'll be talking more about that in Austin. Another recommendation is the excellent paper by Jimmy Lin, "MapReduce is Good Enough?", based on his analysis of machine learning apps at scale during a sabbatical at Twitter. On the one hand, there is ample criticism that Hadoop is not quite suitable for many important kinds of algorithms. On the other hand, as Professor Lin points out, much of our code implementing algorithms has been inherited from 3+ decades of expressing logic based on FORTRAN loops. Seriously. I'm not one to believe that Hadoop won't be replaced (reasonably soon) but we really need to replace algorithm libraries with better updated code. As the paper shows, that can lead to more effective implementations for streaming anyway. In terms of putting related insights into practice, check out the talks on SlideShare by David Gleich, especially about "tall and skinny" QR matrix factorization, as well as excellent explanations of the math behind Google, etc., for multi-arm bandit and other machine learning in practice. Chris Severs at LinkedIn has an excellent implementation for Gleich's TSQR. Last but not least, and putting most of the above into practice: Pete Skomoroch gave an excellent talk "Skills, Reputation, and Search" about data products at LinkedIn. You could search far and wide to try to find a better discussion of how to take a problem of raw, unstructured Big Data from the stage of blank whiteboard to world class app in a matter of months.
One final note: been preparing a workshop about all of the above – typically as a full-day course, very hands-on. We'll try the first in Austin, then take it elsewhere. If you have a city or venue to suggest, please let me know @pacoid
Big Data SF Bay Area presents:
Drinks and Data - Hadoop Happy Hour (PXN co-hosting)
Mon, May 20, 2013 6:30 PM - 9:00 PM (PDT)
187 S Murphy Ave, Sunnyvale, CA 94086
Hands-on Introduction to Data Science -- a full-day workshop with PXN
Wed, May 29, 2013 8:30 AM - 5:30 PM (CDT)
AT&T Conference Center
1900 University Ave, Austin, TX 78705
City of Palo Alto:
a talk by Paco Nathan and Diego May
Sat, Jun 1, 2013 11:00 AM - 7:00 PM (PDT)
downtown Palo Alto, CA
Wed, Jun 26, 5:05 PM - 5:55 PM (PDT)
San Jose Convention Center
150 W San Carlos, San Jose, CA 95110
O'Reilly Media OSCON:
Thu, Jul 25, 2013 5:00 PM (PDT)
Oregon Convention Center
777 NE Martin Luther King, Jr. Blvd., Portland, OR 972322
I've got a series of blog posts going as a "gentle introduction" to Cascading, over at "Cascading for the Impatient". This includes sample code in a GitHub repo, and progresses from a distributed file copy, to yet-another Word Count, to a full TF-IDF implementation in MapReduce, complete with TDD features.
Related to this, see also some recent content:
Related to this, see also some recent content:
- "Intro to Data Science for Enterprise Big Data" (which made the front page of SlideShare)
- "Cascading for the Impatient" lightening talk slides
- "Multitool" (Bash scripting for MapReduce -- recently updated)
- "Sample Recommender" for stock picks in Twitter tweets
- "City of Palo Alto Open Data"
I will also be speaking at some upcoming conferences, all listed on my Lanyrd profile:
- Splunk conference in Las Vegas, 9/13
- Cloud Con Expo in SF, 10/3
- ACM Data Mining, 10/13
- SpringOne in DC, 10/17
- DC Data Science, 10/17
...and for more info about Cascading, please join us at the new Cascading Meetup group.
Geoffrey Moore opened his keynote at Hadoop Summit 2012 and promptly dropped the line: “You will remember this moment years from now.”
After a disappointing set of “sales pitch” keynotes on the first day of the conference (thanks Yahoo! — but you knew that already) many people attending seemed to roll their eyes about yet another keynote talk this morning. Surprise!
I was grateful to hear Geoffrey Moore trash Advertising as an industry at risk. If I may paraphrase: permanently caught between bleeding edge and dinosaurs, yet irreparably dependent on a broken business model. [FWIW, the last three VCs on whom I’ve used that line looked back at me like I was some kind of alien slime-mold.]
By the middle of his talk, Moore put up a slide with a half-dozen bullet points. The slide listed some of the most disruptive technologies on which businesses — Main Street, in his terms — would come to rely in the early 21st century. Those include: collab filters, behavioral targeting, predictive analytics, fraud detection, time series, etc., etc. Outside of the intelligence community and the hedge funds, the significance of these technologies is not well understood yet. Word. Up. Bitches.
Moore’s “Final Thoughts” slide really hit home. He talked about data access patterns (system of record vs. log file usage vs. real-time analytics vs. etc.) and how those access patterns create feedback loops within an organization. Moore claimed this was core DNA for Google, Amazon, etc., which all major businesses must now embrace. Or else. [That's about 95% overlap with a slide I made for (insert recent past employer) during a 2011Q1 pivot. Two pivots later, I left without any particular next gig in mind — clearly needing to get involved with a different business team. Shortly before their CEO got, um, an "opportunity" to find work elsewhere. But I digress.]
So here’s a fun exercise for the interested reader: Pull up a 10-year chart for the S&P 500. Add to that CBS. Right.. Add to that Barnes & Noble. Bokay.. Add to that Wal-Mart. Got few bumps, some upturns.. Nothing to write home about.
Now add Google. Now add Amazon. Now add Apple. One might argue that I’m cherry-picking examples; however, one must understand those three in particular to grasp the trajectory of how Data modifies Companies.
Think about it. Imagine rolling the clock back about 13 years, just a few years before that huge financial sea change got going. Think about perceptions at the time of Apple, Amazon, Google. Most of the mainstream buzz that I heard or read in 1999 was largely disparaging about those three. They didn’t make sense to the average joe, and that was a problem. I will contend that what made sense to a handful of computer science grad students, but not to the average joe, was considered a problem for Main Street. A multi-billion dollar existential problem for some, as it turned out.
At the time, it seemed like Apple would never get past the overwhelming popularity of Dell and Microsoft. Amazon didn’t have a way to justify its enormous P/E ratio — and was probably fluff in the long run. Google was considered interesting, but a little strange, with no clear path toward revenue.
Now think about what happened to the music industry, the mobile industry, the … well, I could go on, but Apple disrupted the pants off lots of established players. Entire industries were taken down by one company. Then consider what happened to retail. One word, a verb according to Geoffrey Moore: Amazon. Think about what happened to advertising. Googled, and not in a nice way either. Amazon and Google took off in 1997Q4 and 1998Q1 respectively, with Big Data projects which became enormous cash cows: Amazon’s recommender system (plus cloud infrastructure), and Google’s search+ads (plus cloud infrastructure). Arguably, those two are the reasons we were having a "Hadoop" conference. Apple perhaps seems less in category; however Apple leveraged mountains of consumer data (plus cloud infrastructure) to drive its smartphones, App Store, etc.
Imagine what kinds of conversations which must have been occurring in the board rooms of CBS, Motorola, Barnes & Noble, Wal-Mart, etc., etc. Gone, gone, gone. Three relative underdogs became giants, tipping almost everyone else’s apple carts. (pun intended) At least three firms understood the power of leveraging their data, they understood the urgency of real-time analytics, etc. Their competitors, mostly, did not. Just look at those stock charts.
According to Moore, that was the tip of the iceberg. Most of the Global 1000 is now on notice. Over the next decade we’ll see monumental failures. Winners and losers, as always, but the magnitude of the losers may be unexpected.
Moore’s central point in the keynote — since this was a Hadoop conference — was that the Hadoop tech stack and business ecosystem is maybe a year ahead of the proverbial “crossing the chasm” moment. Ergo his lead line.
Notably, enormous cultural changes of the 1990s and early 2000s have percolated through personal expectations among those coming up in the ranks. That’s happened more notably and with more impact outside the US than within it. He pointed to the “digitization of culture”, where access has become nearly universal, where broadband created emotional dimensions (Facebook, Pinterest, etc.), where mobile makes the experience ubiquitous regardless of socio-economic position.
Meanwhile, the corporate culture of how to “get stuff done” within enterprise has not kept up. There’s no Facebook for enterprise, no YouTube for enterprise, etc. [Well, actually, there are — and they are each headquartered within a bike ride of my home near the Mountain View / Palo Alto border — but you haven’t heard about them. Yet.]
Meanwhile, Facebook-esque consumer Internet companies of the world are too caught up in their own weirdly distorted realities to solve the larger business problems. Business problems where the solutions will inevitably derive from the social networks’ innovations. Oops.
In Moore’s vaulted opinion, those conditions won’t hold much longer. There will be winners. There will be losers. Big ones.
Meanwhile, for people of my ilk, Moore smiled and predicted: “This should provide at least a decade of entertainment for everyone present.” Fundamental business reasons are simple: enormous change ahead but precious few who are trained and experienced to navigate it.
My first key take-away is based on the observation last year that Enterprise giants bumbled into Hadoop Summit 2011 in a huge and awkward way. Oddly, the logo is an elephant, #justsayin
In contrast, this year was really smooth, completely professional, far too expensive … but almost all about data infrastructure in a world where nobody want to utter the word “Oracle”.
Mind you that neither of the two main “enterprise” keynote speakers from last year still have their same jobs. #justsayin
Also, notable Hadoop practitioners were noticeably absent. In fact, most of the cast and crew of Strata seemed to be missing. A particularly popular social network has been burning the midnight oil to make Hadoop perform backflips — they like gave a couple talks and seemed to vanish.
Let me put this in other words: several hundred million dollars have been invested by VCs (and angels) to recreate an industry in the image of Redhat and Yahoo!
Wow, did anybody think that would be a particularly good idea? No, but it’s the herd mentality in practice. Even after the 5th beer I’d still recognize that strategy as not particularly wise. Feels like when you talk with an ex-convict, and they drop a line “Yeah, I made some poor choices long ago…”
My hunch is those data infrastructure plays are mostly tax write-offs (for the “early adopter” part of Geoffrey’s famous curve) at this point.
Moore underscored how real payouts come when key verticals catch fire — with serious domain expertise leveraged. LinkedIn perhaps got close, but now it almost feels like a spamming broadcast system for HR and BD departments. We’ll see “Big Data” killer apps which mean something to lots of people. Beyond the GOOG+AMZN+AAPL tip o’ the iceberg. They will come from people who have sophisticated backgrounds in Stats + ORSA + distributed systems + functional programming + DevOps, people who can also communicate well with actual business leaders. Not those employed by some halfwit B-school grad who’s posturing as the next Steve Jobs, when in reality he drinks bad beer at a lame, faux-hipster sports bar while watching cable televison. Or something. Dude, hop on your fixed-gear bike and standstill/peddle your sleeve tattoos out of here.
Translated: the proverbial ignite moment, that spark of innovation, is not going to come from the likes of a Cloudera or a Platfora or a (banal noun)-(o|e)ra… But it’s going to come, probably not many moons away. It will be in apps.
the sound of disruption
Thirty years ago, I went into a field called “math science”, i.e. how to build predictive analytics as software apps. Stanford — the Statistics department chairman, Bradley Efron, in particular — had put together an interdisciplinary degree which combined math, statistics, operations research, programming, engineering, etc. At the time, most of my peers in the program went on to become insurance actuaries. I went instead to do graduate work in machine learning and distributed systems.
For nearly two decades, most employers could care less about any quantitative background. They wanted C++ software engineers working all day on APIs from Sun or Microsoft or Oracle. Or they wanted managers. Then, in about 2000, came the sea change.
Right about the same time as ticker symbols for Apple and Google and Amazon were strolling up to their respective launchpads, some people began to look at my resume and ask a different line of questions.
I’ll always remember the first: a microchip vendor — one which makes electronics for several products you’ve purchased — was getting squeezed by Intel and their silicon compiler vendor. Critical features were being deprecated, specifically to put this second-tier player out of business. The company was on notice. They had to find a proverbial needle in a haystack: out of tens of thousands of circuit designs, they had to identify the 1% which would no longer be licensed — then redesign those. Quickly.
An internal team at the company had tried, but given up. Too much data for their techniques, it would’ve taken years to resolve. The company hired an electronics consulting firm in Austin, and engineers went to work, but gave up as well. Too much data, not enough signal. I got called in, as a “Whatever, just see if you can get anywhere” last-ditch effort. About 20 lines of Perl and one relatively simple equation later, I dumped my results into a scatterplot.
One of the lead circuit designers picked up my plot off the laser printer and began laughing. Loudly. The whole office heard him.
His manager grew annoyed: “What?! Why are you laughing?!”
Engineer: “He found it.”
When I turned in my invoice, the manager glared. “Look,” he said in a growl, “Just go somewhere for about three weeks. Bill us the whole time. Then come back and turn that in.”
My brows furrowed, this was a high-dollar rate for 2000.
“If you don’t pad that damn invoice…” he paused, “You’ll make both us and our customer look like complete fools. Piss a lot of people off.”
That’s the sound of Disruption.
More than a decade later, the summary graf of my resume reads like bullet points from Geoffrey Moore’s slide. Collab filters, anti-fraud classifiers, predictive analytics, etc. Even in the past few years, when HR people have read that resume, several looked up with a frown, said they thought that kind of work was better suited for business analysts — yadda, yadda, yadda, keep following the herd: you put the “botch” in “beotch”.
At a time when lots of business (start-ups as well as enterprise) are starving because they cannot hire Data Scientists, I’ve been busy building teams. Teams which delivered $MM results. I’ve hired about thirty people onto Data Science teams within the past few years — at a time when many start-ups would feel lucky to hire one. #justsayin
I read one of the most imbecilic essays recently from Forbes/Quora: “What Would Be The Global Impact If Wal-Mart Abruptly Shut Down?” Essentially, a hagiography stating that Wal-Mart is too big to fail, that the consequences on the US economy, the global economy, would be catastrophic. Translated: may require an enormous bailout, soon.
[In case you hadn’t guessed, I just threw up a little bit in my mouth.]
What. A. Fucking. Moron. The reality is that Wal-Mart hasn’t been doing so well over the past decade. Not if you peel back enough layers of PR. Not since they tried to bamboozle the LA city council. And failed. Moreover, folks at Amazon could really care less which Senators or SecState/former-first-lady the execs in Arkansas have in pocket. Bezos has positioned to take over 150% of Wal-Mart’s business the picosecond after Bentonville implodes. Sears and Target have reinvented themselves specifically for that very instant. So long, good riddance. Remember the point about the Global 1000 on notice? About the importance of business fundamentals?
sears, a.k.a. that web site which kinda looks like amazon
A third keynote talk that day was by the Sears CTO, Phillip Shelley. I had packed up my laptop and backpack, and was getting ready to walk out of the auditorium. After his first few sentences, I put my stuff back down and started taking notes.
Dr. Shelley mentioned how Sears started as a mail-order business a century ago, though more recently got completely kicked by another “catalog” called Amazon. Now they are leveraging Hadoop + R + Linux/Xen private cloud (srsly, this is from the Sears CTO?!?) to reinvent their business with 100x more detail on regional pricing models. Literally calculating personal pricing discounts for individuals, multiple times per day, specifically for mobile.
Sears: core algorithm moved from [6000 lines of COBOL on mainframe with 3.5 hr batch window] to [50 lines of Hadoop app on Linux with 8 min batch window], while reducing TCO for enterprise IT by two orders of magnitude. So much success, that they’ve spun it out as a new business line called MetaScale.
Brilliant strategery by Sears. Some seriously high powered Data Science talent walked out of that auditorium musing how they wished their VP Engineering was half as progressive as Sears. Srsly? Um, that’s called a PR coup. [Literally at the same moment as Wal*Mart had HR droids spamming the audience with whispers and rumors of lucrative salaries. Gak.]
emergence of the confidence economy
What’s the deal? It’s about confidence. Those giants in the Global 1000 which Geoffrey Moore says are on notice? They got that way by believing that business is largely about who barks the loudest, barks the longest, and cuts the most deals under the table. The proverbial alpha male in a wolf pack.
Wal-Mart would be a prime example, in my opinion. Their business is predicated on fundamentals which simply do not hold. Misplaced confidence. Thanks to people like Hillary Clinton, Wal-Mart has gained much influence on the House and Senate floors and the halls of the State Department and the UN assembly. In other words, so long as we manage to keep fuel costs artificially low, Wal-Mart’s market valuation will keep growing. So long as we believe that bullying vendors, conducting intelligence operations against the rest of your ecosystem, etc. — that these kinds of practices are ethical and sound in the long-run, then Wal-Mart will keep growing. Bullshit. Go look at that stock chart again. Wal-Mart is about tall white guys in dark suits, acting like complete pricks, destroying and plundering anything they can get their grimy paws on. Richard Gere in Pretty Woman, before he gets Julia Roberts. And not much more than that. On notice.
Moore is pointing out, in my opinion, that the issue at hand is about uncertainty. The point of establishing a corporate charter was always to externalize risk and perpetuate wealth for shareholders. That was true four centuries ago, when the first transnational was established, and has been true ever since. The modus of that mechanism is a process called sublation. The train wreck for sublation is uncertainty. In an environment where uncertainty holds sway, having real-time analytics from petabytes of customer data wins out over having a Senator in pocket. Any day of the week. The antidote for uncertainty is confidence. While there had been a regime of an “Attention Economy” extant for the past two decades or so, we’re now entering a new regime of the “Confidence Economy”.
Here’s the deal: people
like me like those of us in Moore's lecture have been the “secret sauce” fueling the rise of Amazon, Google, Apple, etc. We use techniques which are mostly not well understood outside of Langley and the hedge funds. The tools of contemporary corporate assassins. Guys in suits who act like pricks in lieu of practicing business fundamentals — those guys are our targets. The modus is Disruption. If you have an MBA or a CxO title and not much else to back it up, I put food on my family’s table by being a sniper paid to hunt you. Lots of *great* food. And some of the best wines available. I shake the tension out of my hands, correct for wind and distance, draw a bead, take a deep breath, squeeze the trigger. Kill shot.
The challenges faced by Data Scientists are daunting. On one hand, most mathematicians lack enough solid engineering to create killer apps. Conversely, most engineers lack enough math to make any headway on the business data. Most business analysts lack enough of either the math or the engineering to be worth hiring. Data Scientists provide all three areas of expertise: the engineering and the math and the business insights to contend with mountainous torrents of data, and move the needle. On the other hand, Data Scientists must also speak truth to power. In any given business, there will be winners and losers. Executives, people accustomed to their own power, taken down. Meanwhile, we Data Scientists come prancing into a business, we do our magic, and consequently we point out which executives are bullshit and must be “executed”.
The reason why I’ve built Data teams at a time when others are starving is simple: confidence. Sure, I’ve logged three decades of machine learning, statistical modeling, data management, distributed computing, etc. When I talk with a grad student about their work, I can tell them in 25 words or less what they need to do on their first day at work to become regarded as an great asset to the team. They already know the techniques, but crave confidence. Into the trenches, fresh-out, having to speak truth to power. They’ll be placed into some faltering business unit, run some detailed analysis, and point out that the VP who’s been arguing loudly was completely wrong for the last N years and his/her ego cost the company several $MM. You can bet that those execs will return fire. However, a person like me is confident that we can get a kill shot. I show new folks how to draw a bead and squeeze the trigger. Been doing it for a long while, and will be doing for a long while more.
Snipers have an eerily pragmatic sense of confidence. And, by the way, that’s a peculiarly difficult job. Praise goes out to the men and women who serve their countries in uniform — when the cause is just. [FWIW, before tackling the challenges of data+science, I wore a military uniform and carried a rifle. Sniper training has become invaluable.]
#1: Enterprise suffers because so many people in the corporate leadership ranks (or rather, amongst those clawing and scrambling to make their way into the corporate leadership ranks) consider themselves to be a different caste — if not a different species all together — from the rest of us who do not have a salaried position with a transnational. In a “Let them eat cake” world fraught with trillion-dollar bailouts, that’s not a particularly good way to future-proof. Moreover, this is why VCs are vital… to demolish that kind of hubris via constructive Disruption. #justsayin Word. Up. Bitches.
#2: Enterprise tooling, which is now mostly dependent on JVM-based apps, suffers because it has embraced “Convention over Configuration” … CoC has its place. I can imagine that it’s an excellent idea for heart surgeons to have a standard toolset, with scalpels in the exact same positions, etc. CoC is not a particularly good way to manage complexity and uncertainty, because it simply displaces major problems into the build system. Ultimately, it fails too much and impedes spin-up. Which, I believe, represents an enormous, ticking time bomb in Enterprise. Here’s a challenge: Show me a metric for the median period it takes in your business for a newly hired engineer to push code changes into production use which is adopted by at least 80% of your customer base. Now show me a metric for the media period it takes in your business between the point where a product manager identifies a needed feature and a newly hired engineer is ready for spin-up. From those, I’ll make a prediction based on that metric for how well your business will survive the “on notice” condition which Geoffrey Moore described. Better clues for navigating complexity and uncertainty can be found in the works of Ilya Prigogine or Stephen Wolfram. To wit, functional programming is more likely to address complexity, real complexity, and also more likely to attract top talent. CoC, not so much. Perhaps your enterprise business addresses Main Street instead of Early Adopters... Recall that Google and Amazon and Apple crossed the chasm by recruiting armies from grad students — at a time when most other people erred on the side of average joes. Remember the point about real-time analytics? It counts for training your people, then retraining, and retraining, constantly — to grapple with uncertainty. Kill shot. Global 1000.
#3: MapReduce will be unrecognizable within three years. Hadoop Summit will become something quite different after Hadoop bifurcates and gets sublated into Something Else. For example, it would be not difficult to use the Simple Workflow Service from Amazon AWS to implement MapReduce using the core part of Cascading… a different kind of MapReduce, which is not constrained by JVMs… which could scale much more gracefully and robustly… which could out-perform Google infrastructure and avoid attempting to re-create the industry in the image of Yahoo! At which point, one could deploy functional programming blocks at enterprise scale, without having to rely on the morass of enterprise build tools. Hmmm… may need to get a term sheet for that one.
Geoffrey Moore, we may have a few answers for your questions.