2013-12-22

Newsletter Updates for December 2013

Hard to believe it’s been since AMPCamp 3 in August that I’ve had a editor buffer open, collecting notes to write up… New Years Resolutions include writing a newsletter on a monthly basis!

AMPCamp 3 was a big success: over 200 people attended a two-day marathon of hands-on work with the Berkeley Stack. Spark Summit doubled that attendance, and the upcoming Spark Summit 2014 this summer is expected to double again. We had Mesos talks at each, got to meet lots of interesting people in the community.

Travels this summer and fall took me through BoulderChicagoNYC, then back to the Bay Area just in time for the Mesos Townhall. Dave Lester wrote up an excellent summary of the #Townhall @Twitter.

Hands down, my favorite event was the Titan workshop in Santa Clara by Matthias Broechler. Our course sold out to capacity – even during an impossible commute of the recent BART strike. Companies sent engineering teams for a deep dive into Titan and graph queries at scale. I’ve never learned so much about both HBase and Cassandra in any other forum: Matthias led a tour de force through systems engineering insights. The magic in its vertex and edge representations and efficient denormalization at scale. Given that you probably have data in HBase or Cassandra already, and that your use cases are most likely graphs anyway, the Titan abstraction atop provides an excellent optimization. Highly recommended.

Strata

Strata NY in October was huge. The conf has grown so much that it must change venues next year – but what a rich environment. I couldn’t walk 5 meters without getting pulled into an incredible discussion. Bob Grossman’s talk Adversarial Analytics 101 was a big eye-opener, articulating what many of us have grappled with over the years – without language to convey issues/concerns to stakeholders.

A talk from Spotify reinforced an oft-repeated theme: some of the most costly issues with Hadoop at scale often involve mis-matches between the Java code and the underlying OS. For example, Spotify traced catastrophic NameNode failures – jobs which ran fine for everybody except for one particular user – to Hadoop’s naive use of Linux group ID. In other words… wait for it… Hadoop is not an OS! Datascope Analytics and IDEO teamed up on Data Design for Chicago Energy – don’t miss the part about their Curry Crawl :) Also, MapR hosted a cocktail party in the penthouse that was an international who’s who of Big Data.

I look forward to being involved with Strata confs – Beau Cronin and I will be co-hosts for the Hadoop and Beyond track at Strata SC in Feb. Flo Leibert and I will be teaching a 3-hour Mesos tutorial. Secret handshakes for discount codes are in effect :) Also, Strata EU next year will be held in Barcelona. Looking forward to that!

Upstairs, Downstairs

Criticisms heard at many conferences involves an “upstairs, downstairs” effect. Upstairs: speakers give interesting talks, perhaps about use cases similar to what you need, but often about obscure cases that are far afield from your needs. Downstairs: vendors push products that address the “nuts and bolts” of cluster computing, and emphasize vague nuances about how they differentiate. Not so much about use cases. People from Enterprise IT organizations walk away shaking their heads: rocket science upstairs, 150 lookalike Big Data vendors downstairs… perhaps the best conversations are to be found on the escalators in-between. Okay, we need to emphasize use cases in the conference content – but how about the content in company experiences?

For the past 4–5 years or so, we’ve been seeing this evolve. Many companies scramble to adopt a Big Data practice based on Hadoop – their “journey”, to co-opt the marketing lingo du jour – without intended use cases. One might call this Hadoop-as-a-Goal. WSJ recently ran an excellent article by Scott Adams, creator of Dilbert, which one can paraphrase as “Goals are for losers.” Adams points out that establishing “goals” is a fast way to set yourself up for failure, while cultivating a “system” is a recipe for success. I couldn’t agree more. It cuts to the heart of the problem about Hadoop-as-a-Goal.
Let me ask: Do you have a system in place to leverage your data at scale? Or do you have a strategy based on what the vendors are providing, punctuated with a goal of converting those tools into a practice?

Stated another way, linear thinkers tend to set goals and flourish during a boom, when markets are steady and heading up. In contrast, I have a hunch that we are in the midst of experiencing years of significant disruptions, per market sector. Much of the disruption is fueled by leveraging data at scale. Non-linear thinkers tend to flourish in those conditions. Leading firms, e.g., Google and Amazon, have consistently flourished during down markets, thriving with approaches (systems) which seemed counter-intuitive to prevailing wisdom (goals).

I get to meet lots of people asking for advice about Big Data, and get to compare notes within lots of different organizations. One observation holds: those who prefer to talk about their favored tools… generally don’t get far in this game; while those who talk about their use cases (over tooling) are generally the ones succeeding. That observation dove-tails with the criticisms about conferences. Focus on having a system, not a set of goals.

Intro to Machine Learning

At the recent Big Data TechCon I did a short course called “Intro to Machine Learning”. Many thanks for the feedback and thoughtful comments. We’re expanding this into a full-day ML workshop through Global Data Geeks – first in Austin on Jan 10, followed soon after in Seattle on Jan 27.

Part of the content for this new ML workshop focuses on history. The origins of Machine Learning date back to the 1920s with the advent of first-order cybernetics, and we follow the emerging themes into contemporary times. A few kind souls came up after the talk, mentioning how that history stuff seemed like a waste of time at first, but then became quite interesting. I really helps establish context and see a bigger picture for an arguably complex topic.

Part of the content focuses on business use cases and team process. There are plenty of people doing excellent Machine Learning MOOCs: Andrew Ng, Pedro Dominguez, et al. I recommend those. However, this new ML workshop attempts to complement what those MOOCs tend to miss: how to put the algorithms to work into a business context. Have a system, not a set of goals.

Monoids, Building Blocks, and Exelixi

Sometimes, other authors beat me to the punch by articulating what I’ve been struggling to put into words. Sometimes they do a fantastic job of that. One recent case is the highly recommended article The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps on the LinkedIn Engineering blog. I’ll call attention to the section titled “Part Four: System Building”, about building blocks for distributed systems. Jay and I perhaps disagree about Java as a priority, but I’ll bank on his words there in just about every other aspect. Jeff Dean @Google gave an excellent talk at ACM recently in a related theme, called Taming Latency Variability and Scaling Deep Learning. Also very highly recommended!

The building blocks have emerged. Once upon a time, there was an exceptionally popular OS called System 360, and business ran on that. IBM powned the world, so to speak. Eventually, Unix emerged as an extremely powerful contender – and why? Because Unix provided building blocks. I’ll contend that if the statement “Hadoop is an OS” has any basis in reality, then it’s more analogous to System 360 – at least for now. However, the “Unix phenomenon” equivalent for distributed systems is emerging now, and that will likely be a matter of open source building blocks, not a monolithic platform.

I’ll be focusing on this “Building Blocks” theme in upcoming meetup talks, workshops, interviews with notable people in the field – plus at Data Day Texason Jan 11, in an O’Reilly webcast on Jan 24 about running Spark on Mesos, and at my Strata SC talk about Mesos as an SDK for building distributed frameworks. See you there.

Another recent case of an excellent author beating me to the punch is the paper Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms by Jimmy Lin at U Maryland and @Twitter. Highly recommended!

This paper provides the punchline for MapReduce. One of the odd aspects of teaching MapReduce to large numbers of people is to watch their reactions to the canonical WordCount example … Most people think of counting “words” in SQL terms – in other words, using GROUP BY and COUNT() to solve the problem quite simply. In that context the MapReduce example at first looks totally bassackwards. Even so, the SQL approach doesn’t scale well, while the MapReduce approach does. What the MapReduce approach fails to clarify is that it represents a gross oversimplification of something which is truly elegant and powerful: a barrier pattern applied to using monoids. Once a person begins to understand how all the WordCount.emit(token, 1) quackery of a MapReduce example is really just a good use of monoids, and that the “map” and “reduce” phases are examples of barriers, then this whole business of parallel processing on a cluster begins to click!


To paraphrase Jimmy, “Okay, what have we done to make this particular algorithm work? The answer is that we’ve created a monoid out of the intermediate value!” Truer words were seldom spoken.

Putting on my O’Reilly author hat, I’ve really wanted to show a simple example of programming with Apache Mesos. Python seemed like the best context, since there are lots of PyData use cases but there aren’t a lot of distributed framework SDKs yet for Python – though, keep an eye on Blaze
So, I wrote up a Mesos tutorial about this recently as a GitHub project, partly for Strata content, and partly for industry use cases. The project is called Exelixi, named after the Greek word for “progress” or “evolution”. By default, Exelixi runs genetic algorithms at scale, atop a Mesos cluster. In the examples, we show Exelixi launching atop an Elastic Mesos cluster on Amazon AWS.

Even so, Exelixi is not tied to genetic algorithms or genetic programs. It can run distributed workflows on a Mesos cluster, by substituting one Python class at the command line. The distributed framework implements a barrier pattern which gets orchestrated across the cluster using REST endpoints. Consequently, it is capable of performing MapReduce plus a much larger class of distributed computing that requires a bit of messaging – which Hadoop and Spark don’t handle well. Python provides quite a rich set of packages for scientific computing, so I’m hoping this new open source project gets some use at scale. Moreover, it provides an instructional tool: showing how to write a distributed framework in roughly 300 lines of Python. You’ll even find a monoid used as a way to optimize GAs and GPs at scale. I could go on more about use of gevent coroutines for high-performance synchronization, about the use of a hash ring, how Exelixi leverages SHA–3 digests and a Trie data structure – but I’ll save that for some upcoming meetup talks.

The implications of building blocks for distributed systems? Not unlike the notion of interchangeable parts – and gosh, that was a game-changer for industry, speaking of “factory patterns”. Brace yourselves for much the same as we “retool” the industrial plant across all sectors to leverage several orders of magnitude higher data rates.

There are further implications… If you look at how Twitter OSS has retooled its cluster computing based on open source projects such as Mesos and Summingbird – again, with lots of those monoids wandering about – an interesting hypothesis emerges. Note that much of the “heavy lifting” in ML boils down to the cost of running lots and lots of optimization via stochastic gradient descent. Google, LinkedIn, Facebook, and others fit that description. Looking a few more years ahead, one of the promises of quantum computing is to be able to knock down huge gradient descent problems in constant time. Engineering focus on leveraging monoids would certainly pay out a high ROI in that scenario.

Minecraft

Ah, but quantum computing … how far off is that? It’s still a matter of research, true. However, one might take a hint from a recent twist… The Google Quantum AI team released a Minecraft mod called qCraft, with the intent of identifying non-linear thinkers who are adept at manipulating quantum problems. Identifying them early. Those of you who have kiddos probably understand that 10-year-olds play Minecraft. So let’s do the math … some subset of these avid Minecraft players will become Google AI interns within less than a decade.

Talk about “Have a system”, indeed.



I’ll take a bet that the event horizon is more like 5 years. Which means that 5+ years from now, I’ll be hiring for engineers and scientists who’ve logged 5+ years of experience with graph databases, monoids, abstract algebra, sparse matrices, datacenter computing, probabilistic programming, etc. And Minecraft :) I will be passing on engineers who’ve spent the past 5+ years rolling up log files in Apache Pig. #justsaying

While we’re on the topic, check out my friend Mansour’s plugin on GitHub for Minecraft. The plugin visualizes the results of analyzing large-scale data using machine learning and Minecraft :) This turns the process into a kind of game. Hint: Mansour and I both have kiddos in that age range.

Events

I’ve been working with Global Data Geeks to interview notable people in the field. Most recently, we had excellent discussions with Davin Potts an expert in computer vision and Python-based supercomputing, and also with Michael Berthold, the lead on KNIME. On deck, we’ll have another interview coming soon with Brad Martin, an expert in commodity hardware for extremely secure cryptographic work for consumer use.
Other events coming up:
Also keep a watch for updates about a Twitter OSS conf coming up this Spring.


On that note, I am available for consulting. Also for weddings and parties. And I wish you and yours very Happy Holidays. See you in the New Year!

2013-09-13

Denver trip, Sep 24-27 2013

Got a Denver trip coming up this month, Sep 24-27, 2013 … we'll be talking lots about Apache Mesos and Cascading / Cascalog / Scalding, with demos and some O'Reilly books to give away. If you're in the area, please come by some of these events!


Boulder/Denver BigData Meetup: Cluster Computing with Cascading and Mesos
Wed 9/25 6:00pm



Hands-on Intro to Data Science
http://intro-data-science-broomfield.eventbrite.com/
Thu 9/26 8:30am - 4:30pm

Omni Interlocken Resort
500 Interlocken Blvd, Broomfield, CO 80021


Drinks and Data
http://drinks-and-data-broomfield.eventbrite.com/
Thu 9/26 5:30pm - 8:30pm

The Tap Room
500 Interlocken Blvd, Broomfield, CO 80021


The two Broomfield events are hosted at the same venue as GlueCon. The meetup in Westminster is a just few miles away.

I'll be staying in Boulder, and will have informal office hours on Wed and perhaps Fri morning at some coffeehouse in downtown Boulder – to schedule, please message me @pacoid on Twitter.

Turn your data center into one big computer

2013-08-26

Newsletter Updates for August 2013

Lots of talks, lots conferences, lots of writing. Here are the latest updates about scheduled events, along with pointers to some of the best content that I've been studying lately.

YARN has a Docker-sized hole. Brace yourselves, in this post some friends at [ list the Hadoop vendors, other than MapR ] may get upset. Good friends all, I wish them the best. However, some things must be said. More about that in a moment. (And if I don't surface before Nowruz, please notify my family.)

It's been a busy month: Portland, Austin, Chicago. While we often go out for beers following a talk, Aaron Betik @Nike had the brilliant insight to hold our PDX meetup inside a brewery. Three birds, one stone. Many thanks to Aaron, Nike, Thomas Lockney @JanRain, Todd Johnson @Urban Airship for putting together that event. And to everyone at O'Reilly Media for such a great OSCON!

Girish Kalathagiri and I wrote a paper about ensemble modeling for the first-ever PMML Workshop at KDD 2013 in Chicago. Fun to be on the ground floor, and I look forward to this expanding into a full conference. Many thanks to Bob Grossman, Walt Wells, et al., for their efforts arranging the workshop.

Mike Zeller from Zementis explained about a big spike in industry usage over the past year. PMML has become a pervasive standard for describing analytics workflows. I was particularly impressed with solutions from Augustus for Python, and with the recent updates to Knime – both presented at our workshop. In the broader scope of Enterprise data workflows, I've been learning much about the Python stack from Continuum Analytics – Anaconda, Wakari, etc. – plus excellent work on IPython Notebook, Pandas, Scikit-Learn, and related projects. Augustus (from Open Data Group) fits well within that context to provide PMML for model creation and model scoring. Especially given Continuum's support for compiling and optimizing apps and attention to low-latency use cases. Similarly, Knime has a brilliant commercial integration in the context of ParAccel by Actian. I'm particularly impressed with Actian's use of Knime and Eclipse to provide a UI for Enterprise data workflows: great UX plus operationalizing. I got to spend some time recently with the teams at Continuum and Actian. Highly recommended.

So, the takeaway is, if you're working with an analytics stack that does not incorporate PMML support as a core feature, run. It's time to switch.

–––

In Austin, I got to meet with directors of the new Business Analytics program at the UT/Austin McCombs School of Business. Brilliant, a mere glance at their syllabus and schedule proved this program to promote the kind of aggressively fast-paced intellect which I crave. Which the industry craves for data scientists, too – in quantity. I'm looking forward to visiting their program again soon.

In Chicago, I got to meet with several of the current fellows in the Data Science for Social Good. Most substantive discussions I've had since Los Angeles about Open Data and its applications in practice. There's a new conference emerging in the midst of those discussions. Looking forward to final projects from this summer's DSSG fellowship!


(cluster computing, à la friends @Hazelcast)

Meanwhile, we have "Intro to Data Science" workshops and talks coming up in Denver, SF, NYC. We'll book-end each trip with office hours and on-site demos for Mesos.

Speaking of the workshops, a few great new titles related to that material, which I recommend:

Analyzing the Analyzers, by Harlan Harris, Sean Murphy, Marck Vaisman; O'Reilly (2013): including an excellent analysis of the skills, experience, and viewpoints of data practitioners in industry.

Mondrian in Action, by William D. Back, Nicholas Goodman, Julian Hyde; Manning (2013): beyond the Pentaho analytics, this team has some of the most comprehensive insights available about Enterprise SQL usage.

Storm Real-Time Processing Cookbook, by Quinton Anderson; Packt (2013): including clear, concise sample apps that integrate the kinds of frameworks we use in production (Storm, Kafka, Cascalog, etc.) and not mere code snippets shown in isolation.

–––

One of the main themes in my workshops and lectures is that Apache Hadoop is almost never used in isolation. However, that perspective has been taken to task by the architects of Hadoop:

"Hadoop is the kernel of a distributed operating system, and all the other components around the kernel are now arriving on the stage."

"DistributedShell is the new WordCount."

Check out Arun's slide deck there. Seriously, Hadoop as a service endpoint? Wow, that's like Enterprise Java Beans redux. A kinder, gentler version of EJB. Or something. The notion of having to write 40 lines of Java to execute a Bash command line – now that's impressive! Just wow. I'd pay good money to be in the room when Cloudera and Hortonworks sales teams attempt to pitch this flavor of JVM nonsense to IT execs at Schwab. Their reality bubble leaves me wondering: have Doug or Arun even looked at a modern kernel, i.e., even bothered to notice Linux kernel commits, since their Java coding days back in 2006?

Because I'm quite certain that the people making Docker have. I'm quite certain that the people making OpenVZ have, too. Moreover, my friends who work in Ops for their profession are generally well aware of Docker and OpenVZ. However, YARN? Not so much…


Let's consider a historical trajectory:
  • COBOL, circa 1959… DoD accountants trying to tell Ops how to do their job
  • EJB, circa 1999… IBM/Sun/Oracle attorneys trying to tell Ops how to do their job
  • OpenStack, circa 2010… NASA app developers trying to tell Ops how to do their job
Notice a trend? And now, for the latest contender:
  • YARN, circa 2013… Yahoo! data engineers trying to tell Ops how to do their job
In my experiences with real companies – companies with substantial amounts of revenue, that is, not the Silicon Valley definitions – anyone outside of Ops trying to tell then how to do their job better have a CxO in their title. Preferably with a vowel in-between. In other words, keep the dilettantes the #&!% out of the Ops pit, to keep your employer from going out of business.

Meanwhile, what's the commercial reality of Big Data today? Well let's compare and contrast a couple big players in the space:

Actian is profitable, with a base of 10K customers, and north of $150MM annualized revenue. BTW, I'm a big fan of Amazon AWS Redshift, how about you? Runs great in production at scale. Love the economics of that.

Hortonworks recently took a B round, with $70MM in funding to-date. The company currently sells training and support; not clear how long it will take to become profitable. Oh, and their CTO recently left the building.

Hmmm… May need to check with some of my b-school friends about how to compare those fundamentals. Meanwhile, here are a few notes from Google about their experiences and benchmarks working with Linux kernels and very large-scale distributed computing.

–––

My first full-time job as a engineer was at a start-up in Sunnyvale in the summers of 1983-84. Our team ported Unix to a 32-bit minicomputer, and I wrote the sort package. If you've shopped at Ace Hardware or Pep Boys, your transactions probably went through our code. I'm grateful for that experience and learned a bit about operating systems by helping implement part of a popular embedded commercial distro.

At the time, many companies were totally absorbed in COBOL – still a good living back then. Those shops seemed oblivious to the changes underway: Unix minicomputers on the high end, workstations in the mid-range, PCs on the low end (e.g., Apple II running VisiCalc). These would soon wipe out COBOL programming like a forest fire raging through a stand of dry pines.

About that time I attended a seminar by one of my CS profs and his colleagues, describing a start-up based on work at Stanford University Network, aka SUN, which had commercialized a new network-enabled minicomputer. They considered that work transformational. Read: disruptive. They were right. That same year, Steve Jobs gave a standing-room-only lecture on campus about what he'd learned from Xerox PARC, aka the Lisa just prior to the Macintosh. He considered that work transformational. He was right. Meanwhile, a guy named Larry Ellison was busy hiring the bulk of our CS grads for his company, Oracle. You know the rest of the story. We could see big changes ahead circa 1983-84.

Over time, I've noticed how disruptive changes in computer technology tend to happen faster at the hardware and OS level, while the popular programming languages struggle to keep pace. Legacy frameworks encounter even more difficulty. I recall lectures at Stanford CS teaching pretty much the opposite: that it's simpler to evolve new technology at the language and application layer. While that notion may hold power in academia, its narrative conceit is that computer programming languages become tied to culture in industry, and culture has inertia.

That leaves us where we are today with Java vis-a-vis Big Data. For better or for worse, YHOO circa 2006 made a big bet on Java as the principal language for Big Data frameworks. YHOO circa 2006 didn't last, but the frameworks that emerged from it persist. These days the Big Data vendors are thoroughly occupied selling Apache Hadoop to the Global 1000 as a glorious path into a shiny, data-imbued future. In other words, recreating the Global 1000 in the image of YHOO circa 2006. Some excellent work came out of Yahoo! from the mid-/late- 2000s, and I greatly admire my friends who were there and made that happen. Even so, Hadoop is based on work at GOOG circa 2002 – now a few generations behind. Living (literally) with GOOG in my backyard, my neighbors who work there smirk whenever the word "Hadoop" gets mentioned.

When I see talented people who have their heads stuck inside IntelliJ, who cannot think outside of a Java API, it seems sad. It reminds me of those poor souls circa 1983 pouring over COBOL punch cards and teletype output. Sure, there's excellent software written in Java – java.util.concurrent comes to mind immediately.

I'm obviously quite a fan of JVM-based functional programming languages such as Clojure and Scala. However, when the "thought leaders" go around talking about Hadoop as an operating system, re-defining HA/low-latency service endpoints to be based on Hadoop and Java – it's COBOL all over again. Hold on tight.


Linux is an operating system. Unix is an operating system which had sophisticated features even sooner – arguably, as of the Linux 3.x kernels the playing field has perhaps become more leveled? Windows is also an operating system, albeit geared toward different usage. When people try to pitch "Hadoop as an operating system", they are trying to sell you snake oil.

The only thing that nonsense will buy the IT industry is even more of a guaranteed revenue stream for people building zero-day exploits in Beijing. Let's just suppose, hypothetically, that you run part of IT at Morningstar or Schwab. Imagine somebody trying to pitch you on JVM for low-latency services and cluster management. Are you going to bet your EVP's bonus on snake oil? Didn't think so.

The lesson is that enormous changes are afoot in terms of multi-core processors, large memory spaces, etc. These changes have huge impact on algorithm work for handling data at scale – without going bankrupt. Hadoop emerged in a day when spinny disks were king, multi-core was rare, and large memory spaces were expensive: that world is gone. Meanwhile, the modern kernels have kept pace with those industry changes. 

So, what is my point? Trying to resolve OS issues in the application layer is almost always a recipe for disaster. Caveat emptor.

I give Hadoop three years before it gets displaced. The lesson of Spark, in my analysis, is that rewriting Hadoop to be 100x better isn't hugely difficult, given the available building blocks for data center computing, based on the modern kernel. Meanwhile, the prognosis for Hadoop? Three years. On the outside.

Many thanks,

Paco

2013-07-22

Newsletter Updates for July 2013

Lots of talks, lots conferences, lots of writing. Here are my latest updates about scheduled events, along with pointers to some of the best content that I've been studying lately.

Got to speak at Hadoop Summit last month, about the Pattern project. Great audience, lots of discussion about deploying predictive models at scale on Apache Hadoop clusters. BTW, one of the most compelling talks at Hadoop Summit this year was by Kevin Coogan, founder of AmalgaMood in DC. Kevin discussed their technology that leverages social signals and Open Data in predictive analytics when financial markets are not responding to what might otherwise be called analysts' consensus. Chaos, in other words. The Q&A for that talk in particular was enlightening and compelling: what a great way to scurry VCs up to the audience microphone.

We had several other excellent events recently: Seattle, Santa Clara, Los Angeles -- many thanks to hosts Surf Incubator, White Pages, and Factual. Plus, we had some private brown bags at LinkedIn, MapR, and other firms.  My takeaway: great to meet many amazing people, a wealth of talent, and overall so much dedication to learning about this field. To that point, there's lots of opportunity in Data Science roles and along with that a big need for people who are adept at working across disciplines. People need enough programming background to leverage distributed systems, which enable the compelling use cases. People also need enough quantitative background to leverage the math required for high ROI apps at scale. I find that many people attending the workshops have expertise in one field, and want to augment with the other field -- which is ideal for learning from each other. Upcoming events: Portland, Austin, Chicago. Nike is helping to sponsor the Portland meet up, at Widmers -- so we won't have far to go for beers afterwards. And during. See the calendar at http://liber118.com/pxn/ for more details. BTW, if you want a 20% discount for OSCON, please use OS13FOS for a discount code. If you have a city or venue to suggest for upcoming workshops and talks, please let me know @pacoid

Toward a general thesis for Cluster Computing...

In other news, the Mesos open source project has graduated into a top-level Apache project. I recently took a position as Chief Scientist at a new company related to that project, called Mesosphere, in San Francisco.

There's a general thesis emerging, namely that we run large-scale apps based on cluster computing. Because the data has become too big to fit on one computer anymore. Or, for that matter, the apps have become too complex to be handled by one computer, one person, one model. We require multi-disciplinary teams, leveraging cluster computing. Three areas of technology innovations get applied: Big Data, Data Science, and Cloud Computing. At a high level, applications generally leverage an abstraction layer, such as Cascading. At a low level, the more advanced organizations are leveraging cluster schedulers such as Mesos -- and, arguably, YARN coming along too. For an excellent overview, see the Wired article by Cade Metz, Return of the Borg: How Twitter Rebuilt Google's Secret Weapon.

I foresee a general trend of smarter clusters, leading into higher ROI on Big Data apps. On the one hand, multi-tenancy in clusters helps balance the utilization curves and cut costs. On the other hand, reducing the "wire tax" of moving data products from batch clusters to web app clusters will help enable new areas of algorithm development, by reducing critical latencies. Companies such as Twitter and Airbnb have both built their tech stacks using these components, Cascading and Mesos. I spot a trend.

My intent is to show sample apps that leverage both layers. Also, one advantage of Mesos is that it manages resources for many different kinds of frameworks and apps: Apache Hadoop, Spark, MPI, Memcached, Nginx, Redis, Ruby on Rails, Python, etc. That's perfect for Big Data use cases that blend multiple frameworks. Stay tuned.

Drilling down into the math...

My lectures tend to emphasize a division between the rigor and formalisms of statistical theory versus the relatively ad-hoc praxis of what we categorize as machine learning. At top schools, grad students in one of those fields receive high salaries and VC funding straight out of school, while grad students in the other... not so much. That's a shame, because mission-critical apps at scale rely on both disciplines. Machine learning allows you to make billion-dollar mistakes, while statistics help you avoid billion-dollar mistakes. Take a look at any good search engine team and you'll see how both disciplines become necessary in practice. Together.



Another point is that machine learning approaches are a subcategory within optimization theory. As I'm researching industry use cases for Data Science, Big Data, Cloud Computing, etc., it becomes clear that more emphasis on optimization is crucial for long-term industry evolution. Hanging around seminars at Stanford's Systems Optimization Lab, the math innovations presented and their applications have huge implications for industry. As a case in point, John Deere probably won't be building a Facebook competitor any time soon; however, they must tackle hard problems at enormous scale in optimization. Mathematicians are responding to that demand. Given that 40% of the world's population works directly in agriculture, plus the urgency of global climate change, etc., I tend to find Deere's domain more compelling than yet-another social network, ad network, social game, etc.

Enough soap box. Instead I'd like to recommend an excellent resource in this area, notably Rob Zinkov's blog series Convex Optimized at http://zinkov.com/

My current homework, thanks to Rob, focuses on Alternating Direction Method of Multipliers (ADMM). This builds on the previous theme of exploiting sparsity and matrix factorization atop Hadoop. It also addresses the need for more emphasis on general approaches in optimization theory, and less on the nuances of machine learning algorithms. Much study will be required before my sample apps begin to emerge. Meanwhile here's to broadcast and gather as a more interesting pair of verbs than map and reduce

Some well-known Big Data vendors are struggling to market Apache Hadoop as an OS. It's not really. Hadoop may be a distributed file system plus some distributed computing, but calling it an operating system would be a great way to fail a CS midterm. More to the point, MapReduce as an abstraction is multiple layers removed from the needs of actual workloads, plus it's already 11+ years old. A better focus for these vendors might be to engage Professor Boyd, et al., to build distributed computing frameworks for commodity hardware based on ADMM principles, which support a wide range of commercial ML problems directly.

Doubtful that will happen. For example, that might require (gasp) supporting MPI features! Some engineer would need to prioritize studying math (gasp) in lieu of lobbying for 150 new commits on an Apache project! People could eventually recognize commercially interesting problems in Enterprise IT which are (gasp) not readily expressed as SQL! No, that won't happen. The Global 1000 is far too busy tooling up on Hadoop. Instead, I'd bet money on Anaconda, Spark, TitanGraphLab, etc., leap-frogging the Hadoop-centric segment of the industry -- once the world beyond Silicon Valley wakes up to the realities of math emerging circa 2010. YMMV.


BTW, I must apologize, but it's become impossible to keep pace with email. While traveling, I find that Twitter works best for what must get said quickly. Semipublicly. I'll check Twitter often -- but email infrequently.

Many thanks,

Paco

2013-06-18

Newsletter Updates for June 2013

Lots of talks, lots conferences, lots of writing. Here are my latest updates about scheduled events, along with pointers to some of the best content that I've been studying lately.

First, the Austin trip was awesome. Many thanks to all who attended the events -- and especially to those who participated in the workshop for excellent feedback. We've learned much from your suggestions, to use in the upcoming workshops.

Hadoop Summit is next week. I'm looking forward to catching up with friends who will be in town or (for those already in town) stepping out of the cubicles :) Come check out our talk about the Pattern project for PMML in Cascading. Other events are listed below for meetups and workshops in SeattleSanta ClaraLos Angeles, followed by OSCON in PDX. Please tell your friends in those cities. I also have an "official" home page now with a newsletter sign-up, event calendar, links, etc., at: http://liber118.com/pxn/

An update on the O'Reilly book about Cascading: it will now be in print  July 22 -- just in time for OSCON. Speaking of which, we should really plan for a meetup, drinkup, birds-of-a-feather, or something in PDX.

In other news, Cascalog also has a new home page at http://cascalog.org/  Paul, Soren, Sam, Bruce, et al., nice work! This is a much needed resource for the developer community.

Now for a few "extras" in the newsletter... I got to attend a Stanford talk earlier this month by Tim Davis, from University of Florida. For the second time in one week, I heard the phrase: "There's no such thing as RAM anymore." That will become a theme for the architecture of algorithms on distributed systems. Following up on the sparse matrix techniques mentioned in my previous newsletter, this lecture was about Sparse Cholesky update/downdate, LU factorization, QR factorization, software architectures based on GPUs for HPC parallel processing, etc. In other words, even if his name doesn't quite ring a bell, you've probably used his software daily: core libraries for linear algebra optimization in R, Matlab, Mathematica, plus some of the core algorithms for Google Street View, 3D Earth, as well as many of the Verilog vendors. Professor Davis curates an extensive collection of sparse matrices, which I highly recommend. Not only are these visualization beautiful, but the examples represent important edge cases for sparse matrix factorization, used to evaluate new work on algorithms. Also note the museum exhibit coming up in October.

Tim Davis / University of Florida: Sparse Matrix Collection

Speaking of sparse matrix factorization and "No such thing as RAM," I thoroughly enjoyed at day at National Instruments in Austin, giving a talk about Big Data trends, and also learning about machine learning techniques at microsecond speeds on FPGAs. There is a form of convergence afoot in the industry, between the technology pyrotechnics of NI and other firms working on sensor arrays for the "Internet of Things", and what we've been doing with large-scale cluster computing. Check out what NI has to say about the practice of Big Analog Data™ -- I have yet to see people not drop their jaws reading those stats.  Oddly enough, some of the most important techniques for machine learning algorithms at microsecond speeds have familiar cousins at petabyte scale, so I have a hunch there are many opportunities ahead based on this area of convergence.

Speaking of "Internet of Things" and real-world data, one of the most astounding projects that I've encountered in a long while is Protei. This is a truly innovative data platform: a multi-hull drone sail boat, built much like an eel, which changes how we clean up marine oil spills, recover "islands" of plastic waste, collect vital data from the radioactive waters off the coast from Fukushima, etc.  For another amazing innovation, Paragon Science received well-deserved press in the article, "Doctors and Social Oncology: The MDs most mentioned by their peers (breast cancer edition)".  I got to speak with Dr. Steve Kramer in Austin, and I'm quite impressed by capabilities of this technology for complex graph analysis and visualization.

In other news, I was grateful to attend the recent collaboration among Facebook, Twitter, and LinkedIn for the #Analytics@WebScale conference. We saw Facebook's first public announcement of Presto, a new approach to handling ad-hoc queries at very large scale which is now displacing Hive. Congrads to Martin, David, and teammates on the Presto project -- looking forward to this work becoming released as open source later this year! And in a new twist on the term "cloud computing", Facebook data is getting so large, so complex that they've even begun to experience weather conditions within their data centers :)

Finally, I learned a lot form the GOTO Chicago conference last month. Nathan Marz gave a talk on "Runaway complexity in Big Data systems... and a plan to stop it", with a video released.  Dean Wampler (from the new firm Concurrent Thought) and Amanda Laucher discuss functional programming an another video from the conference. Good stuff!

If you have a city or venue to suggest for upcoming workshops and talks, please let me know @pacoid

Thanks!

Paco

2013-06-11

new "Official" home page

Here's a quick note to state that my new "Official" home page has been updated and now lives at http://liber118.com/pxn/

Check there for news about upcoming meetups, talks, and our world tour of "Intro to Data Science" hands-on workshops.  Plus, more links to interesting happenings related to large-scale data.

2013-05-06

Newsletter Updates for May 2013


Lots of talks, lots conferences, lots of articles. Here are the latest updates about scheduled events, along with pointers toward some of the best content that I've been studying lately.

Sign up for PXN Newsletter: "Data Workflows" http://liber118.com/pxn/ ]

The O'Reilly book is due in print June 22 – just in time for Hadoop Summit. An online "Rough Cuts" version should be available before then. Many people have asked if there will be a Kindle version? Yes. Yes, there will be! Many many thanks to our technical reviewers for all the excellent feedback and suggestions. Also, speaking of EPUBs, the Liber 118 novel has been available in Kindle version for a nice long while, and please check it out :)

Events are listed below for a week in Austin, then CityCamp and Hadoop Summit here in SV, followed by OSCON in PDX. Should be a fun summer! These talks have ample doses of Big Data frameworks, Data Science use cases, Machine Learning at scale, Open Data apps, etc. For a sample, check out a recent article about the Pattern open source project in the Software Developers Journal. For another sample, check out this recent Airbnb Tech Talk. Functional programming as a way to tackle Big Data projects has resonated well lately in these talks – at Stanford and CMU as well. Also, the chef at Airbnb graciously celebrated the event with a special dinner menu featuring "Paco's Tacos":

I like to include a few "extras" in these newsletters. A few in particular have been on my mind a lot lately. The first is an excellent by John Wilkes at the 2011 Google Americas Faculty Summit, discussing the Borg and Omega projects which Google uses to manage clustered resources in their data centers. Wilkes presents the notion of "surety" as a first-class resource alongside CPU, RAM, I/O, etc., introducing a subtle but powerful change to our accepted notion of Von Neumann architecture. This has several interesting implications for those of use who build large-scale distributed apps. We'll be talking more about that in Austin. Another recommendation is the excellent paper by Jimmy Lin, "MapReduce is Good Enough?", based on his analysis of machine learning apps at scale during a sabbatical at Twitter. On the one hand, there is ample criticism that Hadoop is not quite suitable for many important kinds of algorithms. On the other hand, as Professor Lin points out, much of our code implementing algorithms has been inherited from 3+ decades of expressing logic based on FORTRAN loops. Seriously. I'm not one to believe that Hadoop won't be replaced (reasonably soon) but we really need to replace algorithm libraries with better updated code. As the paper shows, that can lead to more effective implementations for streaming anyway. In terms of putting related insights into practice, check out the talks on SlideShare by David Gleich, especially about "tall and skinny" QR matrix factorization, as well as excellent explanations of the math behind Google, etc., for multi-arm bandit and other machine learning in practice. Chris Severs at LinkedIn has an excellent implementation for Gleich's TSQR. Last but not least, and putting most of the above into practice: Pete Skomoroch gave an excellent talk "Skills, Reputation, and Search" about data products at LinkedIn. You could search far and wide to try to find a better discussion of how to take a problem of raw, unstructured Big Data from the stage of blank whiteboard to world class app in a matter of months.

One final note: been preparing a workshop about all of the above – typically as a full-day course, very hands-on. We'll try the first in Austin, then take it elsewhere. If you have a city or venue to suggest, please let me know @pacoid

Thanks!
Paco

---
Upcoming events…

Big Data SF Bay Area presents:
Mon, May 20, 2013  6:30 PM - 9:00 PM (PDT)
Lilly Mac's
187 S Murphy Ave, Sunnyvale, CA 94086 

GeekAustin:
Hands-on Introduction to Data Science -- a full-day workshop with PXN
Wed, May 29, 2013  8:30 AM - 5:30 PM (CDT)
AT&T Conference Center
1900 University Ave, Austin, TX 78705

City of Palo Alto:
a talk by Paco Nathan and Diego May 
Sat, Jun 1, 2013 11:00 AM - 7:00 PM (PDT)
downtown Palo Alto, CA

Hadoop Summit:
Wed, Jun 26, 5:05 PM - 5:55 PM (PDT)
San Jose Convention Center
150 W San Carlos, San Jose, CA 95110

O'Reilly Media OSCON:
Thu, Jul 25, 2013  5:00 PM (PDT)
Oregon Convention Center
777 NE Martin Luther King, Jr. Blvd., Portland, OR 972322