2.26.2014

Data sidekicks and design patterns for data science workflows - Part I

This is a refactored version of my Strata talk, updated based on reactions, feedback, and conversations at the conference. The original got good reviews, but I think you'll agree that this is even better. The original slides are here (low-rez) and here (high-rez). No speaking notes, I'm afraid, but video should be available, eventually.

Data science has two parents...
Professional data science is largely enabled by two powerful, underlying technologies: big data* and machine learning**. Both of these are mature disciplines in their own right, and each has its own way of measuring success. Big data measures success in operations per second: computations performed, pages served, lines of data processed. Machine learning measures success in prediction accuracy and its cousins: precision, recall, RMSE, and the like.


To win at big data, make these squiggly lines go down.

To win at machine learning, make these squiggly lines go up.
So data science has two parent disciplines, each with its own success metrics. Which parent should data science take after? Should success in data science be judged by computational performance, or prediction accuracy?

Building cool stuff with data
My answer: neither. Instead, data scientists should be judged against their track record of building cool stuff. "Cool stuff" means analytic insights, better-informed decisions, data visualizations, new user experiences, useful back-end tooling, etc. In previous writing, I've called these things "data products": systems that create better internal and external experiences by transforming data in non-trivial ways.


Building cool stuff with legos is almost as cool as building cool stuff with data.
Practically speaking, we do this already.
  • All of the major data science bootcamps (e.g. Insight Fellows, Zipfian Academy) are structured to guide participants toward building something cool.
  • So are the prominent Data-for-Good programs (e.g. DataKind, Data Science for Social Good).
  • Building cool stuff is important for hiring, as evidenced by the ubiquitous interview question, "Tell me about something you built at a previous job, grad school, or as a side project."
  • Most of the high-profile members of the data science community are people with histories of building cool stuff---Wes McKinney (aka "pandas") and Hadley Wickham (aka "ggplot2") are great examples.
  • The Facebook, LinkedIn, and Twitter data clans are hubs in the community for the same reason: they are groups of people with track records of building demonstrably cool products.

Bottom line: data scientists are a community of hackers. We admire people who invent, create, and share. And we admire people who do these things fast.

Cool things per 100 days
Because of that, I'd like to make the case for a different criteria for evaluating success in data science: cool things built per 100 days.

Why cool things per 100 days? Because, in my experience, most professional data scientists are capable of building something cool*** at least every couple weeks. Not always---sometimes you can bang out really great code in a weekend, and sometimes it's worth investing a month in a crucial infrastructure migration---but in most cases, the timeline for new data products is measured on a short scale of weeks. A hundred days is long enough to go through this cycle several times---enough to build a fair sample of work. No wonder DJ Patil recommends a 100-day window for evaluating new data science hires.

This is the universal symbol for "data product"

Let's pause here to answer some potential concerns. First, on analytics-focused teams, the timescale for building cool things is usually shorter: hours for small analysis; a day or two for medium stuff; a week-ish for deep dives. The difference reflects the fact that production data pipelines and UX take more time to build, no matter how lightweight you make them. If you work in more of a traditional BI/analytics environment, you'll need to adjust your stopwatch accordingly.

Second, "cool things per hundred days" raises some important questions. "What about improvements and maintenance to existing data products?" "How big does a change need to be to count as a cool thing?" "Not all data products are equally valuable, right?" "Under what conditions does internal infrastructure count as a cool thing?" These are all valid questions, important for making tactical decisions. We can discuss them later. For the moment, I want continue the discussion without getting bogged down in these details.

Last, I freely acknowledge some fuzziness here. ("How can we measure 'cool'?") The fuzziness is a direct consequence of framing data science in terms of value ("coolness"), rather than a directly measurable metric, such as prediction accuracy or writes per second. From an engineering perspective, that fuzziness can be frustrating, but that's the price of relevance to a broader, non-technical audience. As we'll see, there's still enough precision in this framework to lead to some useful insights.

Okay, that's enough for one long-form blog post. Most of this is pretty basic, but it's important background for the rest of the talk---and the framing seemed new to most of the Strata crowd. Next time, I'll go into more detail on data science workflows. Looking forward to your feedback!

*By "big data," I mean the theory and code that support massive, distributed information processing systems. That's a mouthful, so I'll just say "big data" and assume you know what I mean.

**I'd also include statistics (e.g. econometrics, psychometrics, Bayesian modeling, etc.), and certain kinds of applied math (e.g. network theory) alongside machine learning.

***@astrobiased points out that not every "cool thing" is a fully-baked user-facing data product. Sometimes it's internal plumbing that opens the door for new queries and analysis. Smart data scientists usually figure out ways to build series of smallish cool things that progressively unlock bigger cool things.

1 comment:

  1. A significant hole in the guarantee and ease of use of BD is the exertion required to process the data it gives, and getting valuable information from it. Data Analytics Course

    ReplyDelete