3.25.2014

Therbligs for data science: A nuts and bolts framework for accelerating data work

This is a refactored version of my Strata talk on Data Sidekicks. I shared similar content through a BrightTalk webinar last week.

To recap the argument so far: data science uses tools from machine learning and big data, but it's more than the sum of its parts. Most data products---and data scientists---can't be evaluated against the purely technical standards of prediction accuracy and computational performance. Instead, successful data science is defined on a human value scale and a human time scale. Data scientists can best measure their success against the metric of "cool things built per hundred days."

Think building blocks, not steps

If the key to success in data science is building more cool things per hundred days, then we need to understand where our time goes as data scientists. I've seen many breakdowns of the data science workflow, but most are just silly. I cite, without malice, three recent examples that came across my twitter feed: Exhibit AExhibit BExhibit C. Protip: if a data science article includes a number in the title ("Seven steps to better analytics"), it's probably safe to skip it. Data science and listicles do not mix.


Even thoughtful descriptions of data science workflows (e.g. this blog post and research agenda and this extremely insightful presentation) tend to be too rigid and too technical. By "too rigid," I mean they tend to follow fixed steps (usually in a sequence or a cycle), as if every data product followed the same process. My data workflows don't look like that. They look like the diagrams below. Non-linear, asymmetric, different every time.

Real data workflow example #1: Tasks and dependencies for the last year of my dissertation. It was a lot of work.
Real data workflow example #2: The data pipeline for Paco Nathan's "find a shady spot to sit in Palo Alto" app. This is a data architecture, rather than a task list. Still, since each segment of the pipeline requires design, munging, building, and testing, it gives a sense of the complexity of the workflow needed to build the app.


By "too technical," I mean that these descriptions leave out essential and time-consuming activities just because these activities don't involve writing code or querying databases. Example: at the Data for Good panel at Strata, there was an extended discussion about the right way to identify and scope data projects with non-profit organizations. Rayid Ghani talked about spending months of his time in these conversations, and how good scoping is absolutely essential for successful collaboration. Most descriptions of data science workflows don't account for this time.

Another example: my workflow often includes scheduled "clipboard time": solitary time in a quiet corner, puzzling out the right questions to ask, or mapping out the right architecture for a data system. (In collaboration, whiteboard time accomplishes similar things.) This takes a moderate amount of time. It is also extremely productive. Any reasonable accounting of data science workflows should include this kind of activity.

Rather than trying to straightjacket the data science process into a specific sequence of steps, it's much more productive to think in terms of the types of activities that occupy our time. We should seek out common operations---the things that we do over and over again---and give them names. Once we do so, we can start to recognize the real patterns among them.

This is kind of thinking is different from the purely technical logic of big data and machine learning, because it takes into account the social environment in which data science happens. It's the kind of thinking that led the Trifacta team to build their very well-received tools for data munging. You can see it in Jay Krep's rationale for logs as the unifying abstraction for data science. And Paco Nathan's advocacy for workflow abstractions and test-driven data science in Hadoop.

Therbligs for data science

Back in the 1920's, early industrialists did a similar thing in factory settings. They used extremely detailed time-and-motion studies to isolate and optimize the actions that made workers productive. One famous list included 17 types of actions, named therbligs ("Therblig" is the inventor's last name spelled, with the 'th' transposed.): 

...Suppose a man goes into a bathroom and shaves. We'll assume that his face is all lathered and that he is ready to pick up his razor. He knows where the razor is, but first he must locate it with his eye. That is "search", the first Therblig. His eye finds it and comes to rest -- that's "find", the second Therblig. Third comes "select", the process of sliding the razor prior to the fourth Therblig, "grasp." Fifth is "transport loaded," bringing the razor up to his face, and sixth is "position," getting the razor set on his face. There are eleven other Therbligs -- the last one is "think"!
Cheaper By the Dozen

A time-and-motion study of an assembly line task. Researchers attached lights to the worker's hands and head, and used long-exposure photography to isolate motions and eliminate wasted steps. 

Therblig notation. 17 elemental operations carried out by assembly-line workers.
Without taking the analogy too far, I am proposing that we do a similar thing for data science, on a time scale of hours and days. To build more cool things per hundred days, we need to isolate the actions that take up our time. We need to study them with an eye to efficiency. We need to build tools to accelerate some of them, and arrange our work environments so that we can stop doing others.

Here's an incomplete list of therbligs for data science. I made this list up by reviewing my calendar for the last couple weeks---I'm sure there are many more therbligs to add.

  1. ETL: Replicate a data set from one location to another
  2. Data munging (I): Figure out the columns and data types for a new data source
  3. Data munging (II): Figure out the keys and structure for a new data source
  4. Data munging (III): Figure out the column values, ranges, etc. for a new data source
  5. Data cleaning: Search for outliers, bad values, misformatted rows, etc. in a new data source
  6. Construct, test, and execute a query or long-running job
  7. Extract features for a statistical model
  8. Train a statistical model
  9. Fiddle with the training parameters for a model
  10. Compare alternative models to see how they fit data
  11. Replicate an algorithm in one package (say, R) in another (say, python)
  12. Configure a production environment
  13. Write documentation
  14. Brainstorm questions, analyses, and hypotheses for a data set
  15. Explore a new data set
  16. Meet with a client to scope out a project
  17. Meet with a client to advise on a decision
  18. Meet with a client to share a prototype
  19. Meet with a client or partner to try to obtain data
  20. Design an experiment
  21. ...
To my mind, this is about the right level of specificity for data science workflows. I've done each of these therbligs many times. When I map out daily to-do lists and dependencies for projects, these are the task divisions that I use naturally. If one of these tasks becomes a pain point, I can imagine ways to fix it. Fledging data scientists who want to improving their "wax on, wax off" can focus on these skills one at a time---and mentors can provide coaching.

At this level of granularity, I suspect there are a few dozen common therbligs---and fuel for a lot of interesting conversations. I'm convinced that focusing on the therbligs of data science is a very productive way to approach many of today's pressing questions about the practice of data science ("What infrastructure does my team need?" "How should data science be positioned within my organization?" "What skills do I need to become a data scientist?" etc.)  I'm looking forward to these conversations.

Next time, I plan to build more on the idea of therbligs for data science. We're working back around towards Data Sidekicks---a data science design pattern  that shows up whenever you include a "curation" step in your workflow.

By the way, I've had some questions about when this blog is going to get back to the idea of increasing scientific ROI. Two answers: first, patience. We'll get there. Second: this is it. Very often, "data science" is science on a deadline. All this stuff about workflows is describing how the scientific explosion feels from the inside. 

2 comments:

  1. It is an interesting article. Just looks like the url links in the main body of the blogpost are not working. From looking at the source code seems that HTML link tags got lost during the cross publishing.

    ReplyDelete
    Replies
    1. Oops. Thanks for catching that. I blame Evernote---it seems to drop links some of the time. Should be fixed now.

      Delete