2.05.2014

Wanted: Good examples of Data Sidekicks

I'm working on my Strata talk for next week ("The Sidekick Pattern: Using Small Data to Increase the Value of Big Data"). Here's the abstract:
Creating value from big, messy data sets can be a daunting task. The session introduces the Sidekick Pattern: using small, curated data to increase the value of Big Data. Drawing on lessons from data science for Jawbone’s UP fitness tracker, we will see how smart selection of data sidekicks can accelerate analysis, solve cold start problems, and simplify complicated data pipelines. Read more
You should (*cough*) come and see it. It'll be fun.

In the meantime, I'm looking for more examples of data sidekicks. For all you smart data people out there, what examples come to mind?

Here are some examples to get things started.

1. Pretty much all sentiment analysis, and most machine learning 
Every data scientist should know and recognize this pattern: train a classifier on a small set of labeled documents, then apply to a corpus of any size you like. This is the automated magic of machine learning. It allows us to transfer the categories within a small sidekick dataset to a big dataset of any size.

2. CrowdFlower's "Gold Units"
On crowdsourcing platforms, it's crucial (and very difficult) to weed out responses from spammers. One way to do it is to mix in a small number of "gold standard" tasks where the correct answer is already known. I used a similar technique in my dissertation; CrowdFlower has made a business out of it.

Unlike the first example, this kind of data sidekick doesn't give us categories directly; instead, it allows us to infer credibility.

3. Bridge cases in psychometric scaling models
This is a great variation on the theme, but it's pretty technical. Psychometric scaling models place things on a common scale. SAT scores are the most famous example; DW-NOMINATE scores, another nifty application, classify U.S. politicians from liberal to conservative, based on their recorded votes. In the (somewhat rare) case where we want to align scales across two populations---say U.S. Senators and U.S. Congressmen---a small set of bridge cases (e.g. congressmen who became senators, or vice versa) is required to line them up properly. In this case, the data sidekick is the mapping of IDs in one data set to IDs in the other. ("Yes, Congressman John Smith of Ohio District 7 in 1990 is the same person as Senator John Smith of Ohio in 2002.")

This kind of sidekick isn't about categories or credibility; it's about alignment---enabling fair comparisons.

4. Instrumental variables in multilevel models
This type of sidekick is so technical that I'm not even going to try to explain it fully*, but here's a potential application: reverse causation is rampant in the study of education, because kids' social and family backgrounds influence schools as much or more than the schools influence kids. If you wanted to separate out the causal arrows, you could make some headway by modeling the system at multiple levels (say, students within classrooms, within schools), and then looking for exogenous variables at different levels (say, schools that missed full or half days because of snow). A small dataset about snow days could allow you to make strong causal inferences about the impact of teachers and schools on individual students' test scores.

In this case, the sidekick gives us causation; a clean natural experiment in a small data set allows us to unpack the cause-and-effect structure of a much larger one.

*NB: Andrew Gelman has a whole chapter on causal inference in multilevel Bayesian models. If you don't want to go down the HLM rabbit hole, you can think of regression discontinuity as the simplest case of a multilevel instrument.



These are fine examples of the Sidekick Pattern---in each case, a small sidekick catalyzes new meaning in a larger data set---but they're much more technical than I'd prefer.

What else can you think of? I'd love to have more fun examples to share at Strata. I'll give shout-outs for good examples during the talk.

Here's what I'm looking for:
  • Practical applications backed by intuitive stories would be awesome.
  • Specific examples of the first type (sentiment analysis/machine learning) would be great.
  • Near misses ("This isn't exactly a Sidekick, but…") are fine.
  • I'm especially interested in cases where a Data Sidekick provided a quick substitute for a more labor-intensive, heavyweight alternative.
Thanks for sharing your ideas! Here's a picture of physicists talking about atomic bombs:



No comments:

Post a Comment