2.05.2014

Wanted: Good examples of Data Sidekicks

I'm working on my Strata talk for next week ("The Sidekick Pattern: Using Small Data to Increase the Value of Big Data"). Here's the abstract:
Creating value from big, messy data sets can be a daunting task. The session introduces the Sidekick Pattern: using small, curated data to increase the value of Big Data. Drawing on lessons from data science for Jawbone’s UP fitness tracker, we will see how smart selection of data sidekicks can accelerate analysis, solve cold start problems, and simplify complicated data pipelines. Read more
You should (*cough*) come and see it. It'll be fun.

In the meantime, I'm looking for more examples of data sidekicks. For all you smart data people out there, what examples come to mind?

Here are some examples to get things started.

1. Pretty much all sentiment analysis, and most machine learning 
Every data scientist should know and recognize this pattern: train a classifier on a small set of labeled documents, then apply to a corpus of any size you like. This is the automated magic of machine learning. It allows us to transfer the categories within a small sidekick dataset to a big dataset of any size.

2. CrowdFlower's "Gold Units"
On crowdsourcing platforms, it's crucial (and very difficult) to weed out responses from spammers. One way to do it is to mix in a small number of "gold standard" tasks where the correct answer is already known. I used a similar technique in my dissertation; CrowdFlower has made a business out of it.

Unlike the first example, this kind of data sidekick doesn't give us categories directly; instead, it allows us to infer credibility.

3. Bridge cases in psychometric scaling models
This is a great variation on the theme, but it's pretty technical. Psychometric scaling models place things on a common scale. SAT scores are the most famous example; DW-NOMINATE scores, another nifty application, classify U.S. politicians from liberal to conservative, based on their recorded votes. In the (somewhat rare) case where we want to align scales across two populations---say U.S. Senators and U.S. Congressmen---a small set of bridge cases (e.g. congressmen who became senators, or vice versa) is required to line them up properly. In this case, the data sidekick is the mapping of IDs in one data set to IDs in the other. ("Yes, Congressman John Smith of Ohio District 7 in 1990 is the same person as Senator John Smith of Ohio in 2002.")

This kind of sidekick isn't about categories or credibility; it's about alignment---enabling fair comparisons.

4. Instrumental variables in multilevel models
This type of sidekick is so technical that I'm not even going to try to explain it fully*, but here's a potential application: reverse causation is rampant in the study of education, because kids' social and family backgrounds influence schools as much or more than the schools influence kids. If you wanted to separate out the causal arrows, you could make some headway by modeling the system at multiple levels (say, students within classrooms, within schools), and then looking for exogenous variables at different levels (say, schools that missed full or half days because of snow). A small dataset about snow days could allow you to make strong causal inferences about the impact of teachers and schools on individual students' test scores.

In this case, the sidekick gives us causation; a clean natural experiment in a small data set allows us to unpack the cause-and-effect structure of a much larger one.

*NB: Andrew Gelman has a whole chapter on causal inference in multilevel Bayesian models. If you don't want to go down the HLM rabbit hole, you can think of regression discontinuity as the simplest case of a multilevel instrument.



These are fine examples of the Sidekick Pattern---in each case, a small sidekick catalyzes new meaning in a larger data set---but they're much more technical than I'd prefer.

What else can you think of? I'd love to have more fun examples to share at Strata. I'll give shout-outs for good examples during the talk.

Here's what I'm looking for:
  • Practical applications backed by intuitive stories would be awesome.
  • Specific examples of the first type (sentiment analysis/machine learning) would be great.
  • Near misses ("This isn't exactly a Sidekick, but…") are fine.
  • I'm especially interested in cases where a Data Sidekick provided a quick substitute for a more labor-intensive, heavyweight alternative.
Thanks for sharing your ideas! Here's a picture of physicists talking about atomic bombs:



2 comments:

  1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. machine learning projects for final year In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.


    The Nodejs Training Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

    ReplyDelete
  2. The liquid filament is therefore applied to the print bed line by line and layer by layer through the nozzle. There it solidifies and after several of} hundred layers a three-dimensional object is created. The three figured that they may do more than simply send a one-time provide of printers to Cotton Duvet Covers Lviv. Kranz, who organized the initiative and is main the fundraising, says they’ve raised more than $50,000 with hopes of bringing in additional than $350,000 from particular person donors and organizations. Burgs additionally has entry to 70 of his company’s 3D printers at facilities in Poland.

    ReplyDelete