Definition wanted: What is a data product?
(Spoiler alert: not much. Skip down to see my sparkling, incisive, soon-to-be-in-the-OED definition.)
First, I found a bunch of stodgy-looking sites selling bundles of data, and calling them "data products." These companies have clearly put a lot of work into their SEO.
There are also a bunch of data-sciency pages that mention "data products" in passing, but without clear definition:
In desperation, I turned to Quora and wikipedia. Neither one has an answer yet.
Bottom line: no one is on record with a good definition for "data product." This vaccum must be filled.
What is a data product?
Here's my definition, based on a lot of watercooler talk and healthy dose of personal experience:
A data product is a system that creates better user experiences
by transforming data in non-trivial ways.
Good? Yes? No? It seems to cover all the bases, without boiling down to "a thing kinda like software engineering except more data-y."
Let's kick the tires a little more.
What is a data system?
In terms of parts and plumbing, a typical system supporting a data product has three components:
(1) an intake system that collects and stores raw data,
(2) a pipeline that transforms and refines raw data into more useful formats, and
(3) an endpoint that adjusts user experience based on the data.
This pattern should look pretty familiar to anyone with experience working with data.
|Yes, I drew this myself. Beautiful, isn't it?|
Think of (1) a water tank or reservoir, (2) pipes and plumbing, and (3) a set of appliances and fixtures, you won't be far wrong. Each of these pieces can take many different shapes, and things can get complicated when many data products overlap, but this overall architecture seems to be a common denominator.
In many cases---especially in larger organizations with more specialized roles---the data team will mainly work on (2): pipelines to refine and transform data. But the data product as a whole only works if (1) and (3) are in place as well. Leave them out and you can't create a better experience, and therefore no value.
Who is a user?
I'm deliberately stretching the definition of "users" to include internal stakeholders. Giving executive advice on a strategic decision? The execs are users. Helping the product design team explore customer profiles? The designers are users. If this were a business textbook, I would rewrite the clause as, "experiences for internal and external users," but that feels too clunky for a blog.
In my experience, most data scientists are comfortable with the idea that they may be called on to build products that are external-facing (e.g. a new feature in an app) or internal-facing (a dashboard describing customer engagement). The tools and methods for these kinds of products are very similar, so it makes sense to group them together.
Decisions are a special case of experiences
Much of the hubub about data science focuses on data-driven decision making. I completely agree that data analysis can cast light on big strategic decisions, and can accelerate, fine-tune, and sometimes automate routine decisions. Experiences of the "make better decisions" and "answer new questions" variety are definitely valuable.
But data-driven decisions aren't the only kinds of experiences enabled by data products. Think of Google's first big data product: improved search, powered by the PageRank algorithm. Or LinkedIn's "people you may know" feature. These data products created new/better experiences that were several steps removed from the making of business decisions. But they were powered by data, and created value measurable in millions of dollars.
A lot of people seem to miss this aspect of data science, especially people coming from BI and OR backgrounds. But in my opinion, it's silly to define "data product" in a way that excludes experiences that aren't business decisions. There's a lot more to life than conference rooms, KPIs and strategic initiatives.
What does "non-trivial" mean?
"Non-trivial" means that the transformation requires special expertise of some kind: statistical modeling to deal with noisy data, NLP to process text, massively distributed computation (e.g. MapReduce) to crunch really big data, clever instrumentation and randomization to conduct experiments and infer causality.
If you can accomplish the data transformation using two lines of SQL, it's probably trivial.
Why do we need a definition for data products?
Maybe I should have answered this question first, but it feels like a good place to end. I'm convinced that we need a shared vocabulary for talking about data products, because we're going to be building a lot of them.
The new field of data science is made up of a lot of smart people coming from a lot of different backgrounds. As we get together, we're finding common elements in the ways we think about data and the scientific method. But because we come from so many different disciplines, we often have different jargon for making sense of these things. "Data products" are a starting point -- a term that's general enough to cover most of the things we build, and specific enough to let us start honing in on the details.
*Does anybody ever actually say "part and parcel?"