Data science means shopping and plucking, not just cooking.
Many people think that data science works like this:
But that’s not the whole picture---not even close.
Unless your data pipeline is quite mature, your data is probably more like this.
Unstructured, uncleaned. Still very messy.
Your whole data set is probably more like this:
It contains some of the key ingredients, but not all of them.
I can’t make cookies out of mustard. And I can’t make them out of just chocolate chips and vanilla, either.
Most of the time, building great data products requires shopping and plucking, not just cooking. You can’t cook a great meal until your fridge is stocked with the right ingredients.
Bottom line: your "secret sauce" isn’t an algorithm. It a combination of data cleaning, processing, and curation—plus a judicious choice of the right algorithms.
(Even if you have the right ingredients, you can’t boil your way to good cookies.)
That means you want to work with data scientists who understand the whole process of shopping, plucking, and cooking good data products. If you hire analysts or machine learning specialists who don't know how to pluck and shop, you're going to either (1) get stuck baking mustard cookies, or (2) put a heavy burden on your engineering team to grab and process new data. (1) is yucky. (2) is very slow.
It also means that you don't want to constrain your data scientists to only use the ingredients you already have in your kitchen. You should expect a good data scientist to improve your options by looking for more ways to bring in more data. ("Hm. No eggs. Before we go any further, we're going to need some eggs." "These cookies are okay, but they'd be much better with a dash of cinnamon.")
Practically speaking, "more ways to bring in data" includes things like
- additional instrumentation within your app/website
- mashups with public data sources
- feedback mechanisms within your app/website (e.g. additional profile fields)
- hand-curated data sets to clean and normalize large data feeds
- merging in additional sources of user feedback (e.g. customer support tickets)
- user surveys or interviews
In conclusion, three cheers for cookies!
PS: I’m not saying you should wait for all the perfect ingredients to begin. Great data science usually involves smart sequencing—rapidly learning which data streams add the most value, and developing the systems to gather and process them effectively. Make sugar cookies for now, and add the chocolate chips as soon as you can get them.
PPS: Peter Norvig says that “more data usually beats better algorithms." I’m not disagreeing. Instead, I’m pointing out that at any given point in the life cycle of a data product, your volume of data is more or less fixed. Great data science is about working within that constraint, creating useful data products with the tools and ingredients that are close to hand, and bootstrapping yourself up to the next level.
PPPS: There’s another layer to this conversation: developing the tools (and culture) to enable rapid exploration and deployment of data products. It’s a bit like making sure your kitchen is equipped with a food processor, not just a microwave. But I think this metaphor is already strained enough, so we’ll save that conversation for another day.
Boiling cauldron: http://www.halloweenclipart.com/halloween_clipart_images/witches_brew_a_bubbling_boiling_cauldron_of_evil_spells_and_witchcraft_0515-0809-1516-0952_SMU.jpg