12.07.2013

You might be a data scientist if...


As I meet up-and-coming data scientists, I've realized that we share a surprising number of very specific experiences.  Here's a list of things of these data science rites of passage, in no particular order.

1. Word count in MapReduce.
2. Write a script to send yourself an email.
3. Get emotionally involved in a debate about statistical software (e.g. R vs. python) or graphing libraries.
4. Mess up a git repo by accidentally committing a very large data file.
5. Scrape a website (e.g. ebay, Amazon, IMDB, wikipedia) to answer a personal question.
6. Read a math, stats, or programming book while riding public transportation (train, plane, bus, etc.)
7. Bang your head on a timestamp conversion problem for two hours or more.
8. Train a text classifier, probably using books from project Gutenberg or movie reviews
9. Start writing a poker bot.  (Bonus points for actually finishing.)
10. Fill up a piece of paper with times and percentages to estimate when a long-running job will finish.
11. Enter a Kaggle contest.
12. Get back a batch of really bad results from mturk.
13. Set up a dummy account with a web service solely for the purpose of collecting data.
14. Read a math, stats, or programming book in bed.
15. Write a regular expression to avoid a couple dozen copy-pastes.

Probably no one has done all of them (scavenger hunt, anyone?)  But they're still common enough that you could grab a handful and train a pretty effective Naive Bayes classifier.

What other features would you add to this model?

7 comments:

  1. I've done quite a lot of these. However I will never participate in #3 (getting emotionally involved in a debate about statistical software [e.g. R vs. python] or graphing libraries) or any of the other holy wars in computing. Perhaps it is because I extensively use both R and python as well as many graphing libraries (they all have pros and cons), but I believe it's beneficial to know many tools in and out - you want to have flexibility when working with new collaborators. Inclusivity elevates all!

    ReplyDelete
    Replies
    1. I'm not saying it's a good thing, but it sure happens a lot.

      Delete
  2. "Fill up a piece of paper with times and percentages to estimate when a long-running job will finish" oh my god yes. Great post!

    ReplyDelete
  3. 3. (Dispense with all rationality and) get emotionally involved in a debate about the ethics of using various "available" datasets.

    ReplyDelete
  4. Thanks for sharing your points. Data science is deep knowledge discovery through data inference and exploration. This discipline often involves using mathematic and algorithmic techniques to solve some of the most analytically complex business problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centers around evidence-based analytical rigor and building robust decision capabilities. Ultimately, data science matters because it enables companies to operate and strategize more intelligently. It is all about adding substantial enterprise value by learning from data.
    Thanks! https://intellipaat.com/

    ReplyDelete