Check Yo Self: 5 Things You Should Know About Data Science (Author Note)

The Data Scientist. Thanks to @SeriousRon for the image.

Hey folks. I’m going to take a brief break from the narrative to talk to you directly about data science as a discipline. There’s a lot of noise floating around about how data scientists are the sexy saviors of the world. Well, we’re not. At least, anyone who’s ever seen my hairy, white thighs knows I’m not sexy. I won’t speak for the rest of my data science colleagues, but few look like this:

When trying to be sexy, we look more like this:

So whether you actually are a so-called data scientist or are looking to hire one or become one, let’s have a little quiet time for reflection. Having been in this field for a while, allow me to pontificate uncontrollably:

1) You are not the most important function of your organization. This is a toughie. I do love me some me.

But consider the airline industry. They’ve been doing big data-ish analytics for a long while. For example, they’ve been engaging in revenue management and yield management since before you were a twinkle in yo’ mama’s eye. They’ve used all their data to squeeze that last nickel out of you for that seat you can barely fit in.  It’s a huge win for mathematics.

But you know what? The most important part of their business is flying. The products and services an organization sells matter more than the big data models that tack on pennies to those dollars. Your goals should be things like using data to facilitate better targeting, forecasting, pricing, decision-making, reporting, compliance, etc. In other words, working with the rest of your organization to do betternot to do data science for its own sake.

2) Leave the complexity at the door. A long time ago in a galaxy somewhere near Cambridge, MA, I helped build a supply chain optimization model for a Fortune 500 company. The model used the tangents of an integral over a truncated normal distribution as a function of the mean as linear upper bounds for embedding a probabilistic model inside a Mixed-Integer Program. Sound complicated? It was. And it worked. So long as we baby-sat the fuck out of it and fed it a loving diet of accurate standard deviations of demand forecasts.

The model was dead on arrival, because it was too complex. The same thing happened to the Netflix prize winners. Your model is not the goal; your job is not a Kaggle competition (unless you work at Kaggle or something). Sustainable, repeatable business improvement is the goal, and on-going effort to use your complex big data product can be at odds with this goal if you’re not sensible in your design. If that means using a regular mixed-integer program instead of a probabilistic one, do that. Because this isn’t about proving how smart you are. Do you want to be the only one who can run your model? That’s the most boring kind of job security I can imagine.

3) Much of what you do is marketing. Big Data is a hot topic right now, and company leaders are dying to show off how they’re using their data. Especially if the company is going public or being acquired.  There’s money to be made off of doing cool shit with data. What do you think LinkedIn’s InMaps product was? Did you play with it? Maybe. Was it useful? Errrm, not really. Was it cool? Fuck yeah. Did it impress investors and launch DJ Patil’s career? Ah, there you go.

Go ahead and do the marketing. It’s good for your business. But don’t forget that it’s just you being all “sexy data scientist” again. In certain cases, it may be your responsibility to steer the organization back toward something that delivers real value but uses less Gephi.

Tip: If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. To paraphrase the Gospel of Mark, what does it gain a man to have graphs of tweets and do jack for analysis with them? I thought we were doing science, not our best impression of something that belongs in MoMA.

4) The tools are wagging the analytics right now. I can get a little hadoop added to my latte at Starbuck’s now. hive, pig, mongodb, riak, redis. Rub a little cassandra on it and lay down on the couchdb for an hour. The money to be made is in selling tools and services around the tools. Doing the actually nitty gritty of big data analytics, that’s secondary. So everyone’s talking up their tool set, and you may have a great setup, but it’s not about the tools or how big your dataset is, it’s about how you use them. I’ve seen better analytics done with ten megs of data in a pivot table than what most seem to be doing with their petabytes of unstructured garbage.

It’s great to have a shit-ton of data, find a way to use it to actually make money. Not draw graphs of tweets. Draw graphs of emails instead ;-)

5) Data scientist is a poor term. Communication and creativity are most important. Whoever chose the term data scientist has downplayed what’s most important about this job. A data scientist needs to be someone who can bridge the gap between complex analytics on large data sets and the dreams of company leadership. A data scientist needs to be creative about indentifying ways that data can solve company problems. And if the data’s not collected yet to solve a problem? They need to figure out how to get it. Here’s a break down of how I spend my time:

-1/3 talking with others and figuring out how we can use our data to solve problems

-1/3 up to my elbows in uggo data cleaning and prepping to solve a problem

-1/3 hunting down data that’s logged in some strange way, plying developers with drinks to get the data moved into our big data store in the appropriate way for modeling

-.00000001% actual training of models.

This is why Kaggle is bullshit. It’s like focusing on the cherry on your milkshake. Who gives two shits about the cherry? You need the rest of the milkshake.

  • Pingback: “Drug Deal” Network Analysis with Gephi (Tutorial) « OUseful.Info, the blog…

  • Matt Gershoff

    Right on – this is actually a great post. I would add that the data analyst role (data scientist if you are under 40 yro) is to also help assess the marginal value of data – so think about its costs and the expected returns from having/using it.

    • http://twitter.com/John4man John Foreman

      Absolutely, this is especially important for tech start-ups whose transactional data is of seemingly less value. For a hotel chain or a big box store, the data used in modeling has been primarily purchase and pricing data, but if I’m a website and I’m trying to eek out value from transactions like “did a person visit this page? for how long? with what user agent?” that’s like having Walmart data on what aisle a person walked down.

      Sometimes it’s tough to quantify how valuable this data is unfortunately until after you’ve already mined it to find some hidden result.

  • http://twitter.com/marie_wallace Marie Wallace

    Classic post! While I’m not a data scientist, I’ve worked in the unstructured analytics (nlp) space for more than a decade. We’ve been one of the greatest beneficiaries of the bigdata movement with the downside that we’ve started to believe too much of our own hype. So its just fantastic to see such open, tongue in cheek, no pulls punched, self-reflection :)

    • http://twitter.com/John4man John Foreman

      Glad you liked the post! It came out of reading all this stuff in the HBR and similar publications about data science, and then reflecting on how my role over the past decade meshes with the hype.

      Concerning your work in NLP (which is essential to the whole big data movement), I’ve been lamenting that I probably can’t address it on this blog. While the plot could motivate it (pulling keywords out of a document dump stolen from the DEA?), I’d have to majorly leave Excel to address it. NLP IMHO has more CS in it than math, even though good tokenization often requires AI (max entropy) models. Sigh. I’d probably have to go to Python / NLTK, which I think would be too much of a stretch for the core audience here.

      • http://twitter.com/marie_wallace Marie Wallace

        I tend to see NLP as both tangential and yet core to big data analytics and I guess data science (more tangentially). The main reason for this hypothesis is not necessarily that NLP uses the same core maths (although NLP is becoming increasingly more about statistics) but rather because many analytics challenges today require integration of unstructured data. Having done quite a bit of modelling over the years, I often see a disconnect between the models being used to “manipulate unstructured content so its ingestible into a mathematical algorithm” (ie. analyze/convert it to structured/semi-structured data). However all that being said, NLP is a very different topic, no argument there.

        As an aside, I had a wee moan of my own a while ago complaining about the fact that you “data scientists” were the new “rockstars” and that us content analytics folks were the unsung heros ;-) Bit tongue in cheek, of course!

        http://allthingsanalytics.com/2012/03/04/data-scientist-too-narrow-a-definition/

        • http://twitter.com/John4man John Foreman

          Thanks for the link! It’s funny, I think every analytics professional I’ve talked to (including myself…I talk to myself a lot, especially on the subway while pushing a shopping cart) who’s done analytics prior to about 2 years ago feels that somehow this new term of “data scientist” has not fully captured the techniques and practitioners that have existed out there for some time.

          I was once at an INFORMS panel where a bunch of OR and stats PhDs griped about the term for an hour. What it left out, why “scientist” wasn’t a good term (they actually were scientists after all!), how they’ve been doing this stuff for decades, etc. I imagine the dust will settle in a couple of years, and we’ll be better able to reflect on just what all this hubbub was about.

          For my part, I keep griping about how mathematical programming has been left out of the discussion, but I think that’s because the scale of problems that a MIP solver can handle is smaller than “web scale” (whatever that means).

  • Pingback: Bookmarks for November 25th through November 26th

  • Pingback: This Month in Data Science

  • zyxo

    Fantastic post! As data miner I’m only happy when my marketer’s campaign is a success. And echo model, simple or complete,, has to.fit into our automated scoring system.

  • Joe

    A very good read

  • http://twitter.com/exmathematician Jack Golding

    Thanks for this post John, really good read for someone about to enter the field!

  • Pingback: Blinding myself with data science | Social Glu Communications

  • http://twitter.com/data_nerd Carla Gentry

    WOW, 1st honest piece I have read on Data Science, I’ve been a data nerd for over 15+ years and no one EVER thought what I did was sexy, but let my data be late and OMG, you’d think the world had ended. I spent more time cleaning up crap from inexperienced programmers and data entry people (data collection) – Sexy no, hard work and a lot of it boring as hell or I had to stand on my head and spin three time to find a work around -YES!

  • Stephen Hardy

    Data scientist seems like a bit of a tautology. What does a scientist test their hypothesis with if it isn’t data?

    • john4man

      I agree that the term is rather silly. In general, I prefer the term analytics to data science. I imagine it comes from the idea that many data scientists have skills in teasing out insight from very large transactional datasets.

  • Rome

    awesome post man i fucking love this blog. exactly what i needed. “So everyone’s talking up their tool set, and you may have a great setup, but it’s not about the tools or how big your dataset is, it’s about how you use them. I’ve seen better analytics done with ten megs of data in a pivot table than what most seem to be doing with their petabytes of unstructured garbage.” is exactly what I was thinking. Glad you said it to confirm it to me.