Hey folks. I’m going to take a brief break from the narrative to talk to you directly about data science as a discipline. There’s a lot of noise floating around about how data scientists are the sexy saviors of the world. Well, we’re not. At least, anyone who’s ever seen my hairy, white thighs knows I’m not sexy. I won’t speak for the rest of my data science colleagues, but few look like this:
When trying to be sexy, we look more like this:
So whether you actually are a so-called data scientist or are looking to hire one or become one, let’s have a little quiet time for reflection. Having been in this field for a while, allow me to pontificate uncontrollably:
1) You are not the most important function of your organization. This is a toughie. I do love me some me.
But consider the airline industry. They’ve been doing big data-ish analytics for a long while. For example, they’ve been engaging in revenue management and yield management since before you were a twinkle in yo’ mama’s eye. They’ve used all their data to squeeze that last nickel out of you for that seat you can barely fit in. It’s a huge win for mathematics.
But you know what? The most important part of their business is flying. The products and services an organization sells matter more than the big data models that tack on pennies to those dollars. Your goals should be things like using data to facilitate better targeting, forecasting, pricing, decision-making, reporting, compliance, etc. In other words, working with the rest of your organization to do better, not to do data science for its own sake.
2) Leave the complexity at the door. A long time ago in a galaxy somewhere near Cambridge, MA, I helped build a supply chain optimization model for a Fortune 500 company. The model used the tangents of an integral over a truncated normal distribution as a function of the mean as linear upper bounds for embedding a probabilistic model inside a Mixed-Integer Program. Sound complicated? It was. And it worked. So long as we baby-sat the fuck out of it and fed it a loving diet of accurate standard deviations of demand forecasts.
The model was dead on arrival, because it was too complex. The same thing happened to the Netflix prize winners. Your model is not the goal; your job is not a Kaggle competition (unless you work at Kaggle or something). Sustainable, repeatable business improvement is the goal, and on-going effort to use your complex big data product can be at odds with this goal if you’re not sensible in your design. If that means using a regular mixed-integer program instead of a probabilistic one, do that. Because this isn’t about proving how smart you are. Do you want to be the only one who can run your model? That’s the most boring kind of job security I can imagine.
3) Much of what you do is marketing. Big Data is a hot topic right now, and company leaders are dying to show off how they’re using their data. Especially if the company is going public or being acquired. There’s money to be made off of doing cool shit with data. What do you think LinkedIn’s InMaps product was? Did you play with it? Maybe. Was it useful? Errrm, not really. Was it cool? Fuck yeah. Did it impress investors and launch DJ Patil’s career? Ah, there you go.
Go ahead and do the marketing. It’s good for your business. But don’t forget that it’s just you being all “sexy data scientist” again. In certain cases, it may be your responsibility to steer the organization back toward something that delivers real value but uses less Gephi.
Tip: If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. To paraphrase the Gospel of Mark, what does it gain a man to have graphs of tweets and do jack for analysis with them? I thought we were doing science, not our best impression of something that belongs in MoMA.
4) The tools are wagging the analytics right now. I can get a little hadoop added to my latte at Starbuck’s now. hive, pig, mongodb, riak, redis. Rub a little cassandra on it and lay down on the couchdb for an hour. The money to be made is in selling tools and services around the tools. Doing the actually nitty gritty of big data analytics, that’s secondary. So everyone’s talking up their tool set, and you may have a great setup, but it’s not about the tools or how big your dataset is, it’s about how you use them. I’ve seen better analytics done with ten megs of data in a pivot table than what most seem to be doing with their petabytes of unstructured garbage.
It’s great to have a shit-ton of data, find a way to use it to actually make money. Not draw graphs of tweets. Draw graphs of emails instead
5) Data scientist is a poor term. Communication and creativity are most important. Whoever chose the term data scientist has downplayed what’s most important about this job. A data scientist needs to be someone who can bridge the gap between complex analytics on large data sets and the dreams of company leadership. A data scientist needs to be creative about indentifying ways that data can solve company problems. And if the data’s not collected yet to solve a problem? They need to figure out how to get it. Here’s a break down of how I spend my time:
-1/3 talking with others and figuring out how we can use our data to solve problems
-1/3 up to my elbows in uggo data cleaning and prepping to solve a problem
-1/3 hunting down data that’s logged in some strange way, plying developers with drinks to get the data moved into our big data store in the appropriate way for modeling
-.00000001% actual training of models.
This is why Kaggle is bullshit. It’s like focusing on the cherry on your milkshake. Who gives two shits about the cherry? You need the rest of the milkshake.