“You lied to me, Alex,” said Victor. He spoke quietly but with an undeniable hint of anger.

My mind immediately went to the worst. He knew about my work with the feds. Victor knew he was getting screwed.

“What’d I lie about?” I asked thinking about which door I should make a run for if things got ugly. There were body guards posted outside the house, and I may look like Prefontain but I run like Phoebe Buffay. If things were about to go down, I was likely dead. Even if I did make if out of the house, I’d been blindfolded on the way here, so I had no idea which way to head for safety.

“The forecast you made for my Janaury meth demand. It was wrong!” he said.

He brought his voice up to a boom with the word “wrong,” and it echoed quickly throughout the empty living room of the McMansion where we sat. Victor’s appearence was unkempt, and his usual cool tone had been replaced with a frazzled edge.

Victor was feeling pressure; Agent Bestroff was up to something, and I could guarantee it didn’t have anything to do with meth demand in Flint michigan. This was about something else, but forecast accuracy was what he was going to take his frustration out on.

I decided to push back a little on him.

“Of course the forecast was wrong,” I said.

Victor raised an eyebrow, “What do you mean, ‘of course?’”

“The only thing you can ever guarantee with a forecast is that it *will be* wrong. That’s why forecasting is often called ‘organized ignorance.’ It’s tough to predict the future, especially things that might be impacted by unpredictable forces. Like drug addicts. They’re inherently unpredictable. As is the economy and the weather,” I said.

“How off was the forecast, exactly?” I added.

He shook his head and answered, “You told me I would sell 72.5 kilos. I only sold 68 kilos.”

I nodded, “Yeah, that’ll happen with a forecast.”

“Well then why bother? If I can’t trust it, what’s the point?” asked Victor.

“Hey, it’s better than nothing right? If you didn’t have a seasonally adjusted forecast, then you might just be planning off your gut instead,” I answered.

Victor sighed, “It’s not good enough. I want a worst case scenario to plan for that way I don’t have excess drugs sitting on shelves waiting for the cops to find.”

“Ah,” I answered, “Now a ‘worst case scenario’ is something we can do. We can create something called a **prediction interval** around the forecast. Similar to a confidence interval (but nitpickingly different!), we can put a 95% prediction interval around the forecast so that you’ll have a likely band of where the demand will fall.”

“And we can do that with the forecast we’ve already made?” asked Victor.

“Yes and no,” I answered, “We can do it with the Holt-Winters forecast and a bit of Monte Carlo simulation, but we’ll need to use some slightly different but equivalent level, trend, and seasonality update equations to make it happen.”

“What do you mean?” asked Victor.

“Well, there’s this alternative way of doing an exponential smoothing forecast, Holt-Winters included, called the **error correction form**. The numbers come out the same, but they’re written by forecasting from each period one step into the next, denoting the error between actual demand and that one-step forecast, and correcting the level, trend. and seasonality using that error.”

“And what does that method get us?” asked Victor.

“It will allow us to quantify the distribution of the error when doing a one-step forecast. We can use that distribution to **simulate** future forecast errors, and from those simulated errors we can actually back out an entire demand scenario. By generating a thousand or so demand scenarios based on random errors, we can come up with worst and best case scenarios.”

“OK,” said Victor, “So where do we start?”

He stood from the table where he sat and walked into the empty kitchen of the house. From an almost barren cabinet in the kitchen he pulled out a gigantic tub of kettlecorn and popped it open.

“You want any?” he asked.

“Uh, do I want any kettlecorn?” I asked, “No, Victor, I’m good…what is this house? Why was I blindfolded on the way here? And why does the place seem cleared out except for a random tin of kettlecorn?”

Victor permitted himself a brief chuckle between scowls.

“Recently I’ve had some security issues,” he said, “You wouldn’t know anything about that would you?”

“What kind of issues?” I asked.

“Forget it. My security problems always have a way of working themselves out,” he said and chomped down hard on a fistful of kettlecorn, “Let’s get back to the problem at hand.”

I was, for better or worse, growing used to the violent implications in Victor’s phrasing. I wished briefly that someone would just arrest the guy, so I could go back to my normal life. But then would they want me to testify? Oh lordy. That thought scared the mess out of me. He’d just whack me from prison or something.

I pushed that thought away and looked at Victor, “Can you pull up the spreadsheet we did the demand forecast in?”

He pulled out his laptop and threw the sheet open.

“OK, so if you’ll remember, last time we finished with this ‘HW Exp Smoothing’ sheet and a 12 month forecast,” I said, “What’d I’d like to do is make a new sheet in the workbook called ‘HW Error Correction’ in which we’ll paste row 1 from the smoothing sheet. We can clear out the MAPE value. We’re not going to need that. Also, let’s paste in the time series data from the previous tab and the initial level, trend, and seasonal values in columns D:F.

This gave us a sheet that resembled the one in the figure below:

“Now, to this sheet I’m going to add another couple of calculations in columns G and H. In G, let’s put the one-step ahead forecast from our initial values into January 2008,” I said, “And if you’ll remember from last time how to make such a forecast, for one period forward it’s just the previous level plus the previous trend times the last relevent seasonal adjustment which is F4 in our sheet.”

“So that’s just (D15+E15)*F4?” Victor asked.

“Yep, and we’re going to throw that forecast into G16,” I said, “and then in column H, we can calculate the one-step error between the forecast in G16 and the actual demand in C16.”

This gave the figure show below:

“OK, now how do we roll through time like we did using the previous equations?” asked Victor.

“We need to update the level, trend, and seasonality values and drag everything through all the months until we’re caught up,” I said, “So we’re going to use some update equations that are equivalent to the originals but rely on our one-step error calculation. First, we’ll update the level as just the previous level plus the previous trend plus the level smoothing factor *times* the one-step error divided by the appropriate seasonal factor to deseasonalize the error.”

I threw the calculation into the sheet as seen below:

Victor nodded, “So the higher the smoothing factor is, the more of the one-step error we adjust for when we update the level.”

“Exactly,” I said, “And the trend update works almost exactly the same way. It’s the previous trend plus the trend smoothing factor times the level smoothing factor times the deseasonalized error. You can think of it as the trend incorporating *some* of the error the level incorporated to adjust just the trend component.”

I put that calculation into the sheet as well:

“And note how those two values, 50.51 and 1.15 are the *exact *same values we got using the other update equations. All we’re doing is writing everything in a way that’s going to be handy to us later on,” I said.

Victor nodded, “So then how do we update the seasonality?”

“It’s very similar to the trend update although kindof a mirror image. The seasonality update is just the previous relevant seasonality factor plus the seasonal smoothing factor times the error *not incorporated* into the level divided through by the previous level and trend to put it on a seasonal multiplier scale,” I said.

Victor looked slightly confused.

“To put it differently, it’s the seasonal factor from 12 months ago plus the seasonal smoothing factor times 1 minus the level smoothing factor times the error divided by the previous level and trend,” I said, “It’ll make more sense in the spreadsheet.”

And so I threw the calculation into the workbook as shown below:

“And now, since we’ve used absolute references on the smoothing parameters,” I added, “We can just drag values D16:H16 down through the last month of demand.”

Dragging the calculations down, the sheet filled.

“Here’s the cool part though,” I said, tapping column H on the screen, “We’ve now isolated this one-step ahead error column, and assuming our forecast is unbiased, we can calculate the standard deviation of the error, also called the **standard error,** and use that bell curve to simulate future errors.”

“And what is the standard deviation of our errors?” asked Victor.

“Well, up top in cell I1 let’s calculate the sum of the squared error (SSE) as =SUMPRODUCT(H16:H75,H16:H75) and then the standard error in this context is customarily calculated as the square root of the SSE divided by the number of data points minus the number of smoothing parameters. In our case that’s 60 minus 3,” I said and added the calculations to the top of the sheet:

”OK, so we’ve got a standard error that’s a little over 5, meaning 68% of one-step errors are within 5 and a quarter kilos of the forecast,” I said, “which, by the way, means that my January forecast error wasn’t all that bad.”

Victor nodded, “Fine, but I still didn’t know that.”

“True,” I said, “So what I’d like to now do is add the months in 2013 to the bottom of the sheet here and simulate future forecast error values by drawing them randomly from a normal distribution with mean 0 and standard error 5.24. To do that we use the formula =NORMINV(RAND(),0,K$1).”

I tossed the formula in cell H76 for January 2013 and dragged it down, yielding the sheet below:

“And now that we have our error, what can we do?” asked Victor.

“This is where the analytics gets pretty badass,” I said, “We can drag our calculations in D through G down through the future months just as if the demand had happened. And then using the error plus the one-step forecast, we can back out what the simulated demand is for that future month.”

I pulled the formulas in D:G down through row 87 and then backed out demand in C76 as G76+H76 and dragged that down as well, giving the sheet in the image below:

“All right!” I said, clapping my hands together, “That’s one possible future demand scenario.”

“And we need to find out what’s the worst case of these futures?” asked Victor.

“Yeah,” I nodded, “We need to generate a ton of these scenarios and discover how they spread out.”

“So how do we generate multiple scenarios?” asked Victor.

“Copy paste,” I said smiling, “Well, not quite, but sortof…lemme show you. Down a bit on the sheet on row 93 I’m going to write ‘Demand Scenarios 2013′ and then below that I’m going to paste the months of the year going across the columns. Then below that I’m going to copy 2013′s simulated demand, and I’m going to paste-special the values **transposed **into row 95.”

That ended up looking like this:

“And when you paste a scenario, the random numbers above update themselves,” said Victor.

“Exactly,” I answered.

“So you can just copy paste another scenario,” he said.

“Right, but honestly, that takes too long, so instead we’re going to record a macro to do it for us,” I said. (Some of you may need to show the developer tab on the ribbon in Excel to do this step.)

I navigated to the record macro button on the developer tab of the Excel ribbon. And pressing the record button, I named the macro and assigned it a shortcut key for fast access later:

Pressing OK, I did the following steps:

- Inserted a new, blank row 95
- Copied the 2013 simulated demand data in column C
- Did a paste-special values transposed onto my new blank row 95
- Pressed stop on the macro recording

“And now that we have our macro. We just need to press the shortcut key a lot to generate scenarios. Actually, you can just hold the buttons down and it’ll run through it on its own,” I said and pressed option + command + z on Victor’s Mac for a few minutes until I had just over 1000 scenarios.

Then on row 89, I calculated the 97.5th percentile of each of the simulated columns, on row 90, I pasted the transposed forecast values from the previous tab, and on row 91, I pulled the 2.5th percentile, giving me the following:

“So here we now have our forecast and upper and lower 95% interval bounds on it that we’ve found by simulating the forecast error,” I said, “So you can be quite confident that whatever demand you experience over the next year will fall inside these bounds.”

Victor smiled wide for the first time that day, “Great!”

“And we can highlight these three rows, select the area chart in Excel, and then format the bottom series with a white fill in order to get what folks in the forecasting business call a **fan chart**,” I said and inserted the chart for him.

“And note how the error compounds over time. Also, due to the multiplicative nature of the forecast, the absolute width of the interval increases in high seasonal demand periods,” I added.

“Wow, that’s nice,” said Victor.

It was my turn to smile, “Consider your ignorance about the futre quantified.”

Credit goes to Hyndman for developing the state equations for Holt-Winters that make this whole simulation approach possible. As part of that work, he also derived closed-form calculations for the prediction intervals, but I’m extremely partial to this monte carlo approach because it’s so intuitive (and intuition is really what exponential smoothing is all about).

Hyndman wrote all the good forecasting packages in R, and I highly recommend checking out his blog as well as his (unfinished?) online textbook. Personally, I find Bowerman’s 2004 textbook to be the best for learning a lot of this stuff.

]]>**Speaking and cheating on this blog**

Here’s a blog post I wrote over at the Strata blog: Your analytics talent pool is not made up of misanthropes. And a fun one from the MailChimp blog here. And I’ve been speaking. For instance, there’s this video of me speaking at Strata on how MailChimp is using data to do awesome things:

Upcoming other talks:

6/20-21 Big Data Summit in Toronto

10/6 INFORMS Annual Meeting

**Hacker News debacle of 2013**

You may have missed it, but Analytics Made Skeezy made it to #1 on Hacker News. For all of a minute. Then the site immediately went down. Then HostGator made sure it stayed down for a couple weeks. So I’ve been struggling to get the AMS house in order. I think today we’re in a slightly better spot than we were. My apologies.

**The Book!!!!!**

I’ve been writing an analytics book!

It’s an ongoing effort that was inpired by this blog, actually.

It’s called *Data Smart: Using Data Science to Transform Information into Insight*. Yeah, yeah, I know. It’s no “Analytics Made Skeezy,” but the publisher (Wiley) would only meet me so far on that. I think they were afraid that people might review the book negatively on Amazon due to the crime element.

So here are the details:

- The book, like this blog, will be in Excel (2007, 2010, or 2011 for Mac)
- Unlike this blog, it will not be a narrative. It’ll be guided learning, but I maintain a large presence in the book and promise all sorts of cheesy groaning humor
- The book is written with greater care than this blog. These blog posts take me a couple of days from start to finish. There are errors on the blog. Indeed, I think I mess up the metric system not once but twice! The chapters in the book on the other hand take up to a month to prepare. I go through techniques thoroughly. There are about 4x as many figures, and I try to call out differences in software where applicable.
- Topics covered are:

- Linear and nonlinear programming including Big M constraints, linearizing multiplied variables, and simulation optimization
- K-means clustering, K-medians clustering, the silhouette, and asymetric distance calculations
- Graph modularity maximization for clustering (in Excel and in Gephi…this is the only implementation of divisive clustering through modularity maximization that I know of, especially in Excel. Fun stuff.)
- Linear regression and the use of a logistic link function for “AI”
- Ensemble AI models and node purity calculations — we’ll implement bagged decision stumps in Excel to give a flavour for how it works
- Forecasting using SES, Holt’s Trend, and Holt-Winters. R-squared, the F test, autocorrelation, and prediction intervals with Monte Carlo sim are also covered in this chapter. Hell, we’ll even make a correlogram complete with dashed critical value lines. Sweet!
- Naive Bayes (in Excel???) — Haven’t finished this chapter yet.
- Outlier detection techniques including LOF
- Multi-criteria decision analysis techniques

- The book will come with some ridiculously thorough spreadsheets for download that put this blog to shame. And the examples worked will be from other business topics than the drug trade.

Now, this book is not for academics. It’s for folks who feel a little left behind by this whole data science thing and want to learn some techniques in a familiar environment without having to simultaneously learn to code. For production applications, I still think a lot of this should be implemented in OPL, R, etc., but those environments don’t facilitate learning like Excel does. You can’t learn modularity maximization in Gephi — you can just learn to press the button. We’re going to dig deep and actually learn this stuff.

Why? Because if you really *know* the techniques (even if you forget a little) then your abilitiy to identify opportunities to use them in a business context increases. And you’re not gun-shy; you’ve got some confidence.

Anyway, sign up for the AMS newsletter, and I’ll be updating everyone as we get nearer to release. Alternatively, I’ll be posting book-related info at john-foreman.com

Hugs,

John

]]>“We have a problem Alex,” said Victor. We sat together on a bench outside the Great American Scream Machine at Six Flags. I was beginning to suspect that Victor was picking these bizarre meeting locations on purpose. Certainly no one had stuck a hidden mic under this bench, although plenty of folks had stuck their gum there.

The evening air smelled of sweet garbage, like a rotten apple core. And the rickety roller coaster clacked up the first hill in the distance. Victor watched it through slit, thoughtful eyes. I’d never seen him like this. A scruffy beard, he fidgeted slightly.

“What kind of problem?” I asked nervously. I knew the normal Victor was a dangerous man; I assumed the paranoid Victor wouldn’t be much better.

He turned and looked at me, “There’s a fox in my henhouse.”

“What do you mean?” I asked. Someone on the roller coaster shrieked as it tumbled down a hill.

“I have a source within the DEA,” he said, eyeing me carefully as he spoke, “And this source has told me that there is a Judas waiting to betray me.”

My throat was beginning to close, and my heart leapt.

“Who is it?” I asked, faking a calm curiosity.

“A coke supplier is all I know. They’ve set a trap for me and some of my competition. But I don’t know which supplier,” He shook his head, “Which is terrible because there’s a shortage right now from all the gang violence south of the border, and I’m having to make deals with stranger and stranger suppliers just to meet demand.”

I breathed a sigh of relief. I wasn’t the traitor. At least not this time. But the fact that the DEA had a mole didn’t bode well for me. I made a note to warn Agent Bestroff.

“So do you have any clues?” I asked.

“I have a short list,” said Victor, “A list of all the coke suppliers offering up their goods to me right now who’ve passed basic vetting. But I have no idea which one has been turned.”

I nodded, “You got any data on these guys?”

“Yeah,” he said, “I’ve got some data. Simple checks we run, supply chain info, price, and availability data, but honestly, I already looked through it, and I didn’t see anything funny.”

He opened a spreadsheet that looked like this:

“Here they are,” he said with a shrug, “All 249 of them. I looked at their prices, the quantities they offered…we’ve got small and large players, but no one’s an outlier.”

I smiled, “That’s an interesting word you use, ‘outlier.’ How’d you check to see if you had any outliers?”

“I looked on the list for anyone named ‘Judas,’” he said with a laugh, “No, I computed the means and standard deviations for the price and quantity columns, but no values seemed too far from the others. And I made graphs of the numeric data, but it all looks fine.”

He pulled up a graph of the available coke each supplier had for sale:

“You can see from the jump in the graph that we’ve got about 80% small time suppliers and 20% or so larger suppliers,” said Victor, “But a minimum of 500 kg for sale is not extraordinary nor is a max of 5000 kg. These values are all expected.”

“And the same is true for the other numeric columns here?” I asked. Victor nodded yes.

“So we’re bumping up against a couple of issues in intrusion detection,” I said, “The first problem is that you’re looking at each column individually for an outlying variable, but maybe this fox is hiding right in the middle of your henhouse, and only when we examine all the data holistically will we find him. And by all the data, I mean this text data too.”

I pointed to all the categorical information in the spreadsheet, the ‘Y’s and ‘N’s and shipping methods.

“You said there were a couple issues here. What’s the other one?” Victor asked.

“Well, taking means and standard deviations are terrible for finding outliers here, even if we could do the analysis column by column, because as we can see from the jump in the graph, your data is pretty oddly distributed. Furthermore, when you take a mean or a standard deviation, those values actually *include* any outlying values in them. You should be using more robust statistics of spread and centrality.”

“What do you propose then?” asked Victor.

I thought for a moment, “Well, if there is an intruder in this data, a ‘wolf in sheep’s clothing’ as they say, then the first thing we should search for are people who may not necessarily be outliers globally on any one value but who nevertheless look unlike the suppliers nearest them. They’re trying to fit in with some subgroup of suppliers here, ‘putting on the sheep’s clothing,’ but they’re failing.”

“But let’s back up a minute. Can you explain all the columns in this sheet to me?” I asked.

“Of course,” said Victor. He distractedly smiled as he watched a boy walk by carrying an over-stuffed bear he’d won. The uncomfortable bench we sat on was starting to hurt my butt, and it irritated me that I’d had to fork over $30 just to meet a drug dealer inside an amusement park when the parking lot would have sufficed.

Victor turned his eyes back to the sheet.

“So the ‘Operate Own Transpo’ column is pretty straightforward,” he said, “Some suppliers contract out to drug runners, while others own and operate that vertical. Usually the larger guys will operate their own running operations.”

“The column next to it is the smuggling method they prefer to use,” he continued, “Personally, I don’t like to deal with someone who uses drug mules unless I have to. Too messy, too many things go wrong, and the quantities they can smuggle are small. The new narco subs are the most reliable, but even a car across the Canadian border through the Chippewa Indian Reservation isn’t a bad risk.”

“And what does ‘Counterintel?’ mean?” I asked.

“Do they have any counterintelligence they can bring to the table?” he said, “As in, do they have any tipsters in law enforcement? Do they have inside information that tips the scales in favor of successful transfers of product? Once again, the larger organizations are more likely to have this.”

“Then why not go with the larger guys every time?” I asked.

“Well,” said Victor, smiling, “Economies of scale don’t trickle down to pricing in the drug trade per se. The large guys are less likely to lose a load, get caught, get you busted, poison your customers, but they charge a hefty premium for their higher level of service. Much like shopping at the nice grocery store here in town, Publix I think it’s called, you pay extra for the nice experience. On top of that, I need the small guys to meet my total demand across the U.S.”

“Who are Jimbo, Stevie, and Margalo?” I asked, pointing to the next columns in the spreadsheet.

Victor laughed, “Strange, right? They’ve actually my competition in some major cities, but we pool information on suppliers. If one of them can vouch for a supplier, that’s better than nothing.”

Victor held a finger up, “But you can’t always trust it. You never know when they might get together and burn you. If one of them were to get busted, they could start feeding bad intel to the rest of the group.”

I nodded, “And the ‘Criminal Record’ column is whether the supplier has a criminal record?”

“Yes, their leadership,” he said, “Believe it or not, that’s a plus in this line of business. People who seem too clean often are.”

“What’s ‘need to meet?’” I asked.

“The exchange, money for product, can either go down in a meet or a drop. If you do a meet, the supplier is less likely to get burned because the bags change hands at the same time. But it’s dangerous and unpredictable. The big guys often prefer to just do a drop. Leave the drugs in the trunk of a car and get the money from you where you leave it. Clean hands. But many of the littler guys can’t afford to front the loss of even one shipment so they’re more likely to insist on a person-to-person meet,” said Victor, “As for the last three columns in the sheet, they’re all self-explanatory: the number of times we’ve personally dealt with the supplier, the max coke we can currently buy from them, and the price per kilo.”

I nodded, “Ok, so the first thing we need to do is get this categorical data into something we can analyse. We’re going to convert all the yes/no questions into +1/-1 values, and we’re going to take that transportation method column and split it into dummy variable columns with a 1 in the slot of the transportation method they prefer similarly to how we did it for our LSD trip AI model.”

“Yes, I remember dummy variables,” said Victor tapping his temple with his index finger.

I created a new sheet called “Numeric” and filled it in with the converted data using simple IF() formulas:

“So now the data is numeric, but what do we do with it?” asked Victor.

“Well, we’re going to compute a value for each supplier called their ‘Local Outlier Factor’ or LOF for short,” I said.

“And what will the LOF tell us?” asked Victor.

“The LOF is a single number that when it’s near 1 means ‘Hey, I look like all the suppliers whose data is most similar to mine’ and when it’s far greater than 1 it means ‘Hey, I look unusually different from those suppliers whose data is most similar to mine,’” I answered, “It’s a local outlier detection method as opposed to a global outlier detection method, meaning that while the outlying supplier may not be worlds away from everyone in this data set, he’s odd when compared to his closest peers.”

Victor gave a face that he tentatively accepted what I was saying.

I smiled, “It’s a bit of a journey to get there, but I swear once you see the result, you’ll get it.”

“OK,” said Victor, “So what do we do with this numeric data?”

“Well, the first thing we’re going to do is scale and center it,” I said, “So I’m going to take the trimmed mean and mean absolute deviation of each column to start.”

I added it two rows at the bottom of the numeric data.

“The trimmed mean is just like a regular mean except that I’m gonna toss out the lowest and highest 5% of values from each column to prevent any outliers in a single column from skewing the mean. A trimmed mean is kindof like what they use when judging gymnastics meets and they ‘throw out the highest and lowest scores.’”

“Ah,” said Victor, “That makes sense.”

“And the mean absolute deviation is just the average of the absolute values of the differences between each value and the trimmed mean. It’s like variance except we’re using absolute values instead of squared values to minimize the effect of any outliers,” I said.

“OK,” said Victor as he peered at the sheet, “And what do we do with them?”

“Well,” I said and created a new sheet called Scaled, “We’re going to subtract the trimmed mean from each value in the sheet to center the values around 0. Then we’ll divide through by that column’s mean absolute deviation. That’s going to take something like the price column and put it on the same scale as the ‘# of Past Deals’ column, so that no column is going to be numerically more important than any other just because it’s got a big scale.”

I filled in the ‘Scaled’ spreadsheet, showing Victor how, for example, in the Transpo column for the supplier ‘Abraham’ (B2) its new scaled value was just its old numeric value minus the mean divided by the spread.

“Now we have all numeric, scaled data,” I said, smiling.

“And we use it to tell suppliers apart?” asked Victor.

“We want to use it to measure the distance between each pair of suppliers,” I said, “And by distance, there’s a lot of different distance metrics out there.”

“Like the cosine similarity we used on my wholesale data?” asked Victor.

“Yes, exactly, although this time let’s keep it simple and just use Euclidean distance,” I said.

“So when measuring the distance between two suppliers, then,” said Victor, “That’s just the square root of the sum of the squared differences of the values from each of the columns?”

“Yep, just like you used to do in elementary school when you calculated the length of the hypotenuse of a triangle as the square root of the height squared plus the length squared,” I said.

I created a new tab in the workbook called “Distances” and filled it with a Suppliers X Suppliers grid.

“Can you explain that Excel formula to me?” asked Victor.

“Sure,” I said and pointed to the top left distance in the spreadsheet, “I’m looking up the scaled data for the supplier for this row using a VLOOKUP and pulling out the numeric columns I want from the previous sheet with that curly braced list. I’m doing the same VLOOKUP for the column. Then I’m taking the difference between the two sets of looked up values, squaring each difference, summing up the list, and taking the square root a la Euclidean distance.”

Victor nodded, “And this is another one of your array formulas?”

“Yes,” I said, “Since we’re using the VLOOKUP to grab a list of values which we’re subtracting from another list, we need to use an array formula in Excel instead of a regular one to do that list operation. And to use an array formula, I just type the formula normally into the cell and then instead of hitting return I hit ‘control/command + shift + return’ to engage it. That’s what added those curly braces around the entire formula.”

“So now we have a distance between each (row, column) pair of suppliers?” asked Victor.

“Exactly,” I said, “Very similar to the grid we made for the wholesale customer clustering we did a while back. And the next thing we’re going to do is rank them. For a given column, we’re going to replace the distance to the supplier in a row with its rank-order starting with nearest first. Since the self-distance is on the diagonal, we’ll start the rankings at 0 to make it easy to ignore the self-distance.”

I created a new sheet called ‘Rank’ with the rankings. The formula for the top left rank looked like this:

“So you’re just using the RANK() formula to look up a distance and compare it to the rest of the distances on that column?” asked Victor.

“That’s it,” I said and made another sheet called ‘Rank-Inv’ which was a transpose of the rankings so they were row-oriented.

“Now,” I said, “The reason why this approach is called Local Outlier Factors is because it compares how dense a local neighborhood of points is gathered around you to how dense the neighborhoods are around those points neighboring you. If your neighborhood is bit watery compared to the neighborhoods around you, then you’re an outlier. The people you consider your friends, don’t consider you much of a friend.”

“So how big is a neighborhood?” asked Victor.

“Well,” I said, “The size of the neighborhood is the only parameter we need to choose when calculating LOFs.”

“And what do we set it to?” asked Victor.

I shrugged, “Let’s start with a neighborhood of 5 points. The algorithm is fairly robust to changes in the size of K, but 5 is a good place to begin. And we’re going to create a sheet of ‘reach distances’ where if you’re in my neighborhood of 5 closest points, then your reach distance with respect to me is the distance to the perimeter of the neighborhood, but if you’re outside of my neighborhood, then the reach distance to you is our actual distance.”

Victor looked confused.

“Think of it like Lord of the Rings. If I’m the King of Gondor and you’re in my kingdom, i.e. Gondor, then the reach distance to you is the city walls. But if you’re not in my walled city, if you’re in say, Minas Tirith, then the reach distance between you and me is just our actual distance.”

“I hated that movie,” said Victor, “but I think I’m following you.”

“How can you hate Lord of the Rings?” I asked.

“I liked all the evil creatures,” he said, ”But the nice ones, the elves and the like, they were silly and poorly done.”

An international criminal would think that, wouldn’t he, I thought. But I said nothing.

I added a sheet called ‘Reach-dist,’ filled in ’5′ as the size of the neighborhood, and then calculated the distance to the fifth and furthest point within each supplier’s neighhood, called the ‘K-distance.’

Victor leaned in to read the formula, “So you’re going over to the Rank sheet, finding the rank that’s equal to 5, setting that value equal to 1 and the rest to 0, and then multiplying that unit vector with the column’s distance vector before summing that up. All that really does is just look up the fifth ranked distance in the end.”

“That’s right,” I said, “It’s just a ghetto-hacked lookup across the ‘Rank’ and ‘Distances’ sheets. And to do that array calculation, we have to make the whole thing into an array formula.”

He nodded, “That makes sense.”

“OK,” I said, “So now we’re going to copy in all the distances from the Distances sheet, except if you’re in my neighborhood, I’m going to replace the distance with the distance to my neighborhood’s edge, the K-distance.”

Here’s what the sheet looked like where the MAX function is used to pick between the neighborhood’s edge and the distance to the supplier:

“And now that we have all our Reach Distances set, we’re ready to finish this calculation up,” I said, “We know that the reach distance to my five neighbors with respect to me is just the distance to the fifth and farthest in the neighborhood. But what’s my reach distance with respect to them? If their neighborhood is smaller than mine, i.e. their neighbors are closer to them than mine to me, then while they’re inside my ‘city walls,’ I’m outside theirs.”

“We’re going to calculate the average reachability of each supplier with respect to their five neighbors,” I continued, “To do that we just read across the rows of the Reach Distance sheet, grabbing the Reach Distances from the five closest points using the ranking information from the Rank-Inv sheet that has row-oriented rankings on it.”

I created a new sheet called ‘LOF,’ pasted the names of the suppliers down the first column, and filled in the formula for the average reachability distance to ‘Abraham’ and showed it to Victor.

“Once again, the lookup of rows of data between sheets makes it an array formula,” Victor said, reading over the formula, “And you’re dividing the sum by 5 to make it an average.”

I dragged down the calculation and the sheet looked like this:

“So now that we have the average reachability of each supplier with respect to their neighbors, if we invert that value what we’re left with is basically a density of the area around you” I said, “We’ll call it ‘Local Outlier Density.’”

I inverted the Average Reachability column and dragged the density values down.

“Then, to finish this up,” I added, “The Local Outlier Factor for a supplier is just the average of the ratios of my 5 nearest neighbors’ densities to my own. If we’re all about the same density, then the value will tend toward 1, and I’m locally ordinary. If they’re in regions more dense than mine, then the ratios of their densities over my density will climb. That means that I consider them neighbors way more than they consider me a neighbor. I’m like that kid growing up who lived on a farm just outside of town. I’ve got friends in town who go to my school, but they’re better friends with each other than poor old me.”

I filled in the Local Outlier Factor formula which required that I look up the densities of my five nearest neighbors, divide each by the density of the point at hand, and average those ratios. The whole lookup was once again an array formula.

I tossed some conditional formatting on the column and rubbed my hands together.

“All we have to do now is look for the largest factor. That’s the point whose neighbors don’t really consider him a neighbor. The point is like Edward Scissorhands up in that castle at the edge of town,” I said.

I scrolled down the list. Near the bottom, I stopped.

“Whoa, check out this Zhenli guy,” I said, “He’s the only one with an outlier factor over two, so reachability is way out of whack with those points nearest him.”

“Hmmm,” Victor said, “I’ve not dealt with this supplier, but I’ve heard of him. Up and coming large supplier out of the pacific. Flip back to the raw data.”

I flipped back to the first sheet and scrolled down to the same supplier:

Victor nodded, “OK, he’s a big player. Operates his own narco sub, counterintelligence, has a criminal record. Even Margalo vouched for him, which isn’t nothing. But the other two don’t know him.”

Then Victor fell silent, “But look at his price.”

“Is it abnormally, low?” I asked.

“Not for a small player, but for someone operating transpo and counter-intel, it’s very cheap. And for good quantities too,” he said.

“And he demands that you meet in person,” I said.

“Yes,” said Victor, “Also rare for a big player. I don’t like it…maybe a trap.”

Victor suddenly stood and pulled out his phone, “I need to make a call.”

He walked some steps away where his voice was drowned out by the sound of the children, rides, and games. I feared that I’d just royally screwed the DEA. I needed to warn them. I couldn’t make out the words on Victor’s lips, but his left hand was clenched into a fist and his eyes had a far-away look as he spoke.

He hung up the phone and returned to the bench.

“We’ll see if this man is a problem or not,” said Victor.

“How are you going to do that?” I asked.

Victor just smiled, “I have ways of dealing with such matters.”

I didn’t contact agent Bestroff until I was back in my car and on I-20 speeding back toward Atlanta, safe from prying eyes and ears.

“Mr. Bestroff,” I said into my speakerphone, “I’m not sure if you’re working with a coke supplier named Zhenli Ye Gon, but if you are, Victor’s on to your honeypot.”

The other end of the phone was silent for a moment.

“Shit,” was all that finally came over the line.

“And to make matters worse,” I said, “You’ve got a mole inside the DEA. At least that’s where Victor says he got the intel. I sure hope you haven’t been using my real name in any of your interoffice communications.”

More silence.

“I’ll call you back,” was all Bestroff said, and he rang off.

That didn’t inspire confidence.

“I’m so screwed,” I said quietly to myself, and I sped back toward the frat house.

At MailChimp most of our abuse is conducted not by malicious folks but by unthinking customers. If I download the stolen list of email addresses from the Sony hack and send to it regarding my plumbing company, sure that’s illegal and malicious, but it’s not adversarial per se. Joe the plumber is not trying to hide from me — he’s just being an idiot. Most of the abuse I see in my job does not evolve over time.

In that case, training an AI model like we did in an earlier post, using labelled abusers to find future abusers, is possible. But what if your abusers are adversarial? They’re trying to hide, perhaps blend in. They change their behavior with each intrusion.

In this case, a supervised AI model ain’t gonna do you as much good. That’s where outlier detection techniques like LOF come in. You don’t know apriori what an outlier looks like, but local outlier factors give you a value that helps you prioritize cases to investigate. The example above is somewhat artificial, but you can envision situations where you’ve got HTTP requests and timestamps in a vector from your website and maybe you want to use them to detect aberrant behavior. LOF might be the way to go there. How does a user’s HTTP request vector line him up with his neighbors?

Now, that said, once you have distance data, you can do all sorts of fun stuff with it. You could construct a trimmed kNN graph like we did in our wholesale graphing discussion. Just applying a layout algorithm on the graph using 1 / distance as the weight on the edges might show you something visually. In 2004, Hautamaki et. al. released a paper arguing that looking at the Indegree of each node on the graph would let you know who’s an outlier.

]]>It’s been three days since I last saw Victor, and I’ve been lying low ever since. I know that if he wanted me dead, I’d be dead already. But I can’t help it. I’m freaked. Screw the DEA.

It was Tuesday night when Andre whisked me away to the freaking Marietta Dave & Busters. I couldn’t believe it. Victor was just sitting there at a table in the midst of the arcade games. Around him kids and teens ran from game to game. The din and the lights were a constant aggravation, but there he sat in the middle of it, unbothered, like some drug-dealing Buddhist monk with a laptop. I nearly burst into laughter at the sight of him, but his expression made me rethink my levity.

I took a seat next to him and immediately felt something hard and cold press against my leg. I went weak in the knees. I knew I shouldn’t have cooperated with the DEA and installed that bug on Victor’s laptop. Now I was good and dead.

Victor just smiled at me. His face betrayed none of the situation below the table.

“Tell me, Alex,” he said, “Have you been talking to anyone about our little tutorials?”

He ran a hand along his laptop, and for the first time I noticed that it wasn’t the same laptop as before. It was a small MacBook Air. And all the USB ports had little rubber nubs in them to prevent drives from being inserted. That last bit had me really and truly panicked.

“N-No,” I stammered, “Why would I do that? I’m not exactly innocent myself.”

He stared at me for a time. Suddenly, a ski-ball machine blared a loud noise, and I jumped out of my chair. I thought Victor had pulled the trigger.

Victor just burst into laughter and tucked the gun back into a holster on his ankle.

“You sure are a jumpy one, Alex,” he said with a grin.

My heart has beating like a machine gun, and I felt nauseous. Victor waved me back into a seat.

“I’m not used to this,” I said, “I-I’m not familiar with this sort of world.”

Victor just kept smiling.

“And keep it that way,” he said, “I’m not interested in you learning this world. I’m interested in learning the techniques from yours. Then I pay you. Then you leave and say nothing. Understood?”

“Understood,” I said, half relieved that I was dead, half despondent that I was dealing with Victor instead of playing games like the kids around me. I was rethinking all my bad choices in a big way.

“You want something to eat before we get started?” asked Victor, “Chicken fingers, potato skins, and other vile stuff?”

I forced a smile, “I’m all set. Why don’t we just dive in?”

Victor slapped me on the leg, “Good, my boy. That’s what I like to hear. Especially today, because today I have all sorts of troubles.”

“What’s going on?” I asked.

“My meth supplier wants me to submit orders for the entire next year by month. In the past, I could just order everything the month before, but he says the feds are making life hard for him, so he needs more time to plan,” said Victor, “That means that I have to forecast demand for each city I sell in for all of 2013 at the monthly level.”

“OK, well, that’s not so bad,” I said, “What have you tried so far?”

Victor opened his computer and pushed the screen my way.

“Here are the demand numbers in kilos for Flint, Michigan for the last five years,” he said.

“So I just graphed it, fitted a line to it, and I’m going to use that trend line for projections,” he said and showed me the graph.

I nodded, “OK, and that’d work fine if you were just doing a yearly projection for total demand, but your supplier wants demand projections by month, right?”

“Exactly,” said Victor, “And I know that line is going to be terrible for that. I already know that demand peaks in the summer and lags in the winter in Flint, but that line isn’t going to account for it. Hell, I was thinking about just sketching out the shape of the next year. Just using my eyeballs to do the projection.”

I laughed, “Well, you wouldn’t be wrong in noting that visual inspection is one of the best ways to identify unknown patterns, but I think in this case we can do better than that. Plus, I’m guessing Flint isn’t the only city you’re going to have to do these projections for.”

Victor, “No, and in fact, it’s not the only product either. I hear that some of my other suppliers are moving in this direction.”

“Well, in that case, I’m going to show you a really simple forecasting technique called Triple Exponential Smoothing or Holt-Winters Exponential Smoothing that is perfect for this problem,” I said.

I created a new tab in Victor’s Excel workbook called “HW Exp Smoothing,” and leaving the first couple of rows blank, I pasted the demand data in on row 3. I also added a fake year before 2008 called “initial” so that my data ended up looking like this:

“Now, let me explain a little bit about how Triple Exponential Smoothing works. The ‘Triple’ refers to the fact that we’re going to be more or less splitting this forecast into three components. We have **seasonal adjustment factors **which scoot the forecast up and down by month based on the historical data. So we’d expect the December factor to bump the forecast down and the July factor to bump it up. Then we have the **trend** component which helps us project out without considering the seasonal piece. This captures whether your demand is generally growing or shrinking. We also have what’s called the **level **which you can think of as the constant part of the forecast once we strip out a trend and seasonal variation. It’s a baseline.”

“Ok, so how do we start to break those out?” asked Victor.

“Well, I’m going to start in our fictional time just before the beginning of 2008, and the first thing I’m going to do is initialize the level as the demand from January 2008,” I said, “I’m also going to initialize the trend as the average of monthly increases in demand using same-month pairs from 2008 and 2009 to avoid seasonal wonkiness. So for instance, demand increased by .43 kg a month if I look at January 2009 versus January 2008. I’m going to average those across all 12 pairs in ’08 and ’09.”

I added columns to the sheet for level and trend and used this array formula to initialize the trend:

When I’d set the two initial values, the spreadsheet now looked like this:

“But what about the seasonal adjustment?” asked Victor.

“Well, for that we actually need 12 initial values. One for each month we’re adjusting. And this is where things get slightly more funky. The first thing I’m going to do is take the average monthly demand for each year in your data and calculate how each month in that year deviates from the average. For instance, January 2008 here had 81% of the average monthly demand for 2008,” I said.

“Ah,” said Victor, “If January is 81% of the average monthly demand, then I could use that value to adjust.”

“Yes,” I said, smiling, “But we’ve got five years worth of data, so we’re going to average these values across all five years to get our adjustment factors.”

I added an “Average Monthly Demand for Year” column into the sheet and calculated the value for each year. I then added in the month by monthly variations from the average monthly demand in the next column.

I then added a column for my seasonal adjustment factors where I initialized them as averages of the values from the previous column, using an AVERAGEIF() to make sure I only averaged the correct months.

I clapped my hands together and rubbed them like I was heating them over a fire, ”Now, we’ve got a starting place for these factors.”

“But what do we do with them?” asked Victor.

“Well, we’re going to roll them over the entire horizon of data here, refining them as each month ticks by. We’re going to take some percentage of the factor from the previous month combined with some percentage of a factor calculated from the data at the current month until we reach the very end of our data. Then we’ll have our final estimates for level, trend, and seasonal values,” I said.

Victor looked confused, “What do you mean you’re going to take some percentage of the value from the previous period?”

“Good question,” I said, “I’ll back up a bit. This is where this technique gets the term ‘smoothing’ from. I’m going to introduce three arbitrary terms called smoothing factors. One for level, another for trend, and a third for seasonal. And initially I’m just going to set them at 50%, so as we roll over the data, we’re going to take 50% of the previous estimate for level and 50% of the estimate dictated by the current month’s data, yeah?”

I added in the smoothing factors at the top of the sheet.

“So to see how this works,” I said, “Let me go ahead and roll one month down into January 2008. The first thing we’re going to do is set the January 2008 value of level. Since our level smoothing factor is 50%, we’re going to take 50% of the previous level value plus one month of trend and we’re going to combine it with 50% of the current month’s demand adjusted using the most recent seasonal factor we have for January,” I said.

I added the value to the sheet:

“As for trend, that’s just 50% of the previous trend value and 50% of the difference between the current level value and the previous one,” I said and added the value to the sheet.

“And the new seasonal variation for January is just 50% the previous seasonal variation and 50% January 2008 over the current level,” I continued.

“And here’s the cool part,” I said, “Now that we have these first smoothed values set, we just copy them all the way down the sheet.”

“But what do we do with them?” asked Victor.

I smiled, “Now we’re ready to project into the future. And it’s super easy.”

I added some rows on the sheet for 2013.

“Let’s take June 2013 for example,” I said, “We can forecast that as the final estimate we have for the level plus 6 months of trend since we ended December 2012. Then we take that value and adjust it for June. Bam, we have a forecast.”

Victor looked pleased, “And that’s it? I can use these numbers as my forecast?”

I shrugged, “You could. They wouldn’t be terrible. But that 50/50 split we’ve been doing with the smoothing factors, we really should change that.”

“To what?” asked Victor.

“Well, honestly, I don’t know. How much do you let past data influence you, and how much does your current period’s data matter? That varies from forecast to forecast. But there is a way we can have the computer figure it out for us,” I said.

“How’s that?” asked Victor.

“Well, we can forecast 2012 as if we were sitting at the end of 2011, compare the forecast with actuals, and find the smoothing values which do the best. In this way, we’re training the model like you would any machine learning model,” I said.

“Ah, OK, let’s do it,” he said.

I forecasted 2012 (below in yellow), calculated the average percent error (APE) between the forecast and actual, and then averaged the APEs at the top of the sheet to get a Mean Absolute Percentage Error (MAPE).

“Now, all I have to do is find the values of my smoothing factors between 0 and 1 that minimize my MAPE,” I said.

“So you use Solver?” asked Victor.

I laughed, “You’re catching on to my love of solver, Victor. Absolutely.”

I opened up Solver and set up the problem, “We’re going to minimize the MAPE where our decisions are to change the three smoothing factors subject to keeping them between 0 and 1. And this time let’s select the evolutionary solver from the menu since this problem is so nonlinear and ugly.”

I pressed solve and let the thing roll. The MAPE dropped over various trial solutions until finally it stopped at just 2.6%.

“2.6%. That’s a 10x improvement on the MAPE we had before Solver ran,” I said.

“Awesome,” said Victor, “And now I have a final forecast?”

“Yes,” I said, “you’re done. Let’s graph it real quick.”

I popped the values into a chart so we could look at them.

“Looks great!” said Victor, and then he paused, “but wait, what about the price adjustments we talked about putting in place in Flint?”

“Ah,” I said, “You’re right! We haven’t accounted for the fact that you moved your pricing strategy from an average of 5% less than your competitors to 8% more than your competitors. That means that this historical data is giving us an over-estimate for your newly raised prices. We’ll need to shift it down.”

“But by how much?” asked Victor.

“Well, if you’ll remember, we found in that previous conversation that for every 1% you increased prices over the competition, you lost .42 kilograms of demand for the month,” I said and added a “Price Sensitive Forecast” column to the sheet. I took the existing forecast and subtracted from it 5.46 kg that our demand model anticipated forfeiting by pricing 8% above the competition.

“There,” I said, “Done.”

Victor clapped me on the back and shook my hand before sliding me a massive wad of bills.

“Don’t spend it all on one video game, OK?” he said, smiling.

“And remember what we talked about,” he added with a twinkle in his eye, “If you talk about these conversations with anyone, you’ll need to adjust your forecasted lifespan a great deal lower.”

I nodded and stuttered to respond, but Victor had already turned away to flag down a server.

“I want more of those Jalapeno poppers,” he muttered to himself.

A lot of software packages these days come with Holt-Winters built in. It’s an old method but extremely accurate for how simple it is, oftentimes beating ARIMA models in actual business settings. I wouldn’t try anything more complex than HW until I’d given it a whirl first as a baseline for comparison.

For more reading on forecasting, I’d recommend anything by Rob Hyndman.

The price-sensitive forecasting at the end of this post is simple, but believe it or not, it’s not too far from how many Fortune 500 companies combine price elasticity models and forecasts to create price optimization models.

This post, as written, is decidedly “small data.” Indeed, we don’t get into big data until we start thinking about someone like Amazon where you may need to forecast demand across zillions of SKUs. Then you’ve got a nightmare on your hands.

Luckily, that problem is highly parallelizable. And one of the reasons why I prefer exponential smoothing to something more exotic (like machine learning with exogenous variables included) is that if you’ve got this stuff in a highly automated system across scads of products, you’re not going to necessarily notice when the forecast gets screwed up. I prefer to avoid constant babysitting and retraining. For me, Holt-Winters strikes the right balance of simplicity and effectiveness that many businesses want when they talk about “set it and forget it.”

]]>Hey folks. I’m going to take a brief break from the narrative to talk to you directly about data science as a discipline. There’s a lot of noise floating around about how data scientists are the sexy saviors of the world. Well, we’re not. At least, anyone who’s ever seen my hairy, white thighs knows I’m not sexy. I won’t speak for the rest of my data science colleagues, but few look like this:

When trying to be sexy, we look more like this:

So whether you actually are a so-called data scientist or are looking to hire one or become one, let’s have a little quiet time for reflection. Having been in this field for a while, allow me to pontificate uncontrollably:

1) **You are not the most important function of your organization.** This is a toughie. I do love me some me.

But consider the airline industry. They’ve been doing big data-ish analytics for a long while. For example, they’ve been engaging in revenue management and yield management since before you were a twinkle in yo’ mama’s eye. They’ve used all their data to squeeze that last nickel out of you for that seat you can barely fit in. It’s a huge win for mathematics.

But you know what? The most important part of their business is flying. The products and services an organization sells matter more than the big data models that tack on pennies to those dollars. Your goals should be things like using data to facilitate better targeting, forecasting, pricing, decision-making, reporting, compliance, etc. In other words, working with the rest of your organization **to do better, **not to do data science for its own sake.

2) **Leave the complexity at the door. **A long time ago in a galaxy somewhere near Cambridge, MA, I helped build a supply chain optimization model for a Fortune 500 company. The model used the tangents of an integral over a truncated normal distribution as a function of the mean as linear upper bounds for embedding a probabilistic model inside a Mixed-Integer Program. Sound complicated? It was. And it worked. So long as we baby-sat the fuck out of it and fed it a loving diet of accurate standard deviations of demand forecasts.

The model was dead on arrival, because it was too complex. The same thing happened to the Netflix prize winners. Your model is not the goal; your job is not a Kaggle competition (unless you work at Kaggle or something). Sustainable, repeatable business improvement is the goal, and on-going effort to use your complex big data product can be at odds with this goal if you’re not sensible in your design. If that means using a regular mixed-integer program instead of a probabilistic one, do that. Because this isn’t about proving how smart you are. Do you want to be the only one who can run your model? That’s the most boring kind of job security I can imagine.

3) **Much of what you do is marketing. **Big Data is a hot topic right now, and company leaders are dying to show off how they’re using their data. Especially if the company is going public or being acquired. There’s money to be made off of doing cool shit with data. What do you think LinkedIn’s InMaps product was? Did you play with it? Maybe. Was it useful? Errrm, not really. Was it cool? Fuck yeah. Did it impress investors and launch DJ Patil’s career? Ah, there you go.

Go ahead and do the marketing. It’s good for your business. But don’t forget that it’s just you being all “sexy data scientist” again. In certain cases, it may be your responsibility to steer the organization back toward something that delivers real value but uses less Gephi.

Tip: If you’re making Gephi graphs out of tweets, you’re probably doing more data science marketing than data science analytics. And stop it. Please. I can’t take any more. To paraphrase the Gospel of Mark, what does it gain a man to have graphs of tweets and do jack for analysis with them? I thought we were doing science, not our best impression of something that belongs in MoMA.

**4) The tools are wagging the analytics right now. **I can get a little hadoop added to my latte at Starbuck’s now. hive, pig, mongodb, riak, redis. Rub a little cassandra on it and lay down on the couchdb for an hour. The money to be made is in selling tools and services around the tools. Doing the actually nitty gritty of big data analytics, that’s secondary. So everyone’s talking up their tool set, and you may have a great setup, but it’s not about the tools or how big your dataset is, it’s about how you use them. I’ve seen better analytics done with ten megs of data in a pivot table than what most seem to be doing with their petabytes of unstructured garbage.

It’s great to have a shit-ton of data, find a way to use it to actually make money. Not draw graphs of tweets. Draw graphs of emails instead

**5) Data scientist is a poor term. Communication and creativity are most important.** Whoever chose the term data scientist has downplayed what’s most important about this job. A data scientist needs to be someone who can bridge the gap between complex analytics on large data sets and the dreams of company leadership. A data scientist needs to be creative about indentifying ways that data can solve company problems. And if the data’s not collected yet to solve a problem? They need to figure out how to get it. Here’s a break down of how I spend my time:

-1/3 talking with others and figuring out how we can use our data to solve problems

-1/3 up to my elbows in uggo data cleaning and prepping to solve a problem

-1/3 hunting down data that’s logged in some strange way, plying developers with drinks to get the data moved into our big data store in the appropriate way for modeling

-.00000001% actual training of models.

This is why Kaggle is bullshit. It’s like focusing on the cherry on your milkshake. Who gives two shits about the cherry? You need the rest of the milkshake.

]]>“Getting some early morning air?” Andre asked as I returned to the house from my meeting with Agent Bestroff.

Was he suspicious? I don’t know. I couldn’t tell because I’d never paid attention before. I’d never had to worry about Andre perceiving me as some kind of traitor.

“Rough night,” I said, “Don’t ask.”

Andre cracked a smile, “I hear you. Now go do your make-up, because Victor wants to talk.”

“Geez,” I muttered under my breath.

“Huh?” asked Andre.

My palms were sweaty, and I fingered the usb drive in my pocket that the DEA agent had given me. I hadn’t even had time to collect my thoughts about this, and it was already happening.

“Can this wait?” I asked.

“No, son,” Andre said, “It can’t. And you get paid for it to not, so let’s get moving.”

Great, I thought. I’m dead. I’m so dead. That thought continued right up until I sat down with Victor half an hour later.

We met at the restaurant at the Druid Hills Golf Club. How Victor had become a member at this stodgy hellhole was beyond me. But there he was digging into a plate of huevos rancheros surrounded by a bunch of people whose lives seemed to consist of banalities like Hilton Head, Botox, college football, and marital affairs. Not that I don’t love me some college football, but it was apparent that these people had no idea who they shared their dining room with.

“Alex, my boy,” Victor said with a huge smile, “So glad Andre found you. He’d been waiting at your house for some time. Where were you?”

Did I see a hint of suspicion in Victor’s eyes?

“I was already up,” I shrugged, “Decided to take a walk around town.”

He stared at me for a moment, deep in thought. Then suddenly he crumpled up the cloth napkin in his lap and tossed it onto his plate before pushing his food aside.

“Well, let’s get started then,” he said, “No time like the present.”

He pulled out his laptop and some scrap paper from his briefcase and set it on the dining room table. You had to admire the man’s audacity. There seemed to be no place he wouldn’t talk about dealing drugs.

“What’s on the agenda for today?” I asked.

“Acid,” said Victor, “Specifically, bad trips.”

“What about bad trips?” I asked.

“I make a stellar product,” Victor replied, “I pride myself on the quality of all my products; my LSD is some of the best out there. The problem is that it’s very potent. If it could travel back in time, this acid would have killed all the hippies.”

The man started laughing loudly at the thought, and some old folks in the dining room turned to look.

Victor held up a hand apologetically and stifled his laughter, “Sorry. Sorry. Anyway, yes, the people who love my product, really love it. I get plenty of repeat buyers.”

“But?” I asked.

“There seems to be a certain set of people who can’t handle it,” he said, “They can’t take it, and they have really bad trips. I don’t want to water my product down just to make something mediocre that everyone will kindof like. But if I knew who was going to have a bad trip before they had it, then I could steer them toward the shrooms instead or something.”

I nodded, “Got any data on who these people are? The bad trippers versus the good trippers?”

Victor gave one of his cheshire grins, “As a matter of fact, I do. I had my dealers write down a few things about their customers. If the customer let them know it was a bad trip or if they were a repeat customer but never came back after buying acid, then we logged the sale as a bad trip. If the customer came back and said they had a good time, then we logged the trip as good,” he said.

He fired up excel and turned a spreadsheet toward me. The data looked like this:

“So 1 is a bad trip and 0 is good trip?” I asked. Victor nodded yes.

“This is a really odd set of features,” I said staring at the data, “Some physical descriptors, the customer’s product preference, the phone they carried. I can’t see how their phone matters. Is 1 male or female?”

Victor smiled, “1 is male. And with attire, 10 is a full suit and nice shoes, while 1 is dirty, homeless clothing. It’s very subjective.”

“I thought I could run a linear regression on the numeric values,” he added, “but I wasn’t sure what to do with the last two columns. They’re not numeric.”

“Well, first off, linear regression is totally the wrong way to go, because your response variable, i.e. the thing you’re trying to predict, is binary, not continuous. You want to predict things that are either a good trip or a bad trip, 0 or 1, but if you just plot a trend line through your data, you’re going to get predictions above and below that,” I said, “What you need is to do some logistic regression, which is going to take your data and spit out a probability value between 0 and 1 where the closer you are to 1, the more likely you are to have a bad trip.”

“OK, so logistic regression is something I use when I want to predict the answer to a ‘yes or no’ question?” Victor asked.

“Exactly,” I said, “Will my customer buy? Will they use the coupon I gave them? Is my supplier a cop in disguise? Will they give my meth a five star review on Yelp? All those questions could be answered with logistic regression.”

“But before we do that,” I continued, “the first step is to transform these favorite drug and cell phone columns from categorical data to dummy variables.”

“What’s a dummy variable?” asked Victor.

“Well,” I said, “Rather than have one column for all your favorite drugs, we’re going to make one column for each drug, and if you’re favorite drug is meth then the meth column gets a 1 and the rest get 0s.”

Victor nodded, “So you’re taking each option and giving it its own indicator column?”

“Yep,” I answered, “Except that since you’ve got 7 drug options, we only need columns for 6 of them. If they’re all 0, then we’ll know that the 7th, absent column is a 1. That’ll keep us from encoding any redundancy into our new data.”

I copied the data over into the new format in a tab called “CategoricalFixed” where it now looked like this (Apologies for the width. Click on the picture or* *download the *spreadsheet for this post* to see it better):

“See how the categorical data is now blown out across these dummy columns?” I asked.

“Yes, I see,” said Victor, “And if I were to have 0s all the way across for my phone columns, that means my phone was an Android phone in the original data since that’s the absent column?”

I smiled, “Exactly.”

“Ok, so what do we do with this new data?” asked Victor.

“Well, let’s talk about it as un-technically as possible. For some future customer, I want to gather data just like this and combine it in some way, run it through some function, such that out comes a number between 0 and 1, right?” I asked.

“Correct,” said Victor.

“And what we have here, in machine learning speak, is ‘training data,’” I said, “We’re going to train an artificial intelligence model on this data, and by ‘model,’ all I mean is a function that combines a row of data like this and spits out a value between 0 and 1.”

I added a row at the top of my sheet that had all the same columns as below, except I added a column called ‘constant’ to the end of it.

“This row here is going to give the coeffients that we’re going to multiply by the data you’ll gather from a customer. If this were a linear regression, you’d just fit it such that the sum product of this row with the gathered data for the customer would give you a prediction.”

I jotted the equation down on a sheet of paper:

“That sumproduct looks like this in mathematical notation, where the x values are what we gather from the customer and the b values are what I’ve got in this green coefficient row,” I said, “I’m just going to call that sumproduct a ‘logit,’ but it kinda looks like what we’d get out of a linear regression, right?”

“Yes. But we’re not doing a linear regression.” said Victor.

“Exactly,” I said, “So instead we’re going to take this simple equation I’m calling the logit, and we’re going to transform it into a value that’s guaranteed to be between 0 and 1,” I said.

“How?” asked Victor.

“We’ll, consider this equation,” I said and jotted down a fraction:

“That e is just the mathematic constant used as the base of a natural logarithm. It’s approximately 2.72, and you can use it in Excel via the exp() function,” I said.

“Ok,” said Victor, “So why is this always between 0 and 1?”

“Well, ok, let’s start with the coefficients I currently have in the workbook. All 0s,” I said.

“If I use those coefficients in my logit, then the logit is equal to what?” I asked.

Victor thought a second, “No matter what data the customer gives, the logit would be 0 because all your multipliers are.”

I nodded, “Exactly. Which means that this second equation is e raised to the 0th power divided by one plus e raised to the zero, right?”

“Right,” said Victor, “Which is just one divided by two. It’s just .5.”

“Exactly,” I said, “And what if my coefficients were something else? What if they made my logit a negative number instead of 0?”

“Hmmm,” said Victor, “Well if the logit were, let’s say, -2, then e to the -2 is 1 divided by e squared, which is something like 1 divided by 8 or 9. So that’d give us something close to an eighth divided by one plus an eighth, which is approximately one ninth.”

“Yeah, it’s about .12. So already we’ve fallen well below the .5 value we got from a logit of 0. And what if instead of -2 I had a larger negative number, like -100?” I asked.

“Well, e to the -100 would be a very, very small number. So the whole calculation would give a small number over 1 plus that same small number,” Victor said, “So the whole calculation is getting closer and closer to zero.”

“Right,” I added, “And if we go the other way, if the logit is big and positive, we approach 1. We’d get a big ass number divided by a big ass number plus 1, which is going to end up as .999 repeating. Since our coefficients are going to end up being set to actual numbers, we’ll never hit a solid 0 or a solid 1, but we can get arbitrarily close.”

I jotted down some examples on paper:

“So no matter what value the logit takes,” said Victor, “We’ve got a value between 0 and 1. But how do we set this green row of coefficients such that the value between 0 and 1 we get for the customer’s input data is correct?”

“Ah ha,” I said, “This is where we train the model.”

I added a column into the sheet for the logit for each row of training data the dealers had gathered:

I then added the probability calculation next to each logit:

“Now, all we need to do is find the coefficients that make the probability column as close to the ‘Bad?’ column as possible. If we can find coefficients that give a 0 probability for the people we know had good trips and a 1 for the people who had bad trips in our training data, then bam, we have our model,” I said.

“So how do we find those coefficients?” asked Victor, “Trial and error?”

I laughed, “Oh lord. That’d take an eternity. No, we’re going to use Excel Solver just like we did some weeks ago.”

Victor nodded, “But to use Solver, don’t we need an objective, just like how cocaine cost was our objective in the other problem?”

“Right,” I said, “So this time around, we want our probabilities on our training data to be very near our actual values. So consider this value:

“What happens when I had a bad trip and the model predicted I would have one? In other words, what happens when ‘bad?’ is a 1 and ‘probability’ is super high, like .99?” I asked.

Victor stared at the equation a moment, “Well, you get .99^1*.01^0 which is more or less 1.”

“And if I had a bad trip but the model completely whiffs and says I’ll have a good one?”

“You’d get .01^1 *.99^0,” Victor said, “Which is more or less 0.”

“So then if we calculate one of these values for each row of our training data and sum them up, all we have to do is maximize the value of their sum by changing the coefficients around. If we get a lot of scores of 1, then we’ve got a pretty good fit on the training data, don’t we?” I asked.

I added the score values to each row of the sheet, summed them up, and opened solver.

I set solver to maximize the sum of the scores, while changing my coefficient row.

Before I hit solve, I made sure I’d chosen the GRG Nonlinear solver option from the Solving Methods list.

“Why not use the simplex algorithm, like last time?” asked Victor.

“Because these exponential functions aren’t linear,” I said, “It’d barf. But the nonlinear solver will handle the problem nicely.”

I hit solve, and the algorithm set the coefficients.

“Bam!” I said and pointed to the top row.

“If the coefficient is positive, that’ll push the probability of a bad trip up, right? And if it’s negative, it’s pushing the probability down,” I said, “So what do you notice?”

Victor studied the coefficients for a moment, “Attire is very important. The nicer the customer dresses, the more likely they are to have a bad trip.”

“Yeah, same with jitteriness, while tattoos work the opposite direction” I said, “And check out the favorite drug coefficients. X, meth, and coke all increase likelihood, while Ketamine, LSD, and shrooms all decrease likelihood.”

“And with phones, Blackberries are most likely to correlate with bad trips,” said Victor.

“I think what we’re seeing is that Type A individuals are most likely to have bad trips,” I said.

“Really?” asked Victor.

“Well, when I think about a suit-wearing, non-tattooed, jittery, Blackberry-using cokehead, I think Type A. When I think of a scruffy, laid-back person who likes Ketamine and uses an iPhone instead of a Blackberry, I think Type B,” I said.

Victor nodded his head side to side, “It’s not an airtight theory, but I see your point; people who are controlling might have difficulty with very strong acid.”

“OK, so the next thing we need to do is set up a calculator for future predictions,” I said, “For that we just take a new row of data through the same calculations with the coefficients we just found.”

I set up a calculator section in the spreadsheet for Victor and jammed in some made-up data:

“Let’s say we’ve got a 5’2″ dude with mediocre attire, no tattoos, and the jitters. He loves coke and is glued to his Blackberry,” I said, “In that case, the survey says…he’s a .76 so he’s more likely to have a bad trip than a good one.”

Victor smiled, “Neat! So I could turn this into a little iPhone app and have my dealers refuse a customer or sell them a diluted sample if they score too high.”

“Exactly,” I said, “You can use this simple AI model to better customize your product based on how you think your customers will react to it.”

“This is stellar my boy,” said Victor, “Thanks.”

I laughed and reached out a hand to shake Victor’s, half mockingly. As I moved my hand toward him, I knocked our scrap paper onto the floor.

“Whoops,” I said, and Victor raised a hand and bent under the table to gather the spilt paper. Quickly, I slid the USB drive that agent Bestroff had given me into the USB port of the laptop.

The computer made a brief bump-bump noise, indicating something had been plugged in. I immediately broke into a sweat, but Victor seemed not to notice.

As he leaned back up with the papers, I slid the laptop a little more my way, palming the usb port.

“Let me just save this for you,” I said and clicked save while subtley sliding the USB drive out of the port and back along the table to myself. Victor collected the papers in his hands and seemed not to notice anything I did.

“That all you want to look at today?” I asked cheerily. My heart felt like it was about to burst, and my voice wavered slightly.

Victor set the papers down and fished a wad of bills out of his pants pocket.

He handed it to me, “That is all for today Alex. Thank you so much for being so useful.”

There was that word again. Useful. I wondered what happened to me when “useful” turned into “screwed me over with the DEA.”

Short of basic summary statistics on large datasets, predictive modeling is probably the most common big data pursuit. Can I predict my customers’ behavior using the data I’ve gotten in the past?

And I hope this post has convinced you that very simple predictive modeling is actually quite easy. In fact, this Excel exercise is way harder than what it takes to create the same model in Matlab, R, SPSS, etc., because you’ve gotta solve for the coefficients yourself. Understanding the guts of predictive models is a lot harder than blindly using them, but for better or for worse, the latter approach can often work just fine.

Of course, the idea that this is the sum total of predictive modeling is something I’ve termed the “Kaggle falacy.” There’s a lot more to this stuff than just training models. For instance, which data should Victor have collected? Which features should be selected for the model? What if there’s a class imbalance between good and bad trips? Maybe the data’s dirty, or it’s scaled oddly. What about dreaming up problems based on the data available, or dreaming up what data to gather and features to assemble in order to solve problems that are plaguing your business? The truth is that a holistic approach to machine learning is important, because shit like this is just stupidity.

OK, let me catch my breath. All this to say, machine learning is an awesome tool for your analytics toolbox. It ain’t that hard to use (say, versus building a mixed integer program). So download the workbook and give it a shot.

If you’re interested in further study on this topic, this book is the effing bee’s knees.

Plus, the dude’s name is Torgo which reminds me of Torgo from *Manos: The Hands of Fate*.

]]>

Graham pulled me out of bed, fuming, “Dude, you gotta stop these people bugging us at the house this early. I need my beauty sleep.”

Now, Graham doesn’t need beauty sleep any more than Albert Finny, but he was right. I wasn’t even sure how Victor got my address now that I’d come to think of it.

When I’d dragged myself to the door and opened it, it was neither Andre nor Victor. Instead, the man standing in front of me looked like Ving Rhames with a big old bald dome of a head. His tone was as flat as his boring, gray suit.

“Alex Sheffield?” he asked in a bass voice.

“Yeah,” I said, sleepily, not really understanding what was going on, “Agent Bestroff, DEA. You need to come with me.”

I’m not afraid to admit that I nearly shat a brick. My ears must have been bright red, and my knees had started knocking a little. I would’ve been embarrassed at my lack of composure had I not been too scared shitless to care.

“Am I under arrest or something?” I asked.

Bestroff raised an eyebrow and spoke with a Mississippi drawl, “Do you wanna be?”

He loaded me into his Tahoe, and we drove to a brutalist, kakhi-colored building in the center of downtown. He took me through security and escorted me into an interview room on the sixth floor. Bestroff didn’t say a word to me til he’d settled across from me in an Emeco chair.

“Why do you think you’re here?” asked Bestroff. He ran a hand across his bald head.

I shrugged and remained mute.

Bestroff stood and left the room. A moment later he reentered with a folder, which he flopped disinterestedly on the table in front of me. I opened it and looked inside to find a stack of photos.

Riding in the truck with Andre. Victor and I at the hotel. Victor and I at the airport. Even on the freaking airplane.

“We’ve got all the audio too,” said Bestroff, “You’re a very good tutor.”

For the first time, Bestroff cracked a smile. He knew he had me in a position to squirm.

“I want a lawyer,” was all I said.

Bestroff started to laugh, “Look Alex, you little shit. I don’t care about you, and I sure as well don’t want to waste my time charging you. No, I want Victor, and I want his associates. All of ‘em. And you’re going to help me get them.”

I was still freaking out a bit, but Bestroff’s tone was reassuring. He continued, “And if you don’t want to help, I’m going to approach your school, your frat; hell, I’m gonna have a talk with your parents. I’ve got enough to jail you, but even if I didn’t, I love dealing with spoiled college kids. People like you are always scared and always have too much to lose.”

He had me pegged. I’m not afraid to admit it. And I sure as hell didn’t want my dad involved in this. I’d never hear the end of it.

I just nodded, “What do you want me to do?”

Bestroff cracked another huge smile, “We’re gonna start small. You’re gonna sign some papers. We’ll have a chat. And then we’re gonna all get back to work.”

Bestroff held out a hand across the table, “Welcome to the team, asshole.”

Two hours later, Bestroff escorted me out the building and made me walk back home past the Varsity. In my pocket I carried a little usb drive, but it weighed on me like a brick.

“When Victor’s not looking, pop it in his laptop for five seconds, then remove it. That’s all you’ve got to do,” he’d said, “We’ll log all his communications after that.”

It was a simple task, but those five seconds could get me killed. By the time I’d made it back to the house, the stress was palpable. And that was before I saw Andre leaning against his car door outside the front of the house. I was shitting so many bricks today, I could’ve built a pyramid out of them.

]]>What a crazy weekend. Nuts. I’m completely wiped, but I’m sitting here typing, because I’m jet lagged and can’t sleep a wink.

It all started Friday morning with a knock on my bedroom door. Graham was standing there in his tighty whiteys (why does he wear those again?) and mustache looking like some kind of molester. He was rubbing sleep from his eye.

“There’s someone at the door for you,” said Graham.

“The front door?” I asked groggily. The night before had been fun but unkind, and my head felt a little like it’d been taxidermied.

“Yes, Alex. The front door you jackass. What other door would I be talking about?” he said, irritated.

My first thought was Andre. I just knew it was him. Victor’d sent him again. How much math can one drug dealer need help on, I wondered.

“Some older dude in a suit,” Graham added as I rolled out of bed and tossed on some pants, “Looks like he could be your dad, man.”

Oh shit, I thought. This is gonna be the FBI, DEA, who knows who. I’m so screwed.

But it was neither Andre nor the heat. It was Victor. Victor in a three piece suit, looking dapper as hell. He was standing at the door of the frat hourse eying his watch with impatience when I walked up.

“Come on my boy,” he said when he saw me, “We must get going.”

“Huh?” I asked, still a big groggy. It was only just now 6 AM.

“We have a plane to catch,” he said, “Put on some clothes, and grab your passport and wallet. You do have a passport, don’t you?”

“Yeah, I have a passport,” I answered, “But why?”

“I’ll tell you in the car, now go get what you need for the weekend,” he said and turned back to the CLK Black that was idling in front of the house.

I shoved some stuff in a bag as fast as I could, and moments later, we were speeding down 85 toward the airport.

“I must go to Amsterdam on business this weekend, and I have some decisions I need to make before I meet with my suppliers. There’s no time for you to help me here, so you must work with me on the plane,” he said.

“I’m going to Amsterdam?” I asked.

“Have you ever been?” he asked.

“Never left the states except to Canada and the Bahamas,” I said.

“Well, then you should enjoy this,” he laughed and looked over at me, “Consider it a short European holiday.”

Any thought of this being a holiday faded when I passed through security at Hartsfield in the company of an international drug dealer. The TSA agent stopped me, and for a second I thought that surely I’d be arrested for who knows what. But it turned out that I’d left my lighter in my pocket, so I mopped away the bullets I was sweating and followed Victor to the international concourse.

You get this idea from the movies that all drug dealers fly in private jets, but that’s just not the case it seems. I asked him about it.

“Private jets attract scrutiny. I lead a simple life, but I do like to fly first class on international travel,” he said with a shrug.

When we’d boarded, gotten our gin and tonics in order, and leveled out at thirty six thousand feet, Victor removed his laptop from his carry-on.

“So here’s the deal,” he said, “I’m meeting with my main X supplier in Amsterdam, and he wants me to pick a design for my pills from the new ones he’s got rolling of the line.”

“OK,” I nodded.

He opened a spreadsheet and pointed the screen my way.

“These are my options,” he said.

“Wow, ok,” I said, “Kindof random, aren’t they?”

“He makes what he makes. Strange guy,” said Victor.

“So why not just pick one, any one?” I asked, “Like the four leaf clover. That could be cool.”

“If you were a car company would you just release any car? Would you just plop out an Aztek and be done with it?” he asked.

I shook my head no.

“Of course not,” he said, “And in this case, we know for a fact that pill attributes directly affect sales. That’s what my dealers tell me. At the bigger festivals and shows, there’s plenty of competition. I need my pills to stand out.”

“So what are the most important attributes?” I asked.

“Well, I asked that of twenty of my trusted dealers. I had them give me a ranked list of the attributes that mattered when they tried to sell this stuff,” Victor said, “I also had them vote individually on which attribute was most important in each pair of attributes. This is what they gave me.”

He showed me two small tables of data. The ranked list:

“So color is more important than purity?” I asked.

“Druggies like pretty colors. Especially under black light. Very few bring Simon’s regeant with them, and even when they do, they’re willing to accept less than 100% MDMA so long as the additives are benign,” he shrugged.

“Explain texture,” I said.

“If the pill looks bumpy or brittle, people seem to think it’s cheap, like back alley lab cheap,” he answered.

“And timeliness?” I asked.

“Is the pill stamped with an image, shape, or text that provides a timely reference to world events?” Victor said, “So the Obama pill and the Republican pill are both timely since it’s an election season, while the four leaf clover would be better next March for St. Patty’s. People get a kick out of those touches, and it makes the pill seem fresh.”

“So ‘pop culture’ is like the Bart Simpson pill,” I said, “It’s a reference.”

“Yes, and likewise ‘Brand’ just means it gives a brand reference,” he said.

“What about happy?” I asked.

Victor laughed, “Does the pill look happy? This seemed stupid to me, but many of my dealers agreed on this. People like uplifting images on the pill. Something that’s not intimidating.”

“Weird,” I said.

“And here’s the vote data,” he added and pulled up another table, “I had 20 dealers weigh in.”

“OK, so here you’ve got all the pairwise comparisons with vote counts,” I said.

Victor nodded, ”And here I graded my options as best I could based on these attributes.” He flipped over to a new sheet:

“One means yes, zero means no. The MDMA column gives the percentage MDMA. Also, I gave Bart only half a point for his color since it’s not very vibrant,” he said.

Victor took a sip of his drink and looked over at me, “So which one do I pick?”

I laughed, “So what you’re bumping up against here is a topic called multi-criteria decision analysis. How do you make decisions in the midst of multiple, completing objectives such as the oft-cited ‘risk versus reward.’”

“So this is like picking stocks?” he asked.

“To a degree. We’ve got a whole host of criteria here, but we can only pick one pill design. I can tell you up front though that whatever we pick, it ain’t going to be the Optimus Prime, the clover, the smiley face, or the crown,” I said.

Victor furrowed his brow and looked back at his spreadsheet, “Why?”

“Because those four pills are all strictly dominated by other pills. They’re not Pareto efficient,” I said.

Victor’s brow furrowed even further, so I pointed out a few rows in the spreadsheet, “What I mean by that is that if we look at the grades you’ve given here I can see that Bart has everything the Smiley pill has and more. So no matter which attributes are most important, Bart is always going to beat Smiley. So we can ignore Smiley.”

Victor nodded, “That makes sense.”

“So then the question is,” I said, “How do we combine your attribute scores into a single score here?”

“You average them,” said Victor.

I smiled, “Right, but we can’t do a straight average. After all, some attributes are more important. We have to do a weighted average.”

“Sure,” said Victor, “But what weights do we use?”

“For that, we get into a strange pseudo-scientific field of weighting techniques. We know that our weights should sum to 1, and we know that if an attribute is more important than another, then its weight should be bigger.”

“But how much bigger?” Victor interrupted.

“Precisely,” I said, “How should the weights decay as you go down in ranking? It depends. A lot of psychology goes into it. People tend to overweight their lower criteria when buying things only to make the decision almost entirely off their top criterion. For instance, they may say they care about fuel economy right after cup holders and paint color, but when your customer buys a Hummer, you know that that preference didn’t hold much weight.”

“So the weights should decay rapidly?” he asked.

“Fairly rapidly, yes,” I answered, “I wouldn’t give all the weight to ‘Color’ but I wouldn’t give but a tiny bit to ‘Brand.’”

“Got it,” he said, “So where do we start assigning the weights?”

“Well, let’s take a look at three ‘direct weighting’ techniques, and one ‘indirect’ weight technique that will use the vote data you’ve given me instead of the ranking data,” I said.

“Sounds good,” answered Victor.

I set Victor’s computer on the seat tray in front of me and added a column to the attribute ranking where I inverted the ranks:

“We’ll need that in a sec,” I said, “The first technique I want to show you is the simplest. It’s called Rank Sum, and the way it works is that if I come in first place out of eight criteria, I get 8 points out of a total of 8+7+6+5+4+3+2+1 points equal to 36 points. So my weight is .22. If I come in last place I only get 1 out of 36 points so my weight is .03.”

I scribbled out the formula on a napkin:

“This is the calculation for each attribute *i *with rank *ri *where in this case *K* is 8,” I said. In excel it looked like this:

=(9-A2)/(9*8-SUM(A$2:A$9))

“Those weights don’t taper off very fast,” I said as I dragged down the formula, “In fact, they decay linearly.”

“So what other options are there?” asked Victor.

“Well, on the flip side (pun intended), there’s something called Rank Reciprocal where I get the value of my inverse rank divided by the sum of the other inverse ranks,” I said.

I filled in the column in Excel with this formula:

=C2/SUM($C$2:$C$9)

“Here we get better decay at the beginning,” I said.

“But ‘Happy’ and ‘Brand’ actually count for more at the end,” he said.

“Right, so let’s look at my favorite technique. It’s kindof in the middle of these two. It’s called Rank Order Centroid, although occasionally you’ll hear it referred to as MAGIQ or SMARTER in the literature,” I said and jotted down a new formula on my napkin:

“Just like rank reciprocal, it uses the inverted rank values,” I said, “but this one gives us a decay that’s a bit more agreeable. Not that you care, but the calculation is also a bit better grounded in the natural world. It’s a bastardized center of mass calculation.”

I put the formula in Excel and dragged it down:

=SUM(C2:C$9)/8

I graphed the three different weight columns:

“See how the Rank Order Centroid weights decay quickly and finish low?” I asked.

Victor nodded.

“That fits pretty well with human psychology. Whenever I don’t have weights to start with in a problem, I go the ROC route,” I said.

“But then what about the votes?” asked Victor.

“Ah ha,” I said, “You’re right. We need to go over indirect weighting.”

I adjusted myself in my seat a little. The cabin was hot, and drinking only made it hotter. My butt was starting to stick to the seat.

“So here’s the deal,” I said, “Asking a group of people like your dealers to create a ranked list can be an inherently flawed question. Group decision-making on deciding the placement of low-ranking criteria is shit. Twenty people cannot collectively decide whether fifth and sixth place shouldn’t be reversed. If there’s a loud person in the room, no one’s gonna fight to make sure their ordering of low-ranking criteria wins.”

“So instead, using pairwise comparisons, like we have in this voting data, keeps the task manageable,” I said, “Furthermore, having votes instead of a pure ‘this is better than that’ gives us some more data we can use.”

I flipped the tab in the spreadsheet over to the voting data, “So we can use something called the analytic hierarchy process or AHP to transform these votes into weights. It’s a bit convoluted though.”

Victor nodded, “I’m trapped on a plane with nothing better to do. Let’s try it.”

“OK,” I said, folding my hands and cracking my knuckles, “You’ve got the winners in each vote in the left column, so what I’m going to go is take the vote difference and then normalize that difference to a score between 1 and 9 where 1 is a tie and 9 is a 20-to-0 vote.”

I plugged this formula into Excel to do the normalization:

=E2*(8/20)+1

and dragged it down. I then created a new tab called ‘AHP’ and created a criteria X criteria grid with 1s on the diagonal, the normalized scores on the upper triangle, and the inverse of the normalized scores on the low triangle like this:

“So this is just a matrix representation of your vote data where a value greater than 1 indicates that the row value is more important and a value less than 1 indicates that the column value is more important,” I said.

“What do we do with it though?” asked Victor.

“We’re going to pull an eigenvector out of it that’s going to be our weights,” I said, “which is a fancy word for a simple computation. The first thing we do is multiply all the elements on a row together and take their eighth root in this case because there’s 8 elements.”

I added in the column to the right of the matrix taking the 8th root of the product of the elements on each row.

=PRODUCT(B2:I2)^(1/8)

“Then I’m going to total these values up and normalize them by their sum, so that the vector adds up to 1. And that’ll be my weight vector,” I said.

“And so the advantage of these weights is that I can elicit them without making the group create a ranking together?” asked Victor.

“Yes, and there are all kinds of boring guides out there on how to do this,” I said, “But I think they can be a waste of time for most stuff.”

“So can we see how the various weights perform?” asked Victor.

“Let’s do it,” I said. I moved back over to Victor’s choices, pasted in the weights and took the sumproduct of weights and scores for each pill:

We stared at the results a moment.

“Obama wins for everything but Rank Sum,” said Victor, motioning to the last row, “for that technique, Bart barely wins.”

“Right,” I said, “And that’s because Rank Sum decays slower, so even though Bart’s color is only a half point, he still gets lots of points later. That’s the wrong call in my opinion.”

Victor nodded, “I agree.”

“The rest are quite similar with departures occurring further down the pill ranking,” I said, “And that should tell you that your time is best not spent doing a crapload of complex scoring on pairwise comparisons. Just be careful when you rank your criteria.”

“That makes sense,” said Victor, “I think it’s interesting that some of those points you excluded did well here.”

“Yeah, that’s because while they’re not Pareto efficient, they still had points in the right weighted places. The elephant was Pareto efficient but on low-weighted criteria.”

“Ah ha,” said Victor.

He slammed the lid to his laptop shut and smiled a broad smile, “Green Obama it is! Care for another drink?”

Wanting to weight things comes up more often than you’d think in analytics. It’s not exciting, but when you have to do it, it’s nice to have some tools. If you can’t tell, I gravitate toward ROC as a quick-and-dirty technique that provides generally OK results.

It’s terribly common when dealing with KPIs, performance dashboards, and other various Lean Six Sigma garbage to need to weight scores. Doing a weighted average of scores is also one of the few ways to solve a linear programming problem when you’ve got multiple, conflicting objectives. You can always do something like put one of the objectives in the problem as a constraint or something, but if you can shove everything in the objective function, why not?

The government really digs this stuff, especially the military. If you’ve got a ton of primary and secondary objectives, how do you build a simplified dashboard full of red/yellow/green indicators? You’re gonna have to combine some stuff, and these weighting techniques are one way to do it.

Now, if you’re working a problem with tons of data…for instance, let’s say Victor had months of demand data for his pills in a database. In that case, it might be possible to build a CART model or a random forest model of pill attribute versus likelihood to be purchased or demand or something. Then you could use the relative importance of the attributes from the model as weights. That’s how I’d do it if I had the data.

]]>

Two days later, I was walking to class at 9 in the morning when Andre pulls up in his truck.

“Get in the car, Alex,” he said to me through bites of a breakfast burrito.

“But I’ve got class,” I protested. The other students who walked by me on Tech Parkway eyed our conversation with curiosity.

Andre squirted some hot sauce on the end of his burrito and nodded.

“I don’t give two shits what you’ve got, man. Duty calls,” he said and took another bite.

I had a data structures class that morning, and I was truly hesitant to miss it. I had already slept through four others this semester.

“Can’t we just do this later?” I asked with a sigh.

“If you want to keep the big man waiting,” Andre said with a raised eyebrow, “It’s your funeral.”

I didn’t like Andre’s choice of idioms. I muttered some profanities to myself and got in the car. We sped off.

Half an hour later I was in neither Hapeville nor Buckhead. No, of all places, I met Victor in a Barnes & Noble. I couldn’t freaking believe it, but there he was sitting at a table in the coffee shop drinking a triple espresso and staring at his laptop. When he saw me walk in, he smiled and motioned me over to his table.

“Alex, my friend,” he said cheerily, “How are you this morning?”

I frowned, “I had class, but Andre pulled me away.”

Victor frowned along with me, “You should have told him to wait. I don’t mean to disrupt your schedule.”

But I could tell from Victor’s tone that he was all politeness. Of course he meant to disrupt my schedule. This was not a man who waited for anything. I took a seat next to him and draped my backpack across the back of my chair.

“Would you like something to drink?” he asked.

“Nah, I’ve already had two cups. So what’s on the agenda today?” I asked.

Victor gave me one of his cheshire cat smiles.

“Today we talk about my wholesale business,” he said.

“Wholesale?”

Victor nodded, “Yes, it used to be that I’d occasionally have more product than I needed and less cash than I needed, so I’d sell some of my product in bulk to other dealers. Sometimes even competitors. Slowly over time though, I’ve formalized this business. I handle transpo into the states and sell in bulk to a variety of clients, most operating in cities that I do not. They get a substantial discount off street value for buying in bulk, and I make good money as well.”

“OK,” I said, “So what’s the challenge?”

“Ah, you wouldn’t be here if there wasn’t a problem, right?” he asked.

“Well,” he continued, “I now have a hundred customers who buy from me regularly, and I send them deals regularly. But I want to know more about their interests. I want to know what they’re looking for. I do not know all these fellows personally. Some are friends of friends, flashing headlights in an empty parking lot, but I have their purchasing history. I was wondering what it told me about them. Whether there was a way to target them better in the future. Maybe hit them with the same deal twice if they didn’t take it the first time I contacted them, but I know they should be interested.”

“Sure, they call that retargeting,” I said.

“Yes, retargeting,” he nodded and took a sip of his espresso.

I was a bit freaked out to be talking about this in a public place, but Victor seemed not to mind, and with the sounds of the espresso machine and the grinder and folks talking on their phones, our conversation just kindof blended in.

“So can you show me the data?” I asked.

Victor pushed his laptop screen my way. On it was a spreadsheet with a tab called “Inventory” that looked like this:

I looked the data over.

“So I’ve got deals offered on certain dates going down the rows, with the product type, the quantity, the discount off street value, the country of origin, and this last column ‘Ready for use’…what does that mean?” I said.

“Ready for use just means that it’s not formed into some weird brick or shoved up a teddy bear’s ass and encased in concrete,” Victor answered, “You could take it streetside the day you bought it without further processing.”

I nodded, “And then the columns are your customers where a 1 indicates that they took you up on an offer?”

“Correct,” said Victor.

“OK, so here’s what we’re going to do with this data,” I said, “We’re going to try to detect communities in it. Patterns, types of customers. We’re going to use some algorithms to find hidden segments in it that you can use to target your customers better.”

“That sounds like a perfect idea to me,” Victor said, smiling, “Where do we begin?”

“OK, I’m gonna set up a few things in this spreadsheet, and I just want you to follow along with me as best you can,” I said, “First, I’m going to put a row at the bottom of the table that totals up how many deals each customer took.”

“Then, I’m going to make another tab called ‘InventoryInvert’ in this spreadsheet where I’m going to take your data and transpose it by copying the whole table and doing a paste special –> values on it with the transpose box checked,” I said. When I’d done that, I had a new tab that looked like this:

“Now that I have that,” I continued, “I’m going to create a graph of customer distances in a third sheet.”

“Graph?” asked Victor.

I nodded, “Yeah, let me explain this part. Let’s imagine a network or a web, kinda like a spiderweb, where each customer is connected to every other customer. The thickness of each connection is based on how many deals the two customers both matched up on.”

“So you and I would get a point or something if we both took the same deal?” asked Victor.

“Yes,” I said, “Exactly. Except that I want to penalize people a little bit who are just going through and picking every deal, so instead of giving us a solid point, we’re going to get 1 point divided by the square root of the number of deals I took plus the square root of the number of deals you took.”

“Those are the numbers in that sum row you just added, yes?”

“Correct,” I said, “This is a concept called cosine similarity just for your own edification.”

I created a new tab in the spreadsheet called ‘EdgeWeights’, and in it, I created a matrix of customers X customers where each edge weight was calculated by dragging this formula around from the top left cell:

{=IF(ROW()<>COLUMN(),SUMPRODUCT(Inventory!G$3:G$101,TRANSPOSE(InventoryInvert!$B7:$CV7))/(SQRT((Inventory!G$102)+SQRT(InventoryInvert!$CW7))),0)}

“What do the curly braces mean?” asked Victor.

“Ah, that’s called an array formula in Excel. Because I have to use the transpose function in my calculation, I need the curly braces. To get the curly braces, just type the formula without them and hit CTRL+SHIFT+ENTER (Apple is COMMAND+SHIFT+ENTER). They’ll appear on their own,” I answered, continuing, “So what this formula is saying is that so long as the two customers in the matrix don’t match, I want to have a weight on the edge of the graph. And I want that weight to be equal to the sumproduct of their two deal vectors divided through by the sum of the square roots of the number of deals they took. It’s just the cosine similarity calculation in an excel formula.”

I dragged the formula from the first cell in the table to the rest of the table, and the sheet looked like this:

“So what do we do with this?” asked Victor.

“Well, we’ve got a whole bunch of edges here, but really all we care about are the most significant relationships. For each customer, who are their nearest neighbors?” I said, “So I’m going to create what’s called a trimmed nearest neighbors graph out of this, and then things are going to get interesting.”

“The first step,” I continued, “Is that I’m going to calculate the top 3-ish nearest neighbors for each customer here by first calculating the 97th percentile weight in each column.”

I added a row at the bottom of the sheet with the formula:

=PERCENTILE(B2:B101,0.97)

and dragged it across. The result looked this this:

I then created a new tab in the workbook called ’3NN’ where I pasted the same customer X customer matrix, except this time I filled the cells with the following formula:

=IF(EdgeWeights!B2>=EdgeWeights!B$102,EdgeWeights!B2,”")

“So you’re only copying over those edges that are the most important?” asked Victor.

“That’s right,” I said, “And now we’re ready to get community detecting. For that, we need to leave Excel behind temporarily and move over to Gephi.”

I fired up a browser and downloaded a free copy of Gephi from https://gephi.org/.

“What is this Gephi?” asked Victor.

“It’s a graphing software,” I said, “Basically it’s going to visualize this data for us, create pretty pictures and stuff.”

I installed the software, and saved the 3NN tab to a csv file.

“So you just load the data from the 3NN tab into Gephi?” asked Victor.

“Not quite,” I said, opening up a text editor on Victor’s machine, “I need to replace all the commas in the csv export with semi-colons first.”

I did the find-replace and saved off the file. Then I was ready to open it in Gephi. I navigated to File -> Open and selected the csv file I’d created from the window. The import screen popped up, and I made sure Gephi knew that the graph I was working with was undirected and then pressed OK.

“Why did you select ‘undirected’ from the graph type menu?” asked Victor.

“Because customer edges in this case go both ways. If you’re connected to me, because you like the deals that I like, then I’m connected to you for the same reason,” I said.

A jumble of points popped up on the screen:

“So there’s the graph,” I said.

“It looks like shit,” said Victor with a sigh, “There’s nothing there.”

I laughed a little, “Just hang on. Don’t be impatient. We need to use a layout algorithm to pretty it up. I’m going to select ‘ForceAtlas 2′ from the layout menu over here on the bottom left and hit play.”

Immediately the graph snapped into shape and I pressed stop on the algorithm.

“Now what do you think?” I asked him.

“Looks better,” he said, “So there are two communities?”

“Mmm,” I shrugged, “Let’s find out. I’m going to press the modularity button over on the right side of the screen first.”

“That’s going to run a modularity calculation on the network,” I said and pressed the button. When the pop-up asked me whether or not to randomize the calculation, I clicked OK.

Instantaneously, a window with the results of the calculation popped up:

“It found 6 communities,” I said, “Let’s color them on the graph.”

I navigated to the partition section of the window in the top left corner, refreshed the drop down menu and picked modularity class. I hit the apply button and the six communities popped up in color on the graph.

“Ah ha,” said Victor, “So what do they mean?”

I smiled, “Well, Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do. But let’s dump these communities back into Excel and do some more analysis. I think we’re getting close to some insight here.”

I clicked on the Data Laboratory tab at the top of Gephi and exported each label (customer) with its assigned modularity class (community) to a csv, like so:

Next I opened this data back up in excel and copy-pasted it back into the original workbook we had open in a new tab called ‘Communities’.

I then inserted a new top row in the ‘Inventory’ sheet and used a vlookup to assign the customer names there to their respective communities.

I created a new tab called ‘TopDealsByCommunity’ and pasted all the deals (columns A:F) from the ‘Inventory’ tab into the new sheet. I then added communities 0 thru 5 as headers for columns G thru L.

“What are you doing?” asked Victor.

“The way we can tell what each community means is to figure out which deals were the most important for each community,” I replied, “So what I want to do is for each deal here I want to go back to the ‘Inventory’ tab and sum up all the ones for the customers in my group.”

I filled in the grid by dragging this formula around to all the cells:

=SUMIF(Inventory!$G$1:$DB$1,”=0″,Inventory!$G3:$DB3)

I added some conditional formatting to get a sheet like this:

“Now all I need to do is sort,” I said, “Let’s start with Community 0.”

I sorted, and Victor leaned in close to examine the values:

Victor began to laugh so loudly that a few folks in the coffee shop turned their heads.

“They are my dirt heads,” he said, pointing to the screen, “Look, they only order weed and shrooms. They’re my crunchy people.”

I nodded, “Absolutely. Let’s try Community 1.”

“Hmmm,” said Victor as he studied the data, “They buy all sorts of products.”

He sat for a moment and then his eyebrows raised, “But look at the volumes. They only buy the large quantities.”

“Yeah,” I said, “And look at the ready fur use column. It’s always true.”

“Ah,” he said, “That makes sense. These are the folks who make money by moving large volume fast. They don’t want prep, and they don’t deal in piddly shit. They are my Walmarts.”

“Yeah,” I said, “Let’s check out number 5 here real quick and then I’ll leave the rest for you to interpret.”

“My patriots,” Victor said smiling.

“Made in the U.S.A. baby,” I allowed myself a laugh.

Victor nodded, “My stateside producers do make some very good products.”

He looked over at me, “Thank you, Alex. This has been most insightful. Very cool stuff.”

He reached into his laptop bag and pushed a wad of bills into my hand. He clasped my hand in both of his shook it.

“You do not understand how refreshing it is to talk to someone like you. I like playing with data. It is a relief from the other demands of my position. Less messy,” he said.

He took a final swig from his cup of espresso and crushed the paper cup in his hand. The thought flitted across my mind momentarily that I had no desire to find out exactly what ‘messy’ meant.

Obviously, you’re going to want to move out of Excel for large datasets. A database can provide a sparse representation of this data, because you don’t actually need to save any records of when someone doesn’t take a deal. You only need the actions. Furthermore, the edge weight calculation is very parallelize-able / map-reduce-able.

If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.

On another note, in the problem above I treated all deals equally, but let’s say your problem was such that there were types of deals that everyone took, and they drowned out insight from less popular deals. You could alter the ‘Inventory’ tab to have a 1/sqrt(number of people who took deal) in each cell instead of just a 1. That way, two people who took an unpopular deal have more in common than two people who took a popular one. When you’d do the aggregations / ranking out the other side of the community detection, similarly you’d want to use these modified points.

]]>

When I left Victor that first time, my head was spinning. I had just done a drug dealer’s math homework. I felt the ethical weight of what I’d done for a split second and then I realized…if I hadn’t optimized his coke blending operations for him, he may have made a mistake and blended it himself outside his specs. And that could have been dangerous. So really I’m doing people a favor!

With that half-assed rationalizing of my nefarious deeds, I went to bed completely self-satisfied. I’d helped humanity and gotten free coke to boot! Well, pretty soon my stash dried up again, and I rang Andre to hook me up with some more.

“You wanna pay with your wallet or your mind?” he asked me over his cell.

Ugh, I thought, what a cheesy question.

“Like last time?” I asked.

“Yeuh, like that” he answered. He must have had all the windows down on the interstate, because his voice was coming over the line like he was in a wind tunnel.

An hour later I found myself once again getting dropped off not in Hapeville where I’d met Victor last time but in Buckhead outside the JW.

“Room 1504,” was all Andre said, and he sped off in his Escalade.

I made my way through the hotel lobby. I felt the eyes of the desk clerk on me and for a split second panicked that maybe I was walking into a sting or something. How many years does a guy get for helping a drug dealer out with the numbers, I wondered. But the desk clerk busied himself again with the computer in front of him, and I took the elevator up to the 15th floor. The JW was nicely decorated, although a little stuffy for my taste. The air in the hall was a bit humid, and its thickness seemed to play off my nervous breathing. I’d begun to sweat bullets before I even knocked on 1504.

I didn’t have to knock though. A man from the kitchen was dropping off a bottle of champagne, and when I poked my head through the door, I found Victor sitting in a bathrobe at a desk. He had a laptop open next to him, and when he reached over to take the bottle that had been left for him, he looked up and found me standing in front of him.

He gave me a quizzical look.

“The door was open,” I sputtered a little nervously as the staff sidled out of the room.

“I would wait and knock next time,” he said with a shrug, “It’s best not to surprise me. I’m jumpy at times.”

He stood and grabbed an arm chair that was pushed into the corner of the room. He dragged it over to the desk next to where he’d been sitting.

“Come join me here, and let’s talk this through,” he said, “Would you like something to drink?”

“Nah,” I said, “Long night last night. I wanna try to keep as much a clear head as possible.”

Victor smiled, “This is a good idea. What use are you to me all fuzzy?”

His toothy smile unnerved me. I didn’t like the thought of being useless to a guy like Victor. What had I gotten myself into?

“So what’s on the agenda for today?” I asked. I took a seat in the over-stuffed armchair next to Victor, and he turned his computer screen my way. He had a simple Excel spreadsheet open.

“Mr. Sheffield, I run a simple operation. I try to keep the number of non-contract employees on my books down to as few as possible,” he said, “You see, in the past, I’ve found a direct relationship between the number of employees on my books and the amount of heat coming round the corner.”

I nodded, “I can imagine that you’d want to keep the number of folks who know about all this to a minimum.”

“Yes, and so I’ve been pricing my products myself,” he said, “Sensitive things like finances, I don’t want many to know about them. But the time has come where I can’t eyeball each of my products in each city and set prices. It’s too hard.”

“Why not let your dealers set them?” I asked.

Victor smiled, “My dealers are not properly incentivized to return to me the maximum amount of revenue. Furthermore, the drug trade does not employ the most strategically minded individuals. Courageous, perhaps, but not intelligent.”

“So you want a quick and dirty way to set prices on each of your products in every market yourself?” I asked.

Victor gave me another cheshire grin, “Exactly, quick and dirty. And the data I have mostly looks like this. This is a set of monthly numbers for my meth dealers in Flint, Michigan.”

He pushed his computer screen a little my way.

I smiled, “I hear Flint’s nice this time of year.”

Victor nodded, “Flint is shit every time of year. That’s why I do so well there.”

I took a look at the spreadsheet. It was pretty straight forward.

“OK, so I’m seeing monthly demand, pricing, and revenue data,” I said, “And you’ve even got your competitor’s prices in there. Nice.”

“Meth heads shop around,” said Victor.

I laughed, but Victor gave me a serious look.

“I’m not joking. They’re broke as hell, and they’re more than willing to tell you when the guy down the street’s cheaper. Of course, they also care about quality, and like my coke, my meth is *at least *50% more pure than the competition,” Victor trailed off into thought briefly, “But up until now I’ve just been doing a straight 30% markup over cost. I know I can go higher than that.”

I nodded, “Yeah, so you’re bumping up against the fact that price is a strategic variable, but when you do cost-based pricing, you’re losing that strategic edge. Cost has nothing to do with willingness to pay.”

“Right,” said Victor, “That’s exactly what I wanted to say. So how do I fix that?”

“Quick and dirty?” I asked.

“Quick and dirty,” he repeated.

“Well, this is a revenue management problem, and since you’re already shopping competitors’ rates, what we want to do is build a price optimization model that responds both to your own pricing as well as competitor pricing. It’s called a price ratio model,” I said.

Victor looked flummoxed, “I thought I said quick and dirty.”

I laughed, “Trust me. This is going to take about two minutes.”

“The first thing we want to do,” I continued, “is calculate a price ratio column in this sheet. Price ratio is just your own price divided by the average price of your competition.”

I added the column for him at the end of the table:

“Now what?” Victor asked.

“Well, let’s graph price ratio versus revenue just to get a sense of your customers’ price elasticity,” I said.

“Their what?” asked Victor.

“Price elasticity is just how demand changes with price changes,” I answered and pulled up a graph in Excel that looked like this:

We took a look at the graph together.

“Pretty good fit,” said Victor.

“Yeah, not bad for the real world,” I nodded, “And what we have is a fitted line that gives us the relationship between demand and price ratio.”

“So now we just need to pick the ratio that gives us the most demand?” asked Victor.

“Well, demand isn’t what you’re interested in but rather revenue,” I answered and broke out a sheet of paper.

On it I wrote these equations:

“Ugh,” said Victor with a sigh.

“Just stay with me,” I leaned forward and pointed to the graph in excel, “This function on the graph is the same as the first one I have, only I’ve put in our own price and the competitor’s average price as two separate values in a fraction p-o over p-c. You see that?”

“Yeah, so this is demand as a function of my price and my competitor’s price?” Victor asked.

“Yes, although for the moment, let’s assume that your competitor’s price is fixed, ok? Whatever p-c is, let’s assume it’s stuck there,” I answered, “That way all we need to worry about is moving our price around their price.”

Victor nodded.

“The second function here, all I’ve done is multiplied the demand function through with your own price. So now it’s a revenue function. And the first thing I notice about this revenue function is that it’s a quadratic,” I said, “In other words, it’s raised to the power of two.”

“Yeah, I see that,” said Victor.

“Furthermore, if I look at the acceleration of this function, its second derivative in calculus, I can see that it’s negative,” I said and jotted down the function.

“That means that the function has exactly one optimal price where revenue is maximized,” I said. I looked at Victor, and he seemed confused, so I drew a picture:

“The revenue is 0 when I give away stuff for free, then it goes up, then I get too expensive, and it goes back down,” he said.

“Exactly, and see how there’s only one optimum where the smiley face is? That’s the sweet spot we need to figure out,” I said.

“And how do we do that?” he asked.

“Well, if this were a baseball we were throwing straight up in the air, it’s the place where the speed of that baseball is zero. That’s the apex. That’s as high as the ball is gonna go,” I said, “To find that, we take the speed of our revenue function, also called the first derivative, and find out where it’s zero.”

I jotted down a couple more equations:

I spoke as I wrote, “So here I’ve taken the first derivative, which you’ll remember from high school calculus. Then I’ve set it equal to zero to find the apex. Moving terms across and dividing through, I get that my optimal price is simply 8% higher than that of my competitors.”

I looked over at Victor. He was smiling.

“8% higher eh?” he said.

“8%,” I nodded.

“Sounds like I need to brush up on taking derivatives,” he said with a shrug.

“Sure, but they’re not that hard, and there’s all sorts of sites on the internet that will take them for you,” I said, “And the cool part is, you can just keep a running log of this data and readjust your pricing each month as price elasticity changes. When the economy bounces back, maybe your customers will be less price sensitive, and you can start gouging a little.”

“Don’t count on it,” Victor said, and he slapped me on the back really freaking hard. Reaching up, he grabbed the bottle of champagne and poured himself a glass. He turned around in his chair and reached into the inner breast pocket of a blazer that he’d draped across the chair back. From the blazer he pulled a wad cash held together with a rubber band and tossed it to me.

I caught it.

“What’s this?” I asked, gently weighing the wad in my hand.

“That’s five thousand dollars,” he said.

I was a bit dumbfounded, “For what?”

“For being so helpful. There’s a future for you here if you want it,” he said.

Later that night I lay in bed holding the wad of cash. I still hadn’t bothered to count it. Five grand for thirty minutes of work, I thought. That’s gotta be the easiest money I ever made. There was a voice in my head telling me to go ahead and get out of town. Quit helping Victor. Get clean and start fresh. But my excitement and adrenaline drowned the voice out. This was a sweet gig.

Revenue Management is one of the original ‘Big Data’ pursuits before big data was even a twinkle in your mama’s eye. But it’s been confined to enterprises like airlines, car rental companies, and hotels implementing crap like Teradata and Oracle. This field has not been democratized, so it’s kindof gone unnoticed in the big data conversation. But a system that takes company-wide hotel demand data, prices, competitor prices, yada yada, makes forecasts, and optimizes future prices based on inventory and demand isn’t more big data-y than some silly recommendation engine put out by some bland mobiley/socialy app, then I don’t know what is.

]]>