Introduction to Machine Learning: Logistic Regression for Predicting Bad Trips

NOTE: This one is pretty complex, so I’d highly recommend following along in the spreadsheet for this post. And don’t forget to sign up for the newsletter to hear about future Analytics Made Skeezy tutorials.

“Getting some early morning air?” Andre asked as I returned to the house from my meeting with Agent Bestroff.

Was he suspicious? I don’t know. I couldn’t tell because I’d never paid attention before. I’d never had to worry about Andre perceiving me as some kind of traitor.

“Rough night,” I said, “Don’t ask.”

Andre cracked a smile, “I hear you. Now go do your make-up, because Victor wants to talk.”

“Geez,” I muttered under my breath.

“Huh?” asked Andre.

My palms were sweaty, and I fingered the usb drive in my pocket that the DEA agent had given me. I hadn’t even had time to collect my thoughts about this, and it was already happening.

“Can this wait?” I asked.

“No, son,” Andre said, “It can’t. And you get paid for it to not, so let’s get moving.”

Great, I thought. I’m dead. I’m so dead. That thought continued right up until I sat down with Victor half an hour later.

We met at the restaurant at the Druid Hills Golf Club. How Victor had become a member at this stodgy hellhole was beyond me. But there he was digging into a plate of huevos rancheros surrounded by a bunch of people whose lives seemed to consist of banalities like Hilton Head, Botox, college football, and marital affairs. Not that I don’t love me some college football, but it was apparent that these people had no idea who they shared their dining room with.

“Alex, my boy,” Victor said with a huge smile, “So glad Andre found you. He’d been waiting at your house for some time. Where were you?”

Did I see a hint of suspicion in Victor’s eyes?

“I was already up,” I shrugged, “Decided to take a walk around town.”

He stared at me for a moment, deep in thought. Then suddenly he crumpled up the cloth napkin in his lap and tossed it onto his plate before pushing his food aside.

“Well, let’s get started then,” he said, “No time like the present.”

He pulled out his laptop and some scrap paper from his briefcase and set it on the dining room table. You had to admire the man’s audacity. There seemed to be no place he wouldn’t talk about dealing drugs.

“What’s on the agenda for today?” I asked.

“Acid,” said Victor, “Specifically, bad trips.”

“What about bad trips?” I asked.

“I make a stellar product,” Victor replied, “I pride myself on the quality of all my products; my LSD is some of the best out there. The problem is that it’s very potent. If it could travel back in time, this acid would have killed all the hippies.”

The man started laughing loudly at the thought, and some old folks in the dining room turned to look.

Victor held up a hand apologetically and stifled his laughter, “Sorry. Sorry. Anyway, yes, the people who love my product, really love it. I get plenty of repeat buyers.”

“But?” I asked.

“There seems to be a certain set of people who can’t handle it,” he said, “They can’t take it, and they have really bad trips. I don’t want to water my product down just to make something mediocre that everyone will kindof like. But if I knew who was going to have a bad trip before they had it, then I could steer them toward the shrooms instead or something.”

I nodded, “Got any data on who these people are? The bad trippers versus the good trippers?”

Victor gave one of his cheshire grins, “As a matter of fact, I do. I had my dealers write down a few things about their customers. If the customer let them know it was a bad trip or if they were a repeat customer but never came back after buying acid, then we logged the sale as a bad trip. If the customer came back and said they had a good time, then we logged the trip as good,” he said.

He fired up excel and turned a spreadsheet toward me. The data looked like this:

“So 1 is a bad trip and 0 is good trip?” I asked. Victor nodded yes.

“This is a really odd set of features,” I said staring at the data, “Some physical descriptors, the customer’s product preference, the phone they carried. I can’t see how their phone matters. Is 1 male or female?”

Victor smiled, “1 is male. And with attire, 10 is a full suit and nice shoes, while 1 is dirty, homeless clothing. It’s very subjective.”

“I thought I could run a linear regression on the numeric values,” he added, “but I wasn’t sure what to do with the last two columns. They’re not numeric.”

“Well, first off, linear regression is totally the wrong way to go, because your response variable, i.e. the thing you’re trying to predict, is binary, not continuous. You want to predict things that are either a good trip or a bad trip, 0 or 1, but if you just plot a trend line through your data, you’re going to get predictions above and below that,” I said, “What you need is to do some logistic regression, which is going to take your data and spit out a probability value between 0 and 1 where the closer you are to 1, the more likely you are to have a bad trip.”

“OK, so logistic regression is something I use when I want to predict the answer to a ‘yes or no’ question?” Victor asked.

“Exactly,” I said, “Will my customer buy? Will they use the coupon I gave them? Is my supplier a cop in disguise? Will they give my meth a five star review on Yelp? All those questions could be answered with logistic regression.”

“But before we do that,” I continued, “the first step is to transform these favorite drug and cell phone columns from categorical data to dummy variables.”

“What’s a dummy variable?” asked Victor.

“Well,” I said, “Rather than have one column for all your favorite drugs, we’re going to make one column for each drug, and if you’re favorite drug is meth then the meth column gets a 1 and the rest get 0s.”

Victor nodded, “So you’re taking each option and giving it its own indicator column?”

“Yep,” I answered, “Except that since you’ve got 7 drug options, we only need columns for 6 of them. If they’re all 0, then we’ll know that the 7th, absent column is a 1. That’ll keep us from encoding any redundancy into our new data.”

I copied the data over into the new format in a tab called “CategoricalFixed” where it now looked like this (Apologies for the width. Click on the picture or download the spreadsheet for this post to see it better):

“See how the categorical data is now blown out across these dummy columns?” I asked.

“Yes, I see,” said Victor, “And if I were to have 0s all the way across for my phone columns, that means my phone was an Android phone in the original data since that’s the absent column?”

I smiled, “Exactly.”

“Ok, so what do we do with this new data?” asked Victor.

“Well, let’s talk about it as un-technically as possible. For some future customer, I want to gather data just like this and combine it in some way, run it through some function, such that out comes a number between 0 and 1, right?” I asked.

“Correct,” said Victor.

“And what we have here, in machine learning speak, is ‘training data,’” I said, “We’re going to train an artificial intelligence model on this data, and by ‘model,’ all I mean is a function that combines a row of data like this and spits out a value between 0 and 1.”

I added a row at the top of my sheet that had all the same columns as below, except I added a column called ‘constant’ to the end of it.

“This row here is going to give the coeffients that we’re going to multiply by the data you’ll gather from a customer. If this were a linear regression, you’d just fit it such that the sum product of this row with the gathered data for the customer would give you a prediction.”

I jotted the equation down on a sheet of paper:

“That sumproduct looks like this in mathematical notation, where the x values are what we gather from the customer and the b values are what I’ve got in this green coefficient row,” I said, “I’m just going to call that sumproduct a ‘logit,’ but it kinda looks like what we’d get out of a linear regression, right?”

“Yes. But we’re not doing a linear regression.” said Victor.

“Exactly,” I said, “So instead we’re going to take this simple equation I’m calling the logit, and we’re going to transform it into a value that’s guaranteed to be between 0 and 1,” I said.

“How?” asked Victor.

“We’ll, consider this equation,” I said and jotted down a fraction:

“That e is just the mathematic constant used as the base of a natural logarithm. It’s approximately 2.72, and you can use it in Excel via the exp() function,” I said.

“Ok,” said Victor, “So why is this always between 0 and 1?”

“Well, ok, let’s start with the coefficients I currently have in the workbook. All 0s,” I said.

“If I use those coefficients in my logit, then the logit is equal to what?” I asked.

Victor thought a second, “No matter what data the customer gives, the logit would be 0 because all your multipliers are.”

I nodded, “Exactly. Which means that this second equation is e raised to the 0th power divided by one plus e raised to the zero, right?”

“Right,” said Victor, “Which is just one divided by two. It’s just .5.”

“Exactly,” I said, “And what if my coefficients were something else? What if they made my logit a negative number instead of 0?”

“Hmmm,” said Victor, “Well if the logit were, let’s say, -2, then e to the -2 is 1 divided by e squared, which is something like 1 divided by 8 or 9. So that’d give us something close to an eighth divided by one plus an eighth, which is approximately one ninth.”

“Yeah, it’s about .12. So already we’ve fallen well below the .5 value we got from a logit of 0. And what if instead of -2 I had a larger negative number, like -100?” I asked.

“Well, e to the -100 would be a very, very small number. So the whole calculation would give a small number over 1 plus that same small number,” Victor said, “So the whole calculation is getting closer and closer to zero.”

“Right,” I added, “And if we go the other way, if the logit is big and positive, we approach 1. We’d get a big ass number divided by a big ass number plus 1, which is going to end up as .999 repeating. Since our coefficients are going to end up being set to actual numbers, we’ll never hit a solid 0 or a solid 1, but we can get arbitrarily close.”

I jotted down some examples on paper:

“So no matter what value the logit takes,” said Victor, “We’ve got a value between 0 and 1. But how do we set this green row of coefficients such that the value between 0 and 1 we get for the customer’s input data is correct?”

“Ah ha,” I said, “This is where we train the model.”

I added a column into the sheet for the logit for each row of training data the dealers had gathered:

I then added the probability calculation next to each logit:

“Now, all we need to do is find the coefficients that make the probability column as close to the ‘Bad?’ column as possible. If we can find coefficients that give a 0 probability for the people we know had good trips and a 1 for the people who had bad trips in our training data, then bam, we have our model,” I said.

“So how do we find those coefficients?” asked Victor, “Trial and error?”

I laughed, “Oh lord. That’d take an eternity. No, we’re going to use Excel Solver just like we did some weeks ago.”

Victor nodded, “But to use Solver, don’t we need an objective, just like how cocaine cost was our objective in the other problem?”

“Right,” I said, “So this time around, we want our probabilities on our training data to be very near our actual values. So consider this value:

“What happens when I had a bad trip and the model predicted I would have one? In other words, what happens when ‘bad?’ is a 1 and ‘probability’ is super high, like .99?” I asked.

Victor stared at the equation a moment, “Well, you get .99^1*.01^0 which is more or less 1.”

“And if I had a bad trip but the model completely whiffs and says I’ll have a good one?”

“You’d get .01^1 *.99^0,” Victor said, “Which is more or less 0.”

“So then if we calculate one of these values for each row of our training data and sum them up, all we have to do is maximize the value of their sum by changing the coefficients around. If we get a lot of scores of 1, then we’ve got a pretty good fit on the training data, don’t we?” I asked.

I added the score values to each row of the sheet, summed them up, and opened solver.

I set solver to maximize the sum of the scores, while changing my coefficient row.

Before I hit solve, I made sure I’d chosen the GRG Nonlinear solver option from the Solving Methods list.

“Why not use the simplex algorithm, like last time?” asked Victor.

“Because these exponential functions aren’t linear,” I said, “It’d barf. But the nonlinear solver will handle the problem nicely.”

I hit solve, and the algorithm set the coefficients.

“Bam!” I said and pointed to the top row.

“If the coefficient is positive, that’ll push the probability of a bad trip up, right? And if it’s negative, it’s pushing the probability down,” I said, “So what do you notice?”

Victor studied the coefficients for a moment, “Attire is very important. The nicer the customer dresses, the more likely they are to have a bad trip.”

“Yeah, same with jitteriness, while tattoos work the opposite direction” I said, “And check out the favorite drug coefficients. X, meth, and coke all increase likelihood, while Ketamine, LSD, and shrooms all decrease likelihood.”

“And with phones, Blackberries are most likely to correlate with bad trips,” said Victor.

“I think what we’re seeing is that Type A individuals are most likely to have bad trips,” I said.

“Really?” asked Victor.

“Well, when I think about a suit-wearing, non-tattooed, jittery, Blackberry-using cokehead, I think Type A. When I think of a scruffy, laid-back person who likes Ketamine and uses an iPhone instead of a Blackberry, I think Type B,” I said.

Victor nodded his head side to side, “It’s not an airtight theory, but I see your point; people who are controlling might have difficulty with very strong acid.”

“OK, so the next thing we need to do is set up a calculator for future predictions,” I said, “For that we just take a new row of data through the same calculations with the coefficients we just found.”

I set up a calculator section in the spreadsheet for Victor and jammed in some made-up data:

“Let’s say we’ve got a 5’2″ dude with mediocre attire, no tattoos, and the jitters. He loves coke and is glued to his Blackberry,” I said, “In that case, the survey says…he’s a .76 so he’s more likely to have a bad trip than a good one.”

Victor smiled, “Neat! So I could turn this into a little iPhone app and have my dealers refuse a customer or sell them a diluted sample if they score too high.”

“Exactly,” I said, “You can use this simple AI model to better customize your product based on how you think your customers will react to it.”

“This is stellar my boy,” said Victor, “Thanks.”

I laughed and reached out a hand to shake Victor’s, half mockingly. As I moved my hand toward him, I knocked our scrap paper onto the floor.

“Whoops,” I said, and Victor raised a hand and bent under the table to gather the spilt paper. Quickly, I slid the USB drive that agent Bestroff had given me into the USB port of the laptop.

The computer made a brief bump-bump noise, indicating something had been plugged in. I immediately broke into a sweat, but Victor seemed not to notice.

As he leaned back up with the papers, I slid the laptop a little more my way, palming the usb port.

“Let me just save this for you,” I said and clicked save while subtley sliding the USB drive out of the port and back along the table to myself. Victor collected the papers in his hands and seemed not to notice anything I did.

“That all you want to look at today?” I asked cheerily. My heart felt like it was about to burst, and my voice wavered slightly.

Victor set the papers down and fished a wad of bills out of his pants pocket.

He handed it to me, “That is all for today Alex. Thank you so much for being so useful.”

There was that word again. Useful. I wondered what happened to me when “useful” turned into “screwed me over with the DEA.”

Big Data-ish Tip

Short of basic summary statistics on large datasets, predictive modeling is probably the most common big data pursuit. Can I predict my customers’ behavior using the data I’ve gotten in the past?

And I hope this post has convinced you that very simple predictive modeling is actually quite easy. In fact, this Excel exercise is way harder than what it takes to create the same model in Matlab, R, SPSS, etc., because you’ve gotta solve for the coefficients yourself. Understanding the guts of predictive models is a lot harder than blindly using them, but for better or for worse, the latter approach can often work just fine.

Of course, the idea that this is the sum total of predictive modeling is something I’ve termed the “Kaggle falacy.” There’s a lot more to this stuff than just training models. For instance, which data should Victor have collected? Which features should be selected for the model? What if there’s a class imbalance between good and bad trips? Maybe the data’s dirty, or it’s scaled oddly. What about dreaming up problems based on the data available, or dreaming up what data to gather and features to assemble in order to solve problems that are plaguing your business? The truth is that a holistic approach to machine learning is important, because shit like this is just stupidity.

OK, let me catch my breath. All this to say, machine learning is an awesome tool for your analytics toolbox. It ain’t that hard to use (say, versus building a mixed integer program). So download the workbook and give it a shot.

If you’re interested in further study on this topic, this book is the effing bee’s knees.

Plus, the dude’s name is Torgo which reminds me of Torgo from Manos: The Hands of Fate.

 

  • Joseph Robert Brown

    I love the site so far. I am just recently getting into data science, and I am finding these posts very helpful. Keep up the good work. It would be nice if you could include the original spreadsheet so that we can do the formula entering ourselves.

    • http://twitter.com/John4man John Foreman

      Good call. On this exercise (and I think on some others), the original data Victor brings is in its own tab, so you can just delete the tabs Alex adds and save a copy. In the future, I’ll make sure the starting data is on its own tab.

  • http://www.facebook.com/dlhansen David Hansen

    I have read every post and walked through each tutorial. Do you know if there are any other sites with similar tutorials? Granted none will be as lively as yours. Keep up the good work – this stuff is interesting and you make it accessible to non CS peeps.

    • http://twitter.com/John4man John Foreman

      You know, one of the reasons I put this blog together is because I couldn’t find any good blogs that cover all the basic analytics topics from a beginner level. Usually, blogs (like practitioners) are siloed, so I’ve gotta go here to learn about optimization and here to learn about AI and somewhere else to learn about graphing. Yuck.

      And like you point out, so many of the “big data” analytics discussions out there come from the CS angle which is terribly annoying for someone like me who comes from a math, not CS, background. I don’t want to configure and compile for 3 weeks before I get to play around with something. Ever sat down for 30 minutes and thought “I’m gonna install Apache Mahout and do a little analytics?” God help you.

      If I find some other comprehensive blogs, I’ll be sure to link to them and let you know. Thanks!

  • Fred

    The link to the previous post is broken, it seems to be 29 instead of 25. Cool series of posts by the way! Thanks

    • John

      Fixed. Thanks!

  • Lena

    Hey.

    First of all thanks for the useful posts.

    Would an evolutionary algorithm instead of the GRG be considered an error?
    I’m asking because by using an evolutionary algorithm and bounding the coefficients in the [-10,10] range, I can get an almost 84,6 “Total score” with some differences in the coefficients colors (and thus significance).

    For example, Meth and Coke are now green, Blackberry and Jittery light green, Attire, None, Constant, LSD and X yellow etc. This differs somewhat from your results.

    Any thoughts?

    • http://twitter.com/John4man John Foreman

      Lena,

      The optimization problem is nonlinear with plenty of local optima, so depending on which solver you use and the luck of the draw, you’re gonna get better or worse performance.

      I tried out the evo solver with the same bounds and hit a score of 90! Here’s a paste of the output: http://i.imgur.com/l0v0t.jpg

      It’s a better fit. Much more accurate. The interpretation of the coefficients appears to still be the same: coke, meth, jittery, blackberry, X, and attire all seem to cause bad trips while K, shrooms, flip phone, tattoos don’t. But man is that spread better. I’m going to have to change the post at some point, because the results are just so clearer.

      Thanks for running that and sharing your results!

  • Pingback: Intrusion Detection Made Skeezy: How do you find the DEA in a haystack? | Analytics Made Skeezy

  • http://www.websitedoctor.com/ Alastair McDermott

    “Except that since you’ve got 7 drug options, we only need columns for 6
    of them. If they’re all 0, then we’ll know that the 7th, absent column
    is a 1.”

    Is this regular practice in datamining? From a software perspective including data by implication like this would be considered very bad practice.

    • john4man

      Yes, it’s standard practice to consider degrees of freedom and drop a column when doing dummy variables since it’s not providing any additional information.

  • Stuart Ketcham

    John,

    Thank you for the very useful tutorial. I understand the reason for dropping weed and Android from the stages where you look for the best coefficients, to minimize the degrees of freedom, as you explained. But I am wondering: after you have got the coefficients and are creating a table of conclusions, is there any way you can extract from the analysis predictions of the effects of weed and Android on the probability of a bad trip? I guess one time-intensive way would be to repeat the whole analysis, but choose other features to drop (that is, keep weed and Andriod, but drop, for example, shrooms and iPhone). But is there an easier than repeating the whole analysis? Thanks!!

  • Joel

    Great work! I am teaching high school Excel, always thought this stuff was well out of scope… mind changed. @Joel_Logic

    • john4man

      Thanks for checking it out. There’s no reason why a motivated teen couldn’t learn this stuff. Wish I’d gotten the chance! And if you’re interested, my book will have some drug-free tutorials: http://amzn.com/111866146X

  • ted

    When I opened up the spreadsheet and ran the solver (made sure I selected right fields), I got the following error: “One of the cells in the worksheet became an error value when solver tried certain values for the variable cells.” I clicked OK to let it keep solving anyways and the first individual scores value was “#NUM”

    So yeah not sure why it is popping out this error. It messes all the results up and I don’t get what you have in the post.

    Great post none the less btw!!!

  • A. Friend

    Thanks for the line on Torgo’s book!