Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi

NOTE: Here’s the final spreadsheet for this post if you’d like to work along with Alex and Victor. No drugs needed.

Two days later, I was walking to class at 9 in the morning when Andre pulls up in his truck.

“Get in the car, Alex,” he said to me through bites of a breakfast burrito.

“But I’ve got class,” I protested. The other students who walked by me on Tech Parkway eyed our conversation with curiosity.

Andre squirted some hot sauce on the end of his burrito and nodded.

“I don’t give two shits what you’ve got, man. Duty calls,” he said and took another bite.

I had a data structures class that morning, and I was truly hesitant to miss it. I had already slept through four others this semester.

“Can’t we just do this later?” I asked with a sigh.

“If you want to keep the big man waiting,” Andre said with a raised eyebrow, “It’s your funeral.”

I didn’t like Andre’s choice of idioms. I muttered some profanities to myself and got in the car. We sped off.

Half an hour later I was in neither Hapeville nor Buckhead. No, of all places, I met Victor in a Barnes & Noble. I couldn’t freaking believe it, but there he was sitting at a table in the coffee shop drinking a triple espresso and staring at his laptop. When he saw me walk in, he smiled and motioned me over to his table.

“Alex, my friend,” he said cheerily, “How are you this morning?”

I frowned, “I had class, but Andre pulled me away.”

Victor frowned along with me, “You should have told him to wait. I don’t mean to disrupt your schedule.”

But I could tell from Victor’s tone that he was all politeness. Of course he meant to disrupt my schedule. This was not a man who waited for anything. I took a seat next to him and draped my backpack across the back of my chair.

“Would you like something to drink?” he asked.

“Nah, I’ve already had two cups. So what’s on the agenda today?” I asked.

Victor gave me one of his cheshire cat smiles.

“Today we talk about my wholesale business,” he said.

“Wholesale?”

Victor nodded, “Yes, it used to be that I’d occasionally have more product than I needed and less cash than I needed, so I’d sell some of my product in bulk to other dealers. Sometimes even competitors. Slowly over time though, I’ve formalized this business. I handle transpo into the states and sell in bulk to a variety of clients, most operating in cities that I do not. They get a substantial discount off street value for buying in bulk, and I make good money as well.”

“OK,” I said, “So what’s the challenge?”

“Ah, you wouldn’t be here if there wasn’t a problem, right?” he asked.

“Well,” he continued, “I now have a hundred customers who buy from me regularly, and I send them deals regularly. But I want to know more about their interests. I want to know what they’re looking for. I do not know all these fellows personally. Some are friends of friends, flashing headlights in an empty parking lot, but I have their purchasing history. I was wondering what it told me about them. Whether there was a way to target them better in the future. Maybe hit them with the same deal twice if they didn’t take it the first time I contacted them, but I know they should be interested.”

“Sure, they call that retargeting,” I said.

“Yes, retargeting,” he nodded and took a sip of his espresso.

I was a bit freaked out to be talking about this in a public place, but Victor seemed not to mind, and with the sounds of the espresso machine and the grinder and folks talking on their phones, our conversation just kindof blended in.

“So can you show me the data?” I asked.

Victor pushed his laptop screen my way. On it was a spreadsheet with a tab called “Inventory” that looked like this:

I looked the data over.

“So I’ve got deals offered on certain dates going down the rows, with the product type, the quantity, the discount off street value, the country of origin, and this last column ‘Ready for use’…what does that mean?” I said.

“Ready for use just means that it’s not formed into some weird brick or shoved up a teddy bear’s ass and encased in concrete,” Victor answered, “You could take it streetside the day you bought it without further processing.”

I nodded, “And then the columns are your customers where a 1 indicates that they took you up on an offer?”

“Correct,” said Victor.

“OK, so here’s what we’re going to do with this data,” I said, “We’re going to try to detect communities in it. Patterns, types of customers. We’re going to use some algorithms to find hidden segments in it that you can use to target your customers better.”

“That sounds like a perfect idea to me,” Victor said, smiling, “Where do we begin?”

“OK, I’m gonna set up a few things in this spreadsheet, and I just want you to follow along with me as best you can,” I said, “First, I’m going to put a row at the bottom of the table that totals up how many deals each customer took.”

“Then, I’m going to make another tab called ‘InventoryInvert’ in this spreadsheet where I’m going to take your data and transpose it by copying the whole table and doing a paste special –> values on it with the transpose box checked,” I said. When I’d done that, I had a new tab that looked like this:

“Now that I have that,” I continued, “I’m going to create a graph of customer distances in a third sheet.”

“Graph?” asked Victor.

I nodded, “Yeah, let me explain this part. Let’s imagine a network or a web, kinda like a spiderweb, where each customer is connected to every other customer. The thickness of each connection is based on how many deals the two customers both matched up on.”

“So you and I would get a point or something if we both took the same deal?” asked Victor.

“Yes,” I said, “Exactly. Except that I want to penalize people a little bit who are just going through and picking every deal, so instead of giving us a solid point, we’re going to get 1 point divided by the square root of the number of deals I took plus the square root of the number of deals you took.”

“Those are the numbers in that sum row you just added, yes?”

“Correct,” I said, “This is a concept called cosine similarity just for your own edification.”

I created a new tab in the spreadsheet called ‘EdgeWeights’, and in it, I created a matrix of customers X customers where each edge weight was calculated by dragging this formula around from the top left cell:

{=IF(ROW()<>COLUMN(),SUMPRODUCT(Inventory!G$3:G$101,TRANSPOSE(InventoryInvert!$B7:$CV7))/(SQRT((Inventory!G$102)+SQRT(InventoryInvert!$CW7))),0)}

“What do the curly braces mean?” asked Victor.

“Ah, that’s called an array formula in Excel. Because I have to use the transpose function in my calculation, I need the curly braces. To get the curly braces, just type the formula without them and hit CTRL+SHIFT+ENTER (Apple is COMMAND+SHIFT+ENTER). They’ll appear on their own,” I answered, continuing, “So what this formula is saying is that so long as the two customers in the matrix don’t match, I want to have a weight on the edge of the graph. And I want that weight to be equal to the sumproduct of their two deal vectors divided through by the sum of the square roots of the number of deals they took. It’s just the cosine similarity calculation in an excel formula.”

I dragged the formula from the first cell in the table to the rest of the table, and the sheet looked like this:

“So what do we do with this?” asked Victor.

“Well, we’ve got a whole bunch of edges here, but really all we care about are the most significant relationships. For each customer, who are their nearest neighbors?” I said, “So I’m going to create what’s called a trimmed nearest neighbors graph out of this, and then things are going to get interesting.”

“The first step,” I continued, “Is that I’m going to calculate the top 3-ish nearest neighbors for each customer here by first calculating the 97th percentile weight in each column.”

I added a row at the bottom of the sheet with the formula:

=PERCENTILE(B2:B101,0.97)

and dragged it across. The result looked this this:

I then created a new tab in the workbook called ’3NN’ where I pasted the same customer X customer matrix, except this time I filled the cells with the following formula:

=IF(EdgeWeights!B2>=EdgeWeights!B$102,EdgeWeights!B2,”")

“So you’re only copying over those edges that are the most important?” asked Victor.

“That’s right,” I said, “And now we’re ready to get community detecting. For that, we need to leave Excel behind temporarily and move over to Gephi.”

I fired up a browser and downloaded a free copy of Gephi from https://gephi.org/.

“What is this Gephi?” asked Victor.

“It’s a graphing software,” I said, “Basically it’s going to visualize this data for us, create pretty pictures and stuff.”

I installed the software, and saved the 3NN tab to a csv file.

“So you just load the data from the 3NN tab into Gephi?” asked Victor.

“Not quite,” I said, opening up a text editor on Victor’s machine, “I need to replace all the commas in the csv export with semi-colons first.”

I did the find-replace and saved off the file. Then I was ready to open it in Gephi. I navigated to File -> Open and selected the csv file I’d created from the window. The import screen popped up, and I made sure Gephi knew that the graph I was working with was undirected and then pressed OK.

“Why did you select ‘undirected’ from the graph type menu?” asked Victor.

“Because customer edges in this case go both ways. If you’re connected to me, because you like the deals that I like, then I’m connected to you for the same reason,” I said.

A jumble of points popped up on the screen:

“So there’s the graph,” I said.

“It looks like shit,” said Victor with a sigh, “There’s nothing there.”

I laughed a little, “Just hang on. Don’t be impatient. We need to use a layout algorithm to pretty it up. I’m going to select ‘ForceAtlas 2′  from the layout menu over here on the bottom left and hit play.”

Immediately the graph snapped into shape and I pressed stop on the algorithm.

“Now what do you think?” I asked him.

“Looks better,” he said, “So there are two communities?”

“Mmm,” I shrugged, “Let’s find out. I’m going to press the modularity button over on the right side of the screen first.”

“That’s going to run a modularity calculation on the network,” I said and pressed the button. When the pop-up asked me whether or not to randomize the calculation, I clicked OK.

Instantaneously, a window with the results of the calculation popped up:

“It found 6 communities,” I said, “Let’s color them on the graph.”

I navigated to the partition section of the window in the top left corner, refreshed the drop down menu and picked modularity class. I hit the apply button and the six communities popped up in color on the graph.

“Ah ha,” said Victor, “So what do they mean?”

I smiled, “Well, Gephi can kinda suck for giving us that kind of insight sometimes. Depends on the graph and what you’re trying to do. But let’s dump these communities back into Excel and do some more analysis. I think we’re getting close to some insight here.”

I clicked on the Data Laboratory tab at the top of Gephi and exported each label (customer) with its assigned modularity class (community) to a csv, like so:

Next I opened this data back up in excel and copy-pasted it back into the original workbook we had open in a new tab called ‘Communities’.

I then inserted a new top row in the ‘Inventory’ sheet and used a vlookup to assign the customer names there to their respective communities.

I created a new tab called ‘TopDealsByCommunity’ and pasted all the deals (columns A:F) from the ‘Inventory’ tab into the new sheet. I then added communities 0 thru 5 as headers for columns G thru L.

“What are you doing?” asked Victor.

“The way we can tell what each community means is to figure out which deals were the most important for each community,” I replied, “So what I want to do is for each deal here I want to go back to the ‘Inventory’ tab and sum up all the ones for the customers in my group.”

I filled in the grid by dragging this formula around to all the cells:

=SUMIF(Inventory!$G$1:$DB$1,”=0″,Inventory!$G3:$DB3)

I added some conditional formatting to get a sheet like this:

“Now all I need to do is sort,” I said, “Let’s start with Community 0.”

I sorted, and Victor leaned in close to examine the values:

Victor began to laugh so loudly that a few folks in the coffee shop turned their heads.

“They are my dirt heads,” he said, pointing to the screen, “Look, they only order weed and shrooms. They’re my crunchy people.”

I nodded, “Absolutely. Let’s try Community 1.”

“Hmmm,” said Victor as he studied the data, “They buy all sorts of products.”

He sat for a moment and then his eyebrows raised, “But look at the volumes. They only buy the large quantities.”

“Yeah,” I said, “And look at the ready fur use column. It’s always true.”

“Ah,” he said, “That makes sense. These are the folks who make money by moving large volume fast. They don’t want prep, and they don’t deal in piddly shit. They are my Walmarts.”

“Yeah,” I said, “Let’s check out number 5 here real quick and then I’ll leave the rest for you to interpret.”

“My patriots,” Victor said smiling.

“Made in the U.S.A. baby,” I allowed myself a laugh.

Victor nodded, “My stateside producers do make some very good products.”

He looked over at me, “Thank you, Alex. This has been most insightful. Very cool stuff.”

He reached into his laptop bag and pushed a wad of bills into my hand. He clasped my hand in both of his shook it.

“You do not understand how refreshing it is to talk to someone like you. I like playing with data. It is a relief from the other demands of my position. Less messy,” he said.

He took a final swig from his cup of espresso and crushed the paper cup in his hand. The thought flitted across my mind momentarily that I had no desire to find out exactly what ‘messy’ meant.

Big Data-ish Tip

Obviously, you’re going to want to move out of Excel for large datasets. A database can provide a sparse representation of this data, because you don’t actually need to save any records of when someone doesn’t take a deal. You only need the actions. Furthermore, the edge weight calculation is very parallelize-able / map-reduce-able.

If you had a big dataset that you prepped into a trimmed nearest neighbors graph, keep in mind that visualizing it in Gephi is just for fun. It’s not necessary for actual insight regardless of what the scads of presentations of tweets-spreading-as-visualized-in-Gephi might tell you (gag me). You just need to do the community detection piece. You can use Gephi for that or the libraries it uses. R and python both have a package called igraph that does this stuff too. Whatever you use, you just need to get community assignments out of your large dataset so that you can run things like the aggregate analysis over them to bubble up intelligence about each group.

On another note, in the problem above I treated all deals equally, but let’s say your problem was such that there were types of deals that everyone took, and they drowned out insight from less popular deals. You could alter the ‘Inventory’ tab to have a 1/sqrt(number of people who took deal) in each cell instead of just a 1. That way, two people who took an unpopular deal have more in common than two people who took a popular one. When you’d do the aggregations / ranking out the other side of the community detection, similarly you’d want to use these modified points.

 

  • Pingback: Digging Deeper into Wavelength and EGP Data: Finding Interest Clusters in MailChimp’s Network | MailChimp Email Marketing Blog

  • Pingback: Digging Deeper into Wavelength and EGP Data – MailChimp Email … | Email Marketing Software

  • http://www.clicktruemedia.com/ Jonah Keegan

    Spreadsheet link is not working for me… could i get an email copy?

    • http://twitter.com/John4man John Foreman

      I’m sorry to hear that! I’ve sent you a copy.

  • Pingback: Digging Deeper into Wavelength and EGP Data: Finding Interest … | find-and-address

  • Pingback: Chit Chat with New Datasets – Facets in Open (Was “Google”) Refine « OUseful.Info, the blog…

  • Pingback: “Drug Deal” Network Analysis with Gephi (Tutorial) « OUseful.Info, the blog…

  • Pingback: “Drug Deal” Network Analysis with Gephi (Tutorial) « Another Word For It

  • http://twitter.com/n_dayspring nathan dayspring

    In the last section you mention modifying the values so that they aren’t equal. What happens if you have really drastic values for vectors (vectors diff than the example above) – would this methodology for clustering still be viable? If I had a zip code instead of a person, and my rows were population, income, visits to a website from that location, etc., would this kind of clustering still work?

  • Woon Peng Goh

    Bravo! Absolutely hilarious!

  • J

    Hey I was wondering, why aren’t edge-weights equal for (A, B) and (B, A)? For example, the edge-weight of (Smith, Johnson) is 1.34622708 and the edge-weight of (Johnson, Smith) is 1.23740666.

    I’m confused because the graph type is undirected.

    • J

      I just realized it’s because of the division of the distance by its square-root. However, how does Gephi handle the two different values for the same edge-weight?

      • Justin

        I ran into the same issue with my dataset. I believe the numbers should be the same based on the concept of the formula. I believe i’ve figured out what causes the formula to change the value. There is an extraneous parenthesis after the first SQRT – The formula should be …(SQRT(Inventory!G$102)

        • Justin

          There is a parenthesis at the end of the denominator that would need to removed too-

  • rahul

    Where can I have the original excel sheet so that I could try all these steps. Please upload it somewhere or send on my mail id

  • John Berryman

    @JnBryman again… I dig your work! The deeper I get into my own data analytics pursuits, the more commonalities I see among all fields of data analysis. In your post here, you are finding relationships among drug dealers based upon their choices. [In my post here](http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/) I identify relationships among words based upon documents that contain them. I identify that for instance that the word “vader” is closely associated with words like “luke”, “emperor”, “darth”, “anakin”, “sith”, and “skywalk”. But in your post you take it a big further by applying graph clustering to the dealers. I wonder what it would be to apply this to my word relationships. Scaling might be an issue… there are a lot of words out there :-/

  • http://www.communitydataroundtable.org Community Data Round

    This was great! Very well written and insightful. Bravo!

  • Shankar

    What a splendid way to introduce several things at once: retargeting, excel manipulation of graph data, gephi visualisation of data, and the kicker: all this without the jazz in R. Keep blogging away, please!