If you’ve never contemplated an algorithm for predicting lottery numbers or the fantasy football league then this will be boring. But if you are a secret data science nerd, then please join me in an epic high five!
We just completed coding a cluster analysis tool for data: the competition took 90 minutes, we finished in under an hour.
The problem space is typically O(n³) and is like your classical travelling salesman problem
Our Slalom client had a data and cost problem.
The data problem
Take name, address and other personal information to consolidate information across various sources. I’m writing this in a hurry so I won’t reveal any client-specific details or actual implementation- that belongs to the client now- but a similar problem space might be genealogy.
Then given: you have death records, marriage records, birth records, census records from different sources across different times and different spellings. You don’t want to duplicate data. You don’t want to lose data. So you need to form a cluster. A cluster is how you identify that 1890 census Jimmy Joehosaphat @ 10 Band St is the same as 1880 census James Johosaphat @ 10 Bond Street.
I’m sure you can think of how to do this for one person, but how do you scale this problem to millions without killing your server?
The cost problem
Cost was a concern to the client. They were spending six digits a year on enterprise licensing. Naturally, I imagine they would use the money they save to buy a water fountain connected to a never-ending supply of Monster energy drinks.
Microsoft SQL Server 2012 and SSIS. Sometimes you gotta work with what you got. ;)
We used SQL Server Master Data Management (MDM) similarity functions and SSIS to find similarities among all the records.
Luke has a father.
Luke has a sister.
Vader’s son is Luke.
Leah’s father is Vader.
Leah’s brother’s name is Luke.
Chewbacca is a Wookie
We stored those similarities in a table.
We then ran a stored procedure to consolidate all associative and transitive associations down to a single parent.
The Luke Skywalker cluster: Luke, Leah, Vader
The Wookie cluster: Chewbacca
We did it. We came, we saw, we kicked its ass!
Enter the Slalom solution team
Data management (MDM) team
Our superstars Vaibhav Sathe and Soumen Chakraborty — personality wise they would be Winston (Ernie Hudson) and Ray (Dan Aykroyd) — these two spent several weeks of building out the rules and fuzzy analysis for comparing the data and identifying possible cluster relationships. I worked with Soumen briefly, so I’m type casting the Aykroyd fit, but Vaibhav and I worked together at another client on a similar problem. He’s usually quiet and collected but I could picture him screaming, “I love this town!” after defeating a giant marshmallow man.
Our data management practice area lead, Teddy Wei, is a force to be reckoned with. His left-field wit is equally matched by his commitment to team and the client. He cares. Personality wise, he’s definitely the Venkman (Bill Murray).
Also to help out at the level of MS SQL Server Guru was Kevin Feit. I’m saving the last Ghostbuster for myself, so I’ll have to give Kevin the miscast of Rick Moranis… Sorry Kevin, I’m writing the article. ;) Intellectually though he is an Egon, he knows MSSQL server and T-SQL like Neo knows Kung Fu.
Application Development team (we’re too good for acronyms)
Any by we, I mean me. I helped out on this project for a little over a week’s worth of time. My job, and the part I’m particularly proud of, was writing a stored procedure to connect all the related clusters to a single parent using set-based operations.
Did you imagine a SQL Server cursor? No, I did it with set based operations. Gosh! ;)
In our final tests, my set based operation took 13% of the time of a cursor based approach. I am the Egon (Harold Ramis). Plus my hair fro’s up like Egon’s when I go too long without a haircut.
The scoreboard, race for speed
The most exciting part of this project for me came back in June when Vaibhav (Winston) reached out to me. They were coming into trouble with performance on their final clustering stored procedure, the one that looks at all the relations between data points and creates one parent for each cluster.
They were using one approach that was taking six hours to run. The client needed a solution that completed in under two hours. And this was only a small sample of test data. If we couldn’t beat two hours to do end to end analysis, then the project would be a failure.
If six hours seems excessive here’s the problem space math.
You have a table with 40,000 relationships.
All many-to-many associative relationships.
(A==B, B==A, B==C)
Only a single level of a relationship is spelled out. You have to figure out the transitive relationships on your own.
(if A==B and B==C then A==C)
Doing this the hard way, you have to first look at record 1 and see who is related to record 1. Then those records that are related to record 1, you need to see if those records have any other records related and mark them as related to 1. Then keep going until you can’t go anymore. Then go to record 2.
Without the help of any indexes, SQL Server will have to perform 40,000! reads. I couldn’t find a calculator to do a factorial that large.
It’s a recursive problem. This is why 40,000 relationships was taking six hours.
40,000! = forever
Then the algorithm was optimized to make the best use of indexes. What does that give you now? Your’re now doing factorials on a smaller level. If the average depth of a cluster is 6 nodes then turn that into 6!, which turns into 6*5*4*3*2*1, which equals 720. So if you end up with 1,700 final clusters that then gives you this optimal number.
6! * 1,700 = 1,224,000 records to read
So your best scenario, is that SQL Server will not have to look at more than 1.2 million rows for this small sample size.
With index optimization we got the time down to 15 minutes
However this sample was 3% the size of what production would face. The solution still wouldn’t survive scaling out.
I converted the algorithm to a set based operation.
With set based operations we got the time down to 2 minutes
This was awesome! When we went to production, this stored procedure was able to analyze 1.19 million relationships and return the unique identified clusters for each relationship tree in under ten minutes.
Final result: For half a million records, it took 40 minutes to identify all relationships and 10 minutes to consolidate and identify all clusters.
It’s like looking at all college students and figuring out who has slept with who so you can figure out which students now have herpes. Okay, maybe that was a bit of a prudish metaphor — but same type of problem space. ;)
Wishes and lessons learned
If I had more time I would’ve loved to re-write the fuzzy analysis steps. See if I could invent a way to index records in SQL for Levenshtein and Jaro-Winkler scores. I got close to this using ElasticSearch on another project. Overall, I would like to try to get as close as I could to O(n)log(n).
I also re-learned that when you start to work with millions in SQL Server the query execution planner doesn’t care for indexes with composite columns. It may think a table scan is better than trying to exactly line up two indexes, especially if the columns are not unique. When working with millions, index one column and include the rest.
Unit tests saved a lot of time and headache in this work. Each time I found a new bug in the stored procedure I was writing, I would write a new unit test for it. The unit test wasn’t complex or anything, just a single T-SQL file. It would empty out production records into a temp table. Insert test seed data into the relationship table. Then run the clustering stored procedure and use IF EXISTS (SELECT…) and RAISERROR statements to report if any expectations failed. At the end, it would delete the test seed data and put the production records back into the table.
An architect’s work is never done. That’s why the Pharaohs used to bury them with their works. To shut them up.
This Android game (Osmos) was my inspiration for how I imagined solving the problem.