If you’ve never contemplated an algorithm for predicting lottery numbers or the fantasy football league then this will be boring. But if you are a secret data science nerd, then please join me in an epic high five! We reduced cluster analysis time to 66% of an enterprise product We just completed coding a cluster analysis tool for data: the competition took 90 minutes, we finished in under an hour. The problem space is typically O(n³) and is like your classical travelling salesman problem Our client had a data and cost problem. Slalom The data problem Take name, address and other personal information to consolidate information across various sources. I’m writing this in a hurry so I won’t reveal any client-specific details or actual implementation- that belongs to the client now- but a similar problem space might be genealogy. Then given: you have death records, marriage records, birth records, census records from different sources across different times and different spellings. You don’t want to duplicate data. You don’t want to lose data. So you need to form a cluster. A cluster is how you identify that 1890 census Jimmy Joehosaphat @ 10 Band St is the same as 1880 census James Johosaphat @ 10 Bond Street. I’m sure you can think of how to do this for one person, but how do you scale this problem to millions without killing your server? The cost problem Cost was a concern to the client. Naturally, I imagine they would use the money they save to buy a water fountain connected to a never-ending supply of Monster energy drinks. They were spending six digits a year on enterprise licensing. The tools Sometimes you gotta work with what you got. ;) Microsoft SQL Server 2012 and SSIS. The solution We used SQL Server Master Data Management (MDM) similarity functions and SSIS to find similarities among all the records. Luke has a father. Luke has a sister. Vader’s son is Luke. Leah’s father is Vader.Leah’s brother’s name is Luke.Chewbacca is a Wookie We stored those similarities in a table. We then ran a stored procedure to consolidate all associative and transitive associations down to a single parent. The Luke Skywalker cluster: Luke, Leah, VaderThe Wookie cluster: Chewbacca We did it. We came, we saw, we kicked its ass! Enter the Slalom solution team The Ghostbusters were represented by Slalom’s Data Management (MDM) and Application Development teams Data management (MDM) team Our superstars and — personality wise they would be Winston (Ernie Hudson) and Ray (Dan Aykroyd) — these two spent several weeks of building out the rules and fuzzy analysis for comparing the data and identifying possible cluster relationships. I worked with Soumen briefly, so I’m type casting the Aykroyd fit, but Vaibhav and I worked together at another client on a similar problem. He’s usually quiet and collected but I could picture him screaming, “I love this town!” after defeating a giant marshmallow man. Vaibhav Sathe Soumen Chakraborty Our data management practice area lead, , is a force to be reckoned with. His left-field wit is equally matched by his commitment to team and the client. He cares. Personality wise, he’s definitely the Venkman (Bill Murray). Teddy Wei Also to help out at the level of MS SQL Server Guru was . I’m saving the last Ghostbuster for myself, so I’ll have to give Kevin the miscast of Rick Moranis… Sorry Kevin, I’m writing the article. ;) Intellectually though he is an Egon, he MSSQL server and T-SQL like Neo knows Kung Fu. Kevin Feit knows Application Development team (we’re too good for acronyms) Any by we, I mean me. I helped out on this project for a little over a week’s worth of time. My job, and the part I’m particularly proud of, was writing a stored procedure to connect all the related clusters to a single parent using set-based operations. No, I did it with set based operations**.** Gosh! ;) Did you imagine a SQL Server cursor? In our final tests, my set based operation took 13% of the time of a cursor based approach. I am the Egon (Harold Ramis). Plus my hair fro’s up like Egon’s when I go too long without a haircut. The scoreboard, race for speed The most exciting part of this project for me came back in June when Vaibhav (Winston) reached out to me. They were coming into trouble with performance on their final clustering stored procedure, the one that looks at all the relations between data points and creates one parent for each cluster. They were using one approach that was taking to run. . And this was only a small sample of test data. If we couldn’t beat two hours to do end to end analysis, then the project would be a failure. six hours The client needed a solution that completed in under two hours If six hours seems excessive here’s the problem space math. You have a table with 40,000 relationships. You’re trying to figure out which piece of spaghetti is touching which piece of spaghetti and which pieces those pieces are also touching. Ahh, recursion. All many-to-many associative relationships. (A A, B==C) B, B You have to figure out the transitive relationships on your own. (if A C then A==C) Only a single level of a relationship is spelled out. B and B Doing this the hard way, you have to first look at record 1 and see who is related to record 1. Then those records that are related to record 1, you need to see if those records have any other records related and mark them as related to 1. Then keep going until you can’t go anymore. Then go to record 2. I couldn’t find a calculator to do a factorial that large. Without the help of any indexes, SQL Server will have to perform 40,000! reads. It’s a recursive problem. This is why 40,000 relationships was taking six hours. 40,000! = forever What does that give you now? Your’re now doing factorials on a smaller level. If the average depth of a cluster is 6 nodes then turn that into 6!, which turns into 6*5*4*3*2*1, which equals 720. So if you end up with 1,700 final clusters that then gives you this optimal number. Then the algorithm was optimized to make the best use of indexes. 6! * 1,700 = 1,224,000 records to read So your best scenario, is that SQL Server will not have to look at more than 1.2 million rows for this small sample size. With index optimization we got the time down to 15 minutes However this sample was 3% the size of what production would face. The solution still wouldn’t survive scaling out. I converted the algorithm to a set based operation. With set based operations we got the time down to 2 minutes This was awesome! When we went to production, this stored procedure was able to analyze 1.19 million relationships and return the unique identified clusters for each relationship tree in under ten minutes. Final result: For half a million records, it took 40 minutes to identify all relationships and 10 minutes to consolidate and identify all clusters. It’s like looking at all college students and figuring out who has slept with who so you can figure out which students now have herpes. Okay, maybe that was a bit of a prudish metaphor — but same type of problem space. ;) Wishes and lessons learned If I had more time I would’ve loved to re-write the fuzzy analysis steps. See if I could invent a way to index records in SQL for Levenshtein and Jaro-Winkler scores. I got close to this using ElasticSearch on another project. Overall, I would like to try to get as close as I could to O(n)log(n). I also re-learned that . It may think a table scan is better than trying to exactly line up two indexes, especially if the columns are not unique. When working with millions, index one column and include the rest. when you start to work with millions in SQL Server the query execution planner doesn’t care for indexes with composite columns Each time I found a new bug in the stored procedure I was writing, I would write a new unit test for it. The unit test wasn’t complex or anything, just a single T-SQL file. It would empty out production records into a temp table. Insert test seed data into the relationship table. Then run the clustering stored procedure and use IF EXISTS (SELECT…) and RAISERROR statements to report if any expectations failed. At the end, it would delete the test seed data and put the production records back into the table. Unit tests saved a lot of time and headache in this work. An architect’s work is never done. That’s why the Pharaohs used to bury them with their works. To shut them up. This Android game (Osmos) was my inspiration for how I imagined solving the problem. If you liked this article then show some love and click the heart on the left to recommend. Follow me on Medium for enthusiastic dry-witted uber-nerd content. I am also on and Twitter Instagram