Name Matching by brute force – It’s Not Worth It

Client had 5 million customer base with multiple identifier fields and several ids. However, due to distributed nature of data collection there were several duplicates and mixed field entries. Date of Birth had missing/ default value, Tax ID field had phone number and vice versa. Doing a 5MM by 5MM brute force matching with multiple matching rules was estimated to take 3 days per iteration.  However, being able to run the program without running out of memory was a different problem altogether.

Necessity is the mother of Innovation

With our back against the wall with limited budget and time we set about looking at innovative ways to address the issue. We drew upon experience of team members when matching credit card transactions of > 200MM per month. Choosing a cascading matching going from most conservative to loose matching allowed us to reduce the iteration run time from 3 days to 30 minutes and experiment to get the most optimal results.

Did I mention that we got enough time to try out multiple name matching algorithms (about five).