Data Matching: How Does it Work?
Today we will look at the theory behind Data Matching, what is it and how it works.
Data matching is the ability to identify duplicates in large data sets. These duplicates could be people with multiple entries in one or many databases. It could also be duplicate items, of any description, in stock systems. This allows you to identify duplicates and merge them into one. Equally important, it lets you identify non duplicates.
First, how does Data Matching really work? What is the math behind it? How do we know that two “things” are really the same “thing”? What uniquely identifies something. This is something we do intuitively. We recognize features in objects that are similar, then reason that they could be the same thing. We also do this with people, objects…anything.
This problem was formalized in the 60’s in the seminal work of Fellegi and Sunter, two American statisticians. The first use was for the US census bureau. It’s called ‘Record Linkage’, i.e. how are records from different data sets linked together? For duplicate records it is sometimes called De-duplication, or the process of identifying duplicates and linking them. So, what properties help identify duplicates?
We need ‘unique identifiers’. These are properties that are unlikely to change over time. We can associate and weigh probabilities for each property. For example, noting the probability that those two things are actually the same. This can then be applied to both people and things.
The issues is that things change or become misidentified. We need to figure out what is most likely to change, such as name, address or DOB. Other things, like size and color, are less likely to change.
Record linkage is highly sensitive to the quality of the data being linked. Data should first be ‘standardized’ so it is all of a similar quality.
Now there are two sorts of data linkage-
1. Deterministic Record Linkage
a. That is based on a number of identifiers that match
2. Probabilistic Record Linkage
a. This is based on the probability that a number of identifiers match
The vast majority of Data Matching is Probabilistic Data Matching.
Finally, how do you match? The first step is called blocking. You sort data into blocks with the same attribute then identify what is unlikely to change. Finally, you do the matching. First, assign a match type for each attribute (there are lots of different ways to match these attributes). Names can be matched phonetically; dates can be matched by similarity. Next you calculate the RELATIVE weight for each match attribute. It’s similar to a measure of ‘importance’. Then you calculate the probabilities for matching and also accidently un-matching those fields. Finally, you assign an algorithm for adjusting the relative weight for each attribute to get what is called a ‘Total Match Weight’. That is then the probabilistic match for two things.
• Standardize the Data
• Pick Attributes that are unlikely to change
• Block, sort into similar sized blocks
• Match via probabilities (remember there are lots of different match types)
• Assign Weights to the matches
• Add it all up – get a TOTAL weight
The final step is to tune your matching algorithms so that you can obtain better and better matches.
There are several tools in Talend that will help with data matching that we will cover in a later blog.