Replies: 1 comment
-
|
After having discussions with a few people, I think we've worked out that clustering is not appropriate for what we're doing. The purpose of clustering seems to be to identify records that are connected due to similarity, but not to do any weight transfer to implicit links. In other words, Splink reports connected records, but not the associated weights. This means that our link collapsing approach, when applied to the connected clusters, was actually harmful in that it only preserved a subset of the links that we wanted to capture. This is most obviously an issue with our one-to-many link collapsing, because a link collapsing step that was being carried out after Splink clustering, but before uid-collapsing, was removing most of the "many" side of the links. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Here is the Splink API documentation description of how Splink does clustering:
I thought that "mutually best match" would lead to Splink doing the right thing to leave as many links as possible, but I can't get my head around what is going on here. I understand that Splink has an iterative process for choosing links, and it is substantially slower than a simple "choose best link" algorithm, so I assume that it's doing something more than that.
We have a separate program that does link collapsing based on individual identifiers, implemented as SAS code, which we carry out when evaluating the link rate and false positive rate for links that are in our link database. This is the approach we use:
This is the relevant part of the code in SAS:
And this is how I've implemented it in SQL for using in our Splink workflow (which we currently do after Splink clustering, because we assume that Splink is doing some other magic to properly deduplicate links):
With this approach, I can see that links will be dropped due to a naive "best-link" prioritisation. Here's a demonstrative example:
Because this is such a small example, I can work out an optimal solution by hand that maximises the sum of link probabilities in the result (with ties resolved by choosing the lowest-numbered IDs). Accidentally, that same previous example ends up being a situation where the optimal one-to-one solution has different links from the optimal one-to-many solution:
Beta Was this translation helpful? Give feedback.
All reactions