Trying to get my head around clustering #3049

gringer · 2026-05-01T00:25:20Z

gringer
May 1, 2026

Here is the Splink API documentation description of how Splink does clustering:

Clusters the pairwise match predictions that result from linker.inference.predict() into groups of connected records using a single best links method that restricts the clusters to have at most one record from each source dataset in the duplicate_free_datasets list.

This method will include a record into a cluster if it is mutually the best match for the record and for the cluster, and if adding the record will not violate the criteria of having at most one record from each of the duplicate_free_datasets.

I thought that "mutually best match" would lead to Splink doing the right thing to leave as many links as possible, but I can't get my head around what is going on here. I understand that Splink has an iterative process for choosing links, and it is substantially slower than a simple "choose best link" algorithm, so I assume that it's doing something more than that.

We have a separate program that does link collapsing based on individual identifiers, implemented as SAS code, which we carry out when evaluating the link rate and false positive rate for links that are in our link database. This is the approach we use:

One-to-many: pick unique RHS (node) UIDs from the input link table, choosing the most likely LHS (spine) link, or the first UID when there are ties
One-to-one: pick unique LHS (spine) UIDs from the one-to-many table, choosing the most likely RHS (node) link, or the first UID when there are ties

This is the relevant part of the code in SAS:

* Deduplicate *;
data matches_deduped;
	set links;
	if spine_unique_id='.' then spine_unique_id='1';
	if node_unique_id='.'  then node_unique_id='1';
	by node_uid;
	if first.node_uid;
run;

%if %upcase(&onetoone.) = Y %then %do;
	proc sort data = matches_deduped; by spine_uid qsmatchpassnumber descending match_weight; 
	run;
	data matches_deduped;
		set matches_deduped;
		by spine_uid;
		if first.spine_uid;
	run;
%end;

And this is how I've implemented it in SQL for using in our Splink workflow (which we currently do after Splink clustering, because we assume that Splink is doing some other magic to properly deduplicate links):

sql = f"""
    -- collect best links, grouped by unique RHS UID (one-to-many collapsing)
    WITH rhs AS (
        SELECT
            *,
            ROW_NUMBER() OVER (
                PARTITION BY snz_uid_r
                ORDER BY match_weight DESC, snz_uid_l
            ) AS rn_l
        FROM predict_results_all
    ),
    -- collect best links, grouped by unique LHS UID (one-to-one collapsing)
    lhs AS (
        SELECT
            *,
            ROW_NUMBER() OVER (
                PARTITION BY snz_uid_l
                ORDER BY match_weight DESC, snz_uid_r
            ) AS rn_r
        FROM rhs
        WHERE rn_l = 1
    )
    SELECT *
    FROM lhs
    WHERE rn_r = 1
"""

With this approach, I can see that links will be dropped due to a naive "best-link" prioritisation. Here's a demonstrative example:

Because this is such a small example, I can work out an optimal solution by hand that maximises the sum of link probabilities in the result (with ties resolved by choosing the lowest-numbered IDs). Accidentally, that same previous example ends up being a situation where the optimal one-to-one solution has different links from the optimal one-to-many solution:

Is anyone able to explain what Splink is doing using a similar visual example?
Would Splink be able to find either of these optimal solutions using its clustering algorithms?
If so, how would I implement that through Splink functions?

gringer · 2026-05-06T03:13:01Z

gringer
May 6, 2026
Author

After having discussions with a few people, I think we've worked out that clustering is not appropriate for what we're doing.

The purpose of clustering seems to be to identify records that are connected due to similarity, but not to do any weight transfer to implicit links.

In other words, Splink reports connected records, but not the associated weights.

This means that our link collapsing approach, when applied to the connected clusters, was actually harmful in that it only preserved a subset of the links that we wanted to capture. This is most obviously an issue with our one-to-many link collapsing, because a link collapsing step that was being carried out after Splink clustering, but before uid-collapsing, was removing most of the "many" side of the links.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to get my head around clustering #3049

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Trying to get my head around clustering #3049

Uh oh!

Uh oh!

gringer May 1, 2026

Replies: 1 comment

Uh oh!

Uh oh!

gringer May 6, 2026 Author

gringer
May 1, 2026

gringer
May 6, 2026
Author