Leveraging small labeled datasets for improving dedupe only model #2869

Fricadelle59 · 2026-01-05T16:41:59Z

Fricadelle59
Jan 5, 2026

Hello
Thanks for the amazing work on Splink. I wanted to pick your brains on this:

My use case is to be able to quickly answer the question: does this list of twenty records contain any duplicates?
Currently I trained a dedupe only model on a large unlabeled dataset (>1M rows) and use it to predict duplicates for a new, unseen small dataset (say 100 rows). The model is completely static.

Let's assume we can manually label a small portion of example pairs overtime (using LLMs or human annotators for example) and generate a small, daily 1000 pairs labeled dataset (duplicate/no duplicate). Would it be useful to improve the previous model? For example set dynamically some hyper parameters and get an updated daily model?

Thanks a lot

Best

RobinL · 2026-01-07T13:40:54Z

RobinL
Jan 7, 2026
Maintainer

Currently I trained a dedupe only model on a large unlabeled dataset (>1M rows) and use it to predict duplicates for a new, unseen small dataset (say 100 rows). The model is completely static.

I think that's a good approach - that's exactly how I would do it

Let's assume we can manually label a small portion of example pairs overtime (using LLMs or human annotators for example) and generate a small, daily 1000 pairs labeled dataset (duplicate/no duplicate). Would it be useful to improve the previous model? For example set dynamically some hyper parameters and get an updated daily model?

Honestly, I think it's unlikely to make a big difference. It will only really affect the m probabilities, and empirically we've found that accuracy of models is not hugely sensitive to small differences in the m values.

I think you'll find there is more mileage to:

using the labels to find false positives
manually inspect the false positives using e.g. the waterfall chart to try to understand why the false positive is occurring, and to try and work out whether there are any improvements to the model spec (i.e. the definition of the Comparisons) that may help.

I've written a bit more about this here:
https://www.robinlinacre.com/measuring_data_linking_accuracy/

0 replies

nikosbosse · 2026-01-19T21:50:41Z

nikosbosse
Jan 19, 2026

Do you need to do this at scale? Otherwise you might want to just use an LLM directly. I found Gemini (huge context window) or Claude Code (essentially free if you have the $20 tier) to be quite good. Similarly: how accurate do you need to be, are you worried more about false negatives or false positives?
For the project that we're currently working on (https://github.com/futuresearch/everyrow-sdk) we're using a pipeline that starts with fuzzy matching, then uses embeddings and lastly checks the rest using LLMs.
But I would also be very keen to learn whether your approach of training a custom model works. Do you always have the same kind of data, or do you expect some drift between training and production data?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leveraging small labeled datasets for improving dedupe only model #2869

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Leveraging small labeled datasets for improving dedupe only model #2869

Uh oh!

Uh oh!

Fricadelle59 Jan 5, 2026

Replies: 2 comments

Uh oh!

RobinL Jan 7, 2026 Maintainer

Uh oh!

nikosbosse Jan 19, 2026

Fricadelle59
Jan 5, 2026

RobinL
Jan 7, 2026
Maintainer

nikosbosse
Jan 19, 2026