Leveraging small labeled datasets for improving dedupe only model #2869
Replies: 2 comments
-
I think that's a good approach - that's exactly how I would do it
Honestly, I think it's unlikely to make a big difference. It will only really affect the I think you'll find there is more mileage to:
I've written a bit more about this here: |
Beta Was this translation helpful? Give feedback.
-
|
Do you need to do this at scale? Otherwise you might want to just use an LLM directly. I found Gemini (huge context window) or Claude Code (essentially free if you have the $20 tier) to be quite good. Similarly: how accurate do you need to be, are you worried more about false negatives or false positives? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello
Thanks for the amazing work on Splink. I wanted to pick your brains on this:
My use case is to be able to quickly answer the question: does this list of twenty records contain any duplicates?
Currently I trained a dedupe only model on a large unlabeled dataset (>1M rows) and use it to predict duplicates for a new, unseen small dataset (say 100 rows). The model is completely static.
Let's assume we can manually label a small portion of example pairs overtime (using LLMs or human annotators for example) and generate a small, daily 1000 pairs labeled dataset (duplicate/no duplicate). Would it be useful to improve the previous model? For example set dynamically some hyper parameters and get an updated daily model?
Thanks a lot
Best
Beta Was this translation helpful? Give feedback.
All reactions