Block on only DOB #3000

gringer · 2026-03-31T03:10:48Z

gringer
Mar 31, 2026

Conceptually, blocking on only DOB seems like something that should be feasible, and yet Splink is choking for us when we use DOB alone as a blocking rule for 2M-10M records, with 90 GB disk-cache memory errors from DuckDB, i.e. "max_temp_directory_size". Here's an example of the error we get. This wasn't for a DOB-only blocking rule (FWIW, it was when I checked for exact match first name, exact match last name, exact match sex, and DOB difference of <= 3 years), but the type of error is identical:

If I think about the splits, I could assume 50 years worth of records with 365 days per year, giving 18,250 partitions, which works out to average around 550 records per partition (more years means more partitions; I'm choosing 50 as a ball-park figure). In my own testing of partitions using Splink's investigation functions, that goes up to something like 4000 records for some dates.

But even so... 4000 records isn't something that I expect would be running into memory issues.

So... what's going on here? Is Splink generating all pairwise comparisons for all partitions all at once before exploring any of them? If so, is there any way to encourage serial processing of the blocks to reduce memory consumption? Will sharding in Splink v5 or removal of the implicit cache fix this issue?

Answered by RobinL

Apr 6, 2026

Does that mean the memory consumption [from computing all buckets] will be reduced in Splink v4 by using a threshold weight?

Yes. At the very least the size of the output dataset should be far smaller, especially when using a very loose blocking rule like blocking on dob. It still has to the same number of calculations, it just only materialises a small % of the results.

Are rules operated on sequentially? Could I explode the rule set to do something that has a similar memory reduction effect?

No - they're computed in parallel. If you want the behaviour you're implying, you'd want to have an outer loop over MonthOfBirth, something like (pseudocode):

for each m in MonthOfBirth:
   df_f…

View full answer

RobinL · 2026-04-03T15:18:30Z

RobinL
Apr 3, 2026
Maintainer

The issue here is that Splink computes the whole lot (all buckets) in parallel. That's a big reason it's fast.

So using your example of 18,250 buckets. If there are 100 per bucket (roughly 2M records, then you get n(n-1)/2 comparisons per bucket = 5,000ish, and hence lower end, and hence 100m comparisons.

That's a lower bound because as you say, if you have skew it's worse. So your 4,000 record single dob creates roughly 8 million comparisons.

You're right to say that 8 million comparisons is not too many, but we're not computing one bucket at a time, we're doing all simultaneously.

Of course, you could run this bucket by bucket yourself by iterating through each dob, filtering the input data down to that dob, and running predict() on just that bucket.

If you really want to compute all within-dob comparisons, sharding (chunking) in Splink v5 will help with this quite a lot, so you could switch to that. The key thing here is that I imagine you'll only want to retain a very small percentage of the comaprisons. So like, in your 4,000 -> 8million comparison DOB, you may only keep 100 rows (by e.g. setting threshold_match_weight=1). Since it will run in an arbitrary number of chunks you shouldn't run out of disk space.

0 replies

gringer · 2026-04-06T02:42:39Z

gringer
Apr 6, 2026
Author

If you really want to compute all within-dob comparisons, sharding (chunking) in Splink v5 will help with this quite a lot, so you could switch to that.

That's great to know, thanks. I'll try to work towards getting authorisation to use Splink v5.

in your 4,000 -> 8million comparison DOB, you may only keep 100 rows (by e.g. setting threshold_match_weight=1). Since it will run in an arbitrary number of chunks you shouldn't run out of disk space.

Does that mean the memory consumption [from computing all buckets] will be reduced in Splink v4 by using a threshold weight?

Of course, you could run this bucket by bucket yourself by iterating through each dob, filtering the input data down to that dob, and running predict() on just that bucket.

Are rules operated on sequentially? Could I explode the rule set to do something that has a similar memory reduction effect?

(l.MonthOfBirth = 1) AND (l.splink_birth_date = r.splink_birth_date)
(l.MonthOfBirth = 2) AND (l.splink_birth_date = r.splink_birth_date)
...
(l.MonthOfBirth = 12) AND (l.splink_birth_date = r.splink_birth_date)

Or were you suggesting separate predict steps, and joining the tables afterwards?

1 reply

RobinL Apr 6, 2026
Maintainer

Does that mean the memory consumption [from computing all buckets] will be reduced in Splink v4 by using a threshold weight?

Yes. At the very least the size of the output dataset should be far smaller, especially when using a very loose blocking rule like blocking on dob. It still has to the same number of calculations, it just only materialises a small % of the results.

Are rules operated on sequentially? Could I explode the rule set to do something that has a similar memory reduction effect?

No - they're computed in parallel. If you want the behaviour you're implying, you'd want to have an outer loop over MonthOfBirth, something like (pseudocode):

for each m in MonthOfBirth:
   df_filtered = df.filter(MonthOfBirth = m)
   linker.inference.predict(df_filtered)

and then UNION ALL the results (or just sae them in parquet to a folder and load using read_parquet("path/*parquet")

Answer selected by gringer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block on only DOB #3000

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Block on only DOB #3000

Uh oh!

Uh oh!

gringer Mar 31, 2026

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

RobinL Apr 3, 2026 Maintainer

Uh oh!

Uh oh!

gringer Apr 6, 2026 Author

Uh oh!

RobinL Apr 6, 2026 Maintainer

gringer
Mar 31, 2026

Replies: 2 comments 1 reply

RobinL
Apr 3, 2026
Maintainer

gringer
Apr 6, 2026
Author

RobinL Apr 6, 2026
Maintainer