-
|
Conceptually, blocking on only DOB seems like something that should be feasible, and yet Splink is choking for us when we use DOB alone as a blocking rule for 2M-10M records, with 90 GB disk-cache memory errors from DuckDB, i.e. "max_temp_directory_size". Here's an example of the error we get. This wasn't for a DOB-only blocking rule (FWIW, it was when I checked for exact match first name, exact match last name, exact match sex, and DOB difference of <= 3 years), but the type of error is identical:
If I think about the splits, I could assume 50 years worth of records with 365 days per year, giving 18,250 partitions, which works out to average around 550 records per partition (more years means more partitions; I'm choosing 50 as a ball-park figure). In my own testing of partitions using Splink's investigation functions, that goes up to something like 4000 records for some dates. But even so... 4000 records isn't something that I expect would be running into memory issues. So... what's going on here? Is Splink generating all pairwise comparisons for all partitions all at once before exploring any of them? If so, is there any way to encourage serial processing of the blocks to reduce memory consumption? Will sharding in Splink v5 or removal of the implicit cache fix this issue? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
The issue here is that Splink computes the whole lot (all buckets) in parallel. That's a big reason it's fast. So using your example of 18,250 buckets. If there are 100 per bucket (roughly 2M records, then you get n(n-1)/2 comparisons per bucket = 5,000ish, and hence lower end, and hence 100m comparisons. That's a lower bound because as you say, if you have skew it's worse. So your 4,000 record single dob creates roughly 8 million comparisons. You're right to say that 8 million comparisons is not too many, but we're not computing one bucket at a time, we're doing all simultaneously. Of course, you could run this bucket by bucket yourself by iterating through each dob, filtering the input data down to that dob, and running predict() on just that bucket. If you really want to compute all within-dob comparisons, sharding (chunking) in Splink v5 will help with this quite a lot, so you could switch to that. The key thing here is that I imagine you'll only want to retain a very small percentage of the comaprisons. So like, in your 4,000 -> 8million comparison DOB, you may only keep 100 rows (by e.g. setting threshold_match_weight=1). Since it will run in an arbitrary number of chunks you shouldn't run out of disk space. |
Beta Was this translation helpful? Give feedback.
-
That's great to know, thanks. I'll try to work towards getting authorisation to use Splink v5.
Does that mean the memory consumption [from computing all buckets] will be reduced in Splink v4 by using a threshold weight?
Are rules operated on sequentially? Could I explode the rule set to do something that has a similar memory reduction effect? Or were you suggesting separate predict steps, and joining the tables afterwards? |
Beta Was this translation helpful? Give feedback.

Yes. At the very least the size of the output dataset should be far smaller, especially when using a very loose blocking rule like blocking on dob. It still has to the same number of calculations, it just only materialises a small % of the results.
No - they're computed in parallel. If you want the behaviour you're implying, you'd want to have an outer loop over
MonthOfBirth, something like (pseudocode):