Skip to content
Discussion options

You must be logged in to vote

Thanks, these are useful findings.

One primary reason for blocking, materialising the blocked pairs, and then predicting from the blocked pairs is that it helps parallelisation. Blocking joins do not parallelise well (at least, it is unpredictable) because SQL engines don't tend to be well optimised for queries that create lots of new rows (i.e. the mini cartesian products you get 'within' blocking rules).

As a result, on big workflows, you often see a failure to parallelise e.g. due to straggler tasks or other issues.

If you create the blocked pairs first, then you effectively eliminate skew, and also the database engine 'knows' how many rows its dealing with, so can split up the workloa…

Replies: 5 comments 6 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by gringer
Comment options

You must be logged in to vote
2 replies
@gringer
Comment options

@RobinL
Comment options

RobinL May 16, 2026
Maintainer

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
4 replies
@RobinL
Comment options

RobinL Jun 3, 2026
Maintainer

@gringer
Comment options

@RobinL
Comment options

RobinL Jun 4, 2026
Maintainer

@gringer
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants