Multi-stage blocking - how do I combine results? #3044

gringer · 2026-04-23T04:07:12Z

gringer
Apr 23, 2026

I'm currently working on getting multi-stage blocking into our Splink workflows. I have a blocking_stage variable which stores the current stage, and am feeding in rules from a JSON file which are represented as rule sets within an array. Here's the current loading part of my code, which fetches the trained model from the JSON file as a dict, then patches in the blocking rules for each stage. I assume there's a more direct way to do this, but it seems to be working well enough:

from splink import blocking_analysis
from splink import block_on
from splink.blocking_rule_library import CustomRule

if(blocking_stage > 0):
    if((not multiple_stages) or ((blocking_stage + 1) > len(params["rule_params"]))):
        raise Exception(f"Blocking stage loop counter ({blocking_stage + 1}) greater than expected rule stage count from JSON file")

### Load Settings model into dict
filename="settings_model.json"
filename =  os.path.join(outputs_folder, filename)
logger.info(f"Loading trained model from settings file: {filename}")
with open(filename, "r") as f:
    settings_dict = json.load(f)

logger.info(f"Loading blocking rules for stage {(blocking_stage + 1)}" if multiple_stages else "")
stage_rules = params["rule_params"][blocking_stage] if multiple_stages else params["rule_params"]

### add next blocking rule set to the dict
blocking_rules = [
    block_on(*rule["columns"], arrays_to_explode=rule.get("arrays_to_explode", [])) if ("columns" in rule) else
        CustomRule(blocking_rule=rule["blocking_rule"], sql_dialect=rule["sql_dialect"], arrays_to_explode=rule.get("arrays_to_explode", []))
    for rule in stage_rules
]

settings_dict["blocking_rules_to_generate_predictions"] = blocking_rules

linker = Linker(
    [
        connections["processing"]["conn"].table(f"{params['lhs_table_name']}_clean"),
        connections["processing"]["conn"].table(f"{params['rhs_table_name']}_clean")
    ],
    settings = settings_dict,
    db_api=connections["processing"]["api"]
)

My current challenge is on the other end of things, combining results from multiple stages. I thought that a row-based append concatenation would work:

con = connections["processing"]["conn"]
# Set up output result table
table_name = df_predict_one_to_one.physical_name
if(blocking_stage == 0): # the first stage creates a new table
    con.execute(f"CREATE OR REPLACE TABLE predict_results_all AS SELECT * FROM {table_name} LIMIT 0")
else: # subsequent stages will append to an existing table (if it exists)
    con.execute(f"CREATE TABLE IF NOT EXISTS predict_results_all AS SELECT * FROM {table_name} LIMIT 0")

# Append results
con.execute(f"INSERT INTO predict_results_all SELECT * FROM {table_name}")

Unfortunately, the different stages have different field names, and my code spits out an error:

---------------------------------------------------------------------------
BinderException                           Traceback (most recent call last)
Cell In[41], line 12
      9 print(f"== Blocking Stage {blocking_stage + 1} ==\n")
     11 # Append results
---> 12 con.execute(f"INSERT INTO predict_results_all SELECT * FROM {table_name}")
     14 # Count matching rows from input tables
     15 lhs_total = con.sql(f"""
     16 SELECT
     17     COUNT(*) AS count_lhs
     18 FROM {params["lhs_table_name"]}_clean
     19 """).fetchone()[0]

BinderException: Binder Error: table predict_results_all has 57 columns but 48 values were supplied

What seems to be happening is that output fields are filtered to only include input field names that were used in the blocking rule set. Is there any way in the current Splink code to make the outputs from different predict runs consistent?

retaining all input column names OR
restricting the output to a subset that should be present in all outputs, regardless of the blocking rules used

[If not, I guess I just need to inspect the columns, and pick out the ones that are important to me]

Answered by medwar99

May 20, 2026

Are you after https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html#additional_columns_to_retain? This lets you denote which columns you want to retain in outputs.

Alternatively, depending on what you want to do with the outputs, you could select a subset of columns that are the bare minimum required to describe the table and then reconstruct it later on once you UNION / UNION ALL them together.

For df_predict and df_cluster I save a concise/compressed version which only contains the following fields:

df_predict:
- match_weight
- match_probability
- match_key
- unique_id_l
- unique_id_r
- source_dataset_l
- source_dataset_r
df_cluster:
- cluster_id
- unique_id
- source_dat…

View full answer

medwar99 · 2026-05-20T12:35:04Z

medwar99
May 20, 2026

Are you after https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html#additional_columns_to_retain? This lets you denote which columns you want to retain in outputs.

Alternatively, depending on what you want to do with the outputs, you could select a subset of columns that are the bare minimum required to describe the table and then reconstruct it later on once you UNION / UNION ALL them together.

For df_predict and df_cluster I save a concise/compressed version which only contains the following fields:

df_predict:
- match_weight
- match_probability
- match_key
- unique_id_l
- unique_id_r
- source_dataset_l
- source_dataset_r
df_cluster:
- cluster_id
- unique_id
- source_dataset

We also sometimes include the Splink-produced bayes factor, term frequency and gamma columns in the df_predict table (needed for certain visualisations), however the above are the minimum we like to keep and from these we can reconstruct the full df_predict and df_cluster tables later.

1 reply

gringer May 21, 2026
Author

Depending on what you want to do with the outputs, you could select a subset of columns that are the bare minimum required to describe the table and then reconstruct it later on once you UNION / UNION ALL them together.

What is the bare minimum columns that are required for the Comparison Viewer to work? That's the thing I want to get working for the final union dataset, which I don't expect I can get working currently because columns are being dropped.

Are you after https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html#additional_columns_to_retain? This lets you denote which columns you want to retain in outputs.

Yes, thank you. That's exactly what I was looking for! I assume I can just dump all my input columns into that, and Splink will sort out the rest...?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-stage blocking - how do I combine results? #3044

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-stage blocking - how do I combine results? #3044

Uh oh!

gringer Apr 23, 2026

Replies: 1 comment · 1 reply

Uh oh!

medwar99 May 20, 2026

Uh oh!

gringer May 21, 2026 Author

gringer
Apr 23, 2026

Replies: 1 comment 1 reply

medwar99
May 20, 2026

gringer May 21, 2026
Author