-
|
I'm currently working on getting multi-stage blocking into our Splink workflows. I have a My current challenge is on the other end of things, combining results from multiple stages. I thought that a row-based append concatenation would work: Unfortunately, the different stages have different field names, and my code spits out an error: What seems to be happening is that output fields are filtered to only include input field names that were used in the blocking rule set. Is there any way in the current Splink code to make the outputs from different
[If not, I guess I just need to inspect the columns, and pick out the ones that are important to me] |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Are you after https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html#additional_columns_to_retain? This lets you denote which columns you want to retain in outputs. Alternatively, depending on what you want to do with the outputs, you could select a subset of columns that are the bare minimum required to describe the table and then reconstruct it later on once you UNION / UNION ALL them together. For df_predict and df_cluster I save a concise/compressed version which only contains the following fields:
We also sometimes include the Splink-produced bayes factor, term frequency and gamma columns in the df_predict table (needed for certain visualisations), however the above are the minimum we like to keep and from these we can reconstruct the full df_predict and df_cluster tables later. |
Beta Was this translation helpful? Give feedback.
Are you after https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html#additional_columns_to_retain? This lets you denote which columns you want to retain in outputs.
Alternatively, depending on what you want to do with the outputs, you could select a subset of columns that are the bare minimum required to describe the table and then reconstruct it later on once you UNION / UNION ALL them together.
For df_predict and df_cluster I save a concise/compressed version which only contains the following fields:
df_predict:
df_cluster: