Is there any rule of thumb difference between settings blocking rule and "training" rules? #1543
Replies: 1 comment
-
|
For training rules, I've been trying to optimise to reduce false-positives when there's a link:
https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html For the blocking rules for prediction, I try to optimise to reduce false-negatives, which means trying to think of all the edge cases I can think of that correspond to an "almost-there" link, and creating rule sets that are as small as possible that capture those edge cases:
https://moj-analytical-services.github.io/splink/topic_guides/blocking/blocking_rules.html |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am wondering if there is any logical choice in difference between the rules we use for the settings "blocking rules to generate predicitons"
settings = {
"link_type": "dedupe_only",
"comparisons": [
ctl.name_comparison("first_name"),
ctl.name_comparison("surname"),
ctl.date_comparison("dob", cast_strings_to_date=True),
cl.exact_match("city", term_frequency_adjustments=True),
ctl.email_comparison("email", include_username_fuzzy_level=False),
],
"blocking_rules_to_generate_predictions": [
block_on("first_name"),
block_on("surname"),
],
"retain_matching_columns": True,
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings)
And these rules;
training_blocking_rule = block_on(["first_name", "surname"])
training_session_fname_sname = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
Beta Was this translation helpful? Give feedback.
All reactions