Can splink identify join keys? #2873

sahilzs8 · 2026-01-11T15:10:20Z

sahilzs8
Jan 11, 2026

so my use case is -
i have 2 files (csv data) which contains many headers (columns). so i want to join this 2 files based on some join key......and this join key will not be straight forward....sometimes it will be a combination of 2-3 columns or sometimes we have to strip a column to create a column which is join key or so.....

so for this use case can splink is a good fit before i do a poc?

jackwildman · 2026-01-23T13:56:45Z

jackwildman
Jan 23, 2026

A lot of this depends on how “not straightforward” we’re talking here. If your join logic is well-defined, following clear and predictable rules for preprocessing then joining (concatenating columns into a unique key, stripping whitespace), then something like pandas is more than enough:

df1['join_key'] = (df1['first_name'] + df1['last_name']).str.strip().str.lower()
df2['join_key'] = df2['full_name'].str.strip().str.lower()
result = df1.merge(df2, on='join_key')

This will start to become complicated if you’re dealing with some less well-defined noise within data, such as typos in fields, missing records, where you can’t easily construct a simple string match but can assume that there exists correlation between fields that can allow them to be identified. Splink is a great fit if this is the kind of thing you’re dealing with context-free correlation, and potentially covers everything you need.

Going further beyond, if the matching logic is sufficiently hard that it requires real judgement a context, you might consider a solution backed by LLM matching. For example, if you’re tracking companies over time and need to know that Facebook == Meta, or you have a set of software and a set of software vendors, and need to match on Excel == Microsoft, then an LLM approach works well. The trade-off, of course, being cost and latency, but we’ve found some of the newer generation of fast models (Gemini Flash 3, for example), really help keeping this under control. If this is kind of thing you need, then it’s available in https://github.com/futuresearch/everyrow-sdk (disclaimer: I work on this) using it’s merge operation:

result = await merge(
    task="Match the software to the company that produces it",
    left_table=df1,
    right_table=df2,
    merge_on_left="software_name",
    merge_on_right="vendor_name",
)

Agan, this approach really shines if the context and judgement are required to be able to make a good match under flexible criteria. For most other cases, where we can deterministically construct our own keys or infer matches by correlation, then a simple pandas approach or using splink is likely a better option.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can splink identify join keys? #2873

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can splink identify join keys? #2873

Uh oh!

sahilzs8 Jan 11, 2026

Replies: 1 comment

Uh oh!

jackwildman Jan 23, 2026

sahilzs8
Jan 11, 2026

jackwildman
Jan 23, 2026