Replies: 1 comment
-
|
A lot of this depends on how “not straightforward” we’re talking here. If your join logic is well-defined, following clear and predictable rules for preprocessing then joining (concatenating columns into a unique key, stripping whitespace), then something like df1['join_key'] = (df1['first_name'] + df1['last_name']).str.strip().str.lower()
df2['join_key'] = df2['full_name'].str.strip().str.lower()
result = df1.merge(df2, on='join_key')This will start to become complicated if you’re dealing with some less well-defined noise within data, such as typos in fields, missing records, where you can’t easily construct a simple string match but can assume that there exists correlation between fields that can allow them to be identified. Splink is a great fit if this is the kind of thing you’re dealing with context-free correlation, and potentially covers everything you need. Going further beyond, if the matching logic is sufficiently hard that it requires real judgement a context, you might consider a solution backed by LLM matching. For example, if you’re tracking companies over time and need to know that Facebook == Meta, or you have a set of software and a set of software vendors, and need to match on Excel == Microsoft, then an LLM approach works well. The trade-off, of course, being cost and latency, but we’ve found some of the newer generation of fast models (Gemini Flash 3, for example), really help keeping this under control. If this is kind of thing you need, then it’s available in https://github.com/futuresearch/everyrow-sdk (disclaimer: I work on this) using it’s merge operation: result = await merge(
task="Match the software to the company that produces it",
left_table=df1,
right_table=df2,
merge_on_left="software_name",
merge_on_right="vendor_name",
)Agan, this approach really shines if the context and judgement are required to be able to make a good match under flexible criteria. For most other cases, where we can deterministically construct our own keys or infer matches by correlation, then a simple pandas approach or using splink is likely a better option. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
so my use case is -
i have 2 files (csv data) which contains many headers (columns). so i want to join this 2 files based on some join key......and this join key will not be straight forward....sometimes it will be a combination of 2-3 columns or sometimes we have to strip a column to create a column which is join key or so.....
so for this use case can splink is a good fit before i do a poc?
Beta Was this translation helpful? Give feedback.
All reactions