Skip to content

fix: Remove catalog access from SparkSQLWriter#14083

Merged
nsivabalan merged 4 commits into
apache:masterfrom
linliu-code:fix_catalog_access_spark_datasource
Oct 17, 2025
Merged

fix: Remove catalog access from SparkSQLWriter#14083
nsivabalan merged 4 commits into
apache:masterfrom
linliu-code:fix_catalog_access_spark_datasource

Conversation

@linliu-code

@linliu-code linliu-code commented Oct 13, 2025

Copy link
Copy Markdown
Collaborator

Describe the issue this Pull Request addresses

#14081

HoodieSparkSqlWriter would access enabled catalog within Spark Datasource operations during schema resolution, when it can not get the schema from the commit metadata, table config or data files.

This behavior may cause some confusion since Spark Datasource operation may accidentally access the catalog and get the schema from a table with the same name, which may be an irrelevant table.

Summary and Changelog

We remove the catalog access from the writer, and pass the schema from SQL command. Therefore, for Spark Datasource operations, no catalog access could happen.

Impact

Removal of some unexpected behavior of Spark writer.

Risk Level

Medium.

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Oct 13, 2025
@nsivabalan

Copy link
Copy Markdown
Contributor

can you link the offending commit @linliu-code

@nsivabalan

Copy link
Copy Markdown
Contributor

hey @linliu-code : can you follow up on test failures on this

@linliu-code linliu-code force-pushed the fix_catalog_access_spark_datasource branch from 56cab12 to 7aafc3e Compare October 14, 2025 20:48
@linliu-code linliu-code marked this pull request as ready for review October 14, 2025 20:48
@nsivabalan

Copy link
Copy Markdown
Contributor

hey @linliu-code : did you get to triage the test failures?

@nsivabalan

Copy link
Copy Markdown
Contributor

hey @linliu-code : any leads on test failures.

@linliu-code linliu-code force-pushed the fix_catalog_access_spark_datasource branch from 3c45a03 to ebdc1a3 Compare October 15, 2025 17:18
@linliu-code linliu-code force-pushed the fix_catalog_access_spark_datasource branch from ebdc1a3 to c21b437 Compare October 16, 2025 01:37
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands@hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan

Copy link
Copy Markdown
Contributor

I am not convinced yet that we should contact catalog even for spark-sql writes.
can you get to the bottom of changes done in https://github.com/apache/hudi/pull/6358/files#r2425086622
and understand prior to that patch, what we were doing.
and why we had to poll catalog. for eg, prior to alexey's patch, was INSERT_INTO polling catalog for the schema.

if not, why can't we remove it completely.

in other words, within deduceWriterSchema, why can't we handle empty tables. i..e no schema for latestTableSchema.

@nsivabalan nsivabalan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nsivabalan nsivabalan merged commit 8491d6a into apache:master Oct 17, 2025
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants