Skip to content

Fix Kubernetes worker secret upsert race on concurrent job creation (#16447)#22036

Open
wtfashwin wants to merge 12 commits into
PrefectHQ:mainfrom
wtfashwin:fix/k8s-upsert-secret-concurrent-create-16447
Open

Fix Kubernetes worker secret upsert race on concurrent job creation (#16447)#22036
wtfashwin wants to merge 12 commits into
PrefectHQ:mainfrom
wtfashwin:fix/k8s-upsert-secret-concurrent-create-16447

Conversation

@wtfashwin
Copy link
Copy Markdown
Contributor

Closes #16447

Problem

KubernetesWorker._upsert_secret reads the API key secret and, on a 404, creates it. When a newly started worker picks up several scheduled flow runs at once, it issues consecutive create_namespaced_job calls that each run _upsert_secret concurrently. They all observe the 404, all attempt to create the secret, and every call after the first fails with a 409 AlreadyExists. That ApiException propagated unhandled and crashed the flow run.

This matches the root-cause analysis in #16447 (a classic check-then-act / TOCTOU race between the read and the create).

Fix

Treat the 409 for what it is — a benign lost create race. On conflict, re-read the secret to obtain the current resourceVersion and replace it with our value, mirroring the existing update path. The secret value is deterministic for a given worker, so the winning value is unchanged either way; the worker no longer crashes.

try:
    secret = await core_client.create_namespaced_secret(namespace=namespace, body=secret)
except ApiException as create_exc:
    if create_exc.status != 409:
        raise
    # A concurrent job submission created the secret between our read and
    # create. Re-read to obtain the current resourceVersion and replace it.
    current_secret = await core_client.read_namespaced_secret(name=name, namespace=namespace)
    current_secret.data = {"value": encoded_value}
    secret = await core_client.replace_namespaced_secret(name=name, namespace=namespace, body=current_secret)

Tests

  • Added test_upsert_secret_recovers_from_concurrent_create, which drives the read-404 -> create-409 -> re-read -> replace recovery and asserts the worker's value wins. It fails before the change (the 409 propagates) and passes after.
  • Existing secret tests (test_can_store_api_key_in_secret, test_store_api_key_in_existing_secret, and the rest of the secret/api_key suite) still pass.
  • ruff check and ruff format --check pass on both changed files.

…refectHQ#16447)

`KubernetesWorker._upsert_secret` read the API key secret and, on a 404,
created it. When a freshly started worker submits several scheduled flow
runs at once, multiple `create_namespaced_job` calls run `_upsert_secret`
concurrently. They all observe the 404, all attempt to create the secret,
and every call after the first fails with a 409 `AlreadyExists` that
propagated unhandled and crashed the flow run.

Treat the 409 as the benign lost-race it is: re-read the secret to obtain
the current resourceVersion and replace it with our value, mirroring the
existing update path. Add a regression test that drives the
read-404 -> create-409 -> re-read -> replace recovery.
@github-actions github-actions Bot added the bug Something isn't working label May 20, 2026
Comment on lines +1196 to +1198
secret = await core_client.replace_namespaced_secret(
name=name, namespace=namespace, body=current_secret
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not replace the secret when there's a race between workers. I think we're ok not raising on a 409 in all cases, but if you want to compare values and raise if they don't match for added safety, then I'm cool with that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — updated to no longer replace the secret on a 409. We now re-read the secret created by the winning worker and leave it in place. For the added safety you mentioned, we compare the existing value against ours and raise only if they unexpectedly differ. Updated the regression test and added one covering the mismatch case.

Address review feedback on PrefectHQ#22036: on a 409 during the create fallback,
re-read the secret created by the other worker rather than replacing it.
Raise only if the concurrently created value unexpectedly differs.
@wtfashwin
Copy link
Copy Markdown
Contributor Author

@desertaxle thanks for the review! I've pushed a change addressing your feedback: we no longer replace the secret on a 409. The losing worker now re-reads the secret created by the winner and leaves it in place, raising only if the concurrently created value unexpectedly differs. Tests updated accordingly. Could you take another look when you get a chance? 🙏

wtfashwin added 3 commits May 23, 2026 12:32
The replacement pod after eviction emits pending and running events so
close together that the Kubernetes watcher can report them in either
order. Assert the expected event multiset plus the meaningful ordering
constraints (a running event precedes eviction, succeeded is last)
instead of a rigid positional sequence.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 28, 2026

Merging this PR will not alter performance

✅ 8 untouched benchmarks
⏩ 1 skipped benchmark1


Comparing wtfashwin:fix/k8s-upsert-secret-concurrent-create-16447 (c125911) with main (f7d76f2)

Open in CodSpeed

Footnotes

  1. 1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports.

wtfashwin added 2 commits May 28, 2026 23:21
…k test

The pod observer adopted Kubernetes lifecycle timestamps in PrefectHQ#22062, but those
have second-granularity. When a new pod's Pending and Running events both land
in the same wall-clock second, /events/filter sorted ASC by `occurred` can
return them out of lifecycle order (Running before Pending), breaking
`test_pod_eviction_with_backoff_limit`. Cache each pod's last emitted event
and bump the next event's `occurred` by 1us if the k8s timestamp would
collide or regress, restoring strict per-pod monotonicity.

Separately, `test_acquire_lock_when_previously_holder_timed_out` used
hold_timeout=0.1s — too tight for scheduling jitter on contended CI runners;
holder1's lease was expiring before the immediate `is_locked` assertion. Bump
to 2s; the test still exercises the same blocking-on-timeout behavior.
@wtfashwin
Copy link
Copy Markdown
Contributor Author

@desertaxle resolved

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert these changes to keep the PR scoped only to the changes necessary to solve the linked issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert these changes to keep the PR scoped only to the changes necessary to solve the linked issue. This change has already been made on main.

Comment thread tests/test_locking.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert these changes to keep the PR scoped only to the changes necessary to solve the linked issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flow run fails when newly started worker tries to create Kubernetes secret concurrently in concurrent create job function calls

2 participants