Fix flaky block_keyspace_notification test for HGETDEL notify race#3766
Conversation
The test added in valkey-io#3743 is racy and intermittently fails with: Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}' When HGETDEL empties the hash, hgetdelCommand fires two notifications back-to-back on the main thread: 1. notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...) 2. notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...) (key removed) The block_keyspace_notification test module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification the RM_BlockClient call returns NULL (the client is already blocked by the first), so the second thread never unblocks anyone — it only appends "del" to the log. The first thread is the one that unblocks the client, so HGETDEL returns as soon as the first thread logs "hdel". The test client then immediately reads b_keyspace.events, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only "hdel" visible. Fix: wait_for_condition until both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file. Reproduces in 2/5 runs without the fix on a quiet macOS host; 30/30 with the fix. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA test for the HGETDEL command in the blocking keyspace notification suite was updated to handle asynchronous ordering of ChangesHGETDEL Keyspace Notification Async Ordering Fix
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3766 +/- ##
============================================
+ Coverage 76.68% 76.72% +0.03%
============================================
Files 162 162
Lines 80697 80697
============================================
+ Hits 61885 61911 +26
+ Misses 18812 18786 -26 🚀 New features to boost your workflow:
|
…3766) The HFE-blocking-keyspace-notify test added in #3743 is racy and intermittently fails with: ``` Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}' (context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test) ``` ## Why it races When `HGETDEL` empties the hash, `hgetdelCommand` fires two notifications back-to-back on the main thread: 1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)` 2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was removed) The `block_keyspace_notification` test module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification, `RM_BlockClient` returns `NULL` (the client is already blocked by the first), so the second thread never unblocks anyone — it only appends `"del"` to the log. The first thread is the one that unblocks the client, so `HGETDEL` returns as soon as that thread has logged `"hdel"`. The test client then immediately reads `b_keyspace.events`, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only `"hdel"` visible. ## Fix `wait_for_condition` until both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file (e.g. the `Server-created keyspace notification` and `Event that fires twice` blocks). ```tcl wait_for_condition 50 100 { [llength [r b_keyspace.events]] >= 2 } else { fail "Did not see both hdel and del events: [r b_keyspace.events]" } assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]] ``` Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…3766) The HFE-blocking-keyspace-notify test added in #3743 is racy and intermittently fails with: ``` Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}' (context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test) ``` ## Why it races When `HGETDEL` empties the hash, `hgetdelCommand` fires two notifications back-to-back on the main thread: 1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)` 2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was removed) The `block_keyspace_notification` test module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification, `RM_BlockClient` returns `NULL` (the client is already blocked by the first), so the second thread never unblocks anyone — it only appends `"del"` to the log. The first thread is the one that unblocks the client, so `HGETDEL` returns as soon as that thread has logged `"hdel"`. The test client then immediately reads `b_keyspace.events`, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only `"hdel"` visible. ## Fix `wait_for_condition` until both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file (e.g. the `Server-created keyspace notification` and `Event that fires twice` blocks). ```tcl wait_for_condition 50 100 { [llength [r b_keyspace.events]] >= 2 } else { fail "Did not see both hdel and del events: [r b_keyspace.events]" } assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]] ``` Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…3766) The HFE-blocking-keyspace-notify test added in #3743 is racy and intermittently fails with: ``` Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}' (context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test) ``` ## Why it races When `HGETDEL` empties the hash, `hgetdelCommand` fires two notifications back-to-back on the main thread: 1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)` 2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was removed) The `block_keyspace_notification` test module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification, `RM_BlockClient` returns `NULL` (the client is already blocked by the first), so the second thread never unblocks anyone — it only appends `"del"` to the log. The first thread is the one that unblocks the client, so `HGETDEL` returns as soon as that thread has logged `"hdel"`. The test client then immediately reads `b_keyspace.events`, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only `"hdel"` visible. ## Fix `wait_for_condition` until both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file (e.g. the `Server-created keyspace notification` and `Event that fires twice` blocks). ```tcl wait_for_condition 50 100 { [llength [r b_keyspace.events]] >= 2 } else { fail "Did not see both hdel and del events: [r b_keyspace.events]" } assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]] ``` Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
# Backport sweep for 9.1 Automated cherry-picks from PRs marked "To be backported". ## Applied | Source PR | Title | Detail | |---|---|---| | #3743 | Fix buffered_reply assert in HFE commands with module keyspace notifications | cherry-picked in a prior sweep | | #3766 | Fix flaky block_keyspace_notification test for HGETDEL notify race | cherry-picked in a prior sweep | | #3800 | Fix heap-use-after-free in ACL LOAD when client free is deferred | cherry-picked in a prior sweep | | #3723 | Fix double-finish and RESP reply violation in cluster slot migration | cherry-picked in a prior sweep | | #3872 | Redacting customer information when hide_user_data_from_log is true in rdb.c, networking.c, debug.c and t_hash | cherry-picked in a prior sweep | | #3846 | Fix use-after-free in VM_RegisterClusterMessageReceiver | cherry-picked in a prior sweep | | #3806 | Add ALL_DBS flag to CLUSTER FLUSHSLOT for database-level ACL | cherry-picked in a prior sweep | | #3847 | Harden SENTINEL commands and config rewrite against control-character injection | | | #3801 | Validate every DB clause in COPY against ACL db= permissions | | | #3851 | Replace AUTOMATION_PAT with Valkeyrie Bot GitHub App token | | | #3848 | Fix cluster AUX-field control-character and delimiter injection | | | #3544 | Revert "IO-Threads redesign cleanup work (#3544)" | cherry-picked in a prior sweep | | #3888 | Report exact dbid for COPY in ACL LOG when db= access is denied | conflicts resolved by Claude Code | | #3942 | Fix shard_id format specifier in UPDATE message log | | | #3941 | Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30 | | --- *Generated by valkey-ci-agent using Claude Code.* --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: chx9 <lovelypiska@outlook.com> Signed-off-by: zackcam <zackcam@amazon.com> Signed-off-by: Eran Ifrah <eifrah@amazon.com> Signed-off-by: Jules Lasarte <lasartej@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: akash kumar <akumdev@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: lovelypiska <lovelypiska@outlook.com> Co-authored-by: zackcam <zackcam@amazon.com> Co-authored-by: eifrah-aws <eifrah@amazon.com> Co-authored-by: jjuleslasarte <jules.lasarte@gmail.com> Co-authored-by: Jules Lasarte <lasartej@amazon.com> Co-authored-by: Akash Kumar <45854686+akashkgit@users.noreply.github.com>
The HFE-blocking-keyspace-notify test added in #3743 is racy and intermittently fails with:
Why it races
When
HGETDELempties the hash,hgetdelCommandfires two notifications back-to-back on the main thread:notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)(the key was removed)The
block_keyspace_notificationtest module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification,RM_BlockClientreturnsNULL(the client is already blocked by the first), so the second thread never unblocks anyone — it only appends"del"to the log.The first thread is the one that unblocks the client, so
HGETDELreturns as soon as that thread has logged"hdel". The test client then immediately readsb_keyspace.events, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only"hdel"visible.Fix
wait_for_conditionuntil both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file (e.g. theServer-created keyspace notificationandEvent that fires twiceblocks).wait_for_condition 50 100 { [llength [r b_keyspace.events]] >= 2 } else { fail "Did not see both hdel and del events: [r b_keyspace.events]" } assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]Verification
On a quiet macOS host:
origin/unstable).Test-only change; no server code is touched.