Skip to content

Fix flaky block_keyspace_notification test for HGETDEL notify race#3766

Merged
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
ranshid:fix/hfe-blocking-keyspace-notify-test-race
May 19, 2026
Merged

Fix flaky block_keyspace_notification test for HGETDEL notify race#3766
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
ranshid:fix/hfe-blocking-keyspace-notify-test-race

Conversation

@ranshid

@ranshid ranshid commented May 19, 2026

Copy link
Copy Markdown
Member

The HFE-blocking-keyspace-notify test added in #3743 is racy and intermittently fails with:

Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}'
(context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test)

Why it races

When HGETDEL empties the hash, hgetdelCommand fires two notifications back-to-back on the main thread:

  1. notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)
  2. notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...) (the key was removed)

The block_keyspace_notification test module's callback always spawns a background thread that sleeps 1s, appends to the event log, and optionally unblocks the client. For the second notification, RM_BlockClient returns NULL (the client is already blocked by the first), so the second thread never unblocks anyone — it only appends "del" to the log.

The first thread is the one that unblocks the client, so HGETDEL returns as soon as that thread has logged "hdel". The test client then immediately reads b_keyspace.events, racing the second thread which has not yet acquired the event-log mutex. On a loaded host or under valgrind the second thread routinely loses the race, leaving only "hdel" visible.

Fix

wait_for_condition until both events have been logged before asserting on their contents. This matches the pattern already used by the four other event-log assertions in this file (e.g. the Server-created keyspace notification and Event that fires twice blocks).

wait_for_condition 50 100 {
    [llength [r b_keyspace.events]] >= 2
} else {
    fail "Did not see both hdel and del events: [r b_keyspace.events]"
}
assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]

Verification

On a quiet macOS host:

  • Without the fix: 2 of 5 runs fail with the exact error above (and intermittently in larger batches).
  • With the fix: 30 of 30 runs pass (10-run + 20-run batches, plus a 15-run rerun directly off origin/unstable).

Test-only change; no server code is touched.

The test added in valkey-io#3743 is racy and intermittently fails with:

    Expected '{event del key h1} {event hdel key h1}' to be equal to
    '{event hdel key h1}'

When HGETDEL empties the hash, hgetdelCommand fires two notifications
back-to-back on the main thread:

  1. notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)
  2. notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...) (key removed)

The block_keyspace_notification test module's callback always spawns
a background thread that sleeps 1s, appends to the event log, and
optionally unblocks the client. For the second notification the
RM_BlockClient call returns NULL (the client is already blocked by
the first), so the second thread never unblocks anyone — it only
appends "del" to the log.

The first thread is the one that unblocks the client, so HGETDEL
returns as soon as the first thread logs "hdel". The test client
then immediately reads b_keyspace.events, racing the second thread
which has not yet acquired the event-log mutex. On a loaded host or
under valgrind the second thread routinely loses the race, leaving
only "hdel" visible.

Fix: wait_for_condition until both events have been logged before
asserting on their contents. This matches the pattern already used
by the four other event-log assertions in this file.

Reproduces in 2/5 runs without the fix on a quiet macOS host;
30/30 with the fix.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b314b9e5-395e-42ba-a02c-e7c1b903a01d

📥 Commits

Reviewing files that changed from the base of the PR and between 54fbe4a and 38a6096.

📒 Files selected for processing (1)
  • tests/unit/moduleapi/block_keyspace_notification.tcl

📝 Walkthrough

Walkthrough

A test for the HGETDEL command in the blocking keyspace notification suite was updated to handle asynchronous ordering of hdel and del keyspace events. The test now waits for both events to be recorded before asserting their presence, instead of assuming timing.

Changes

HGETDEL Keyspace Notification Async Ordering Fix

Layer / File(s) Summary
Async keyspace event ordering test fix
tests/unit/moduleapi/block_keyspace_notification.tcl
The HGETDEL keyspace notification test now uses wait_for_condition to wait until at least two events are recorded before asserting both hdel and del events are present, replacing a timing-dependent assumption and failing with the current event list on timeout.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • valkey-io/valkey#3743: The main PR's test change for HGETDEL's hdel/del keyspace notifications directly reflects the notification/reply timing behavior fixed in the retrieved PR's hgetdelCommand deferred-reply handling.

Suggested reviewers

  • madolson
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a flaky test related to HGETDEL notifications and race conditions in the block_keyspace_notification test.
Description check ✅ Passed The description thoroughly explains the test failure, root cause analysis of the race condition, the implemented fix, and verification results, all directly relevant to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@ranshid ranshid requested a review from enjoy-binbin May 19, 2026 03:49
@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.72%. Comparing base (54fbe4a) to head (38a6096).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3766      +/-   ##
============================================
+ Coverage     76.68%   76.72%   +0.03%     
============================================
  Files           162      162              
  Lines         80697    80697              
============================================
+ Hits          61885    61911      +26     
+ Misses        18812    18786      -26     

see 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@enjoy-binbin enjoy-binbin merged commit 8274099 into valkey-io:unstable May 19, 2026
63 checks passed
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 9.1 May 19, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 9.0 May 19, 2026
@ranshid ranshid added the test-failure An issue indicating a test failure label May 19, 2026
valkeyrie-ops Bot pushed a commit that referenced this pull request May 20, 2026
…3766)

The HFE-blocking-keyspace-notify test added in #3743 is racy and
intermittently fails with:

```
Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}'
(context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test)
```

## Why it races

When `HGETDEL` empties the hash, `hgetdelCommand` fires two
notifications back-to-back on the main thread:

1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)`
2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was
removed)

The `block_keyspace_notification` test module's callback always spawns a
background thread that sleeps 1s, appends to the event log, and
optionally unblocks the client. For the second notification,
`RM_BlockClient` returns `NULL` (the client is already blocked by the
first), so the second thread never unblocks anyone — it only appends
`"del"` to the log.

The first thread is the one that unblocks the client, so `HGETDEL`
returns as soon as that thread has logged `"hdel"`. The test client then
immediately reads `b_keyspace.events`, racing the second thread which
has not yet acquired the event-log mutex. On a loaded host or under
valgrind the second thread routinely loses the race, leaving only
`"hdel"` visible.

## Fix

`wait_for_condition` until both events have been logged before asserting
on their contents. This matches the pattern already used by the four
other event-log assertions in this file (e.g. the `Server-created
keyspace notification` and `Event that fires twice` blocks).

```tcl
wait_for_condition 50 100 {
    [llength [r b_keyspace.events]] >= 2
} else {
    fail "Did not see both hdel and del events: [r b_keyspace.events]"
}
assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]
```

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request May 29, 2026
…3766)

The HFE-blocking-keyspace-notify test added in #3743 is racy and
intermittently fails with:

```
Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}'
(context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test)
```

## Why it races

When `HGETDEL` empties the hash, `hgetdelCommand` fires two
notifications back-to-back on the main thread:

1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)`
2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was
removed)

The `block_keyspace_notification` test module's callback always spawns a
background thread that sleeps 1s, appends to the event log, and
optionally unblocks the client. For the second notification,
`RM_BlockClient` returns `NULL` (the client is already blocked by the
first), so the second thread never unblocks anyone — it only appends
`"del"` to the log.

The first thread is the one that unblocks the client, so `HGETDEL`
returns as soon as that thread has logged `"hdel"`. The test client then
immediately reads `b_keyspace.events`, racing the second thread which
has not yet acquired the event-log mutex. On a loaded host or under
valgrind the second thread routinely loses the race, leaving only
`"hdel"` visible.

## Fix

`wait_for_condition` until both events have been logged before asserting
on their contents. This matches the pattern already used by the four
other event-log assertions in this file (e.g. the `Server-created
keyspace notification` and `Event that fires twice` blocks).

```tcl
wait_for_condition 50 100 {
    [llength [r b_keyspace.events]] >= 2
} else {
    fail "Did not see both hdel and del events: [r b_keyspace.events]"
}
assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]
```

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 10, 2026
…3766)

The HFE-blocking-keyspace-notify test added in #3743 is racy and
intermittently fails with:

```
Expected '{event del key h1} {event hdel key h1}' to be equal to '{event hdel key h1}'
(context: ... cmd {assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]} proc ::test)
```

## Why it races

When `HGETDEL` empties the hash, `hgetdelCommand` fires two
notifications back-to-back on the main thread:

1. `notifyKeyspaceEvent(NOTIFY_HASH, "hdel", ...)`
2. `notifyKeyspaceEvent(NOTIFY_GENERIC, "del", ...)` (the key was
removed)

The `block_keyspace_notification` test module's callback always spawns a
background thread that sleeps 1s, appends to the event log, and
optionally unblocks the client. For the second notification,
`RM_BlockClient` returns `NULL` (the client is already blocked by the
first), so the second thread never unblocks anyone — it only appends
`"del"` to the log.

The first thread is the one that unblocks the client, so `HGETDEL`
returns as soon as that thread has logged `"hdel"`. The test client then
immediately reads `b_keyspace.events`, racing the second thread which
has not yet acquired the event-log mutex. On a loaded host or under
valgrind the second thread routinely loses the race, leaving only
`"hdel"` visible.

## Fix

`wait_for_condition` until both events have been logged before asserting
on their contents. This matches the pattern already used by the four
other event-log assertions in this file (e.g. the `Server-created
keyspace notification` and `Event that fires twice` blocks).

```tcl
wait_for_condition 50 100 {
    [llength [r b_keyspace.events]] >= 2
} else {
    fail "Did not see both hdel and del events: [r b_keyspace.events]"
}
assert_equal [lsort {{event hdel key h1} {event del key h1}}] [lsort [r b_keyspace.events]]
```

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
ranshid added a commit that referenced this pull request Jun 11, 2026
# Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

## Applied

| Source PR | Title | Detail |
|---|---|---|
| #3743 | Fix buffered_reply assert in HFE commands with module keyspace
notifications | cherry-picked in a prior sweep |
| #3766 | Fix flaky block_keyspace_notification test for HGETDEL notify
race | cherry-picked in a prior sweep |
| #3800 | Fix heap-use-after-free in ACL LOAD when client free is
deferred | cherry-picked in a prior sweep |
| #3723 | Fix double-finish and RESP reply violation in cluster slot
migration | cherry-picked in a prior sweep |
| #3872 | Redacting customer information when hide_user_data_from_log is
true in rdb.c, networking.c, debug.c and t_hash | cherry-picked in a
prior sweep |
| #3846 | Fix use-after-free in VM_RegisterClusterMessageReceiver |
cherry-picked in a prior sweep |
| #3806 | Add ALL_DBS flag to CLUSTER FLUSHSLOT for database-level ACL |
cherry-picked in a prior sweep |
| #3847 | Harden SENTINEL commands and config rewrite against
control-character injection | |
| #3801 | Validate every DB clause in COPY against ACL db= permissions |
|
| #3851 | Replace AUTOMATION_PAT with Valkeyrie Bot GitHub App token | |
| #3848 | Fix cluster AUX-field control-character and delimiter
injection | |
| #3544 | Revert "IO-Threads redesign cleanup work (#3544)" |
cherry-picked in a prior sweep |
| #3888 | Report exact dbid for COPY in ACL LOG when db= access is
denied | conflicts resolved by Claude Code |
| #3942 | Fix shard_id format specifier in UPDATE message log |  |
| #3941 | Avoid random() % 0 undefined behaviour when
cluster-node-timeout < 30 | |

---
*Generated by valkey-ci-agent using Claude Code.*

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: chx9 <lovelypiska@outlook.com>
Signed-off-by: zackcam <zackcam@amazon.com>
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
Signed-off-by: Jules Lasarte <lasartej@amazon.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: akash kumar <akumdev@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: lovelypiska <lovelypiska@outlook.com>
Co-authored-by: zackcam <zackcam@amazon.com>
Co-authored-by: eifrah-aws <eifrah@amazon.com>
Co-authored-by: jjuleslasarte <jules.lasarte@gmail.com>
Co-authored-by: Jules Lasarte <lasartej@amazon.com>
Co-authored-by: Akash Kumar <45854686+akashkgit@users.noreply.github.com>
@sarthakaggarwal97 sarthakaggarwal97 moved this from To be backported to Done in Valkey 9.1 Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-failure An issue indicating a test failure

Projects

Status: To be backported
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants