[fix][meta] Fix ZooKeeper session reconnect race condition in PulsarZooKeeperClient.clientCreator#25910
Conversation
|
@lhotari With the current direct forwarding approach, events from the new ZooKeeper instance are delayed until the new handle is published. However, if the old expired client still has any queued or late events delivered to the same Do you think we should add a lightweight generation fencing mechanism here, so that only events from the currently active ZooKeeper instance are forwarded/processed, and stale events from previous instances are ignored? |
…ooKeeperClient.clientCreator (apache#25910) (cherry picked from commit 5627c01) (cherry picked from commit 9ebfc3b)
Motivation
ZKSessionTest.testReacquireLeadershipAfterSessionLostcan observe unstable metadata session events after a ZooKeeper session expires andPulsarZooKeeperClientcreates a replacement client.Failure test case1:
Failure test case2:
The race happens during the handoff from the expired ZooKeeper instance to the new one:
ZooKeepercan deliverSyncConnectedwhile the new client is still being constructed.ZooKeeperWatcherBaseforwards that session event to child watchers.PulsarZooKeeperClientpublishes the newZooKeeperhandle.This can produce extra or incomplete session transitions around
ConnectionLost,SessionLost,Reconnected, andSessionReestablished.Modifications
This change keeps the reconnect flow local to
PulsarZooKeeperClientandZKSessionWatcher.PulsarZooKeeperClientnow creates replacement ZooKeeper clients with a forwarding watcher instead of passingwatcherManagerdirectly.watcherManager.waitForConnection()runs after that release because it depends on the forwardedSyncConnectedevent.ZKSessionWatcherrecords the session id used for its asyncexists("/")probe and only applies the probe result if the current session id still matches.Verifying this change
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes