There are a couple cases where using WebSockets with WebFlux on Tomcat can leave connections in a CLOSE_WAIT state after closing the websocket session. These connections stick around, and will eventually cause tomcat to reach its connection limit (if set). This prevents tomcat from accepting new connections, and thus leads to the server becoming unresponsive (except for previously established connections)
When running the same test cases with WebFlux on Netty or Undertow, the connections are closed properly.
I have provided an example project (ws-close-waiting.zip) that shows the cases where the connection gets stuck in CLOSE_WAIT on tomcat after the websocket session is closed.
The project has three websocket endpoints, each showing a different case (only 2 cases fail). In each case, the server will close the websocket session (but in different ways) after receiving a message from the client.
/closeZip - Calls session.close(...) while processing the input stream. The input/output stream are merged with the zip operator. This case leaves the connection in CLOSE_WAIT on tomcat.
/closeZipDelayError - Calls session.close(...) while processing the input stream. The input/output stream are merged with the zipDelayError operator. This case properly closes the connection. I included this case for comparison with the first case. I'm not sure what the downsides of using zipDelayError would be though. Advice appreciated.
/exceptionZipDelayError - Propagates an exception on the input stream, but handles that exception with onErrorResume by calling session.close(...). The input/output streams are merged with the zipDelayError operator. This case leaves the connection in CLOSE_WAIT on tomcat. I included this case to show that the zipWithError operator will "fix" some cases (2), but not every case.
I have enabled the following logging:
logging.level.org.springframework.http.server.reactive=debug
logging.level._org.springframework.http.server.reactive.AbstractListenerReadPublisher=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteProcessor=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor=trace
logging.level._org.springframework.http.server.reactive=trace
logging.level.reactor.netty=debug
logging.level.org.apache.tomcat.websocket=debug
In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] cancel [READING]
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher : [37936546] READING -> COMPLETED
In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.
To test each use case, I used netstat to observe connections, and websocat as the websocket client. Specifically...
I started netstat in a loop to observe connections every second...
while true ; do clear; date; sudo netstat -pn | grep 8080; sleep 1; done
Then I used websocat in another terminal as follows:
- connect to one of the three websocket endpoints...
e.g. websocat -v -v ws://localhost:8080/closeZip (or closeZipNoDelay or exceptionZipNoDelay)
netstat will show something like...
Fri Apr 28 01:59:55 PM CEST 2023
tcp 0 0 127.0.0.1:57316 127.0.0.1:8080 ESTABLISHED 232014/./websocat
tcp6 0 0 127.0.0.1:8080 127.0.0.1:57316 ESTABLISHED 231835/java
- type something on the websocat console and press enter. websocat will send what you typed as a text websocket message, and leave the connection open. netstat output remains unchanged
- press CTRL-D on the websocat console to end the input stream. websocat will exit.
For the successful cases, the connections will disappear from netstat.
For the failure cases, netstat will show something like...
Fri Apr 28 02:01:36 PM CEST 2023
tcp 0 0 127.0.0.1:57316 127.0.0.1:8080 FIN_WAIT2 -
tcp6 7 0 127.0.0.1:8080 127.0.0.1:57316 CLOSE_WAIT 231835/java
Eventually the old client side connection (the one in FIN_WAIT2) will go away. But the server connection (the one in CLOSE_WAIT) will remain until the server is shutdown.
Again, when running WebFlux on Netty or Undertow, the connections always go away in all three cases.
There are a couple cases where using WebSockets with WebFlux on Tomcat can leave connections in a CLOSE_WAIT state after closing the websocket session. These connections stick around, and will eventually cause tomcat to reach its connection limit (if set). This prevents tomcat from accepting new connections, and thus leads to the server becoming unresponsive (except for previously established connections)
When running the same test cases with WebFlux on Netty or Undertow, the connections are closed properly.
I have provided an example project (ws-close-waiting.zip) that shows the cases where the connection gets stuck in CLOSE_WAIT on tomcat after the websocket session is closed.
The project has three websocket endpoints, each showing a different case (only 2 cases fail). In each case, the server will close the websocket session (but in different ways) after receiving a message from the client.
/closeZip- Callssession.close(...)while processing the input stream. The input/output stream are merged with thezipoperator. This case leaves the connection in CLOSE_WAIT on tomcat./closeZipDelayError- Callssession.close(...)while processing the input stream. The input/output stream are merged with thezipDelayErroroperator. This case properly closes the connection. I included this case for comparison with the first case. I'm not sure what the downsides of usingzipDelayErrorwould be though. Advice appreciated./exceptionZipDelayError- Propagates an exception on the input stream, but handles that exception withonErrorResumeby callingsession.close(...). The input/output streams are merged with thezipDelayErroroperator. This case leaves the connection in CLOSE_WAIT on tomcat. I included this case to show that thezipWithErroroperator will "fix" some cases (2), but not every case.I have enabled the following logging:
In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:
In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.
To test each use case, I used netstat to observe connections, and websocat as the websocket client. Specifically...
I started netstat in a loop to observe connections every second...
Then I used websocat in another terminal as follows:
e.g.
websocat -v -v ws://localhost:8080/closeZip(orcloseZipNoDelayorexceptionZipNoDelay)netstat will show something like...
For the successful cases, the connections will disappear from netstat.
For the failure cases, netstat will show something like...
Again, when running WebFlux on Netty or Undertow, the connections always go away in all three cases.