Problem
While studying the Visdom realtime transport pipeline, I noticed that the current WebSocket/event delivery architecture may become unstable under long-running or high-frequency update workloads.
The current implementation appears to maintain shared mutable subscription state across browser clients and Python event sources, while also supporting both polling and WebSocket-based synchronization paths.
Under sustained realtime updates, this can potentially lead to:
- stale subscriptions
- duplicate event delivery
- dropped updates during reconnects
- inconsistent state replay after temporary disconnects
- unbounded event queue growth for slow clients
- browser instability under rapid update workloads
There are also existing reports related to realtime rendering instability and socket-related failures that may be connected to transport-level synchronization behavior.
Proposed Improvement
I would like to propose a transport reliability improvement focused on:
1. Reconnect-safe synchronization
- replay missing events after reconnect
- maintain per-client session state
- improve env/window consistency after browser refresh
2. Bounded queue/backpressure handling
- prevent unlimited queue accumulation
- drop/coalesce outdated updates safely
- protect slow clients from overwhelming memory growth
3. Heartbeat/connection lifecycle improvements
- detect stale/disconnected clients
- cleanup dead subscriptions properly
- reduce ghost subscribers
4. Unified transport abstraction
Currently polling and WebSocket paths appear partially duplicated.
A unified event delivery abstraction could improve:
- maintainability
- transport consistency
- testing reliability
Potentially Relevant Areas
socket_handlers.py
ApiProvider.js
Legacy.js
server_utils.py
- websocket subscription/event handling flow
Expected Benefits
- improved realtime stability
- better long-running training session reliability
- lower memory growth risk
- smoother browser reconnect behavior
- more scalable multi-client visualization
Additional Notes
I am currently studying the existing transport/event flow in detail and would be interested in contributing improvements in this area if maintainers think this direction would be valuable.
Problem
While studying the Visdom realtime transport pipeline, I noticed that the current WebSocket/event delivery architecture may become unstable under long-running or high-frequency update workloads.
The current implementation appears to maintain shared mutable subscription state across browser clients and Python event sources, while also supporting both polling and WebSocket-based synchronization paths.
Under sustained realtime updates, this can potentially lead to:
There are also existing reports related to realtime rendering instability and socket-related failures that may be connected to transport-level synchronization behavior.
Proposed Improvement
I would like to propose a transport reliability improvement focused on:
1. Reconnect-safe synchronization
2. Bounded queue/backpressure handling
3. Heartbeat/connection lifecycle improvements
4. Unified transport abstraction
Currently polling and WebSocket paths appear partially duplicated.
A unified event delivery abstraction could improve:
Potentially Relevant Areas
socket_handlers.pyApiProvider.jsLegacy.jsserver_utils.pyExpected Benefits
Additional Notes
I am currently studying the existing transport/event flow in detail and would be interested in contributing improvements in this area if maintainers think this direction would be valuable.