Skip to content

Improve WebSocket transport reliability with reconnect-safe state synchronization and bounded event queues #1331

@Ridanshi

Description

@Ridanshi

Problem

While studying the Visdom realtime transport pipeline, I noticed that the current WebSocket/event delivery architecture may become unstable under long-running or high-frequency update workloads.

The current implementation appears to maintain shared mutable subscription state across browser clients and Python event sources, while also supporting both polling and WebSocket-based synchronization paths.

Under sustained realtime updates, this can potentially lead to:

  • stale subscriptions
  • duplicate event delivery
  • dropped updates during reconnects
  • inconsistent state replay after temporary disconnects
  • unbounded event queue growth for slow clients
  • browser instability under rapid update workloads

There are also existing reports related to realtime rendering instability and socket-related failures that may be connected to transport-level synchronization behavior.


Proposed Improvement

I would like to propose a transport reliability improvement focused on:

1. Reconnect-safe synchronization

  • replay missing events after reconnect
  • maintain per-client session state
  • improve env/window consistency after browser refresh

2. Bounded queue/backpressure handling

  • prevent unlimited queue accumulation
  • drop/coalesce outdated updates safely
  • protect slow clients from overwhelming memory growth

3. Heartbeat/connection lifecycle improvements

  • detect stale/disconnected clients
  • cleanup dead subscriptions properly
  • reduce ghost subscribers

4. Unified transport abstraction

Currently polling and WebSocket paths appear partially duplicated.

A unified event delivery abstraction could improve:

  • maintainability
  • transport consistency
  • testing reliability

Potentially Relevant Areas

  • socket_handlers.py
  • ApiProvider.js
  • Legacy.js
  • server_utils.py
  • websocket subscription/event handling flow

Expected Benefits

  • improved realtime stability
  • better long-running training session reliability
  • lower memory growth risk
  • smoother browser reconnect behavior
  • more scalable multi-client visualization

Additional Notes

I am currently studying the existing transport/event flow in detail and would be interested in contributing improvements in this area if maintainers think this direction would be valuable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions