Scaling Horizontally, Connecting Vertically: Solving WebSockets' Scaling Challenge in Thursday Socials

Thursday, True Sparrow’s own product, is a place for remote teams to do socials. It is an audio-video-enabled space consisting of a lounge and mixers. WebSockets are essential for the successful execution of these online social events. In this blog, we will outline the scaling problem we encountered with WebSockets and present our solution.

WebSockets in Thursday

In Thursday, each user’s actions are relayed to others in real-time. For example, in the lounge, a user’s cursor movement is shared with everyone else in real-time. For this reason, we use WebSockets, a full-duplex and a low latency protocol compared to a half-duplex and half-real-time solution like long polling.

The underlying mechanism involves creating an event on the server, where all users establish WebSocket connections with that unique event ID. When a user performs an action, such as moving their cursor, the action is sent to the server, which then broadcasts it to all participants connected to that particular event ID.

Scaling Challenge

The challenge originates from the fact that WebSockets are stateful in nature and each connection belonging to a specific event must lie on a single server. If these connections are distributed over a cluster of servers, then the clients connected to different nodes won't know about each other. This is a real issue when it comes to horizontal scaling. Vertical scaling was neither a feasible solution nor viable to handle millions of concurrent connections.

Solution

We deployed RabbitMQ to solve this problem. All WebSocket servers subscribe to a RabbitMQ server. When a server receives a message to be routed to a client which is not in the local connections dictionary, it will publish a message to the RabbitMQ server, which will tell all the subscriber servers to look for this client and issue the message if it's connected to that server.

For each social, we declare exclusive queues using their event IDs as the binding key and all the servers subscribe to them. Any message pertaining to a social has the event ID as its routing key. We use the Direct Exchange so that the message directly goes to the queues whose binding key exactly matches the routing key of the message i.e. the event ID.

Implementation

We divide the implementation into three modules to manage social and their respective Ws connections.

Social

The Social struct represents a social instance. When it is initialized, it creates a queue and associates it with a Direct Exchange using the event ID as the key. Additionally, it starts a publisher that listens for any social-related messages on the PublishChan Go channel and publish them to RabbitMQ with the event ID as the key. It also starts a consumer that consumes the messages and writes them to all WebSocket connections within the social.

Master

The Master struct represents a server master that maintains a list of all the running socials on that server and manages the connection to the RabbitMQ server. When initialized, it declares a Direct Exchange on the RabbitMQ server and starts listening to the deleteChan Go channel for any social that has ended and needs to be removed from the list.

WS Handler

When the social starts, each client makes a WebSocket connection with the servers. This connection gets registered under the social  using its RegisterUnregisterChan Go channel.

Conclusion

WebSockets play a crucial role in enabling real-time communication and interactions in Thursday socials. However, scaling WebSockets can pose challenges due to their stateful nature, which restricts connections to a specific server. We solved this problem by leveraging RabbitMQ, a message broker, that facilitates communication among WebSocket servers. This combination of WebSockets and RabbitMQ provides a robust foundation for horizontal scaling and efficient handling of millions of concurrent connections.

Sarthak M Das

Sarthak M Das