Please see below for posmortem incident Report – “Message System Outage November 12, 2026 (Evening)”
Summary
On November 12, 2026, at 21:20, the Zisson Interact platforms experienced a disruption in the internal message system. During this incident, agents were unable to log in or out of queues, receive or make calls using the softphone, change their queue status, or process ongoing queue traffic. The issue was mitigated through manual intervention, and services were gradually restored until full functionality was confirmed around 23:00 on one and 23:30 on the other.
Description
At 21:20, alarms were triggered indicating instability in the internal message system on Interact platforms. Operations initiated troubleshooting immediately. Agents were unable to interact with queues or use the softphone for inbound or outbound calls. The issue was not caused by abnormal traffic or external factors, but by a synchronization failure within the platform’s message cluster. To restore stability, traffic was redirected to a single functioning message server. However, other servers on the platform experienced connection issues, requiring manual reconnection and verification to ensure proper communication between components. The platform gradually stabilized, and all services were fully operational by 23:00.
Timeline of Events
21:20: Alarms triggered – message system instability detected on one Interact platform.
21:22: Immediate troubleshooting initiated by Operations team.
21:30: Situation Room established; engineers began isolating the affected servers.
21:40: Traffic redirected to a single functioning message server to stabilize the system.
21:50–22:45: Manual reconnection and synchronization of servers to restore communication between components.
23:30: All services confirmed operational, normal traffic resumed.
Root Cause
A synchronization failure occurred between servers within the message system cluster, preventing normal communication between components.
Further investigation revealed that a new network configuration was implemented by our hosting partner before the incident.
This change caused the message servers within the message system cluster to lose communication with each other, leading each node to operate independently (a “split-brain” condition).
As a result, multiple servers assumed the master role simultaneously, which generated message loops and excessive load on the system. The issue was not caused by an external network event, but by an unintended consequence of the configuration change, which disrupted internal synchronization between the cluster nodes.
Actions Taken
Immediate investigation initiated by Operations and Support teams.
Traffic redirected to a single operational message server to restore stability.
Manual synchronization and reconnection of affected servers.
Verified full system functionality by 23:00.
Hosting partner rolled back the change.
Next Steps
We recognize the impact such disruptions have on our customers and sincerely apologize for the inconvenience caused.
— Zisson Operations Team, 2026-11-12