Please see below for the postmortem report about the incident that occurred on the afternoon of November 12th.
Summary
On November 12, 2026, the Zisson Interact platform T experienced an outage in the internal message system between 14:20 and 15:55. During this time, agents were unable to log in or out of queues, receive or make calls using the softphone, change their queue status, or process ongoing queue traffic. Temporary fixes implemented from 15:15 gradually improved the situation until full resolution was achieved at 15:55.
Description
At 14:20, monitoring alarms were triggered due to a growing queue in the Interact message system. The issue, which is still under investigation, was most likely caused by a network issue that led to confusion within the message system cluster, resulting in two servers assuming the master role simultaneously. This caused a feedback loop where thousands of messages per second were generated, heavily overloading the system. As a result, agents were unable to log in or out of queues, receive or make calls with the softphone, change their availability, or process incoming queue traffic. The incident was challenging to diagnose due to misleading symptoms that initially appeared to stem from abnormal traffic levels.
Timeline of Events
14:20: Alarms triggered – message queue congestion detected in Zisson Interact.
14:25: Situation Room initiated – personnel from Operations and Support gathered, troubleshooting started.
14:35: Issue escalated – additional engineers joined the investigation.
14:55: Troubleshooting continued; cause still unclear, extremely high message volume observed. Queue traffic unable to be processed.
15:00: External specialists engaged to assist in diagnostics.
15:15–15:30: Temporary mitigation measures implemented – agents gradually regained ability to log in/out of queues, process queue traffic, and handle calls using the softphone.
15:55: Root cause identified and corrected – system returned to stable operation.
Root Cause
Further investigation during the night revealed that a new network configuration was implemented by our hosting partner shortly before the incident. This change caused the RabbitMQ servers within the message system cluster to lose communication with each other, leading each node to operate independently (a “split-brain” condition).
As a result, multiple servers assumed the master role simultaneously, which generated message loops and excessive load on the system. The issue was not caused by an external network event, but by an unintended consequence of the configuration change, which disrupted internal synchronization between the cluster nodes.
Actions Taken
Established Situation Room with Operations, Support, and external experts.
Implemented temporary mitigation to restore softphone functionality, queue control, and traffic flow.
Identified and resolved master role conflict within the message system.
Verified system stability after resolution.
Next Steps
Implement improved monitoring for network drops and role conflicts.
Review automatic failover logic to ensure stability during transient network issues.
Our hosting partner has updated their procedures for implementing network configuration changes to ensure that such changes are properly reviewed, tested, and coordinated to prevent similar issues in the future.
We understand the critical nature of this disruption and sincerely apologize for the impact it had on our customers’ operations.
— Zisson Operations Team, 2026-11-12