Impact
Approximately 740 sessions were impacted, representing around 32.04% of the total sessions during the incident period, affecting internal and external users attempting to access the platform. The issue started on UTC-5 25-07-02 15:26 and was reactively discovered 17.5 hours (TTD) later by a client who reported to a staff member [1] that, when attempting to access the platform, some users encountered a “504 Bad Gateway” error from Cloudflare, preventing regular interaction with the platform. No other modules besides platform access were affected during the incident. The problem was resolved in 2.6 hours (TTF), resulting in a total window of exposure of 20.1 hours (WOE) [2].
Cause
The problem started when a user tried to generate a report. This triggered a series of automatic actions that, due to an issue with one of our external providers (Twilio), caused many repeated requests to be sent. On top of that, two additional factors made things worse:
As a result, the servers became overloaded and started failing to respond, showing the 504 error to some users. This was a rare and complex situation resulting from the combination of an external service failure, slow automatic scaling, and inefficient error handling [3].
Solution
Two key actions were taken:
Conclusion
These changes improved the system’s ability to handle sudden spikes in traffic and external service failures. The incident highlighted the need to stop requests that are taking too long to process, instead of letting them overload the system. We’re planning further improvements to how the system handles errors and timeouts. To further improve reliability and security, we plan to implement a Time-based One-Time Password (TOTP) system for user verification. This approach will reduce our dependence on external providers like Twilio and make authentication faster and safer [6]. THIRD_PARTY_ERROR < INFRASTRUCTURE_ERROR < INCOMPLETE_PERSPECTIVE