Gateway timeout error when accessing the platform

Incident Report for Fluid Attacks

Postmortem

Impact

Approximately 740 sessions were impacted, representing around 32.04% of the total sessions during the incident period, affecting internal and external users attempting to access the platform. The issue started on UTC-5 25-07-02 15:26 and was reactively discovered 17.5 hours (TTD) later by a client who reported to a staff member [1] that, when attempting to access the platform, some users encountered a “504 Bad Gateway” error from Cloudflare, preventing regular interaction with the platform. No other modules besides platform access were affected during the incident. The problem was resolved in 2.6 hours (TTF), resulting in a total window of exposure of 20.1 hours (WOE) [2].

Cause

The problem started when a user tried to generate a report. This triggered a series of automatic actions that, due to an issue with one of our external providers (Twilio), caused many repeated requests to be sent. On top of that, two additional factors made things worse:

The system was slow to add new servers to handle the increased load.
The requests that were failing took too long to stop, keeping the servers busy for too long.

As a result, the servers became overloaded and started failing to respond, showing the 504 error to some users. This was a rare and complex situation resulting from the combination of an external service failure, slow automatic scaling, and inefficient error handling [3].

Solution

Two key actions were taken:

The login process was simplified by removing unnecessary steps that could add extra work for the servers [4].
We adjusted how the system adds new servers when traffic increases, so that it reacts faster by adding them sooner when usage grows [5].

Conclusion

These changes improved the system’s ability to handle sudden spikes in traffic and external service failures. The incident highlighted the need to stop requests that are taking too long to process, instead of letting them overload the system. We’re planning further improvements to how the system handles errors and timeouts. To further improve reliability and security, we plan to implement a Time-based One-Time Password (TOTP) system for user verification. This approach will reduce our dependence on external providers like Twilio and make authentication faster and safer [6]. THIRD_PARTY_ERROR < INFRASTRUCTURE_ERROR < INCOMPLETE_PERSPECTIVE

Posted Jul 09, 2025 - 08:41 GMT-05:00

Resolved

The incident has been resolved, and users can now access the platform normally.

Posted Jul 03, 2025 - 13:49 GMT-05:00

Identified

The platform fails to load due to a Gateway timeout error (504), blocking user access.

Posted Jul 03, 2025 - 09:36 GMT-05:00

This incident affected: Platform.