Impact
At least one user experienced issues accessing the platform. The issue started on UTC-5 25-06-24 15:47 and was proactively discovered 28.8 minutes (TTD) later by a staff member who reported through our help desk [1] that, upon accessing the platform, only a blank screen was displayed. The problem was resolved in 1.6 hours (TTF), resulting in a total window of exposure of 2.1 hours (WOE) [2].
Cause
During the implementation of Datadog as a monitoring system, the previous Coralogix configurations were retained to avoid gaps in observability. However, both tools were running simultaneously with tracing enabled, which caused a conflict and ultimately corrupted the headers of requests made to the API [3].
Solution
A rollback of the changes was performed to restore the platform’s stability [4].
Conclusion
This incident highlighted the lack of manual testing that treats the local environment as production. Since both services operate only in production, the development environment remained unaffected, and the issue was not detected beforehand. Developers need to exercise caution and manually test integrations that are exclusive to the production environment, treating their machines as production when necessary. INFRASTRUCTURE_ERROR < INCOMPLETE_PERSPECTIVE