A series of cascading failures resulted in downtime for our Australian region. This was initially caused by high load on the internal service that handles extensibility, which, at its peak, caused our Authentication API to return errors. Those errors then caused a storm of retries from an affected client which prevented the service from recovering. Service was restored after the Webtask service was scaled up and the affected client blocked while the system returned to a healthy state. The service was down from 9:21 AM UTC until 10:02 AM UTC.
We would like to apologize for the impact this had on you and your users and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
On June 7, 2020, at 5:00 AM UTC, an internal service which handles Rules, Hooks, Customer Database Scripts, and Extensions for our Australian environment started to receive an increase in load. By 9:10 AM UTC we hit a peak load, at which point the cpu resources of the internal service’s VMs became saturated. While saturated, operations being run by the internal service began failing at increasing rates from 9:16 AM UTC to 9:22 AM UTC. These failures tripped an internal rate-limiting mechanism causing this internal service to produce 429 errors. Errors continued until autoscaling automatically provided enough capacity to support the surge in load.
This internal service saturation led to a peak in error rates and latency at 9:21AM UTC causing our authentication service to return 500 errors to clients. These errors resulted in a storm of retries from a customer application that did not have a retry backoff configured which put significant additional strain on the already overloaded service. At that point the proxy layer began returning 502 errors.
At 9:37 AM UTC, the recovery of the unhealthy authentication server instances commenced - new nodes were being brought online and additional capacity was being provisioned to handle the increased number of requests from the clients executing the retries. At 9:50 AM UTC we took action to block the customer application that was retrying constantly in hopes of reducing load to let the system recover.
From 10:02 AM UTC the authentication server began to successfully serve requests again. At 10:40 AM UTC the clients that were executing large numbers of retry requests were unblocked and traffic patterns were being monitored. By 11:37 AM UTC, the incident was declared as resolved.
We have now over-provisioned the internal service which handles extensibility while we investigate and work to mitigate the issue.
Additionally we would like to thank the temporarily blocked customer for their prompt response to update their application the same day and the details they were able to provide on what was seen from their side.
We are taking the following actions to ensure this does not happen again:
05:00 UTC - Load increase started on internal extensibility service.
09:10 UTC - Internal extensibility service request volume increased to a peak of 222 RPS sustained.
09:16 UTC - Internal extensibility service began logging socket hang ups and ECONNRESETs. This resulted in the authentication server returning 500s to clients attempting to execute Rules.
09:21 UTC - Number of 500s returned from the authentication server peaked. Downstream client retry logic resulted in a retry storm. P99 response times, memory and CPU usage in the authentication server hosts peaked.
09:21 UTC - Authentication server nodes started to report as unhealthy. Proxy layer started to return 502s for all authentication requests. Monitoring identified that the authentication server requests were failing, and the on-call Engineer was automatically paged.
09:22 UTC - The internal extensibility service autoscaler kicked in and scaled up. At this point, the service was healthy again.
09:37 UTC - Recovery of unhealthy authentication server instances commenced and additional capacity was being provisioned.
09:50 UTC - Clients executing a large number of retries were blocked.
10:02 UTC - Authentication server nodes were fully restored.
10:40 UTC - Clients executing large numbers of retries were unblocked. Traffic patterns were monitored as they stabilized.
11:37 UTC - Incident was declared resolved.