Elevated response times and increased error rates in Auth0
Incident Report for Auth0
Postmortem

2020-05-06 - Brief Disruption of Service to Auth0 EU

Summary

On Wednesday, May 6, 2020, a traffic spike period caused a brief disruption of service to the Auth0 Production EU environment.

What Happened

On Wednesday, May 6, 2020, between 00:58 and 01:15 UTC, an unexpected traffic spike to the Auth0 Production EU environment caused a disruption of service.

During these spikes, some customers in the Auth0 EU region may have observed connectivity issues and/or high latency for API requests using custom domains. Not all customers located in this Auth0 region were impacted. Tenants in the US and AU Auth0 regions were not impacted. In practical terms, some tenants in the EU region may have noticed slow login behavior or failed logins, for approximately 15 minutes, combined.

Resolution

Service started to recover by itself as soon as the spike ended, however given the load from the spike's requests still being processed, high response time was still observed in the backend services.

As part of the mitigation, both Auth0 and the custom domain services substantially scaled to handle the increased load after the recovery.

Action Items

Auth0 takes its customer commitments and user experience seriously. To prevent another occurrence, here are the actions we are taking.

  • We are in the process of adding new rate-limit protections to our custom domains service. These protections will help prevent this entire class of issues in the future.
  • We are also increasing the scaling rate of our custom domains service. Faster scaling will enable us to react more quickly to increased traffic loads.

Annex 1: Events Timeline

2020-05-06 00:59 UTC: Tenant begins sending a large number of requests to Auth0 service.

2020-05-06 01:01 UTC: The spike in traffic triggers automated Auth0 alerts. Auth0 engineering starts to investigate.

2020-05-06 01:05 UTC: Spike in customer traffic ends.

2020-05-06 01:05 UTC: Auth0 service start the recovery process, customer impact is mitigated

2020-05-06 01:15 UTC: Auth0 engineering starts to scale up internal infrastructure in prevention.

2020-05-06 01:15 UTC: All traffic levels return to normal and alarms are cleared.

Posted May 21, 2020 - 03:49 UTC

Resolved
We identified elevated response times in our Production EU environment, which presented themselves as HTTP 502 errors. These errors were presented on May 6th, 2020, between 00:58 and 01:15 UTC. All services have now been restored, and are functioning as expected. Auth0 thanks you for your understanding, and we will provide a full Root Cause Analysis document in the coming weeks.
Posted May 06, 2020 - 01:15 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted May 06, 2020 - 00:59 UTC
This incident affected: Auth0 Europe (PROD) (Authentication API, Custom Domains).