Errors in Extensibility Service

Incident Report for Auth0

Postmortem

June 10, 2020 - Errors in Extensibility Service

Summary

The internal service which handles Rules, Hooks, Customer Database Scripts, and Extensions in our US region experienced a spike in errors between 18:35 UTC and 18:50 UTC. This might have affected authentication requests if you were using Rules, Hooks, or Custom DB Scripts. Extensions might have displayed unavailability or experienced increased latency if you were using them during this period of time. This was caused by all the traffic shifting from to a previous versioned cluster with lower capacity.

We apologize for any issues you might have encountered, and thank you for your continued trust in Auth0.

What Happened

Our internal service for handling extensibility features, including Rules, Hooks, Custom Database Scripts and Extensions has a routing layer which handles traffic shifting and blue-green deployments. During a deployment, it’s common that we might have two clusters of this service behind the routing layer at the same time and they might have different capacities.

At 6:32 PM UTC, when we deployed a new release of the routing layer, the traffic shifted from the latest version of the extensibility service with more capacity, to the previous one with less capacity.

Operators were alerted about the problem, and once the problem was identified, the traffic was immediately shifted back to the latest extensibility service cluster, with prescaled capacity. Once traffic was 100% back to the latest cluster, the incident was completely resolved.

Mitigation Actions

Improve the machinery which does our blue green canarying logic by loading the state of where it left off on startup. This would ensure new deployments, and restarts would retain the prior known, good state.
Improve our process to increase the test coverage under load for the scenarios as exposed by this incident in the staging environment.

Annex 1: Events Timeline

18:32 UTC - Rollout of internal extensibility service routing layer performed.

18:35 UTC - Automated alerts started firing.

18:44 UTC - Operator shifted traffic back to the new extensibility service cluster.

18:52 UTC - Service was back to nominal state.

Posted Jun 22, 2020 - 22:43 UTC

Resolved

Webtask in our US region experienced a spike in errors between 18:35 UTC and 18:50 UTC. This might have affected authentication requests if you were using Rules, Hooks, or Custom DB Scripts. Extensions might also have been affected. This is now resolved.

Posted Jun 10, 2020 - 18:50 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 10, 2020 - 18:35 UTC

This incident affected: Auth0 US (PREVIEW) DEPRECATED ([DEPRECATED] Authentication using Custom DB Connections & Rules (PREVIEW)) and Auth0 US (PROD) ([DEPRECATED] Authentication using Custom DB Connections & Rules).