Errors in Authentication API

Incident Report for Auth0

Postmortem

Overview

On November 13th between 20:33 and 21:28 UTC, Auth0 customers in the US-2 environment received 8321 responses with HTTP status code 502. This represented 0.3% of the total of requests during that timeframe.

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What happened?

At Auth0 our current default deployment method for our main services, the Authentication and Management APIs, is rolling deployments. These services runs on multiple instances, and traffic is distributed among these instances using an AWS Application Load Balancer, for scale and high availability reasons.

The following is the flow of the rolling deployment for each node running the service: 1. We remove a single node from the load balancer. This means that the load balancer stops forwarding new requests/connections to the node but the node can still send data back. This is important for connection draining. 1. We drain connections. This means that requests that are on-going at the time of the removal can complete. 1. We update the node with the new version of the code. 1. We start the new instances of the services that start running the new code. 1. We monitor the health check endpoint until it returns HTTP 200, which means all services in that instance are running fine. 1. We add the node back to the load balancer.

Authentication and Management API nodes are fronted by two different load balancers. One for regular domains ({tenant}.auth0.com) and another one for our custom domains feature. Our deployment automation did not have nodes for removed/added from the custom domains load balancer. As a result, deployments that required service restarts would ultimately result in failed requests, as the node would receive requests while the service processes were not restarting, and thus unable to reply.

On November 13th at 20:16 UTC we launched a rolling deploy in the US-2 environment. This particular deployment included a change in the core authentication service required a restart. Deployments requiring restarts are uncommon, the last time we required a restart in this service was June 18th 2018, before the custom domain features was made available in this environment.
Because nodes were not removed from the CNAMEs load balancer while processes restarted, the restart in the service affected all active requests in each node sequentially, resulting in failures with status code 502.

Timeline

20:16 UTC - Deployment started
20:33 UTC - First node in the lists restarts the service, first batch of errors occur
21:28 UTC - Last node in the lists restarts the service, last batch of errors occur
21:34 UTC Deployment finished

What Are We Doing About It?

We’ve updated our deployment scripts to consider all service associated load balancers.
We’re are evaluating alternatives to trigger more specific alerts with lower thresholds to get notified faster when a single node reports failures or a small number of errors occurs

Posted Nov 24, 2017 - 15:03 UTC

Resolved

This incident has been resolved.

Posted Nov 13, 2017 - 22:14 UTC

Monitoring

We identified a deployment issue and have mitigated. The errors have stopped. We will continue to monitor for additional issues.

Posted Nov 13, 2017 - 21:40 UTC

Investigating

We are investigating errors in the Authentication API in the US2 environment.

Posted Nov 13, 2017 - 21:09 UTC

This incident affected: Auth0 US-2 (PROD) (User Authentication).