[Minor] A small percentage of authentication requests using our AD connector component is failing

Incident Report for Auth0

Postmortem

Overview

Between 2017-12-13- 20:22 UTC and 2017-12-14 01:25 UTC, end users of Auth0 customers in the AU environment using the AD/LDAP connector without caching experienced logins errors

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What Happened

During December 13th, our infrastructure team worked on a scheduled maintenance aiming to migrate our AU environment infrastructure to a new one created via automations they had been working on. The main goal of having this "more automated" environment is to reduce risk and increase our velocity when delivering updates to our infrastructure.

As part of this migration, we deployed new AWS EC2 instances of all our services, including the one that supports the connector feature.

When this service was deployed to new instances, it included a configuration mistake, which caused it to spawn 2 processes per EC2 instance instead of 1. This setting is supported by the service for environments where the AD/LDAP feature is not used, but this caused intermittent failures when the client tried to connect to the AD/LDAP server.

During the first hours of the migration, a single customer reported this issue. All our tests were successful and we saw successful transactions from other customers in our monitoring systems. This was because by default, Auth0 caches AD/LDAP users. This allowed end users to authenticate even if the connector was down, making the vast majority of login attempts to succeed. Tenants that had the caching feature disabled were affected by this issue.

This situation initially led us to consider a possible configuration issue on the customer's side. We started two parallel investigations: looking at what error could exist from a tenant configuration perspective and also at our changes (which were still more likely to be the cause). The bug took a long time to find as we did not have good logs that indicated that running this service with more than one process per instance was a problem.

Additionally, even though our migration plan included progressive checks at different stages, we did not include clear guidelines for rollback conditions and a clear rollback procedure for all situations. We considered rolling back when we could not find the issue, but not having the analysis done beforehand, we decided against this because the impact could have increased with a bad rollback.

At Dec 14th 01:18 UTC, we found the configuration bug and deployed a fix to decrease the number of workers, which solved the issue.

Timeline

Dec 13th, 20:22 UTC: DNS record for Active Directory connector got migrated.
Dec 13th, 23:27 UTC: We receive the first notification about this issue
Dec 14th, 01:15 UTC: We identified the configuration bug causing the incident.
Dec 14th, 01:18 UTC: We deployed a hotfix to reduce workers in auth0-users
Dec 14th, 01:25 UTC: Deployment is completed, incident is over

What Are We Doing About It?

Adding and improving monitors to detect service health.
Update the functional tests to be able to run them even if an environment is not the active one. This will allow us to run these tests before changing the DNS record.
Adding warning logs to auth0-users service to notify when workers is set to a number which prevent AD to work
Improve our migration procedure to define: the roll forward step, success conditions, rollback criteria and the rollback procedure.
Improving alerts on failing functional test.
Separating the more critical functional tests to make sure they are run more frequently, complete faster and alert faster on failures.

Summary

We are sorry about this issue and have learned lessons that we will incorporate and action items that we will work on to help us prevent similar situations.

Thank you for your understanding and your continued support of Auth0.

Posted Feb 05, 2018 - 20:18 UTC

Resolved

This incident has been resolved.

Posted Dec 14, 2017 - 01:42 UTC

Monitoring

An issue with our connector mesh caused some requests. We have fixed this and errors have stopped

Posted Dec 14, 2017 - 01:28 UTC

Investigating

We are currently investigating this issue.

Posted Dec 14, 2017 - 00:32 UTC

This incident affected: Auth0 Australia (PROD) (User Authentication) and Auth0 Australia (PREVIEW) DEPRECATED (User Authentication).