Between 2017-12-13- 20:22 UTC and 2017-12-14 01:25 UTC, end users of Auth0 customers in the AU environment using the AD/LDAP connector without caching experienced logins errors
We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
During December 13th, our infrastructure team worked on a scheduled maintenance aiming to migrate our AU environment infrastructure to a new one created via automations they had been working on. The main goal of having this "more automated" environment is to reduce risk and increase our velocity when delivering updates to our infrastructure.
As part of this migration, we deployed new AWS EC2 instances of all our services, including the one that supports the connector feature.
When this service was deployed to new instances, it included a configuration mistake, which caused it to spawn 2 processes per EC2 instance instead of 1. This setting is supported by the service for environments where the AD/LDAP feature is not used, but this caused intermittent failures when the client tried to connect to the AD/LDAP server.
During the first hours of the migration, a single customer reported this issue. All our tests were successful and we saw successful transactions from other customers in our monitoring systems. This was because by default, Auth0 caches AD/LDAP users. This allowed end users to authenticate even if the connector was down, making the vast majority of login attempts to succeed. Tenants that had the caching feature disabled were affected by this issue.
This situation initially led us to consider a possible configuration issue on the customer's side. We started two parallel investigations: looking at what error could exist from a tenant configuration perspective and also at our changes (which were still more likely to be the cause). The bug took a long time to find as we did not have good logs that indicated that running this service with more than one process per instance was a problem.
Additionally, even though our migration plan included progressive checks at different stages, we did not include clear guidelines for rollback conditions and a clear rollback procedure for all situations. We considered rolling back when we could not find the issue, but not having the analysis done beforehand, we decided against this because the impact could have increased with a bad rollback.
At Dec 14th 01:18 UTC, we found the configuration bug and deployed a fix to decrease the number of workers, which solved the issue.
We are sorry about this issue and have learned lessons that we will incorporate and action items that we will work on to help us prevent similar situations.
Thank you for your understanding and your continued support of Auth0.