On January 17th between 02:01 UTC and 02:06 UTC, end users of customers in our EU environment were unable to authenticate using our services.
We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
As part of the remediation actions for the meltdown security incident, we were upgrading linux kernels across all our systems. As the patches are too complex, the live patching mechanism wasn’t available, so a full reboot was needed.
When restarting MongoDB, our primary storage, an expected downtime around 20 seconds was expected but some services were not able to re-connect to MongoDB.
While the application was not connected to MongoDB some non expected 401, 404 and wrong TLS certificates error were seen.
Several issues has taken place to increase the time to recover after the primary node being elected.
A change to the database reconnection logic was made improving how services handles db connection issues. Due to a configuration problem, the logic that handles the process initialization failed and were not able to restart the master process.
This problem was detected in our tests and a patch was already in place but it wasn’t yet deployed to that environment due to our rolling deployment process.
There is an dependency between services, and although the main component (auth0-server) recovered in an expected way, the upstream service used for database connections authentication was not recovered yet. The errors shown in this case were 401 (unauthorized) instead of 503 (gateway timeout).
When our load balancer detected all the instances as unhealthy (2:01-2:04 UTC), the DNS performed a failover to eu.auth0.com instead of *.eu.auth0.com, which was pointing to the wrong set of IPs. As those IPs were not pointing to the authentication cluster, customers experienced 404s responses and TLS errors. This was seen only by tenants performing requests from fresh DNS queries and no caching whatsoever.
As part of the upgrade process, we detected the services that were unable to recover and we proceeded to restart them. After that, every service was able to connect to the new mongo primary instance and everything started working as expected.
We are really sorry about this issue and the downtime we’ve caused for you. We will use what we learnt to help us prevent similar situations.
Thank you for your understanding and your continued support of Auth0.