[Major] Elevated errors on logins in EU
Incident Report for Auth0
Postmortem

Overview

On January 17th between 02:01 UTC and 02:06 UTC, end users of customers in our EU environment were unable to authenticate using our services.

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What happened?

As part of the remediation actions for the meltdown security incident, we were upgrading linux kernels across all our systems. As the patches are too complex, the live patching mechanism wasn’t available, so a full reboot was needed.

When restarting MongoDB, our primary storage, an expected downtime around 20 seconds was expected but some services were not able to re-connect to MongoDB.

While the application was not connected to MongoDB some non expected 401, 404 and wrong TLS certificates error were seen.

Timeline

  • 02:00:22 UTC: A step down was performed on the primary MongoDB node.
  • 02:00:42 UTC: A new primary was elected and MongoDB was healthy.
  • 02:01:00 UTC: Our main and IdP services fails to reconnect to the database and the load balancer detects all the instances as unhealthy.
  • 02:01:13 UTC: Our monitoring shows failures to authenticate.
  • 02:03:28 UTC: The applications were manually restarted
  • 02:03:41 UTC: Our main service recovers and is ready to process requests.
  • 02:04:00 UTC: The load balancer detects some instances as healthy
  • 02:05:31 UTC: Our internal IdP service recovers and is ready to process requests.
  • 02:05:49 UTC: We stop receiving errors from our monitoring systems and all our manual tests success

Root cause analysis

Several issues has taken place to increase the time to recover after the primary node being elected.

A change to the database reconnection logic was made improving how services handles db connection issues. Due to a configuration problem, the logic that handles the process initialization failed and were not able to restart the master process.

This problem was detected in our tests and a patch was already in place but it wasn’t yet deployed to that environment due to our rolling deployment process.

There is an dependency between services, and although the main component (auth0-server) recovered in an expected way, the upstream service used for database connections authentication was not recovered yet. The errors shown in this case were 401 (unauthorized) instead of 503 (gateway timeout).

When our load balancer detected all the instances as unhealthy (2:01-2:04 UTC), the DNS performed a failover to eu.auth0.com instead of *.eu.auth0.com, which was pointing to the wrong set of IPs. As those IPs were not pointing to the authentication cluster, customers experienced 404s responses and TLS errors. This was seen only by tenants performing requests from fresh DNS queries and no caching whatsoever.

How did we fix this?

As part of the upgrade process, we detected the services that were unable to recover and we proceeded to restart them. After that, every service was able to connect to the new mongo primary instance and everything started working as expected.

What Are We Doing About It?

  • We will do a thorough testing for failure scenarios.
  • We ensure the failed patch for reconnecting to the DB was rolled back in all environments.
  • We will improve our internal processes to detect risks and make sure every critical patch is applied before doing any kind of change in our infrastructure.
  • We’ve looking for alternatives to manage database disconnections gracefully and reconnections faster.
  • We fixed our DNS failover records by pointing them to our canary environment in case of health check failures.
  • We are defining new testing procedures to test failover configurations and catch configurations issues beforehand.
  • We are improving our IdP error response handling in our main service to properly answer when they fail.

Summary

We are really sorry about this issue and the downtime we’ve caused for you. We will use what we learnt to help us prevent similar situations.

Thank you for your understanding and your continued support of Auth0.

Posted Feb 14, 2018 - 11:19 UTC

Resolved
No new errors have occurred
Posted Jan 17, 2018 - 02:24 UTC
Monitoring
A service restart has fixed the issue
Posted Jan 17, 2018 - 02:08 UTC
Investigating
A percentage of authentication transactions is failing to process correctly. The team is currently investigating the root cause. We'll keep you updated.
Posted Jan 17, 2018 - 02:05 UTC
This incident affected: Auth0 Europe (PROD) (User Authentication, [DEPRECATED] Custom DB Connections & Rules) and Auth0 Europe (PREVIEW) (User Authentication, [DEPRECATED] Custom DB Connections & Rules).