Elevated error rates for MFA with Auth0 Guardian
Incident Report for Auth0
Postmortem

Overview

On December 13th, at 20:19 UTC, the Auth0 Australia environment started experiencing TLS connection errors in authentication requests to Guardian. Around 22:56 UTC, the correct certificate got uploaded and service came back to normal.

What Happened

On December 13th, our infrastructure team worked on a scheduled maintenance aiming to migrate our AU environment infrastructure to a new one created via automations they had been working on. The main goal of having this "more automated" environment is to reduce risk and increase our velocity when delivering updates to our infrastructure. We began the migration at a time during which traffic in the AU region is typically at a minimum. At 20:19 UTC the DNS record for guardian.au.auth0.com (our MFA service DNS) was pointed to the new cluster. We observed traffic in the new cluster across the load balancer and the backend nodes but because of the hour, only health check requests from load-balancers could be seen at the time. At this point we wrongly assumed that Guardian was successfully migrated.

Every hour, automated tests for Guardian are run to check that all environments are working as expected. The test are run against every region, and in this case, they took 45 minutes to complete, and we got our first alert at 22:05 UTC. At that point we started to diagnose the problem, and a manual test made evident that requests were failing during the authentication flow due to errors establishing a TLS connection. The problem was then quickly identified: a wrong certificate was configured on the new load balancers. We decided to deploy a fix for the certificate configuration instead of rolling back since there was no customer traffic on these instances at this time. At 22:56 UTC, the new certificate was configured on the load balancers and requests against the MFA API were successful.

Timeline

  • 2017-12-13 20:19 UTC: DNS record for Guardian service got migrated.
  • 2017-12-13 22:31 UTC: Functional test reported SSL connection errors.
  • 2017-12-13 22:40 UTC: We identified the wrong certificate causing the incident.
  • 2017-12-13 22:56 UTC: Certificate got uploaded and functional tests ran successfully.

What Are We Doing About It?

Improve our migration procedure to define: the roll forward step, success conditions, rollback criteria and the rollback procedure. Update the functional tests to be able to run them even if an environment is not the active one. This will allow us to run these tests before changing the DNS record. Adding and improving monitors to detect service health. Improving alerts on failing functional test.. Separating the more critical functional tests to make sure they are run more frequently, complete faster and alert faster on failures. Improving MFA functional tests performance in general.

Closing

We are sorry about the issue this incident caused. We used this opportunities to learn and implement improvements to both aim to prevent similar situations and help us react better and faster to them in case they happen.

Thank you for your understanding and your continued support of Auth0.

Posted Jan 09, 2018 - 21:56 UTC

Resolved
This incident has been resolved.
Posted Dec 13, 2017 - 23:29 UTC
Monitoring
An incorrect certificate was causing the issue. We have fixed the issue and errors are no longer occurring
Posted Dec 13, 2017 - 23:10 UTC
Investigating
We are currently investigating this issue.
Posted Dec 13, 2017 - 22:58 UTC
This incident affected: Auth0 Australia (PROD) (User Authentication) and Auth0 Australia (PREVIEW) (User Authentication).