Errors with MFA
Incident Report for Auth0
Postmortem

2020-05-20 - Errors with MFA

Summary

A faulty code change was applied to our MFA implementation, causing end-users to be unable to login. During the time of the incident, this issue affected 100% of login attempts requiring MFA for customers using SAML and configured to use the Classic Universal Login Experience.

For Preview environments, the incident lasted from May 18th 23:00UTC to May 20th 16:30UTC.

For Production environments, the incident lasted from May 20th 13:00UTC to May 20th 14:35UTC.

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What Happened

On Monday May 18th, our engineers merged a code change to our authorization server implementation that affects the authentication flow logic.The change altered the token generation logic used as part of the authentication flow, in particular, it added a new attribute to the token that is used when the user transitions to the MFA challenge step.

When end-users reach that step via a SAML authentication flow, the new logic passed an invalid value to that token attribute. As a result, when end-users transitioned to the MFA step, the authentication flow was interrupted as downstream validation logic detected that invalid value.

As the change was first deployed to our Preview environments, a small number of customers began to be affected by the change. Our monitoring and alerting automation failed to detect the increase before the change was promoted to our Production environments.

Once the change reached the production environments, the volumes of failures increased. As reports of failure were escalated to our engineering teams, the team undertook the steps to roll back the change.

Mitigation Actions

  • Because our monitoring and alerting automation failed to detect the anomaly before the change was promoted to our production environments, we are reviewing their configuration to detect this kind of failure in the future. The alerts showed that a subset of authentication flows failed with status code 401, making it difficult for our systems to separate a normal failure flow from an internal one.
  • We have identified gaps in our automated testing approach, in particular, to cover the various possible transitions from a given authentication flow (ex. SAML) to our MFA flows. Our team will work to improve target test suites that will allow us to detect regression in this use case.

Summary

We realize that Auth0 is a critical part of your development and production infrastructure. Again, we would like to take a moment to apologize for the impact that this outage had on your operations. We are deeply aware of the pains that you and your subscribers feel as a result of downtime. Our teams continue to work tirelessly to provide you with the best authentication experience possible.

Thank you for your continued support of Auth0.

Annex 1: Events Timeline

Monday May 18th

21:15 - Faulty code change merged

23:00 - Change was deployed to preview (Australia and Europe)

23:12 - First errors appear in preview (Australia)

Tuesday May 19th

04:45 - First errors appear in preview (Europe)

13:28 - Change was deployed to preview (US)

~13:30 - First errors appear in preview (US)

Wednesday May 20th

13:00 - Starting deployment of change to production (US, Europe, Australia)

13:03 - First errors appear in production

14:12 - Customer ticket escalation to the on-call engineer

14:27 - Start rollback job in production (US, Europe, Australia)

~14:35 - Errors stopped happening in production (US, Europe, Australia)

16:30 - Change was reverted and errors stopped happening in preview (US, Europe, Australia)

Posted Jun 03, 2020 - 15:16 UTC

Resolved
This incident has been resolved.
Posted May 20, 2020 - 16:53 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 20, 2020 - 16:38 UTC
Update
We are continuing to work on a fix for our Preview environments. Our Production environments are running as expected, as of 14:39 UTC. We will keep you updated on our progress.
Posted May 20, 2020 - 15:49 UTC
Update
We are continuing to work on a fix for this issue.
Posted May 20, 2020 - 14:39 UTC
Identified
We have identified the issue and are working on providing a fix as soon as possible. We have also confirmed that this is only affecting customers using the Universal Login Classic Experience.
Posted May 20, 2020 - 14:32 UTC
Investigating
We are currently experiencing error with our MFA endpoints. Some users might be unable to perform multi-factor authentication when logging in. We are investigating and will provide more information as it becomes available.
Posted May 20, 2020 - 14:27 UTC
This incident affected: Auth0 Europe (PREVIEW) (Multi Factor Authentication), Auth0 US (PREVIEW) (Multi Factor Authentication), Auth0 US (PROD) (Multi Factor Authentication), Auth0 US-2 (PROD) (Multi Factor Authentication), Auth0 Australia (PREVIEW) (Multi Factor Authentication), Auth0 Europe (PROD) (Multi Factor Authentication), and Auth0 Australia (PROD) (Multi Factor Authentication).