A faulty code change was applied to our MFA implementation, causing end-users to be unable to login. During the time of the incident, this issue affected 100% of login attempts requiring MFA for customers using SAML and configured to use the Classic Universal Login Experience.
For Preview environments, the incident lasted from May 18th 23:00UTC to May 20th 16:30UTC.
For Production environments, the incident lasted from May 20th 13:00UTC to May 20th 14:35UTC.
We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
On Monday May 18th, our engineers merged a code change to our authorization server implementation that affects the authentication flow logic.The change altered the token generation logic used as part of the authentication flow, in particular, it added a new attribute to the token that is used when the user transitions to the MFA challenge step.
When end-users reach that step via a SAML authentication flow, the new logic passed an invalid value to that token attribute. As a result, when end-users transitioned to the MFA step, the authentication flow was interrupted as downstream validation logic detected that invalid value.
As the change was first deployed to our Preview environments, a small number of customers began to be affected by the change. Our monitoring and alerting automation failed to detect the increase before the change was promoted to our Production environments.
Once the change reached the production environments, the volumes of failures increased. As reports of failure were escalated to our engineering teams, the team undertook the steps to roll back the change.
We realize that Auth0 is a critical part of your development and production infrastructure. Again, we would like to take a moment to apologize for the impact that this outage had on your operations. We are deeply aware of the pains that you and your subscribers feel as a result of downtime. Our teams continue to work tirelessly to provide you with the best authentication experience possible.
Thank you for your continued support of Auth0.
21:15 - Faulty code change merged
23:00 - Change was deployed to preview (Australia and Europe)
23:12 - First errors appear in preview (Australia)
04:45 - First errors appear in preview (Europe)
13:28 - Change was deployed to preview (US)
~13:30 - First errors appear in preview (US)
13:00 - Starting deployment of change to production (US, Europe, Australia)
13:03 - First errors appear in production
14:12 - Customer ticket escalation to the on-call engineer
14:27 - Start rollback job in production (US, Europe, Australia)
~14:35 - Errors stopped happening in production (US, Europe, Australia)
16:30 - Change was reverted and errors stopped happening in preview (US, Europe, Australia)