[Major] Elevated errors on logins
Incident Report for Auth0
Postmortem

Overview

On November 1st 2018 between 08:20 UTC and 13:27 UTC, approximately 0.13% of all authentication requests to the ‘login/callback’ endpoint in the EU region failed (87.6% of the failed requests used the waad strategy and 12.4% used the adfs strategy) due to certificate rollover issues and customers were not informed because the scheduled job that sends certificate rollover notifications failed to run. A manual run of the certificate rollover job in the US region also led to authentication errors between 17:38 UTC and 18:24 UTC that accounted for 0.007% of calls to the ‘login/callback’ endpoint (99.3% of the failed requests used the SAMLP strategy, 0.5% adfs and 0.2% pingfederate).

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What Happened

Customer support ticket reporting outage

At 08:20 UTC our Developer Success team escalated a customer support ticket to our Engineering team related to possible certificate rollover issues in the EU region. The customer had reported a production outage.  At 08:25 UTC, Developer Success reported that the customer had manually updated their connection settings and resolved the issue. Engineering acknowledged the issue and stated that there was an existing support ticket with our Infrastructure team to resolve previous issues with the certificate rollover script. The script attempts to updates certificates automatically and if it cannot, it notifies the customer that their certificate is near expiration and they should manually update their connection settings with the new certificate.

Previous certificate rollover issues

On October 12th 2018 a customer asked for an API that would allow querying for customers’ certificate expiry. We had also received sporadic support tickets about certificate expiry issues. These tickets led to Engineering investigating the current state of the certificate rollover job and identifying that the scheduled job to run the script had not been run for at least 6 months in any region, and monitoring was not sufficient to alert us to this failure.

An Infrastructure support ticket was created to resolve the immediate issue by running the job as a one-off. The work to reinstate the scheduled job and add monitoring was passed to our Engineering team.

More customer support tickets escalated

At 11:27 UTC another customer support ticket concerning certificate rollover and authentication errors was escalated to Engineering, followed by confirmation from Developer Success that multiple tenants were affected. Engineering undertook a further short investigation into the tenants affected and whether Azure issues could have caused it. Azure issues were ruled out and an incident team gathered to address the issue.

Manual run of the script

With the scheduled job failing, the decision was taken to manually run the certificate rollover script. Our Infrastructure team ran the script in the EU region at 12:39 UTC. The script ran the certificate rollover successfully.

At 13:14 UTC the script was run successfully in the AU region. At 13:22 UTC the script was run in the US region and % failed with XML parsing errors. Authentication errors were seen in the US region between 17:38 UTC and 18:24 UTC that accounted for 0.03% of calls to the ‘login/callback’ endpoint. The Engineering team is investigating if these were the customers affected by the XML parsing errors.

The incident team continued to monitor the issue by watching logs for similar errors until 01:15 UTC when the incident was declared closed.

Timeline

08:20 UTC: Our Developer Success team reported that they had received a production outage ticket regarding Azure Active Directory certificate rollover in the EU region

08:25 UTC: Developer Success that the customer who reported the issue had manually updated their connection settings and their outage had been resolved, and Developer Success recommended we continue to investigate the issue

08:59 UTC: Our Engineering team began investigation into the issue and why certificate rollover notifications had not been sent to customers in the EU region

11:27 UTC: Our Developer Support team reported another ticket from a different customer with the same issue

11:34 UTC: Developer Success confirmed that the issue was affecting multiple tenants in the EU region and Engineering continued to investigate and confirmed that the issue was not caused by an outage with Azure but instead an issue with the script that sends the certificate rollover notifications

12:07 UTC: Engineering formed an incident team to address the issues and included our Core Infrastructure team who had access to manually run the script to send the notifications

12:17 UTC: The Auth0 status page was updated with the incident

12:17 UTC: Core Infrastructure started preparing the script to run by confirming any arguments needed to run the script

12:39 UTC: The script was run in EU successfully sent the notifications and the incident team continued to monitor logs. The script did error when attempting to send metrics to our monitoring systems due to a bug in our instrumentation code but this did not affect the sending of the notifications

13:14 UTC: The script was run in AU

13:22 UTC: The script was run in US of which a small percentage had XML parsing errors and this is being investigated as a follow-up action

13:27 UTC: The incident status was updated to Monitoring on the Auth0 status page

01:15 UTC: The incident was marked as resolved on the Auth0 status page

What Are We Doing About It?

  • Extracting the certificate rollover job to improve its reliability and our ability to monitor and respond to incidents related to this job
  • Investigate and fix the bug causing the XML parsing errors shown when the script was run in the US region
Posted Nov 21, 2018 - 15:28 UTC

Resolved
This incident has been resolved.
Posted Nov 02, 2018 - 01:15 UTC
Update
A fix has been implemented and we are monitoring the results.
Posted Nov 01, 2018 - 13:29 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 01, 2018 - 13:28 UTC
Investigating
A small percentage of authentication transactions is failing to process correctly due to a certificate rollover issue. The team is currently investigating. We'll keep you updated.
Posted Nov 01, 2018 - 12:16 UTC
This incident affected: Auth0 Europe (PROD) (User Authentication).