[Minor] Elevated errors on logins using the Github strategy
Incident Report for Auth0
Postmortem

Overview

On 6 September 2018 between 7:01am UTC and 11:16am UTC, approximately 80% of authentication requests using the Github strategy failed. The errors were caused by timeouts attempting to connect to Github’s API.

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What Happened

A customer raised a support ticket via Zendesk stating that users cannot login with Github

A customer raised a support ticket via Zendesk to inform us that their users were unable to login with Github. Our Technical Support team confirmed with the customer that the issue affected approximately 95% of their users attempting to login with Github. Technical Support further investigated the issue by examining logs, and found an increasing number of errors for multiple customers in the US region for requests to Auth0’s login/callback endpoint (called during the authentication process). Technical Support escalated the issue to our Engineering team.

Our Engineering team investigated and contacted Github

Initial investigation indicated that the failure was with the Github API but with their status page reporting all systems normal, our Engineering team looked into other possible causes to rule out issues with Auth0’s services.

With no evidence of issues with Auth0’s services, we contacted Github support with details of the failed requests to Github’s API, including the IP addresses that Auth0 attempted to connect to.

Our investigation highlighted the need to improve our monitoring systems so that in future we can improve our response time to authentication strategy errors.

Github’s response and confirmation of the cause of the incident

Github support responded to state that their data center had investigated an issue during the time window that Auth0 customers could not login with Github. This investigation identified some packet loss on the way out of AWS. Github stated that they have setup additional monitors to alert them to similar issues in the future.

Timeline

  • 07:01 UTC: The first authentication request to Github that times out is logged by Auth0’s logging system
  • 09:05 UTC: A customer raised a support ticket via Zendesk informing us that their users are unable to login with Github
  • 09:26 UTC: Our Technical Support team responded and confirmed with the customer that only users attempting to login with the Github authentication strategy were affected and of these, approximately 95% of requests were failing due to timeouts
  • 09:30 UTC: Technical Support escalated to Engineering
  • 09:30 UTC: Engineering opened an investigation into the issue
  • 10:38 UTC: Auth0’s status page was updated https://status.auth0.com/incidents/b6jwvpxb300t
  • 11:16 UTC: Request timeouts stopped and normal service resumed
  • 11:36 UTC: Engineering concluded that the errors were not caused by problems with Auth0 services
  • 12:29 UTC: Auth0’s status page updated to confirm that timeouts were no longer being observed
  • 12:34 UTC: Auth0 contacted Github support with details of the failed requests
  • 14:03 UTC: Auth0’s status page updated to mark the issue as resolved
  • 15:30 UTC: Github support responded to confirm that there was some packet loss from the AWS environment hosting the IPs that Auth0 made requests to

What Are We Doing About It?

  • [Done] Improve our monitoring systems to add alerts for authentication strategy errors to improve our response time to similar incidents in the future.
Posted Sep 27, 2018 - 16:05 UTC

Resolved
This incident has been resolved.
Posted Sep 06, 2018 - 14:03 UTC
Monitoring
We are no longer seeing errors from authentication transactions using the Github strategy. We believe the issue was caused by a change made by Github. We are confirming this with Github.
Posted Sep 06, 2018 - 12:29 UTC
Investigating
A number of authentication transactions using the Github strategy are failing to process correctly. The team is currently investigating the root cause. We will keep you updated.
Posted Sep 06, 2018 - 10:38 UTC
This incident affected: Auth0 US (PREVIEW) (User Authentication) and Auth0 US (PROD) (User Authentication).