[Feature Disruption] Tenant Logs API in EU
Incident Report for Auth0
Postmortem

Overview

On 14 September 2018 between 6:09am UTC and 10:56am UTC all requests to the /api/v2/logs endpoint in the EU region returned errors with status code 500. These accounted for 2.43% of all requests to the Management API during this time frame. This issue prevented customers in the EU region from accessing their logs.

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What Happened

Customer support tickets and PagerDuty alerts

The first alert of issues with the Elasticsearch cluster that holds customers’ tenant logs was sent via Pagerduty at 04:48 UTC on 14 September 2018. The alert was marked as low priority and should have reached the primary on-call member of our team but the alerts did not reach that person due to delay in receiving notifications from PagerDuty (related to an incident in the PagerDuty platform itself: https://status.pagerduty.com/incidents/nw33wnq4748w). The alert was configured as low urgency. As a result of the priority, it did not get automatically escalated to the secondary on-call.

At 09:21 UTC the Engineering team responsible for tenant logs was notified that several support tickets had been raised by customers unable to access their logs (via the Management API /api/v2/logs, log extensions or the Dashboard).

Investigation into Elasticsearch nodes failing

An incident response team was mobilised at 09:30 UTC and convened on a Zoom call to investigate the issue. Within 5 minutes of mobilising, the incident response team was able to confirm that the issue was caused by an Elasticsearch node failing in the EU region. The team members that were present did not have the correct level of access to our systems to identify the root cause of the issue or resolve it. The team proceeded to escalate to our Infrastructure team.

The team responding to the incident had trouble reaching out to the infrastructure team to help debug and resolve the issue. This was caused by a delay in updating PagerDuty on-call rotas, and no calls or push notifications reaching the infrastructure team via their phones for over 1.75 hours after mobilising the incident team.

Delayed response to alerts and reduced support in EU time zones

Following the Security team’s best practices, we restrict administrative access to our systems. Unfortunately, a failure in our alerting processes meant that we were unable to contact any member of our team with administrative access to our core systems until 6 hours after the first alert was sent and 2.75 hours after the first customer support ticket regarding the incident was received.

Enabling additional nodes and resolution

When members of our Infrastructure and SRE teams were able to join the incident Zoom, the root cause of the issue was quickly identified: 3 of 6 Elasticsearch nodes had failed due to out of memory errors. The remaining 3 nodes had not been added to the Elastic Load Balancer (ELB) handling requests to Elasticsearch for customers’ logs due to issues with the automation, so the 3 healthy nodes were not being used by the load balancer.

The issue was resolved for this incident by manually adding the healthy nodes to the load balancer.

Timeline

  • 04:48 UTC: a monitor for Elasticsearch sent an alert marked as low priority via Pagerduty
  • 05:23 UTC: the monitor for Elasticsearch sent another alert marked as low priority via Pagerduty
  • 06:11 UTC: the monitor alerted again via Pagerduty
  • 08:03 UTC: the first support ticket reporting that logs are failing in the EU region is received
  • 9:00 UTC: TSE responds to customer
  • 09:04 UTC: the support ticket was escalated to our Engineering team
  • 09:21 UTC: after initial triage the incident was raised with the team responsible for audit logs
  • 09:30 UTC: an incident team was mobilised and joined a Zoom call to investigate the issue
  • 09:35 UTC: the incident team confirmed that the issue was caused by Elasticsearch nodes failing and going out of service
  • 09:45 UTC: the incident team made the first attempt to contact a member of our Infrastructure team to assist in resolving the issue
  • 10:06 UTC: Auth0 status page updated
  • 10:48 UTC: members of our Infrastructure and SRE teams join the incident Zoom and starts investigating the issues
  • 10:59 UTC: our SRE team resolved the issue by adding 3 nodes that already existed in the Elasticsearch cluster to the ELB
  • 11:10 UTC: Auth0 status page updated to mark the issues as resolved

What Are We Doing About It?

  • [Done] Upgrade Elasticsearch connection failure alerts to high priority
  • [Done] Improve monitoring to ensure all the engineering teams related to this feature are notified upon peaks of errors
  • Improve monitoring to ensure that Elasticsearch issues are detected sooner
  • Reach out to PagerDuty to understand the impact of any delays to our notifications
  • Ensure that our automation adds all Elasticsearch nodes to the ELB
  • Update status page documentation to include more detailed guidance and examples
Posted Oct 01, 2018 - 16:40 UTC

Resolved
The issue was caused by a number of Elasticsearch nodes failing in EU. This issue has now been resolved.
Posted Sep 14, 2018 - 11:10 UTC
Investigating
All requests to the /api/v2/logs endpoint in the EU region are failing with 500 errors.
Posted Sep 14, 2018 - 10:06 UTC
This incident affected: Auth0 Europe (PROD) (Management API) and Auth0 Europe (PREVIEW) (Management API).