On 14 September 2018 between 6:09am UTC and 10:56am UTC all requests to the /api/v2/logs endpoint in the EU region returned errors with status code 500. These accounted for 2.43% of all requests to the Management API during this time frame. This issue prevented customers in the EU region from accessing their logs.
We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
Customer support tickets and PagerDuty alerts
The first alert of issues with the Elasticsearch cluster that holds customers’ tenant logs was sent via Pagerduty at 04:48 UTC on 14 September 2018. The alert was marked as low priority and should have reached the primary on-call member of our team but the alerts did not reach that person due to delay in receiving notifications from PagerDuty (related to an incident in the PagerDuty platform itself: https://status.pagerduty.com/incidents/nw33wnq4748w). The alert was configured as low urgency. As a result of the priority, it did not get automatically escalated to the secondary on-call.
At 09:21 UTC the Engineering team responsible for tenant logs was notified that several support tickets had been raised by customers unable to access their logs (via the Management API /api/v2/logs, log extensions or the Dashboard).
Investigation into Elasticsearch nodes failing
An incident response team was mobilised at 09:30 UTC and convened on a Zoom call to investigate the issue. Within 5 minutes of mobilising, the incident response team was able to confirm that the issue was caused by an Elasticsearch node failing in the EU region. The team members that were present did not have the correct level of access to our systems to identify the root cause of the issue or resolve it. The team proceeded to escalate to our Infrastructure team.
The team responding to the incident had trouble reaching out to the infrastructure team to help debug and resolve the issue. This was caused by a delay in updating PagerDuty on-call rotas, and no calls or push notifications reaching the infrastructure team via their phones for over 1.75 hours after mobilising the incident team.
Delayed response to alerts and reduced support in EU time zones
Following the Security team’s best practices, we restrict administrative access to our systems. Unfortunately, a failure in our alerting processes meant that we were unable to contact any member of our team with administrative access to our core systems until 6 hours after the first alert was sent and 2.75 hours after the first customer support ticket regarding the incident was received.
Enabling additional nodes and resolution
When members of our Infrastructure and SRE teams were able to join the incident Zoom, the root cause of the issue was quickly identified: 3 of 6 Elasticsearch nodes had failed due to out of memory errors. The remaining 3 nodes had not been added to the Elastic Load Balancer (ELB) handling requests to Elasticsearch for customers’ logs due to issues with the automation, so the 3 healthy nodes were not being used by the load balancer.
The issue was resolved for this incident by manually adding the healthy nodes to the load balancer.