[Feature Disruption] User Indexing Delays impacting Search API

Incident Report for Auth0

Postmortem

Overview

Between February 19th at 09:00 UTC and February 21st 22:23 UTC, Auth0 Europe requests to /api/v2/logs endpoints and tenants logs view in the management dashboard returned outdated results. No logs were lost during the incident.

What Happened

For historical reasons our EU environment shared the same cluster both for user and log search largely for historical reasons. On February 19th between 09:00 UTC, indexation for user search started taking a negative effect on the whole cluster due to mapping issues by getting delayed.No alerts were triggered at this point.

Our User and Logs search feature use an ElasticSearch cluster for storage. User updates and tenant activity logs are written asynchronously to ElasticSearch. This is done for performance reasons and to ensure that our primary database component (we used to process authentication transactions) is isolated from these operations. User and Logs search requests read from that ElasticSearch cluster. We provide a lot of flexibility in the data that can be stored and queried: you can store nested objects of any types, without upfront schema definition and you can query with any sort of pattern. In the long term this strategy started to show reliability issues.

When a write operation is performed, the Elasticsearch primary node needs to propagate cluster state to all nodes within the cluster to acknowledge a write. During the incident, some write operations failed to propagate to all nodes, resulting in timeouts. When these timeouts occurred, workers that move logs from the data stream to Elasticsearch failed and retried the operation. This timeout/fail/retry loop incremented the time needed to successfully write a log into Elasticsearch, and some tenants got affected more than others due to timing. Because of the way the cluster is architected this is not something that can be solved by simply adding more capacity.

On 2018-02-20 06:04 UTC, the incident response team started troubleshooting the incident and tried to improve the situation by removing unused mappings at 19:20 UTC (something that was already planned). The operation also failed for the same reason that indexing was failing in the first place. After trying and considering other solutions to stabilize the cluster, on February 21st at 06:24 UTC the team decided to split audit logs to a separate cluster. The operation was completed February 22nd at 22:23 UTC and logs started being processed as usual. This setup with separate cluster was already rolled out to all the other regions and was planned for this one. By using different clusters we can achieve separation of concerns and improve monitoring, capacity planning, and the reliability the user and log search features, as they become more independent of each other.

Timeline

2018-02-19 09:00 UTC: Delays started appearing on few requests 2018-02-20 06:04 UTC: Team started to investigate the issue. 2018-02-20 19:18 UTC: Diagnose timeouts in Elasticsearch when trying to create new aliases as the root cause. 2018-02-20 19:20 UTC: We attempt to recover the Elasticsearch cluster by cleaning up unused aliases. 2018-02-21 06:24 UTC: A decisions was made to split both log clusters. 2018-02-21 20:56 UTC: New Tenant Elasticsearch cluster finished its deployment. 2018-02-22 15:47 UTC: All customer data was successfully migrated and started processing delayed logs. 2018-02-22 22:23 UTC: All previous delayed logs became fully ingested.

What Are We Doing About It?

A new tenant log cluster has been deployed in order to split them from Users logs. For a long-term fix we have been working in the past few months on a completely rewrite of the user search feature (what we call search v3) and that should have a high impact on reliability of this feature and largely solve the capacity issues we had. This rewrite is already being tested and we will give more details soon. Improve our monitors around log processing to get alerted faster and more reliably

Summary

We are sorry about the issue this incident caused. We used this opportunities to learn and implement improvements to both aim to prevent similar situations and help us react better and faster to them in case they happen.

Thank you for your understanding and your continued support of Auth0.

Posted Apr 06, 2018 - 13:54 UTC

Resolved

This incident has been resolved.

Posted Feb 20, 2018 - 21:53 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 20, 2018 - 21:20 UTC

Investigating

We are currently investigating this issue.

Posted Feb 20, 2018 - 20:04 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 20, 2018 - 18:35 UTC

Investigating

Our user index processing pipeline is running behind. Queries to the Search API will return stale results until the system is caught up. We will post an update with an ETA as soon as possible.

Posted Feb 20, 2018 - 15:18 UTC

This incident affected: Auth0 US (PREVIEW) (Management API) and Auth0 US (PROD) (Management API).