[Major] Elevated errors on logins

Incident Report for Auth0

Postmortem

Overview

On January 15th between 16:35 UTC and 16:52 UTC, customers using our EU environment were unable to authenticate using our services.

We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.

What happened?

Kernel’s OOM killer killed our primary database process.
Starting in early December, we started seeing our MongoDB database clusters initiate elections at unexpected times because nodes in the cluster lost visibility of the primary node. Our initial research pointed to a networking issue. In order to diagnose it, on December 20th we deployed a script which ran every time it detected a network disconnection, and logged network diagnostics during the duration of the disconnection using a variety of tools.

There is known issue in our VPN between primary and failover regions in Europe, which causes short time disconnections between them. We don’t route customer traffic through this VPN and this issue doesn’t affects our regular operations, but because of this issue our network information collection script detected a huge amount of network disconnections on January 15th, between 16:20 and 16:28, and spawned multiple processes during that period.

All these processes running at the same time took up 40% of the available memory in our database instance. That and the 50% of memory assigned to the database process plus the regular OS processes resulted in no memory being available. At this point, the linux kernel to called oom-killer. When oom-killer is invoked, it evaluates which is the best process to kill in order to release memory for the system. The MongoDB process was killed, as a it was taking up 50% of the available memory. When a master database process is down, our architecture immediately promotes a secondary node to primary. Under normal conditions this allows us to handle issues with our primary database instance with only a few seconds, or not downtime at all.

Auth0 service stopped working when database primary was migrated
On January 10th, we rolled out a change to the library that manages MongoDB connections in all our services. This change aimed to reduce downtime when a disconnection from the database occurs. Before the change, if an application were unable to connect to the database the library would crash the application process causing the service manager to respawn it. With this change, a SIGTERM is emitted, allowing the service process to handle its own exit. Due to a bug in the implementation, the process terminated with status 0, indicating a clean exit. When a process exits with status 0, the service manager does not respawn it.

When the election caused by the OOM condition took place, processes exited with status 0 and were not re-spawned by the service manager. We noticed that this was the case and manually restarted all processes at 16:51 UTC.

A wrong certificate was displayed during the outage
All traffic from the domain *.eu.auth0.com goes through an AWS ALB. When this ALB detects that an instance is not healthy/available, it takes it out of rotation. This is done by performing a ping to a health check endpoint.

When the MongoDB process was killed in the primary database instance, there were no authentication service processes. This resulted in all health checks to fail and all instances were removed from the ALB. The ALB entered the "Unhealthy" state.

Due to an error in our Route53 configuration, a DNS policy existed that automatically ALIASed the *.eu.auth0.com to the auth0.com domain when the ALB mapped to the *.eu.auth0.com is in the "Unhealthy state". As a result, requests going to *.eu.auth0.com were served using the auth0.com (our website's) certificate, which resulted in security error messages in clients.

As soon as functionality was restored in the EU environment, the policy stopped being enforced, traffic was sent to the proper ALB, and served with the correct certificate.

Timeline

16:20 UTC Network information collector spawns multiple processes
16:35 UTC: Database process in primary instance is killed
16:39 UTC: We find nodes failing to connect to new primary instance
16:44 UTC: We restore the previous primary database node
16:51 UTC: We find stopped process, and restart them in all nodes
16:52 UTC: We stop receiving errors from our monitoring systems and all our manual tests success

What Are We Doing About It?

[Done] After evaluating alternatives to manage database disconnections gracefully, we rolled back to our previous flow.
[Done] Removed the incorrect DNS policy so that it does not cause requests to be performed against "auth0.com" when the ALB serving traffic for *.eu.auth0.com is in an "Unhealthy" state.
[active] We found the underlying reason for network disconnections, it was a bug in the ixgbevf driver version we are using. We are deploying a new version of this driver to all affected instances (bug report here)
[pending] We are improving the controls we perform when deploying custom scripts, specifically focusing on their monitoring
[pending] We will migrate EU regions automation we have recently developed, replacing VPN nodes with AWS VPN services

Summary

We are sorry about this issue and have learned lessons that we will incorporate and action items that we will work on to help us prevent similar situations.

Thank you for your understanding and your continued support of Auth0.

Posted Feb 07, 2018 - 20:47 UTC

Resolved

Resolving as we no longer see any errors or alerts

Posted Jan 15, 2018 - 18:13 UTC

Investigating

We are receiving reports of some errors continuing to happen on logins. We are looking into this

Posted Jan 15, 2018 - 17:52 UTC

This incident affected: Auth0 Europe (PROD) (User Authentication, [DEPRECATED] Custom DB Connections & Rules) and Auth0 Europe (PREVIEW) (User Authentication, [DEPRECATED] Custom DB Connections & Rules).