On January 15th between 16:35 UTC and 16:52 UTC, customers using our EU environment were unable to authenticate using our services.
We would like to apologize for the impact this had on you and your customers and explain what caused the problem, how we reacted to the incident, and what we are doing to prevent incidents like this from happening in the future.
Kernel’s OOM killer killed our primary database process.
Starting in early December, we started seeing our MongoDB database clusters initiate elections at unexpected times because nodes in the cluster lost visibility of the primary node. Our initial research pointed to a networking issue. In order to diagnose it, on December 20th we deployed a script which ran every time it detected a network disconnection, and logged network diagnostics during the duration of the disconnection using a variety of tools.
There is known issue in our VPN between primary and failover regions in Europe, which causes short time disconnections between them. We don’t route customer traffic through this VPN and this issue doesn’t affects our regular operations, but because of this issue our network information collection script detected a huge amount of network disconnections on January 15th, between 16:20 and 16:28, and spawned multiple processes during that period.
All these processes running at the same time took up 40% of the available memory in our database instance. That and the 50% of memory assigned to the database process plus the regular OS processes resulted in no memory being available. At this point, the linux kernel to called oom-killer. When oom-killer is invoked, it evaluates which is the best process to kill in order to release memory for the system. The MongoDB process was killed, as a it was taking up 50% of the available memory. When a master database process is down, our architecture immediately promotes a secondary node to primary. Under normal conditions this allows us to handle issues with our primary database instance with only a few seconds, or not downtime at all.
Auth0 service stopped working when database primary was migrated
On January 10th, we rolled out a change to the library that manages MongoDB connections in all our services. This change aimed to reduce downtime when a disconnection from the database occurs.
Before the change, if an application were unable to connect to the database the library would crash the application process causing the service manager to respawn it. With this change, a SIGTERM
is emitted, allowing the service process to handle its own exit. Due to a bug in the implementation, the process terminated with status 0
, indicating a clean exit. When a process exits with status 0
, the service manager does not respawn it.
When the election caused by the OOM condition took place, processes exited with status 0
and were not re-spawned by the service manager. We noticed that this was the case and manually restarted all processes at 16:51 UTC.
A wrong certificate was displayed during the outage
All traffic from the domain *.eu.auth0.com goes through an AWS ALB. When this ALB detects that an instance is not healthy/available, it takes it out of rotation. This is done by performing a ping to a health check endpoint.
When the MongoDB process was killed in the primary database instance, there were no authentication service processes. This resulted in all health checks to fail and all instances were removed from the ALB. The ALB entered the "Unhealthy" state.
Due to an error in our Route53 configuration, a DNS policy existed that automatically ALIASed the *.eu.auth0.com to the auth0.com domain when the ALB mapped to the *.eu.auth0.com is in the "Unhealthy state". As a result, requests going to *.eu.auth0.com were served using the auth0.com (our website's) certificate, which resulted in security error messages in clients.
As soon as functionality was restored in the EU environment, the policy stopped being enforced, traffic was sent to the proper ALB, and served with the correct certificate.
We are sorry about this issue and have learned lessons that we will incorporate and action items that we will work on to help us prevent similar situations.
Thank you for your understanding and your continued support of Auth0.