From 0:00 UTC until 07:16 UTC all Facebook logins through Auth0 failed.
Our Facebook connection was requesting fields that Facebook had deprecated in May 2018 and then removed access to in January 2019. They made a change on July 17 2020 which caused these previously deprecated fields to begin returning invalid rather than being ignored.
We created and deployed a hotfix which was rolled out to all environments and resolved the issue.
At 2020-07-17 0:00 UTC the API used for our Facebook native and social logins began responding with (#200) Missing Permissions, causing all logins using those connections to fail.
We learned of this when a customer contacted our Developer Support Engineering (DSE) team to report this issue at 02:43. The responding DSE paged the on-call engineer for this feature who responded immediately and began investigating. At 03:01 an incident was declared.
Our initial investigation checked which environments were affected and found that all of them were. Given this it seemed unlikely to be related to a change we had made. Assuming it was an external change we searched for recent deprecations or other communications about changes to the Facebook API but found none.
The code for the connection included one comment on a field mentioning deprecation. This led us to find a deprecation notice from Facebook in May 2018. The deprecation would take effect January 2019 and specified what fields would no longer be available as part of privacy changes. Given this, the hotfix work was to focus on removing these fields.
Once the hotfix and build was complete we deployed through our environments, first to our Preview environment to confirm the fix, and then to all others. The issue was resolved in our Production environments by 07:16 UTC.
Later in the day at 12:51 UTC our automated release process deployed an unpatched release to the Preview environment. This release did not contain the hotfix and the issue was seen again in the Preview environment until 13:08 UTC when a release containing the hotfix was deployed.
The majority of our incidents follow a few common resolution paths such as rolling back a change or troubleshooting infrastructure to add capacity. This incident was abnormal in that it was an external API change that we had to hotfix to resolve. Working through this issue identified several areas where we can strengthen our current practices for addressing external changes.
We are taking the following actions to address how we handle external deprecations as well as the issues which slowed us down as we worked to resolve this incident.
Review and improve the process and playbook for how we handle deprecations from identity providers like Facebook
Update our incident playbooks to include how we respond to external change related issues like this
Create alerts for IdP connection errors so that we detect them through monitoring rather than customer reports
Review and update our hotfix deployment process to see where we can accelerate the development and deployment actions.
Investigate options to ensure follow-on deployments contain the hotfix to prevent recurrence of the hotfixed issue that was seen later in the day with this incident
All times are UTC
2020-07-17 0:00 - Change is made on the Facebook side and Facebook logins through Auth0 begin to fail
2:43: Developer Support Engineer (DSE) receives report from customer, pages Engineering team, on call Engineer A begins investigation
2:57: Engineer A confirms DSE report and declares incident.
3:01: Eng. A confirms that all regions are affected and the errors started at the same time. Auth0 deployment discarded as reason. Suspects a missed deprecation of API change
3:07: Status page updated.
3:16: Incident team checks for recent deprecation notices and sees nothing for today or close to the current date.
3:20: Eng. A starts debugging Facebook connection code
3:36: Eng. A brings in Engineer B to help
3:55: Eng. A and B suspect there is one user profile field requested during login that may be causing the problem.
4:08: Eng. B notices the “cover” field appears to be what’s causing the issue and asks Eng. A to confirm. Eng. A confirms it works when removing “cover” from the list of fields. Eng. B notes this is detailed in a deprecation that had happened a long time ago: https://developers.facebook.com/blog/post/2018/05/01/facebook-login-updates-further-protect-privacy
4:08: Eng. A decides to go for a hotfix in all regions by removing all fields mentioned in the deprecation notice found by Eng. B. Eng. A Pages release team to prep them for the upcoming hotfix and Release Engineer A joins the incident
4:10: Status page updated to “identified”.
4:17-4:34: A hotfix is created and merged
5:27: Integration and unit tests passed in all hotfix branches.
6:16: Rel. Eng. A starts hotfix deployment to one Preview environment.
6:25: Rel. Eng. A starts preparing the release for Preview noting he will not release it until initial Preview deployment is confirmed working.
6:33: Rel. Eng. A start preparing the release for Production noting he will not release it until other deployments are confirmed working.
6:38: Rel. Eng. A deploys hotfix to initial Preview environment.
6:44: Eng. A and DSE confirms hotfix working in initial Preview environment.
6:55: Eng. A greenlights deployment to other Preview environments.
7:04: Rel. Eng. A decides to deploy to Production first, before Preview, given the low risk and urgency of the fix for our customers.
7:16: Fix confirmed working in Production environments
7:37: Fix confirmed working in Preview environments
7:38: Status updated to “monitoring”.
8:01: Status updated to “resolved”.
12:51: New reports of failures in Auth0 Community
13:00: Eng. B notes there may have been a deployment to Preview without the fix, and notes there is another one in the queue already coming with the fix.
13:08: Version with the fix got deployed to Preview, errors stopped.