Increased error rate for Facebook logins
Incident Report for Auth0
Postmortem

Summary

From 0:00 UTC until 07:16 UTC all Facebook logins through Auth0 failed.

Our Facebook connection was requesting fields that Facebook had deprecated in May 2018 and then removed access to in January 2019. They made a change on July 17 2020 which caused these previously deprecated fields to begin returning invalid rather than being ignored.

We created and deployed a hotfix which was rolled out to all environments and resolved the issue.

What Happened

At 2020-07-17 0:00 UTC the API used for our Facebook native and social logins began responding with (#200) Missing Permissions, causing all logins using those connections to fail.

We learned of this when a customer contacted our Developer Support Engineering (DSE) team to report this issue at 02:43. The responding DSE paged the on-call engineer for this feature who responded immediately and began investigating. At 03:01 an incident was declared.

Our initial investigation checked which environments were affected and found that all of them were. Given this it seemed unlikely to be related to a change we had made. Assuming it was an external change we searched for recent deprecations or other communications about changes to the Facebook API but found none.

The code for the connection included one comment on a field mentioning deprecation. This led us to find a deprecation notice from Facebook in May 2018. The deprecation would take effect January 2019 and specified what fields would no longer be available as part of privacy changes. Given this, the hotfix work was to focus on removing these fields.

Once the hotfix and build was complete we deployed through our environments, first to our Preview environment to confirm the fix, and then to all others. The issue was resolved in our Production environments by 07:16 UTC.

Later in the day at 12:51 UTC our automated release process deployed an unpatched release to the Preview environment. This release did not contain the hotfix and the issue was seen again in the Preview environment until 13:08 UTC when a release containing the hotfix was deployed.

Mitigation Actions

The majority of our incidents follow a few common resolution paths such as rolling back a change or troubleshooting infrastructure to add capacity. This incident was abnormal in that it was an external API change that we had to hotfix to resolve. Working through this issue identified several areas where we can strengthen our current practices for addressing external changes.

We are taking the following actions to address how we handle external deprecations as well as the issues which slowed us down as we worked to resolve this incident.

  • Review and improve the process and playbook for how we handle deprecations from identity providers like Facebook

    • As an intermediary between IdPs and our customers, we need to balance responding to deprecations while giving customers time to adjust
  • Update our incident playbooks to include how we respond to external change related issues like this

  • Create alerts for IdP connection errors so that we detect them through monitoring rather than customer reports

  • Review and update our hotfix deployment process to see where we can accelerate the development and deployment actions.

  • Investigate options to ensure follow-on deployments contain the hotfix to prevent recurrence of the hotfixed issue that was seen later in the day with this incident

Annex 1: Events Timeline

All times are UTC

2020-07-17 0:00 - Change is made on the Facebook side and Facebook logins through Auth0 begin to fail

2:43: Developer Support Engineer (DSE) receives report from customer, pages Engineering team, on call Engineer A begins investigation

2:57: Engineer A confirms DSE report and declares incident.

3:01: Eng. A confirms that all regions are affected and the errors started at the same time. Auth0 deployment discarded as reason. Suspects a missed deprecation of API change

3:07: Status page updated.

3:16: Incident team checks for recent deprecation notices and sees nothing for today or close to the current date.

3:20: Eng. A starts debugging Facebook connection code

3:36: Eng. A brings in Engineer B to help

3:55: Eng. A and B suspect there is one user profile field requested during login that may be causing the problem.

4:08: Eng. B notices the “cover” field appears to be what’s causing the issue and asks Eng. A to confirm. Eng. A confirms it works when removing “cover” from the list of fields. Eng. B notes this is detailed in a deprecation that had happened a long time ago: https://developers.facebook.com/blog/post/2018/05/01/facebook-login-updates-further-protect-privacy

4:08: Eng. A decides to go for a hotfix in all regions by removing all fields mentioned in the deprecation notice found by Eng. B. Eng. A Pages release team to prep them for the upcoming hotfix and Release Engineer A joins the incident

4:10: Status page updated to “identified”.

4:17-4:34: A hotfix is created and merged

5:27: Integration and unit tests passed in all hotfix branches.

6:16: Rel. Eng. A starts hotfix deployment to one Preview environment.

6:25: Rel. Eng. A starts preparing the release for Preview noting he will not release it until initial Preview deployment is confirmed working.

6:33: Rel. Eng. A start preparing the release for Production noting he will not release it until other deployments are confirmed working.

6:38: Rel. Eng. A deploys hotfix to initial Preview environment.

6:44: Eng. A and DSE confirms hotfix working in initial Preview environment.

6:55: Eng. A greenlights deployment to other Preview environments.

7:04: Rel. Eng. A decides to deploy to Production first, before Preview, given the low risk and urgency of the fix for our customers.

7:16: Fix confirmed working in Production environments

7:37: Fix confirmed working in Preview environments

7:38: Status updated to “monitoring”.

8:01: Status updated to “resolved”.

12:51: New reports of failures in Auth0 Community

13:00: Eng. B notes there may have been a deployment to Preview without the fix, and notes there is another one in the queue already coming with the fix.

13:08: Version with the fix got deployed to Preview, errors stopped.

Posted Jul 31, 2020 - 05:19 UTC

Resolved
This incident has been resolved.
Posted Jul 17, 2020 - 08:00 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 17, 2020 - 07:39 UTC
Update
We are continuing to work on the rollout of the fix.
Posted Jul 17, 2020 - 06:15 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jul 17, 2020 - 05:12 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 17, 2020 - 04:10 UTC
Update
We are continuing to investigate this issue.
Posted Jul 17, 2020 - 03:59 UTC
Investigating
We are seeing an increased number of failures during Facebook web and native logins. We are investigating.
Posted Jul 17, 2020 - 03:05 UTC
This incident affected: Auth0 Europe (PROD) (Authentication API), Auth0 US-2 (PROD) (Authentication API), Auth0 Australia (PROD) (Authentication API), Auth0 Europe (PREVIEW) (Authentication API), Auth0 Australia (PREVIEW) (Authentication API), Auth0 US (PREVIEW) (Authentication API (PREVIEW)), and Auth0 US (PROD) (Authentication API).