Technicals on 4th October Facebook Outage? (BGP)
In recent times we have seen an outage on some of the busiest services on the Internet (AWS). Most of the services are designed to keep fault tolerance and disaster recovery in mind and are self-healing but that’s not always the case. Facebook’s 4th October 2021 outage is no different and every time something like this happens we get an opportunity to learn something new.
From various sources, at least we know that the problem lies around BGP (Broader Gateway Protocol) Network. We saw that facebook’s routes were withdrawn from BGP. So let’s see what could have caused this outage.
Read more on BGP withdrawal protection here: Preventing the Unnecessary Propagation of BGP Withdraws
1. What is BGP?
- The Border Gateway Protocol (BGP) allows Internet Service Providers (ISPs) and enterprise networks like (Facebook) to announce routes towards their IP prefixes.
- Simply BGP is the protocol that connects the entire internet.
- As you see above, in the organization with flat network devices (A and B) they can connect using Local Network addresses.
- Now if the organization have a different LAN’s it can connect using the router as seen above. Router here uses various routing protocols, including internal BGP.
- Now Large organizations (Facebook) and other ISPs manage internet connectivity from multiple sites and this whole is called AS or Autonomous System.
- Now AS (Autonomous System) networks can handle local traffic but we need BGP to handle traffic between different Autonomous Systems.
External BGP: External BGP handles traffic between different Autonomus systems
2. What possibly caused Facebook downtime?
- External BGP contains a routing table to find the best path to the destination for network packets/traffic.
- BGP’s path vector has two types of messages: Updates and Withdraws
- A BGP update is used to advertise a path towards a prefix or a change in a previously announced path towards a prefix. (Adding or updating path)
- In layman language Facebook telling the BGP network how to reach the Facebook server by publishing Paths.
- A BGP withdraw indicates that a previously announced prefix becomes unreachable. (Deleting paths from routing tables)
- Facebook server telling BGP network to withdraw or delete the paths.
- To conclude as we know Facebook’s server reached out to the BGP network and withdrew path’s which allows the internet to reach out to Facebook. Hence, routing tables had no information on how to reach the Facebook server and caused this downtime.
3. Possible causes of prefix withdrawal.
- https://link.springer.com/content/pdf/10.1007%2F978-3-642-01399-7_39.pdf, the research paper explains some of the reasons for path withdrawal but is not limited to this. Adding one of the reasons below.
- If some interdomain links are unstable and fail frequently. Each of these failures causes the transmission of a number of BGP withdraws.
- Second, as BGP relies on path vectors, it suffers from the path exploration problem when a route becomes unavailable. When a route fails, a new BGP convergence starts. During this convergence, routers may advertise paths that they consider valid although they are also affected by the failure. These paths will be withdrawn later causing another exchange of BGP messages.