Facebook went down completely on Monday, bringing Instagram and WhatsApp (among others) down with it. According to sources inside Facebook, traffic analysis, and instinct, the incident was related to BGP, or Border Gateway Protocol. Facebook is on its way back up, but this does raise some questions.
What is BGP?
BGP provides speedy traffic delivery to the internet by using a system that is essentially a routing algorithm. Considering that there are so many different ISPs, backbone routers, and servers that handle your data that makes its way to Facebook, your packets could potentially take a variety of different paths. It is BGP’s job to suggest the best route and assure it is the most efficient.
In addition to being used as post offices, air traffic controllers, and more, BGP is compared to maps. BGP is like a bunch of people who create and update maps showing you where to find YouTube or Facebook.
Your computer uses BGP to figure out who’s bridges it must cross in order to get to Facebook
With BGP, the internet is divided into autonomous systems, or large networks. They’re kind of like island nations, since they are controlled by some large organization, like a government or a university, but they can also be internet service providers, such as Comcast, companies, or even government agencies. Because it would be extremely challenging to connect every island with every other, BGP determines which islands (or autonomous systems) you have to pass through in order to reach your destination.
Because the internet is constantly changing, updated maps are vital – you don’t want to be led down a path that doesn’t lead to Google. It would be impossible to map the entire internet continuously, so autonomous systems share their maps. Every now and then, they’ll check their island neighbors’ maps to see what updates they’ve made.
When used as a framework, maps make it easier to see what could go wrong. Whenever GPS was first available to consumers, there were always jokes about it taking you off a cliff or out into the desert. When BGP is used incorrectly, someone can steer traffic somewhere it is not supposed to go, causing it issues. A mistake like that will be marked on everyone’s map if it’s not caught. The situation can go wrong in other ways as well, but we’ll cover those later.
A BGP is like a map that shows you how to get to a website at the highest speed?
Yes! Unfortunately, the short answer isn’t always the best because it isn’t always the shortest. Many reasons can influence a routing algorithm’s decision; cost is often one – some networks charge others to be included in their routes.
It’s hard to map unchanging roads; imagine mapping the internet
Maps are also very challenging! Recently, I discovered that roads were not included on all maps or were different on different maps when I was trying to plan a trip. The name of one road appeared on three maps. Imagine how difficult it would be to try to connect an entire internet if we had a town with only five streets. The Internet just has to cope with changes in the world, as roads do not change that often, whereas websites change countries or service providers or add or remove functionality.
According to a paper presented earlier this year, Facebook has built its own BGP system, which enables it to perform “fast incremental updates.” In any case, what they describe there is a system for data center communication, and it’s hard to determine what caused Facebook’s problems on Monday, and I would need someone smarter than me to say whether Facebook’s data center communications caused Facebook’s problems. Bryan Krebs of the cybersecurity blog KrebsOnSecurity claims that the outage is the result of a normal BGP update.
In Facebook’s engineering update, it said that the issue was caused by “configuration changes on the backbone routers that coordinate network traffic between our data centers.” That then led to a “cascading effect on the way [Facebook’s] data centers communicate, bringing [its] services to a halt.” It looks like the issue was with Facebook communicating with itself, not with the outside world (though that could certainly cause an outage given how much Facebook controls).
DNS – what’s the deal?
In Cloudflare’s words: DNS tells you where you’re going, and BGP lets you get there. In most cases, however, computers do not use DNS to determine the IP address of a website or other resource. For example, if you ask your friend where his house is, it’s likely that you will still need a GPS system in order to find it.
A Cloudflare article goes into the specifics of Monday’s Facebook incident, so if you’re looking for an explanation from an autonomous system’s perspective, be sure to read it.
BGP: What could go wrong?
Several factors. Cloudflare mentions two incidents where an ISP made a mistake, including one in which the entire internet used its service in 2004 and another where a Pakistani ISP accidentally banned YouTube on a global scale, despite only targeting its subscribers. A BGP error can cascade because one group making a mistake can lead to mistakes in other groups (which is one of the reasons it makes sense).
It has been called the “DUCT TAPE of the Internet”
By compromising a separate ISP’s BGP servers, hackers in 2018 were able to hijack requests to Amazon’s DNS and steal thousands of dollars in Ethereum. There was no hacking at Amazon, but there was traffic that went elsewhere.
An incorrect BGP update can cause you to lose your entire service. Although BGP is affectionately referred to as “duct tape of the internet,” no adhesive is perfect.
What went wrong with Facebook?
Somehow, Facebook’s servers apparently told everyone that they needed to take them off their maps for some reason. There is an initial report from Facebook, but it’s very light on details – Facebook may release a more in-depth explanation later, saying why the changes were made. However, at this point, that’s likely all we’ll hear (at least officially).
According to Cloudflare’s CTO, however, the service experienced a large number of route withdrawals from Facebook just before it went dark (much of which was route withdrawals and routes disappearing from the map). Several Fastly engineers have tweeted that Facebook blocked Fastly’s routes when the company went offline, and KrebsOnSecurity confirms this was caused by some update to Facebook’s BGP.
Cloudflare’s explanation is recommended, if you want nitty-gritty technical details.
How does Facebook fix BGP if it was a problem?
With the outage lasting for hours, the answer must be “not easily.” Facebook must ensure that the records it advertised were true and that they were picked up by the web at large. In other words, it needed to ensure that its maps were up-to-date and that they could be seen by everyone.
Getting there is trickier than it sounds. Several Facebook employees were said to be lockouts from badge-protected doors and to be unable to communicate with coworkers. Having to determine who has the knowledge and permission to solve the problem, as well as how to connect these people, is essential in a situation like this. Reports said that Facebook engineers were physically sent to a California data center to repair the problem when the company was underwater.
Is Web3 a viable solution?
However, to answer the question quickly, probably not – even if Facebook hopped on the decentralized train, there would still need to be some protocol describing where to find its resources. As we’ve seen before, blockchain contracts can be misconfigured, so I’d be skeptical if anyone said that a blockchain-based internet is immune to this kind of issue.
With all the bad Facebook news, that outage sure was odd timing, wasn’t it?
Okay, well it’s obvious that since all of this happened during a TV appearance in which a whistleblower floated out Facebook’s dirty laundry, there’s no shortage of alternate explanations. The problem could have also been caused by an innocent mistake made by a (very, very unfortunate) member of Facebook’s IT team.
This is the official explanation provided by Facebook. Rather than blame any devious hacks, it blames a “faulty configuration change”.