AS701 BGP leak
Incident Report for NamePros
Postmortem

Summary

Early on June 24, 2019, Verizon (AS701) leaked a number of routes, disrupting internet availability globally. The issue took several hours to resolve. The issue could have been prevented by Verizon if they were following basic BGP best practices. Verizon failed to respond to the issue, so the problem could only be resolved by the small upstream AS that announced the faulty routes to Verizon.

Impact

NamePros saw traffic abruptly decrease by about 60% around 10:34 UTC on June 24, 2019. Traffic returned to near-normal levels briefly over the course of the next hour, finally settling around 11:30 UTC. Intermittent partial disruptions continued to occur until around 13:35.

During disruptions, NamePros' infrastructure was fragmented. Infrastructure automatically recovered without intervention when disruptions ended. This fragmentation did not affect the functionality of NamePros from a user’s perspective, but it did interrupt database and file replication to backup servers. As a result, some snapshots taken during and shortly after the disruptions may represent outdated views of the databases at the times they were taken. Integrity and consistency of the snapshots was unaffected, and up-to-date snapshots were taken at one datacenter.

Remediation

NamePros is not an AS and cannot mitigate routing leaks. Pressure needs to be placed on ISPs, especially large ISPs like Verizon, to follow BGP best practices. Until such time as all major ISPs follow best practices, these issues will continue to occur. Switching CDNs, datacenters, or technologies will not prevent future issues.

NamePros' infrastructure behaved as designed during the disruption. Despite fragmentation, the website remained functional for those who were able to access it. When the disruptions ended, infrastructure automatically healed without intervention from NamePros staff.

Despite global internet issues, monitoring and alerting was mostly functional. The alerting system was unavailable from the on-call staff member’s residential internet connection; he was able to resolve this by accessing the system from his phone over the cell network. Had the issue affected both networks, he wouldn’t have been alerted. It may be possible to mitigate this possibility by triggering an alert locally when the alerting system is unavailable; however, this is likely to be of little benefit, as there is nothing NamePros can do to resolve global internet disruptions anyway.

Timeline

  • June 24, 2019 at 10:34:37 UTC: Traffic to NamePros abruptly drops to about 40% of normal levels.
  • 10:36: Alert #1384 is raised when monitors in Virginia, New Jersey, and Ohio fail to connect to NamePros. Over the course of the next few minutes, other monitors throughout the world also indicate they are unable to connect.
  • 10:40: Alert #1385 is raised when a backup server in Massachusetts fails to report metrics.
  • 10:45: Alerts #1384 and #1385 are determined to be caused by widespread internet issues. As the issue is outside NamePros' control, the alerts are marked as resolved. However, the on-call staff member responding to the alerts is unable to do so from his residential internet connection due to the underlying internet issue and has to do so from his phone.
  • 10:43: Monitors in the US are intermittently able to connect to NamePros.
  • 10:55: Alert #1386 indicates a backup server in Quebec has failed to report metrics. The alert is acknowledged and put on hold while the situation is monitored.
  • 11:02: Cloudflare posts a status update on cloudflarestatus.com: “Cloudflare is observing network related issues.”
  • 11:05: Alert #1386 is automatically resolved when the backup server in Quebec resumes reporting metrics.
  • 11:19: Traffic to NamePros rises to about 70% of normal levels and continues rising slowly.
  • 11:20: Traffic to NamePros returns to near-normal levels (within about 10%).
  • 11:24: Alert #1387 indicates the backup server in Massachusetts is failing to report metrics again. The alert is acknowledged and put on hold while the situation is monitored.
  • 11:27: Traffic to NamePros drops to about 30% of normal levels.
  • 11:27: Alert #1388 is raised when monitors in Virginia and New Jersey continue having trouble connecting to NamePros. The alert is acknowledged and put on hold while the situation is monitored.
  • 11:30: Traffic to NamePros returns to near-normal levels.
  • 11:30: Alert #1388 is automatically resolved when all monitors in the US report that they are able to connect to NamePros.
  • 11:36: Cloudflare posts a status update on cloudflarestatus.com: “We have identified a possible route leak impacting some Cloudflare IP ranges and are working with the network involved to resolve this.”
  • 11:38: Alert #1389 is raised when a backup server is Quebec fails to report metrics again. The alert is acknowledged and put on hold while the situation is monitored.
  • 11:38: Alert #1390 is raised when monitors in the US are unable to connect to NamePros again. It is automatically resolved less than a minute later, indicating that the monitors are intermittently able to connect to NamePros.
  • 11:39: Alert #1387 is automatically resolved when the backup server in Massachusetts resumes reporting metrics.
  • 11:43: Cloudflare posts a status update on cloudflarestatus.com: “We are continuing to work on a fix for this issue.”
  • 11:50: Traffic to NamePros drops to about 30% of normal levels.
  • 11:52: Alert #1391 is raised because monitors in the US continue to be intermittently unable to connect to NamePros. The alert is acknowledged and put on hold while the situation is monitored.
  • 11:54: Alert #1389 is automatically resolved when the backup server in Quebec resumes reporting metrics.
  • 11:59: Traffic to NamePros returns to near-normal levels.
  • 12:00: Further alerts are suppressed. At this point, it has become clear that a BGP leak has occurred and is impacting global internet availability. NamePros' infrastructure is now fragmented, but the website is mostly functional for those who are able to access it. Failing over to alternative CDNs or datacenters won’t resolve the issue, so no action can be taken by NamePros.
  • 12:25: Traffic to NamePros drops to about 40% of normal levels.
  • 12:28: Traffic to NamePros returns to near-normal levels.
  • 12:29: The only remaining monitor unable to connect to NamePros, located in Singapore, indicates that it is now able to connect.
  • 12:34: Cloudflare posts a status update on cloudflarestatus.com: “This leak is impacting many internet services including Cloudflare. We are continuing to work with the network provider that created this route leak to remove it.”
  • 12:42: Cloudflare posts a status update on cloudflarestatus.com: “The network responsible for the route leak has now fixed the issue. We are seeing improvement and are continuing to monitor this before we consider this issue resolved.”
  • 13:02: Cloudflare posts a status update on cloudflarestatus.com: “Traffic levels have returned to normal now that the route leak has been fixed. We are now marking this incident as resolved.”
  • 13:35: Traffic to NamePros returns to normal levels.
  • 19:58: Cloudflare publishes a postmortem on their blog, attributing the issue to Verizon (AS701) and scolding Verizon for their failure to follow basic BGP best practices.
Posted 21 days ago. Jun 25, 2019 - 05:36 UTC

Resolved
AS701 (Verizon) BGP leak resulted in several hours of intermittent downtime. See: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/
Posted 22 days ago. Jun 24, 2019 - 10:36 UTC