BGP Graceful Restart (GR)

NSF/SSO, NSR, Graceful Restart

Nonstop forwarding (NSF) refers to the capability of the data plane to continue forwarding IP packets when the control plane disappears (momentarily, that is), most likely an RP switchover (failing over to a standby RP.)

Stateful switchover (SSO) refers to the capability of the control plane to hold configuration and various states during this switchover, and to thus effectively reduce the time to utilize the newly failed-over control plane. This is also handy when doing scheduled hitless upgrades within the ISSU execution path. The time to reach SSO for the newly active RP may vary depending on the type and scale of the configuration.

Graceful restart (GR) refers to the capability of the control plane to delay advertising the absence of a peer (going through control-plane switchover) for a “grace period”, and thus help minimize disruption during that time (assuming the standby control plane comes up). GR is based on extensions per routing protocol, which are interoperable across vendors. The downside of the grace period is huge when the peer completely fails and never comes up, because that slows down the overall network convergence, which brings us to the final concept: nonstop routing (NSR).

NSR is an internal (vendor-specific) mechanism to extend the awareness of routing to the standby control plane so that in case of failover, the newly active control plane can take charge of the already established sessions.

BGP Graceful-Restart

The BGP Graceful-Restart (GR) feature allows a BGP speaker to express its ability to preserve forwarding state during Border Gateway Protocol (BGP) restart or Route Processor (RP) switchover. In other words, it is the capability exchanged between the BGP speakers to indicate its ability to perform Nonstop Forwarding (NSF). This helps in minimizing the impact of services caused by BGP restart.

If the control plane could restart without impacting the data used by the data plane, it should be possible to install new software, replace certain pieces of hardware, and perform other such tasks without causing the network to lose routing through the router.

Beyond the local control plane and data plane in the router, however, we need to consider how the peers of this router will react if the local control plane, and the routing protocol processes along with it, are restarted.

Scenario 1:

  • Router C would learn two paths for every reachable destination within or through AS65500, with router A chosen as the best path for each route.
  • If the control plane on router A restarts, router C will continue forwarding traffic through router A until the BGP peering session fails, due to lost keepalives, for instance.
  • At this point, router C would recalculate the best path for each route in its local tables, redirecting traffic to router B.

Under normal circumstances, this is what we would want; in fact, if router A fails, we would want router C to detect this failure as quickly as possible, and switch all of its traffic to router B as soon as it can. But if router A is still capable of forwarding traffic, because its data plane is capable of retaining the information required to continue forwarding traffic (through NonStop Forwarding, or NSF), then the extra network disturbance of detecting the failure and recalculating paths is undesirable.

In some way, then, router A must keep router C from switching its best paths to router B. The most apparent way to accomplish this would be for router A to continue sending keepalives to router C while it is rebooting, or immediately on finishing its reload, so router C will continue to forward traffic through router A. BGP protocol updates being transported over TCP include sequence numbers that are not easily (or impossible to be) maintained between system restarts.

There are two ways this problem can be addressed:

  1. By maintaining routing protocol state in some secondary memory within router A and referring to this state information while router A’s control plane is restarting.
  2. By introducing some signaling within the routing protocol, which allows router A’s routing process to recover the necessary state once it has restarted, including “I’ll be back” signaling,” which indicates a router is simply restarting and not being removed from the network.

There are two changes to the BGP protocol that are used to support graceful restart between two peers:

  1. A new capability that is used to indicate if a BGP speaker supports graceful restart, or can support a peer that is restarting, along with other information about the restart process.
  2. An End Of RIB marker, which indicates that a BGP speaker has finished sending all its local routing information. This is implemented as a simple empty withdraw packet.

The new Graceful Restart capability includes the following:

  • The Restart State bit, indicating if this BGP speaker is currently restarting or not.
  • The Restart Timer, an indication of how long the speaker’s peer should maintain state before declaring the graceful restart a failure, and resetting the BGP session.
  • A series of address family identifiers (AFIs) and subsequence address family identifiers (SAFIs), one for each address family the peers are exchanging routing information for.
  • For each address family, a Forwarding State bit indicating if forwarding state has been preserved for this address family.

Graceful-Restart Process

Illustrates the process used by a pair of BGP speakers when one of them restarts:

  1. Router A’s control plane restarts; the forwarding tables used by the data plane are preserved.
  2. When Router A’s control plane restarts, it marks any preserved routing or forwarding information as stale.
  3. Router A sends the graceful restart capability along with the BGP Open message. It sets the Restart bit in the graceful restart capability and includes a TLV for each address family it has preserved forwarding state for, with the Forwarding bit set on each one.
  4. When Router B receives this BGP Open message, it will examine its current open connections and realize that this Open message is from a BGP speaker it is already peered to. Router B will maintain all the information it has received from Router A, also maintaining forwarding state it built based on the received routing information.
    • Because RA advertised that it supports graceful restart, RB will:
      • Keep the routes from Router A in its RIB and FIB
      • Mark all routes in RIB as Stale
      • Continue forwarding traffic for those routes in the FIB
  5. Router B now starts its restart timer; if this timer expires before the peering relationship between Routers A and B is fully formed, Router B will reset the session.
  6. Router’s A control plane comes back up and sends a Route Refresh Request to Router B.
  7. Router B begins sending the contents of its BGP table to Router A.
  8. When Router B has finished sending the contents of its BGP table to Router A, it transmits an End-of-RIB marker, to let Router A know it has received all routing information.
  9. Router A, when it receives the End-of-RIB marker from Router B, runs BGP’s best path algorithm across the information it has received and installs the required routing information into the local routing tables, which should also update the local forwarding tables.
  10. Router A now removes any routing information that has not been refreshed by the route refresh.

Scenario 2

Graceful Restart Flow Example:

  • All three routers exchange OPEN messages.
    • GR Capability = “I support graceful restart”
      • Flags:
        • R = 0 “I have not restarted”
        • Restart Time = 30 seconds “If I restart in the future, I expect to be UP in 30 seconds”
        • AFI/SAFI = IPv4 Unicast “I support GR for IPv4-Unicast”
        • F = 0 “Relevant only after a restart”
  • R1 advertises some prefixes: 1.1.0.0/16 and 2.2.0.0/16.
  • R2 receives updates and installs entries in its RIB/FIB.
  • R2 propagates the BGP updates received to R3.
  • R3 installs entries in its RIB/FIB.

R2’s Control Plane Restarts

Forwarding plane on R2

  • Keeps forwarding traffic using the routes installed in the FIB to 1.1.0.0/16 and 2.2.0.0/16.
  • Marks the routes in FIB as stale.

R3 notices that the BGP session with R2 goes down (due to BFD, holdtime expired)

  • Normally, R3 would remove the BGP routes received from R2 (namely 1.1.0.0/16 and 2.2.0.0/16) from its RIB and FIB.

However, because R2 advertised that it supports graceful restart, R3 will:

  • Keep the routes from R2 in its RIB/FIB
  • Mark all routes in the RIB as stale
  • Continue forwarding traffic for those routes in the FIB

Topology Change

  • While R2 is down, something changes in the topology of the network. Let’s say R1 stop advertising prefix 1.1.0.0/16.
  • R1 withdraws route 1.1.0.0/16 from its RIB/FIB, and sends a BGP withdraw to all of its neighbors.
  • However, R1 cannot send a withdraw message to R2 because the R1-R2 BGP session is down at this point. We will recover from this “getting out of sync” problem later on.

R2’s Control Plane comes back up

  • We assume it took less than 30 seconds for the control plane to come back up.
    • If it took longer, R3 would have “given up” and flushed the routes learned from R2 from its RIB/FIB.
  • The BGP sessions come back up:
    • GR capability “I support graceful restart”
    • Flags:
      • R = 1 “I did restart”
      • Restart Time = 30 “If I restart in the future, I expect to be done in 30 seconds”
      • AFI SAFI = IPv4 Unicast “I support GR for IPv4-Unicast”
      • F = 1 “** I DID PRESERVE FORWARDING STATE IN THE FIB **”

Re-synchronization

  • R2 knows it restarted, so it is NOT going to send any updates until it has received all updates from its neighbors (R1 and R3), select the best routes, and update its RIB/FIB.
  • R1 originates prefixes 2.2.0.0/16 (but not 1.1.0.0/16 anymore due to the topology change).
    • R2 installs 2.2.0.0/16 in its RIB/FIB.
    • R2 already had an entry for 2.2.0.0/16 in its FIB which was marked stale. This stale marking is now removed; it is fresh again.
    • R2 does not receive an 1.1.0.0/16 update from R1. Hence, R2 does not have an entry for 1.1.0.0/16 in its RIB. But R2 does still have an entry for 1.1.0.0/16 in its FIB which is and remains marked stale.
    • R1 has finished sending all routes to R2, so it sends an end-of-rib marker to R2.

At this point, R2 has received an end-of-rib marker from R1, but not yet from R3. So, it does not yet take any action (it needs to have received an end-of-rib marker from all neighbors).

  • R3 does not have any prefixes to send to R2, so it immediately sends an End-of-RIB marker.

At this point, R2 has received End-of-RIB markers from all of its neighbors (R1 and R3), so it will take the following actions:

  • R2 will run the best route selection process for every destination prefix in its BGP table (in this example only 2.2.0.0/16)
  • R2 will install the selected best route for every prefix in the RIB into the FIB (only 2.2.0.0/16)
  • R2 will flush any remaining stale routes from the FIB (in this case 1.1.0.0/16)
  • R2 will start sending updates to advertise the routes in its RIB to the neighbors.
  • R2 propagates the BGP updates received from R1, to R3.
    • R2 has finished sending all routes to R3, so it sends an End-of-RIB marker to R3.
    • Note that R2 does not have routes to send to R1 (specifically it does not send the route for 2.2.0.0/16 back to R1 because of the AS-path loop). So, R2 immediately sends an End-of-RIB marker to R1 as well.
    • When R3 receives the end-of-rib marker from R2, it flushes all stale routes from R1 (in this case 1.1.0.0/16) from both is RIB/FIB.
    • R1 does the same when it receives the end-of-rib marker from R2, but it this example there is nothing to flush since R2 did not advertise any routes to R1.

GR Deployment Considerations

When deploying graceful restart for any routing protocol, there are two issues you need to keep in mind:

  • the impact of partial deployments, and
  • the interactions between BGP and the underlying IGP if both are not capable of and configured for graceful restart.

In this network, we assume router D is not graceful restart capable, or it is not configured to respond to peers gracefully restarting. When B’s control plane restarts, it signals router A so A doesn’t reset its peering session and continues forwarding through router B. However, router D doesn’t recognize this signaling, and it resets its session with B.

Instead, it reconverges on the path through router C as the best path and drops traffic along the path until the reconvergence is complete. In this case, then, the path between B and D will become asymmetric; in other cases, it’s possible to form a routing loop. It’s also possible that router D will begin rejecting the traffic forwarded by router B because its unicast reverse path forwarding check will fail.

The solution to this problem is to make certain D is graceful restart capable, even if it’s not configured to restart gracefully on failure (or isn’t capable of it). Router D must be able to understand and respond correctly to router B’s signals during a graceful restart to prevent network problems form developing.

If BGP is not capable of (or isn’t configured for) graceful restart, but the underlying interior gateway protocol is, some amount of traffic can be dropped while BGP is reconverging after a control plane restart.

  • Assume router A has chosen the routes learned through B as its best paths.
  • When router B restarts, A will reset its BGP peering session with B, but it will continue learning the same information through C, in fact, C will be setting the next hop on the BGP routes it is learning to the same next hop as B did before it reset.
  • Router A will not, however, reset its OSPF adjacency with B. Since OSPF is configured for graceful restart, it will believe that all paths reachable before the restart are still reachable through B, including the next hop for the routes it is learning through router C.
  • So router A will continue forwarding packets through router B, because router B is still the best path to the destinations learned through router C, based on the interior gateway protocol cost to the next hop, which is D. However, router B, when it receives these packets, may not have the forwarding information needed to forward them.
  • Since the BGP process on router B has reset, it’s likely that all the BGP learned routing information in the forwarding tables has been discarded, even though the OSPF learned forwarding information has been retained.
  • Router B, then, will drop all the BGP traffic forwarded along this path by router A. To resolve this problem, always make certain BGP and the underlying interior gateway protocols are both capable of and configured for graceful restart.

Leave a Reply

Related Post