High Availability

In networking, availability refers to the operational uptime of the network. The aim of high availability is to achieve continuous network uptime by designing a network to avoid single points of failure, incorporate deterministic network patterns, and utilize event-driven failure detection to provide fast network convergence.

Network Convergence Overview

Network convergence is the time required to redirect traffic around a failure that causes loss of connectivity (LoC). The latency requirements for convergence vary by application. For example, an Open Shortest Path First (OSPF) network using default settings may take 5 or more seconds to converge around a link failure. This length of time may be acceptable for users reading Internet website articles, but it is completely unacceptable for IP telephony users.

A number of factors influence network convergence speed. In general, the higher the number of network prefixes and routers in the network, the slower the convergence will be. The primary factors that influence network convergence are as follows:

  • T1: Time to detect the failure event
  • T2: Time to propagate the event to neighbors
  • T3: Time to process the event and calculate new best path
  • T4: Time to update the routing table and program forwarding tables

Continuous Forwarding

Routers specifically designed for high availability include hardware redundancy, such as dual power supplies and route processors (RPs). An RP, which is also called a supervisor on some platforms, is responsible for learning the network topology and building the route table (Routing Information Base [RIB]). An RP failure can trigger routing protocol adjacencies to reset, resulting in packet loss and network instability. During an RP failure, it may be more desirable to hide the failure and allow the router to continue forwarding packets using the previously programmed CEF table entries versus temporarily dropping packets while waiting for the secondary RP to reestablish the routing protocol adjacencies and rebuild the forwarding table.

The following two high availability features allow the network to route through a failure during an RP switchover:

  • Stateful switchover (SSO) with nonstop forwarding (NSF)
  • Image Stateful switchover (SSO) with nonstop routing (NSR)

Stateful Switchover (SSO)

Stateful switchover (SSO) is a redundancy feature that allows a Cisco router with two RPs to synchronize router configuration and control plane state information.

  • The process of mirroring information between RPs is referred to as checkpointing.
  • SSO-enabled routers always checkpoint line card operation and Layer 2 protocol states.
  • During a switchover, the standby RP immediately takes control and will prevent problems such as interface link flaps and router reloads; however, Layer 3 packet forwarding is disrupted without additional configuration. The standby RP does not have any Layer 3 checkpoint information about the routing peer, so a switchover will trigger a routing protocol adjacency flap that clears the route table. After the route table is cleared, the CEF entries are purged, and traffic is no longer routed until the network topology is relearned and the forwarding table is reprogrammed.
  • Enabling NSF or NSR high availability capabilities informs the routers to maintain the CEF entries for a short duration and continue forwarding packets through an RP failure until the control plane recovers.
  • SSO requires that both RPs have the same software version.
  • The feature is automatically enabled by default on all IOS XR routers.
  • The IOS and IOS XR command show redundancy provides details on the current SSO state operation.
  • Manually triggering a switchover between route processors is performed with the command redundancy force-switchover on IOS routers and with the command redundancy switchover on IOS XR nodes.

Configuration / Verification

				
					// IOS - enabling stateful switchover using the redundancy mode command
redundancy
 mode sso
				
			

Nonstop Forwarding and Graceful Restart (NSF/GR)

Nonstop forwarding (NSF) is a feature deployed along with SSO to protect the Layer 3 forwarding plane during an RP switchover. With NSF enabled, the router continues to forward packets using the stored entries in the FIB table.

There are three categories of NSF routers:

  • NSF-capable router: A router that has dual RPs and is manually configured to use NSF to preserve the forwarding table through a switchover. The router restarts the routing process upon completion of the RP switchover.
  • Image NSF-aware router: A neighbor router, which assists the NSF-capable router during the restart by preserving the routes and adjacency state during the RP switchover. An NSF-aware router does not require dual RPs.
  • Image NSF-unaware router: A router that is not aware or capable of assisting a neighboring router during an RP switchover.

SSO with NSF is a high availability feature that is part of the internal router operation. Graceful restart (GR) is a subcomponent of NSF and is the mechanism the routing protocols use to signal NSF capabilities and awareness.

The GR signaling mechanism differs slightly for each protocol, but the general concept is the same for all.

Example:

  • R1 has NSF with SSO enabled.
  • The primary RP has failed, but the router continues to forward packets using the existing CEF tables (FIB).
  • During this time, the backup RP transparently takes over and reestablishes communication with R2 to restore the control plane and repopulate the routing tables (RIB).
  • Throughout this grace period, R2 does not notify the rest of the network that a failure has occurred on R1, which maintains stability in the network and prevents a networkwide topology change event.
SSO with NSF

NSF Capability Exchange Process

  1. R1 signals to R2 that it is NSF/SSO-capable while forming the initial routing adjacency. The two agree that if R1 should signal a control plane reset, R2 will not drop the peering and will continue sending and receiving traffic to R1 as long as the routing protocol hold timers do not expire.
  2. R1 sends a GR message to R2 indicating that the control plane is temporarily going offline immediately preceding the RP failover.
  3. R1 maintains the CEF table programming to forward traffic to R2 while the RP switchover takes place.
  4. Upon completion of the switchover, the new primary RP on R1 reestablishes communication with R2 and requests updates for repopulating the route table.
  5. R2 provides the route information to R1, while at the same time suppressing a notification to the rest of the network that an adjacency flap has occurred to R1. Stability is improved by preventing an unnecessary networkwide best path recalculation for the route flap.
NSF Graceful Restart Relationship Building Process

Note: NSF freezes the CEF table and allows the router to forward packets to the last known good next-hop from prior to the RP switchover. If the network topology changes while the router is recovering, the packets may be suboptimally routed or possibly sent to the wrong destination and dropped.

NSF should not be deployed in parallel with routing protocol keepalive and holddown timers of less than 4 seconds. The NSF-capable router requires time to reestablish control plane communication during an RP switchover, and aggressive holddown timers can expire before this activity completes, leading to a neighbor adjacency flap. Bidirectional forwarding detection (BFD) is a better solution in most scenarios than aggressive keepalive timers.

Topology

The interior routing protocols OSPF Protocol, IS-IS Protocol, and EIGRP are automatically NSF-aware for both IOS and IOS XR. Routers with dual RPs need to be configured as NSF-capable within the routing protocol.

NSF Routers

OSPF

Two GR configuration modes are available for OSPF:

  • Cisco: The Cisco mode for performing GR adds Link Local Signaling (LLS) bits to the hello and DBD packets.
    • The LSDB Resynchronization (LR) bit is included in the database description (DBD) packets to indicate out-of-band resynchronization (OOB) capability.
    • A hello packet with the LLS Restarting R bit set indicates that the router is about to perform a restart. This method was developed by Cisco and was later standardized by the IETF in RFC 4811, 4812, and 5613.
  • IETF: The IETF RFC 3623 method for performing GR uses link-local opaque LSAs.
    • A router sends a Grace LSA to indicate it is about to restart the OSPF process. The router resynchronizes the LSDB using Grace LSAs once the restart completes.

Configuration

				
					// IOS
router ospf 100
 nsf cisco
				
			
				
					// IOS XR
router ospf 100
 nsf cisco
				
			

Verification

NSF Graceful Restart Agreement

Demonstrates that the neighbors are using LLS, which is required for NSF awareness and successful GR negotiations. Notice that the neighbor adjacency peering includes the LLS LSDB OOB Resynchronization (LR) capability bit set.

OSPF Graceful Restart Initiated

  • Demonstrates that a GR has just taken place on the neighboring router R1.
  • Notice that R2 and XR3 do not terminate the adjacency session because of the previously negotiated GR agreement.
  • During a GR event, IOS routers display the OOB resynchronization countdown timer for the recovery.
    • If R1 does not respond by the end of the timer, R2 and XR3 will consider the connection down.
    • IOS XR routers also use the OOB-Resync timer to track the state of the neighbor adjacency.

OSPF Graceful Restart Completed

  • Demonstrates that a GR completed successfully 11 seconds ago.
  • R1 recovered, and the neighbor adjacency was not reset on the R2 and XR3 side of the connection per the GR agreement.

Note: During the GR, the NSF-aware router does not clear the route table entries or the neighbor adjacency, and therefore the age of the routes predates the GR event as if nothing happened. The NSF-capable router performing the SSO restarts the routing process, so the neighbor adjacency and route table entries age is reset to zero on the restarting router.

EIGRP

EIGRP includes a Restart (RS) bit that allows for the signaling of a GR.

  • When the RS bit is set to 1, the neighboring NSF-aware router knows that an RP switchover is about to take place on the NSF-capable router.
  • Once the SSO event completes, the two routers synchronize route tables with the RS bit still enabled.
    • The NSF-aware router sends an End-of-Table (EOT) signal to indicate that it has provided all the updates, at which point the two routers clear the RS bit from the EIGRP packets.

Configuration

				
					// IOS (Classic AS Configuration)
router eigrp 100
 nsf
				
			
				
					// IOS (Named Mode Configuration)
router eigrp LAB
 address-family ipv4 unicast autonomous-system 100
   nsf
				
			

Verification

Displays the EIGRP neighbor status on an NSF-aware router. The highlighted output indicates the neighbor has been up for 50 minutes but that an RP switchover occurred on R1 52 seconds ago.

BGP

  • RFC 4724 describes BGP GR signaling.
  • Enabling the features modifies the initial BGP open negotiation message to include GR capability code 64.
  • The GR capability informs the neighboring router that it should not reset the BGP session and immediately purge the routes when it performs an SSO.
  • During an RP SSO event, the TCP connection used to form the BGP session is reset, but the routes in the RIB are not immediately purged.
  • The BGP NSF-aware router detects that the BGP TCP socket has cleared, marks the old routes stale, and begins a countdown timer, while at the same time continuing to forward traffic using the route table information from prior to the reset.
  • Once the NSF-capable router recovers, it forms a new TCP session and sends a new GR message notifying the NSF-capable router that it has restarted.
  • The two routers exchange updates until the NSF-capable router signals the end-of-RIB (EOR) message.
  • The NSF-aware router clears the stale countdown timer and any stale entries that are no longer present are removed.

Note: An RP failure will cause the BGP session to temporarily reset on both sides of the connection. The session uptime will correlate to the RP failure. On the restarting router (NSF-capable router), the route table entries will be purged, but on the NSF-aware router, the learned routes will remain unchanged with an age that precedes the GR.

Unlike the other routing protocols, BGP routers are not NSF-aware or NSF-capable by default. The GR capability requires manual configuration on both sides of the session.

Enabling BGP NSF

Step 1. Enable GR and NSF awareness

GR support is enabled for all peers:

  • [IOS] bgp graceful-restart
  • [IOS XR] graceful-restart

To selectively enable GR per peer:

  • [IOS] neighbor ip-address ha-mode graceful-restart
  • [IOS XR] graceful-restart [disable]

Step 2. Set the restart time (optional)

The GR restart time determines how long the router will wait for the restarting router to send an open message before declaring the neighbor down and resetting the session.

  • [IOS] bgp graceful-restart restart-time seconds
  • [IOS XR] graceful-restart restart-time seconds

The default value is 120 seconds.

Step 3. Set stale route timeouts (optional).

The stale routes timeout determines how long the router will wait for the end-of-record (EOR) message from the restarting neighbor before purging routes.

  • [IOS] bgp graceful-restart stalepath-time seconds
  • [IOS XR] graceful-restart stalepath-time seconds

The default value is 360 seconds.

Note: Enabling GR capabilities on an IOS router after the session has already been established will trigger the session to renegotiate and will result in an immediate session flap. In IOS XR, enabling GR is nondisruptive. The session will continue working with the previous setting until the session is manually reset and the capability is negotiated.

Configuration

				
					// R1
router bgp 65001
 bgp graceful-restart restart-time 120
 bgp graceful-restart stalepath-time 360
 neighbor 10.0.13.3 ha-mode graceful-restart
 
// XR3
router bgp 65003
 bgp graceful-restart restart-time 120
 bgp graceful-restart stalepath-time 360
 bgp graceful-restart
				
			

Verification

The command show bgp ipv4 unicast neighbor can be used to determine whether GR capabilities have been negotiated. Example verifies that routers R1 and XR3 have successfully negotiated NSF capabilities with each other.

It is common to have BGP and IGP routing protocols on the same routers. To avoid suboptimal routing during an RP failover, the protocols should having matching NSF capabilities configured.

Nonstop Routing (NSR)

Nonstop routing (NSR) is an internal Cisco router feature that does not use a GR mechanism to signal to neighboring routers that an RP switchover has taken place.

Instead, the primary RP is responsible for constantly transferring all relevant routing control plane information to the backup RP, including routing adjacency and TCP sockets.

During a failure, the new RP uses the “checkpoint” state information to maintain the routing adjacencies and recalculate the route table without alerting the neighboring router that a switchover has occurred.

Example:

  • Figure demonstrates an RP switchover on R1, which has SSO with NSR enabled.
  • The routing protocol peering between R1 and R2 is unaffected by the RP failure.
  • R2 is unaware that a failure has occurred on R1.
  • Throughout the entire RP failover process, the traffic continues to flow between the two routers unimpeded using the Cisco Express Forwarding (CEF) forwarding table.
SSO with NSR

NSR’s primary benefit over NSF is that it is a completely self-contained high availability solution. There is no disruption to the routing protocol interaction so the neighboring router does not need to be NSR– or NSF-aware.

NSR Routing Protocol Interaction

OSPF

				
					// IOS
router ospf 100
 nsr
				
			
				
					// IOS XR
router ospf 100
 nsr
				
			

The IOS command to verify whether NSR is operational is show ip ospf nsr, and the IOS XR command is show redundancy.

IS-IS

				
					// IOS
router isis LAB
 nsf cisco
				
			
				
					// IOS XR
router isis LAB
 nsf cisco
				
			

BGP

  • Per peer basis:
    • [IOS] neighbor ip-address ha-mode sso
  • All neighbor sessions:
    • [IOS XR] nsr

Note: NSR and NSF/GR can be configured at the same time. Typically, NSR will take precedence over NSF. However, when deployed together with BGP on IOS, the router will attempt to use the NSF GR method over NSR. The IOS command neighbor ip-address ha-mode graceful-restart disable ensures that NSR is the active high availability feature used for the peering.

IOS XR routers give NSR preference over NSF GR when the two features are deployed in unison.

Example:

  • Demonstrates how to configure NSR for BGP.
  • GR has been globally enabled within the BGP process.
  • The GR capability has been disabled for the specific peer to ensure that SSO with NSR is employed.
				
					// R1
router bgp 65001
 bgp graceful-restart restart-time 120
 bgp graceful-restart stalepath-time 360
 bgp graceful-restart
 neighbor 10.0.13.1 ha-mode sso
 neighbor 10.0.13.1 ha-mode graceful-restart disable

// XR3
router bgp 65003
 nsr
				
			

The IOS command show bgp ipv4 unicast sso summary or the IOS XR command show bgp ipv4 unicast sso summary may be used to verify BGP NSR operational status.

NSF and NSR Together

Routing protocols may use NSF and NSR together at the same time. NSR takes precedence for the IGP routing protocols, and NSF will be used as a fallback option where NSR recovery is not possible. For example, NSR does not support process restarts, and therefore there are benefits to deploying the two high availability features in tandem.

The IOS XR global command nsr process-failures switchover may be used when NSF is not enabled to force an RP failover when a routing process restarts.

Tags: , , ,

Leave a Reply