Home

  • Packet-Forwarding Process

    When troubleshooting connectivity issues for an IP-based network, the network layer (Layer 3) of the OSI reference model is often an appropriate place to begin your troubleshooting efforts (divide-and-conquer method).

    For example, if you are experiencing connectivity issues between two hosts on a network, you could check Layer 3 by pinging between the hosts. If the pings are successful, you can conclude that the issue resides at upper layers of the OSI reference model (Layers 4 through 7). However, if the pings fail, you should focus your troubleshooting efforts on Layers 1 through 3. If you ultimately determine that there is a problem at Layer 3, your efforts might be centered on the packet-forwarding process of a router.

    Layer 3 Packet-Forwarding Process

    • PC1 needs to access HTTP resources on Server1.
    • Notice that PC1 and Server1 are on different networks.
    • So how does a packet from source IP address 192.168.1.2 get routed to destination IP address 192.168.3.2?

    Step 1.

    • PC1 compares its IP address and subnet mask 192.168.1.2/24 with the destination IP address 192.168.3.2.
      • PC1 determines the network portion of its own IP address.
      • It then compares these binary bits with the same binary bits of the destination address. If they are the same, it knows the destination is on the same subnet. If they differ, it knows the destination is on a remote subnet.
      • PC1 concludes that the destination IP address resides on a remote subnet. Therefore, PC1 needs to send the frame to its default gateway, which could have been manually configured on PC1 or dynamically learned via DHCP.
    • PC1 has the default gateway address 192.168.1.1 (that is, R1). To construct a proper Layer 2 frame, PC1 needs the MAC address of the frame’s destination, which is PC1’s default gateway. If the MAC address is not in PC1’s ARP cache, PC1 uses ARP to discover it.
      • Once PC1 receives an ARP reply from  R1, PC1 adds R1’s MAC address to its ARP cache. PC1 then sends its data destined for Server1 in a frame addressed to R1.

    Step 2.

    • R1 receives the frame sent from PC1, and because the destination MAC address is R1’s, R1 tears off the Layer 2 header and interrogates the IP (Layer 3) header.
      • An IP header contains a time-to-live (TTL) field, which is decremented once for each router hop. Therefore,  R1 decrements the packet’s TTL field. If the value in the TTL field is reduced to zero, the router discards the packet and sends a time-exceeded Internet Control Message Protocol (ICMP) message back to the source.
      • Assuming that the TTL is not decremented to zero, R1 checks its routing table to determine the best path to reach the IP address 192.168.3.2.
    • R1’s routing table has an entry stating that network 192.168.3.0/24 is accessible through interface Serial 1/1.
      • Note that ARP is not required for serial interfaces because these interface types do not have MAC addresses. Therefore, R1 forwards the frame out its Serial 1/1 interface, using the Point-to-Point Protocol (PPP) Layer 2 framing header.

    Step 3.

    • When R2 receives the frame, it removes the PPP header and then decrements the TTL in the IP header, just as R1 did.
      • Again, assuming that the TTL did not get decremented to zero, R2 interrogates the IP header to determine the destination network.
    • In this case, the destination network 192.168.3.0/24 is directly attached to R2’s Fast Ethernet 0/0 interface.
      • Much the way PC1 sent out an ARP request to determine the MAC address of its default gateway, R2 sends an ARP request to determine the MAC address of Server1 if it is not already known in the ARP cache.
      • Once an ARP reply is received from Server1, R2 stores the results of the ARP reply in the ARP cache and forwards the frame out its Fast Ethernet 0/0 interface to Server1.

    Router Data Structures

    The previous steps identified two router data structures:

    1. IP routing table: When a router needs to route an IP packet, it consults its IP routing table to find the best match. The best match is the route that has the longest prefix.
      • For example, suppose that a router has a routing entry for networks 10.0.0.0/8, 10.1.1.0/24, and 10.1.1.0/26. Also, suppose that the router is trying to forward a packet with the destination IP address 10.1.1.10. The router selects the 10.1.1.0/26 route entry as the best match for 10.1.1.10 because that route entry has the longest prefix, /26 (so it matches the most number of bits).
    2. Layer 3-to-Layer 2 mapping table: R2’s ARP cache contains Layer 3-to-Layer 2 mapping information. Specifically, the ARP cache has a mapping that says MAC address 2222.2222.2222 corresponds to IP address 192.168.3.2.
      • An ARP cache is the Layer 3-to-Layer 2 mapping data structure used for Ethernet-based networks, but similar data structures are used for Multipoint Frame Relay networks and Dynamic Multipoint Virtual Private Network (DMVPN) networks.
      • However, for point-to-point links such as PPP or High-Level Data Link Control (HDLC), because there is only one other possible device connected to the other end of the link, no mapping information is needed to determine the next-hop device.

    Continually querying a router’s routing table and its Layer 3-to-Layer 2 mapping data structure (for example, an ARP cache) is less than efficient. Fortunately, CEF gleans its information from the router’s IP routing table and Layer 3-to-Layer 2 mapping tables. Then, CEF’s data structures in hardware can be referenced when forwarding packets.

    The two primary CEF data structures are as follows:

    1. Forwarding Information Base (FIB): The FIB contains Layer 3 information, similar to the information found in an IP routing table. In addition, an FIB contains information about multicast routes and directly connected hosts.
    2. Adjacency table: When a router is performing a route lookup using CEF, the FIB references an entry in the adjacency table.
      • The adjacency table entry contains the frame header information required by the router to properly form a frame.
      • Therefore, an egress interface and a next-hop MAC address is in an adjacency entry for a multipoint Ethernet interface, whereas a point-to-point interface requires only egress interface information.

    Tshoot Packet-Forwarding Process

    When troubleshooting packet-forwarding issues, you need to examine a router’s IP routing table.

    • If the observed behavior of the traffic is not conforming to information in the IP routing table, remember that the IP routing table is maintained by a router’s control plane and is used to build the tables at the data plane.
    • CEF is operating in the data plane and uses the FIB.
    • You need to view the CEF data structures (that is, the FIB and the adjacency table) that contain all the information required to make packet-forwarding decisions.

    The output indicates that, according to CEF, IP address 192.168.1.11 is accessible out interface Fast Ethernet 0/0, with the next-hop IP address 192.168.0.11.

    The output indicates that a packet sourced from IP address 10.2.2.2 and destined for IP address 192.168.1.11 will be sent out interface Fast Ethernet 0/0 to next-hop IP address 192.168.0.11.

    For a multipoint interface such as point-to-multipoint Frame Relay or Ethernet, when a router knows the next-hop address for a packet, it needs appropriate Layer 2 information (for example, next-hop MAC address or data link connection identifier [DLCI]) to properly construct a frame.

    • The output shows the Frame Relay interfaces, the corresponding DLCIs associated with the interfaces, and the next-hop IP address that is reachable out the interface using the permanent virtual circuit (PVC) associated with the listed DLCI.
      • In this case, if R2 needs to send data to the next-hop IP address 172.16.33.6, it uses the PVC associated with DLCI 406 to get there.
    • show ip nhrp, this command displays the NHRP cache that is used with DMVPN networks.
      • In this example, if a packet needs to be sent to the 192.168.255.2 next-hop IP address, the nonbroadcast multiaccess (NBMA) address 198.51.100.2 is used to reach it.

    The output shows the CEF information used to construct frame headers needed to reach the next-hop IP addresses through the various router interfaces.

    • Notice the value 64510800 for Serial 1/0. This is a hexadecimal representation of information that is needed by the router to successfully forward the packet to the next-hop IP address 172.16.33.5, including the DLCI 405.
    • Notice the value CA1B01C4001CCA1C164000540800 for Fast Ethernet 3/0. This is the destination MAC address, the source MAC address, and the EtherType code for an Ethernet frame. The first 12 hex values are the destination MAC address, the next 12 are the source MAC address, and 0800 is the IPv4 EtherType code.

    Routing Information Sources

    As a router receives routing information from a neighboring router, the information is stored in the data structures of the IP routing protocol and analyzed by the routing protocol to determine the best path, based on metrics. An IP routing protocol’s data structure can also be populated by the local router. For example, a router might be configured for route redistribution, where routing information is redistributed from the routing table into the IP routing protocol’s data structure. The router might be configured to have specific interfaces participate in an IP routing protocol process. In that case, the network that the interface belongs to is placed into the routing protocol data structure as well.

    A router could conceivably receive routing information from the following routing sources all at the same time:

    • Connected interface
    • Static route
    • RIP
    • EIGRP
    • OSPF
    • BGP

  • Classless InterDomain Routing (CIDR)

    Classful vs Classless Addressing

    Classful Addressing

    The “classful” IP addressing scheme divides the IP address space into five classes, A through E, of differing sizes. Classes A, B and C are the most important ones.

    • Class A: First octect 0-127 Mask /8
    • Class B: First octect 128-191 Mask /16
    • Class C: First octect 192-223 Mask /24

    Determining Address Class From the First Octet

    For example, consider Class B.

    • The first two bits of the first octet are “10”. The remaining bits can be any combination of ones and zeroes. This is normally represented as “10xx xxxx”.
    • Thus, the binary range for the first octet can be from “1000 0000” to “1011 1111”. This is 128 to 191 in decimal.
    • So, in the “classful” scheme, any IP address whose first octet is from 128 to 191 (inclusive) is a Class B address.

    Note: In the “classful” IP addressing scheme, the class of an IP address is identified by looking at the first one, two, three or four bits of the address.

    Classless Addressing (CIDR)

    Classless Inter-Domain Routing (CIDR) is a system of IP addressing and routing that solves the many problems of “classful” addressing by eliminating fixed address classes in favor of a flexible, multiple-level, hierarchical structure of networks of varying size.

    While classful networks make life simpler, they are not efficient in terms of IP address usage. What if you want a Class C network with only two hosts on it? Well, for that network, you would need to have four IP addresses, that is, two for the hosts, one for the network address, and one for the broadcast address. We would have 252 IP addresses sitting there unused. Admittedly, that does give you scope to grow your network, but it is still not ideal.

    CIDR provides us with the means of escaping from default subnet masks, thus allowing us to be more flexible in sizing our networks. Do you only want two hosts? Not a problem – we can create a subnet mask for that. CIDR is based on Variable Length Subnet Masks (VLSMs), which enables network engineers to divide an IP address space into subnets of different sizes, making it possible to create subnetworks with different host counts without wasting large number of addresses. VLSMs offer you the ability to break your network down into smaller networks of various sizes (as opposed to having multiple smaller networks all of the same size). VLSM allow us to create subnets with different sizes.

    Classful vs Classless Routing Protocols

    The biggest distinction between classful and classless routing protocols is that classful routing protocols do not send subnet mask information in their routing updates. Classless routing protocols include subnet mask information in the routing updates.

    The two original IPv4 routing protocols developed were RIPv1 and IGRP. They were created when network addresses were allocated based on classes (class A, B, or C). At that time, a routing protocol did not need to include the subnet mask in the routing update, because the network mask could be determined based on the first octet of the network address.

    Classful routing protocols also create problems in discontiguous networks. A discontiguous network is when subnets from the same classful major network address are separated by a different classful network address.

    Example:

    • Notice that the LANs of R1 (172.16.1.0/24) and R3 (172.16.2.0/24) are both subnets of the same class B network (172.16.0.0/16).
    • They are separated by different classful network addresses (192.168.1.0/30 and 192.168.2.0/30).
    • When R1 forwards an update to R2, RIPv1 does not include the subnet mask information with the update; it only forwards the class B network address 172.16.0.0.

    R1 Forwards a Classful Update to R2

    • R2 receives and processes the update. It then creates and adds an entry for the class B 172.16.0.0/16 network in the routing table.
    • When R3 forwards an update to R2, it also does not include the subnet mask information and therefore only forwards the classful network address 172.16.0.0.
    • R2 receives and processes the update and adds another entry for the classful network address 172.16.0.0/16 to its routing table.
      • When there are two entries with identical metrics in the routing table, the router shares the load of the traffic equally among the two links. This is known as load balancing.

    Two entries with identical metrics

    Discontiguous networks have a negative impact on a network. For example, a ping to 172.16.1.1 would return “U.U.U” because R2 would forward the first ping out its Serial 0/0/1 interface toward R3, and R3 would return a Destination Unreachable (U) error code to R2. The second ping would exit out of R2’s Serial 0/0/0 interface toward R1, and R1 would return a successful code (.). This pattern would continue until the ping command is done.

    Summary

    • Classful routing protocols do not carry subnet masks; classless routing protocols do.
    • Older routing protocols, including RIP and IGRP, are classful. Newer protocols, including RIP-2, EIGRP, and OSPF, are classless.
    • Since RIP-2 updates carry subnet masks, it is possible to associate different subnet masks within a single classful network — in other words, RIP-2 supports VLSM. VLSM, a feature of classless routing protocols.

    Understanding CIDR

    People who developed this thought it would be enough to have 3 different classes, class A,B and C networks. There were only three subnet masks:

    • Class A: 255.0.0.0 (16.777.216 addresses)
    • Class B: 255.255.0.0 (65.536 addresses)
    • Class C: 255.255.255.0 (256 addresses)

    These networks are also known as classful networks.

    When the internet started growing rapidly, large companies received entire Class A networks with millions of addresses. Smaller companies could get a Class B network with 65.536 addresses or Class C networks with 256 addresses. Many addresses were wasted so something had to be done.

    The solution to this problem is Classless Interdomain Routing, in other words we stop working with the classful networks and start working with classless networks.

    Classless networks means we don’t use the class A,B or C networks anymore but are free to use any subnet mask we like. Also instead of writing down the subnet mask like 255.255.255.0 we often use a “bit” notation like /24. This represents the number of bits that are used for the subnet mask.

    Summary

    • The primary use of CIDR is to reduce the size of routing tables by aggregating several classful addresses in a single route entry. Also decrease the rapid exhaustion of IPv4 addresses.

    CIDR Notation

    CIDR notation (Classless Inter-Domain Routing) is an alternate method of representing a subnet mask. It is simply a count of the number of network bits (bits that are set to 1) in the subnet mask.

    The CIDR notation is easier to write down than typing the entire subnet mask. Unfortunately most operating systems and network devices still require you to type in the full subnet mask.

    A little overview with subnet masks and CIDR notation:

    CIDR Notation Subnet Mask
    /8 255.0.0.0
    /9 255.128.0.0
    /10 255.192.0.0
    /11 255.224.0.0
    /12 255.240.0.0

    and so on..

  • High Availability

    In networking, availability refers to the operational uptime of the network. The aim of high availability is to achieve continuous network uptime by designing a network to avoid single points of failure, incorporate deterministic network patterns, and utilize event-driven failure detection to provide fast network convergence.

    Network Convergence Overview

    Network convergence is the time required to redirect traffic around a failure that causes loss of connectivity (LoC). The latency requirements for convergence vary by application. For example, an Open Shortest Path First (OSPF) network using default settings may take 5 or more seconds to converge around a link failure. This length of time may be acceptable for users reading Internet website articles, but it is completely unacceptable for IP telephony users.

    A number of factors influence network convergence speed. In general, the higher the number of network prefixes and routers in the network, the slower the convergence will be. The primary factors that influence network convergence are as follows:

    • T1: Time to detect the failure event
    • T2: Time to propagate the event to neighbors
    • T3: Time to process the event and calculate new best path
    • T4: Time to update the routing table and program forwarding tables

    Continuous Forwarding

    Routers specifically designed for high availability include hardware redundancy, such as dual power supplies and route processors (RPs). An RP, which is also called a supervisor on some platforms, is responsible for learning the network topology and building the route table (Routing Information Base [RIB]). An RP failure can trigger routing protocol adjacencies to reset, resulting in packet loss and network instability. During an RP failure, it may be more desirable to hide the failure and allow the router to continue forwarding packets using the previously programmed CEF table entries versus temporarily dropping packets while waiting for the secondary RP to reestablish the routing protocol adjacencies and rebuild the forwarding table.

    The following two high availability features allow the network to route through a failure during an RP switchover:

    • Stateful switchover (SSO) with nonstop forwarding (NSF)
    • Image Stateful switchover (SSO) with nonstop routing (NSR)

    Stateful Switchover (SSO)

    Stateful switchover (SSO) is a redundancy feature that allows a Cisco router with two RPs to synchronize router configuration and control plane state information.

    • The process of mirroring information between RPs is referred to as checkpointing.
    • SSO-enabled routers always checkpoint line card operation and Layer 2 protocol states.
    • During a switchover, the standby RP immediately takes control and will prevent problems such as interface link flaps and router reloads; however, Layer 3 packet forwarding is disrupted without additional configuration. The standby RP does not have any Layer 3 checkpoint information about the routing peer, so a switchover will trigger a routing protocol adjacency flap that clears the route table. After the route table is cleared, the CEF entries are purged, and traffic is no longer routed until the network topology is relearned and the forwarding table is reprogrammed.
    • Enabling NSF or NSR high availability capabilities informs the routers to maintain the CEF entries for a short duration and continue forwarding packets through an RP failure until the control plane recovers.
    • SSO requires that both RPs have the same software version.
    • The feature is automatically enabled by default on all IOS XR routers.
    • The IOS and IOS XR command show redundancy provides details on the current SSO state operation.
    • Manually triggering a switchover between route processors is performed with the command redundancy force-switchover on IOS routers and with the command redundancy switchover on IOS XR nodes.

    Configuration / Verification

    				
    					// IOS - enabling stateful switchover using the redundancy mode command
    redundancy
     mode sso
    				
    			

    Nonstop Forwarding and Graceful Restart (NSF/GR)

    Nonstop forwarding (NSF) is a feature deployed along with SSO to protect the Layer 3 forwarding plane during an RP switchover. With NSF enabled, the router continues to forward packets using the stored entries in the FIB table.

    There are three categories of NSF routers:

    • NSF-capable router: A router that has dual RPs and is manually configured to use NSF to preserve the forwarding table through a switchover. The router restarts the routing process upon completion of the RP switchover.
    • Image NSF-aware router: A neighbor router, which assists the NSF-capable router during the restart by preserving the routes and adjacency state during the RP switchover. An NSF-aware router does not require dual RPs.
    • Image NSF-unaware router: A router that is not aware or capable of assisting a neighboring router during an RP switchover.

    SSO with NSF is a high availability feature that is part of the internal router operation. Graceful restart (GR) is a subcomponent of NSF and is the mechanism the routing protocols use to signal NSF capabilities and awareness.

    The GR signaling mechanism differs slightly for each protocol, but the general concept is the same for all.

    Example:

    • R1 has NSF with SSO enabled.
    • The primary RP has failed, but the router continues to forward packets using the existing CEF tables (FIB).
    • During this time, the backup RP transparently takes over and reestablishes communication with R2 to restore the control plane and repopulate the routing tables (RIB).
    • Throughout this grace period, R2 does not notify the rest of the network that a failure has occurred on R1, which maintains stability in the network and prevents a networkwide topology change event.

    SSO with NSF

    NSF Capability Exchange Process

    1. R1 signals to R2 that it is NSF/SSO-capable while forming the initial routing adjacency. The two agree that if R1 should signal a control plane reset, R2 will not drop the peering and will continue sending and receiving traffic to R1 as long as the routing protocol hold timers do not expire.
    2. R1 sends a GR message to R2 indicating that the control plane is temporarily going offline immediately preceding the RP failover.
    3. R1 maintains the CEF table programming to forward traffic to R2 while the RP switchover takes place.
    4. Upon completion of the switchover, the new primary RP on R1 reestablishes communication with R2 and requests updates for repopulating the route table.
    5. R2 provides the route information to R1, while at the same time suppressing a notification to the rest of the network that an adjacency flap has occurred to R1. Stability is improved by preventing an unnecessary networkwide best path recalculation for the route flap.

    NSF Graceful Restart Relationship Building Process

    Note: NSF freezes the CEF table and allows the router to forward packets to the last known good next-hop from prior to the RP switchover. If the network topology changes while the router is recovering, the packets may be suboptimally routed or possibly sent to the wrong destination and dropped.

    NSF should not be deployed in parallel with routing protocol keepalive and holddown timers of less than 4 seconds. The NSF-capable router requires time to reestablish control plane communication during an RP switchover, and aggressive holddown timers can expire before this activity completes, leading to a neighbor adjacency flap. Bidirectional forwarding detection (BFD) is a better solution in most scenarios than aggressive keepalive timers.

    Topology

    The interior routing protocols OSPF Protocol, IS-IS Protocol, and EIGRP are automatically NSF-aware for both IOS and IOS XR. Routers with dual RPs need to be configured as NSF-capable within the routing protocol.

    NSF Routers

    OSPF

    Two GR configuration modes are available for OSPF:

    • Cisco: The Cisco mode for performing GR adds Link Local Signaling (LLS) bits to the hello and DBD packets.
      • The LSDB Resynchronization (LR) bit is included in the database description (DBD) packets to indicate out-of-band resynchronization (OOB) capability.
      • A hello packet with the LLS Restarting R bit set indicates that the router is about to perform a restart. This method was developed by Cisco and was later standardized by the IETF in RFC 4811, 4812, and 5613.
    • IETF: The IETF RFC 3623 method for performing GR uses link-local opaque LSAs.
      • A router sends a Grace LSA to indicate it is about to restart the OSPF process. The router resynchronizes the LSDB using Grace LSAs once the restart completes.

    Configuration

    				
    					// IOS
    router ospf 100
     nsf cisco
    				
    			

    				
    					// IOS XR
    router ospf 100
     nsf cisco
    				
    			

    Verification

    NSF Graceful Restart Agreement

    Demonstrates that the neighbors are using LLS, which is required for NSF awareness and successful GR negotiations. Notice that the neighbor adjacency peering includes the LLS LSDB OOB Resynchronization (LR) capability bit set.

    OSPF Graceful Restart Initiated

    • Demonstrates that a GR has just taken place on the neighboring router R1.
    • Notice that R2 and XR3 do not terminate the adjacency session because of the previously negotiated GR agreement.
    • During a GR event, IOS routers display the OOB resynchronization countdown timer for the recovery.
      • If R1 does not respond by the end of the timer, R2 and XR3 will consider the connection down.
      • IOS XR routers also use the OOB-Resync timer to track the state of the neighbor adjacency.

    OSPF Graceful Restart Completed

    • Demonstrates that a GR completed successfully 11 seconds ago.
    • R1 recovered, and the neighbor adjacency was not reset on the R2 and XR3 side of the connection per the GR agreement.

    Note: During the GR, the NSF-aware router does not clear the route table entries or the neighbor adjacency, and therefore the age of the routes predates the GR event as if nothing happened. The NSF-capable router performing the SSO restarts the routing process, so the neighbor adjacency and route table entries age is reset to zero on the restarting router.

    EIGRP

    EIGRP includes a Restart (RS) bit that allows for the signaling of a GR.

    • When the RS bit is set to 1, the neighboring NSF-aware router knows that an RP switchover is about to take place on the NSF-capable router.
    • Once the SSO event completes, the two routers synchronize route tables with the RS bit still enabled.
      • The NSF-aware router sends an End-of-Table (EOT) signal to indicate that it has provided all the updates, at which point the two routers clear the RS bit from the EIGRP packets.

    Configuration

    				
    					// IOS (Classic AS Configuration)
    router eigrp 100
     nsf
    				
    			

    				
    					// IOS (Named Mode Configuration)
    router eigrp LAB
     address-family ipv4 unicast autonomous-system 100
       nsf
    				
    			

    Verification

    Displays the EIGRP neighbor status on an NSF-aware router. The highlighted output indicates the neighbor has been up for 50 minutes but that an RP switchover occurred on R1 52 seconds ago.

    BGP

    • RFC 4724 describes BGP GR signaling.
    • Enabling the features modifies the initial BGP open negotiation message to include GR capability code 64.
    • The GR capability informs the neighboring router that it should not reset the BGP session and immediately purge the routes when it performs an SSO.
    • During an RP SSO event, the TCP connection used to form the BGP session is reset, but the routes in the RIB are not immediately purged.
    • The BGP NSF-aware router detects that the BGP TCP socket has cleared, marks the old routes stale, and begins a countdown timer, while at the same time continuing to forward traffic using the route table information from prior to the reset.
    • Once the NSF-capable router recovers, it forms a new TCP session and sends a new GR message notifying the NSF-capable router that it has restarted.
    • The two routers exchange updates until the NSF-capable router signals the end-of-RIB (EOR) message.
    • The NSF-aware router clears the stale countdown timer and any stale entries that are no longer present are removed.

    Note: An RP failure will cause the BGP session to temporarily reset on both sides of the connection. The session uptime will correlate to the RP failure. On the restarting router (NSF-capable router), the route table entries will be purged, but on the NSF-aware router, the learned routes will remain unchanged with an age that precedes the GR.

    Unlike the other routing protocols, BGP routers are not NSF-aware or NSF-capable by default. The GR capability requires manual configuration on both sides of the session.

    Enabling BGP NSF

    Step 1. Enable GR and NSF awareness

    GR support is enabled for all peers:

    • [IOS] bgp graceful-restart
    • [IOS XR] graceful-restart

    To selectively enable GR per peer:

    • [IOS] neighbor ip-address ha-mode graceful-restart
    • [IOS XR] graceful-restart [disable]

    Step 2. Set the restart time (optional)

    The GR restart time determines how long the router will wait for the restarting router to send an open message before declaring the neighbor down and resetting the session.

    • [IOS] bgp graceful-restart restart-time seconds
    • [IOS XR] graceful-restart restart-time seconds

    The default value is 120 seconds.

    Step 3. Set stale route timeouts (optional).

    The stale routes timeout determines how long the router will wait for the end-of-record (EOR) message from the restarting neighbor before purging routes.

    • [IOS] bgp graceful-restart stalepath-time seconds
    • [IOS XR] graceful-restart stalepath-time seconds

    The default value is 360 seconds.

    Note: Enabling GR capabilities on an IOS router after the session has already been established will trigger the session to renegotiate and will result in an immediate session flap. In IOS XR, enabling GR is nondisruptive. The session will continue working with the previous setting until the session is manually reset and the capability is negotiated.

    Configuration

    				
    					// R1
    router bgp 65001
     bgp graceful-restart restart-time 120
     bgp graceful-restart stalepath-time 360
     neighbor 10.0.13.3 ha-mode graceful-restart
     
    // XR3
    router bgp 65003
     bgp graceful-restart restart-time 120
     bgp graceful-restart stalepath-time 360
     bgp graceful-restart
    				
    			

    Verification

    The command show bgp ipv4 unicast neighbor can be used to determine whether GR capabilities have been negotiated. Example verifies that routers R1 and XR3 have successfully negotiated NSF capabilities with each other.

    It is common to have BGP and IGP routing protocols on the same routers. To avoid suboptimal routing during an RP failover, the protocols should having matching NSF capabilities configured.

    Nonstop Routing (NSR)

    Nonstop routing (NSR) is an internal Cisco router feature that does not use a GR mechanism to signal to neighboring routers that an RP switchover has taken place.

    Instead, the primary RP is responsible for constantly transferring all relevant routing control plane information to the backup RP, including routing adjacency and TCP sockets.

    During a failure, the new RP uses the “checkpoint” state information to maintain the routing adjacencies and recalculate the route table without alerting the neighboring router that a switchover has occurred.

    Example:

    • Figure demonstrates an RP switchover on R1, which has SSO with NSR enabled.
    • The routing protocol peering between R1 and R2 is unaffected by the RP failure.
    • R2 is unaware that a failure has occurred on R1.
    • Throughout the entire RP failover process, the traffic continues to flow between the two routers unimpeded using the Cisco Express Forwarding (CEF) forwarding table.

    SSO with NSR

    NSR’s primary benefit over NSF is that it is a completely self-contained high availability solution. There is no disruption to the routing protocol interaction so the neighboring router does not need to be NSR– or NSF-aware.

    NSR Routing Protocol Interaction

    OSPF

    				
    					// IOS
    router ospf 100
     nsr
    				
    			

    				
    					// IOS XR
    router ospf 100
     nsr
    				
    			

    The IOS command to verify whether NSR is operational is show ip ospf nsr, and the IOS XR command is show redundancy.

    IS-IS

    				
    					// IOS
    router isis LAB
     nsf cisco
    				
    			

    				
    					// IOS XR
    router isis LAB
     nsf cisco
    				
    			

    BGP

    • Per peer basis:
      • [IOS] neighbor ip-address ha-mode sso
    • All neighbor sessions:
      • [IOS XR] nsr

    Note: NSR and NSF/GR can be configured at the same time. Typically, NSR will take precedence over NSF. However, when deployed together with BGP on IOS, the router will attempt to use the NSF GR method over NSR. The IOS command neighbor ip-address ha-mode graceful-restart disable ensures that NSR is the active high availability feature used for the peering.

    IOS XR routers give NSR preference over NSF GR when the two features are deployed in unison.

    Example:

    • Demonstrates how to configure NSR for BGP.
    • GR has been globally enabled within the BGP process.
    • The GR capability has been disabled for the specific peer to ensure that SSO with NSR is employed.

    				
    					// R1
    router bgp 65001
     bgp graceful-restart restart-time 120
     bgp graceful-restart stalepath-time 360
     bgp graceful-restart
     neighbor 10.0.13.1 ha-mode sso
     neighbor 10.0.13.1 ha-mode graceful-restart disable
    
    // XR3
    router bgp 65003
     nsr
    				
    			

    The IOS command show bgp ipv4 unicast sso summary or the IOS XR command show bgp ipv4 unicast sso summary may be used to verify BGP NSR operational status.

    NSF and NSR Together

    Routing protocols may use NSF and NSR together at the same time. NSR takes precedence for the IGP routing protocols, and NSF will be used as a fallback option where NSR recovery is not possible. For example, NSR does not support process restarts, and therefore there are benefits to deploying the two high availability features in tandem.

    The IOS XR global command nsr process-failures switchover may be used when NSF is not enabled to force an RP failover when a routing process restarts.

  • BGP fast-external-fallover

    BGP Fast-external-fallover command terminates external BGP sessions of any directly adjacent peer if the link used to reach the peer goes down; without waiting for the hold-down timer to expire.

    Historically, when the fast-external-fallover feature was not available and a link went down, the EBGP session remained up until the hold-down timer expired. This situation used to cause a traffic black hole situation and service impact. To overcome this problem, bgp fast-external-fallover command was introduced. With this command configured, the EBGP session terminates immediately if the link goes down.

    This feature is enabled by default for EBGP sessions but disabled for IBGP sessions.

    Although the command bgp fast-external-fallover improves on convergence time, it is good to disable the command if the EBGP link is flapping continuously. By disabling fast-fallover, the instability caused by neighbors continually transitioning between idle and established states and the routing churn caused by the flood of ADVERTISE and WITHDRAW messages can be avoided.

    Configuration

    Purpose: Fast external fallover is enabled by default on IOS, IOS XR and NX-OS. When an interface that is used for a BGP connection goes down, the BGP session is immediately terminated. If the interface is flapping, instability can be caused, because the neighbors will constantly be transitioning between the idle and established states. There will also be a flood of BGP UPDATE and WITHDRAWN messages. If you have a flapping interface, use the no form of this command.

    • Use the no bgp fast-external-fallover command to disable this feature on both Cisco IOS and NX-OS.
    • Use the command bgp fast-external-fallover disable command to disable this feature on IOS XR.
    • The feature can also be enabled at the interface level using the command ip bgp fast-external-fallover on Cisco IOS.

    				
    					// no bgp fast-external-fallover (EBGP)
    *Jul 22 04:16:46.232: %BGP-3-NOTIFICATION: sent to neighbor 10.3.3.1 4/0 (hold time expired) 0 bytes 
    *Jul 22 04:16:46.233: %BGP-5-NBR_RESET: Neighbor 10.3.3.1 reset (BGP Notification sent)
    *Jul 22 04:16:46.236: %BGP-5-ADJCHANGE: neighbor 10.3.3.1 Down BGP Notification sent
    *Jul 22 04:16:46.236: %BGP_SESSION-5-ADJCHANGE: neighbor 10.3.3.1 IPv4 Unicast topology base removed from session  BGP Notification sent
    
    // bgp fast-external-fallover enabled by default (EBGP)
    *Jul 22 04:19:34.594: %BGP-5-NBR_RESET: Neighbor 10.3.3.1 reset (Interface flap)
    *Jul 22 04:19:34.598: %BGP-5-ADJCHANGE: neighbor 10.3.3.1 Down Interface flap
    *Jul 22 04:19:34.599: %BGP_SESSION-5-ADJCHANGE: neighbor 10.3.3.1 IPv4 Unicast topology base removed from session  Interface flap
    *Jul 22 04:19:36.561: %LINK-5-CHANGED: Interface GigabitEthernet0/3, changed state to administratively down
    *Jul 22 04:19:37.563: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/3, changed state to down
    
    // bgp fast-external-fallover disabled by default (IBGP)
    *Jul 22 04:25:09.181: %BGP-3-NOTIFICATION: sent to neighbor 10.1.1.1 4/0 (hold time expired) 0 bytes 
    *Jul 22 04:25:09.182: %BGP-5-NBR_RESET: Neighbor 10.1.1.1 reset (BGP Notification sent)
    *Jul 22 04:25:09.184: %BGP-5-ADJCHANGE: neighbor 10.1.1.1 Down BGP Notification sent
    *Jul 22 04:25:09.185: %BGP_SESSION-5-ADJCHANGE: neighbor 10.1.1.1 IPv4 Unicast topology base removed from session  BGP Notification sent
    
    // neighbor x.x.x.x fall-over configured (IBGP)
    *Jul 22 04:29:51.308: %BGP-5-NBR_RESET: Neighbor 10.1.1.1 reset (Route to peer lost)
    *Jul 22 04:29:51.311: %BGP-5-ADJCHANGE: neighbor 10.1.1.1 Down Route to peer lost
    *Jul 22 04:29:51.312: %BGP_SESSION-5-ADJCHANGE: neighbor 10.1.1.1 IPv4 Unicast topology base removed from session  Route to peer lost
    *Jul 22 04:29:53.273: %LINK-5-CHANGED: Interface GigabitEthernet0/1, changed state to administratively down
    *Jul 22 04:29:54.273: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/1, changed state to down
    				
    			

  • BGP Nonstop Routing (NSR)

    High-availability features like GR are really useful in critical network environments, where traffic loss even for few seconds can cost a lot to the organization, whether it is a service provider network or an enterprise. But GR is not really a feasible solution in all deployments. Think about a service provider network. It is easy to deploy a GR feature everywhere in the service provider core and edge, but the service provider cannot expect to have the customers enable GR or be GR capable. There might be customer environments where the CPE might be running a platform or software that does not support GR or might be running the CPE with just a single RP. In such situations, GR is not feasible for the customers.

    An RP switchover should be transparent to the customer, and this was the primary motivation behind NSR. NSR is a feature where routing protocols explicitly checkpoint state from active RP to the standby RP to maintain routing information across a switchover. Thus, NSR sessions are in established state on the standby RP prior to switchover and remain established even after the switchover. The main benefit of using NSR is it is transparent to the remote speaker; that is, the remote does not need to be NSR capable for the feature to work.

    There are three phases in NSR operation. Each phase performs certain actions, and based on these phases, it becomes easier to identify any problem with BGP NSR.

    1. Synchronization: During this state, the task of session state mirroring happens between the active and the standby RP. The TCP stack is first synchronized, followed by the application stacks—in this case, BGP.
    2. NSR-ready: The active and standby stacks operate independently, but the incoming packets or updates are replicated to both the RPs. The outgoing segments or updates are sent out via the standby RP or active RP depending on the underlying platform. On IOS/IOS XE, the active RP sends the update to the peers, but on IOS XR, the update is sent out via the standby RP. Note that the system uses asynchronous inter-process communication (IPC) between the active and standby RPs to replicate the information. In this state, the active RP sends prefix/best-path information to the standby.
    3. Switchover: When the switchover occurs, TCP activates the sockets based on the application trigger and restores keepalive functionality to maintain the session states. In other words, the new active RP (previously acting standby RP) continues from where the active RP left.

    The BGP NSR feature is supported on IOS/IOS XE and IOS XR platforms.

    • To enable BGP NSR on Cisco IOS, use the command neighbor ip-address ha-mode sso.
    • On IOS XR, NSR is not supported on a per-neighbor basis and can only be enabled globally for all address families using the command nsr under the router bgp configuration mode.
    • NSR is enabled globally on Cisco IOS by using the command bgp sso route-refresh-enable. This command only allows BGP NSR to be enabled to peers that are Route Refresh capable.
    • The BGP NSR related information is found for each peer by using the command show bgp afi safi neighbor ip-address.
    • On IOS XR, another command to verify if NSR is enabled for the BGP process is the command show bgp process.

    				
    					//IOS
    router bgp 100
     bgp sso route-refresh-enable
     neighbor 192.168.2.2 ha-mode sso
    
    // IOS XR
    router bgp 100
     nsr
     commit
    				
    			

    In IOS XR, there are instances when a process crashes because of various reasons. So, if a TCP or BGP process starts on the active RP, the system can force the active RP to failover to standby RP as a recovery action in such situations. But this is not done automatically. To enable this behavior, configure the command nsr process-failures switchover. Note that if a process restarts on the standby RP, only the NSR functionality is lost until the time the process comes up again, but there is not any other service impact.

    From the command-line perspective, there isn’t much information that can be viewed on the Cisco IOS or IOS XE platforms, but on IOS XR, a lot of information is available for BGP NSR. The BGP NSR goes through various states.

    The following describes the different states of the BGP NSR finite state machine:

    • None: NSR is disabled (not configured).
    • Initializing: Basic initialization in progress. This is done after the first time NSR is configured.
    • Connecting: Attempting to connect to peer (ACTV/STDBY) process.
    • TCP Init-Sync: Synchronization of TCP sessions in progress.
    • BGP Init-Sync: Synchronization of BGP database in progress.
    • NSR-Ready: Ready to perform NSR-enabled switchover.

    Note that in previous example output, the NSR state is None. This is because there is not a standby RP present in the system. In an ideal situation with dual RPs, the NSR state should be NSR-Ready.

    • To view the NSR state on a dual RP system, use the command show redundancy. This command displays the active and the standby RP redundancy states.
    • Use the command show bgp afi safi [prefix | summary] [standby] to view the BGP session state and the BGP table for an AFI/SAFI on the standby RP.

    Note: If a manual switchover is required for maintenance purposes, ensure that the redundancy state is Standby hot and also the standby is in NSR-Ready state. This ensures seamless activity without any service impact.

    After a switchover, the standby RP goes through all the NSR states.

    • To display all the various modes that the standby goes through after it moves to a standby ready state along with the timeline. It also shows the state of the BGP neighbor along with the NSR state.
      • show bgp summary nsr
      • show bgp nsr
    • To view the NSR states and the neighbor state on the standby RP:
      • show bgp summary nsr standby
    • A cumulative view of all the session states, that is, Neighbor State and NSR State, is viewed by using the command:
      • show bgp sessions.
    • If there are sessions that are not NSR ready, such sessions are viewed by using the command:
      • show bgp sessions [not-nsr-ready].
    • Because the TCP state is required to be synchronized between the active RP and the standby RP, it is vital to verify how many sessions an application (in this case BGP) ask TCP to synchronize and how many have actually synchronized. To verify this information, use the command:
      • show tcp nsr session-set brief

  • BGP Graceful-Restart Feature (GR)

    NSF/SSO, NSR, Graceful Restart

    Nonstop forwarding (NSF) refers to the capability of the data plane to continue forwarding IP packets when the control plane disappears (momentarily, that is), most likely an RP switchover (failing over to a standby RP.)

    Stateful switchover (SSO) refers to the capability of the control plane to hold configuration and various states during this switchover, and to thus effectively reduce the time to utilize the newly failed-over control plane. This is also handy when doing scheduled hitless upgrades within the ISSU execution path. The time to reach SSO for the newly active RP may vary depending on the type and scale of the configuration.

    Graceful restart (GR) refers to the capability of the control plane to delay advertising the absence of a peer (going through control-plane switchover) for a “grace period”, and thus help minimize disruption during that time (assuming the standby control plane comes up). GR is based on extensions per routing protocol, which are interoperable across vendors. The downside of the grace period is huge when the peer completely fails and never comes up, because that slows down the overall network convergence, which brings us to the final concept: nonstop routing (NSR).

    NSR is an internal (vendor-specific) mechanism to extend the awareness of routing to the standby control plane so that in case of failover, the newly active control plane can take charge of the already established sessions.

    BGP Graceful-Restart Feature

    The BGP Graceful-Restart (GR) feature allows a BGP speaker to express its ability to preserve forwarding state during BGP restart or Route Processor (RP) switchover. In other words, it is the capability exchanged between the BGP speakers to indicate its ability to perform Nonstop Forwarding (NSF). This helps in minimizing the impact of services caused by BGP restart. Specially in large network deployments, where BGP carries large number of prefixes, a BGP restart, especially by a route-reflector (RR) router, can have a severe performance and service impact and can lead to major outages.

    Example:

    R1 is acting as the RR and its peering with multiple clients. If there is a BGP restart or RP switchover on R1, the peer detects the session flaps and propagate routing updates throughout the network. This can lead to increased CPU utilization if the RR is holding a large BGP table. The traffic destined to the prefixes that were removed are impacted.

    Impact of Node Failure in a Network with BGP Route Reflectors

    RFC 4724 defines the GR mechanism for BGP. The BGP GR was developed with the following motivations:

    • Avoid widespread routing changes.
    • Decrease control plane overhead throughout the network.
    • Enhance overall stability of routing.

    A GR-capable device announces its ability to perform GR for the BGP peer. It also initiates the graceful-restart process when a RP switchover occurs and acts as a GR-aware device. A GR-aware device, also known GR helper mode, is capable of understanding that a peer router is transitioning and takes appropriate actions based on the configuration or default timers.

    GR capability should always be enabled for all routing protocols, especially when the routers are running with dual route processors (RP) and perform a switchover in case of any failure instance. Because BGP runs on TCP, GR should be enabled on both the peering devices. After GR is configured or enabled on both peering devices, reset the BGP session to exchange the capability and activate the GR feature.

    Note: GR is always on by default for non-TCP–based protocols such as Interior Gateway Protocol (IGPs).

    BGP GR is an optional feature and is not enabled by default. BGP peers announce GR capability in the BGP OPEN message. Within the OPEN message, the following information is negotiated:

    • Restart Flag: This bit indicates if a peer sending the GR capability has just restarted.
    • Restart Time: Indicates the length of time that the sender of the GR capability requires to complete a restart. The restart timer also helps in speeding up convergence in the event the peer never comes back up after a restart.
    • AFI/SAFI: Address-family for which GR is supported.
    • AFI Flags: It contains a Forwarding State bit. This bit indicates whether the peer sending the GR capability has preserved forwarding during the previous restart.

    When a BGP restart happens on the peer router or when RP switchover occurs, the routes currently held in the forwarding table; that is, hardware, are marked as stale. This way, the forwarding state is preserved as the control plane and the forwarding plane operate independently.

    1. On the restarting peer (where the switchover occurred), BGP on the newly active RP starts to establish sessions with all the configured peers.
    2. BGP on the other side, the nonrestarting side, sees new connection requests coming in while BGP already is in established state. Such an event is an indication for the nonrestarting peer that the peer has restarted. At this point, the restarting peer sends the GR capability with Restart State bit set to 1 and Forwarding State bit set to 1 for the AFI/SAFIs.
    3. The nonrestarting peer at this point cleans up old (dead) BGP sessions and marks all the routes in the BGP table that are received from the restarting peer as stale.
      • If the restarting peer never reestablishes the BGP session, the nonrestarting peer purges all stale routes after the Restart Time expires.
      • The nonrestarting peer sends an initial routing table update, followed by an End-of-RIB (EoR) marker.
      • Restarting peer delays best-path calculation for an AFI until after receiving EoR from all peers except for those that are not GR capable or for the ones that have Restart State bit set.
    4. The restarting peer finally generates updates for its peers and sends the EoR marker for each AFI after the initial table is sent.
    5. The nonrestarting peers receive the routing updates from the restarting peer and remove stale marking for any refreshed route. It purges any remaining stale routes after EoR is received from the restarting peer or the Stale Path Timer expires.

    Configuration

    BGP GR is an optional feature and is not enabled by default.

    • Use the command bgp graceful-restart to enable GR globally.
    • Use the command bgp graceful-restart restart-time value to set the GR restart timer
    • Use the command bgp graceful-restart stalepath-time value to set the maximum time for which the router will maintain the stale path entries in case it does not receives an EoR from the restarting peer.

    The GR Restart Timer, which defaults to 120 seconds, takes care of clearing the stale path entries in case the BGP peer does not comes up within this time period.

    If the BGP session is already in established state before GR configuration, the BGP sessions are required to be reset in order to exchange the GR capability.

    The GR capability is verified by using the command show bgp afi safi neighbors ip-address. Notice that in the command output, the GR capability is in advertised and received state. If either the advertised or received state is missing, it means that one of the peers is not having GR configured or the GR was configured after the session came up.

    				
    					// IOS and NX-OS
    router bgp 100
     bgp graceful-restart
     bgp graceful-restart restart-time 300
     bgp graceful-restart stalepath-time 400
    
    // IOS XR
    router bgp 100
     bgp graceful-restart
     bgp graceful-restart restart-time 300
     bgp graceful-restart stalepath-time 400
     bgp graceful-restart purge-time 400
     commit
    				
    			

    Sometimes, not all peers are GR capable and are not required to be GR capable as well. GR can also be configured on a per-neighbor basis and having the GR globally disabled. This helps in exchanging GR capability with only those neighbors for which forwarding should not be impacted or be least impacted.

    • GR is enabled for an individual neighbor using the command neighbor ip-address graceful-restart on both Cisco IOS XR and NX-OS.
    • Using the command neighbor ip-address ha-mode graceful-restart on Cisco IOS software.

    				
    					// IOS
    router bgp 100
     neighbor 192.168.2.2 ha-mode graceful-restart
    
    // NX-OS and IOS XR
    router bgp 100
     neighbor 192.168.1.1
      graceful-restart
    				
    			

    GR/NSF

    Cisco’s implementation of GR assumes NSF is enabled and tells the peers: “If I ever drop this session, it is because I am failing over from primary RP to secondary RP and will keep forwarding packets.This makes the peer think that it needs to keep sending the packets.

    This scenario works as long as there is no reload or reboot on the router. If the router goes down, the neighbor router keeps sending the packets to this router, instead of forwarding the traffic to a working path, assuming the router that restarted is performing a switchover and it has its Forwarding Information Base (FIB) updated. This causes the traffic to black hole and causes an outage.

    The problem is not with the feature itself but with the understanding between GR and NSF. GR does not mean that NSF is enabled but only assumes that NSF is enabled on the router. NSF is not configurable but is enabled by default when the router is running in Stateful Switchover (SSO) mode. NSF can also be defined as a function to checkpoint the FIB on the standby router.

    Stateful Switchover (SSO) is a redundancy feature that allows a Cisco device with two route processors to synchronise router configuration and control plane state information. In modular chassis with dual supervisors, NSF/SSO synchronizes information between the primary and backup supervisor, allowing for rapid supervisor switchover in case the primary fails.

    Note: It is important to understand routers’ and switches’ different high-availability operating modes with dual RPs.

    • Stateful Switchover (SSO): Failover from the active RP (crashing or reloading) to the standby RP (which takes over as the active role) where state is preserved and the router was in hot-standby mode before the switchover.
    • RPR+: RP redundancy mode where standby RP is partially initialized, but there is no synchronization of state.

    It is required to have SSO state for features like NSF, Nonstop Routing (NSR), or GR.

    Parameter RPR RPR+ SSO
    Failover Time 2-4 minutes 30-60 seconds 2-4 seconds
    Status on “show module” output Cold Warm Hot
    Backup SUP Engine Status The backup SUP engine is partially initialized and must reload every switch module after the primary engine fails. The backup SUP engine is partially initialized but doesn’t need to reload each switch module after primary engine fails. The backup SUP engine is completely initialized and layer 2 information is synchronized with the primary engine.
    Configuration redundancy
    mode rpr
    redundancy
    mode rpr-plus
    redundancy
    mode sso
    FIB Table Status The backup SUP engine doesn’t have the FIB table synchronized.
    All tables must be rebuilt after backup engine is initialized.
    The backup SUP engine doesn’t have the FIB table synchronized.
    All tables must be rebuilt after backup engine is initialized.
    FIB table is not flushed since it is already updated.
    NSF No support No support Supports
    Netflow Records Not maintained Not maintained Maintained

  • BGP Graceful Restart (GR)

    NSF/SSO, NSR, Graceful Restart

    Nonstop forwarding (NSF) refers to the capability of the data plane to continue forwarding IP packets when the control plane disappears (momentarily, that is), most likely an RP switchover (failing over to a standby RP.)

    Stateful switchover (SSO) refers to the capability of the control plane to hold configuration and various states during this switchover, and to thus effectively reduce the time to utilize the newly failed-over control plane. This is also handy when doing scheduled hitless upgrades within the ISSU execution path. The time to reach SSO for the newly active RP may vary depending on the type and scale of the configuration.

    Graceful restart (GR) refers to the capability of the control plane to delay advertising the absence of a peer (going through control-plane switchover) for a “grace period”, and thus help minimize disruption during that time (assuming the standby control plane comes up). GR is based on extensions per routing protocol, which are interoperable across vendors. The downside of the grace period is huge when the peer completely fails and never comes up, because that slows down the overall network convergence, which brings us to the final concept: nonstop routing (NSR).

    NSR is an internal (vendor-specific) mechanism to extend the awareness of routing to the standby control plane so that in case of failover, the newly active control plane can take charge of the already established sessions.

    BGP Graceful-Restart

    The BGP Graceful-Restart (GR) feature allows a BGP speaker to express its ability to preserve forwarding state during Border Gateway Protocol (BGP) restart or Route Processor (RP) switchover. In other words, it is the capability exchanged between the BGP speakers to indicate its ability to perform Nonstop Forwarding (NSF). This helps in minimizing the impact of services caused by BGP restart.

    If the control plane could restart without impacting the data used by the data plane, it should be possible to install new software, replace certain pieces of hardware, and perform other such tasks without causing the network to lose routing through the router.

    Beyond the local control plane and data plane in the router, however, we need to consider how the peers of this router will react if the local control plane, and the routing protocol processes along with it, are restarted.

    Scenario 1:

    • Router C would learn two paths for every reachable destination within or through AS65500, with router A chosen as the best path for each route.
    • If the control plane on router A restarts, router C will continue forwarding traffic through router A until the BGP peering session fails, due to lost keepalives, for instance.
    • At this point, router C would recalculate the best path for each route in its local tables, redirecting traffic to router B.

    Under normal circumstances, this is what we would want; in fact, if router A fails, we would want router C to detect this failure as quickly as possible, and switch all of its traffic to router B as soon as it can. But if router A is still capable of forwarding traffic, because its data plane is capable of retaining the information required to continue forwarding traffic (through NonStop Forwarding, or NSF), then the extra network disturbance of detecting the failure and recalculating paths is undesirable.

    In some way, then, router A must keep router C from switching its best paths to router B. The most apparent way to accomplish this would be for router A to continue sending keepalives to router C while it is rebooting, or immediately on finishing its reload, so router C will continue to forward traffic through router A. BGP protocol updates being transported over TCP include sequence numbers that are not easily (or impossible to be) maintained between system restarts.

    There are two ways this problem can be addressed:

    1. By maintaining routing protocol state in some secondary memory within router A and referring to this state information while router A’s control plane is restarting.
    2. By introducing some signaling within the routing protocol, which allows router A’s routing process to recover the necessary state once it has restarted, including “I’ll be back” signaling,” which indicates a router is simply restarting and not being removed from the network.

    There are two changes to the BGP protocol that are used to support graceful restart between two peers:

    1. A new capability that is used to indicate if a BGP speaker supports graceful restart, or can support a peer that is restarting, along with other information about the restart process.
    2. An End Of RIB marker, which indicates that a BGP speaker has finished sending all its local routing information. This is implemented as a simple empty withdraw packet.

    The new Graceful Restart capability includes the following:

    • The Restart State bit, indicating if this BGP speaker is currently restarting or not.
    • The Restart Timer, an indication of how long the speaker’s peer should maintain state before declaring the graceful restart a failure, and resetting the BGP session.
    • A series of address family identifiers (AFIs) and subsequence address family identifiers (SAFIs), one for each address family the peers are exchanging routing information for.
    • For each address family, a Forwarding State bit indicating if forwarding state has been preserved for this address family.

    Graceful-Restart Process

    Illustrates the process used by a pair of BGP speakers when one of them restarts:

    1. Router A’s control plane restarts; the forwarding tables used by the data plane are preserved.
    2. When Router A’s control plane restarts, it marks any preserved routing or forwarding information as stale.
    3. Router A sends the graceful restart capability along with the BGP Open message. It sets the Restart bit in the graceful restart capability and includes a TLV for each address family it has preserved forwarding state for, with the Forwarding bit set on each one.
    4. When Router B receives this BGP Open message, it will examine its current open connections and realize that this Open message is from a BGP speaker it is already peered to. Router B will maintain all the information it has received from Router A, also maintaining forwarding state it built based on the received routing information.
      • Because RA advertised that it supports graceful restart, RB will:
        • Keep the routes from Router A in its RIB and FIB
        • Mark all routes in RIB as Stale
        • Continue forwarding traffic for those routes in the FIB
    5. Router B now starts its restart timer; if this timer expires before the peering relationship between Routers A and B is fully formed, Router B will reset the session.
    6. Router’s A control plane comes back up and sends a Route Refresh Request to Router B.
    7. Router B begins sending the contents of its BGP table to Router A.
    8. When Router B has finished sending the contents of its BGP table to Router A, it transmits an End-of-RIB marker, to let Router A know it has received all routing information.
    9. Router A, when it receives the End-of-RIB marker from Router B, runs BGP’s best path algorithm across the information it has received and installs the required routing information into the local routing tables, which should also update the local forwarding tables.
    10. Router A now removes any routing information that has not been refreshed by the route refresh.

    Scenario 2

    Graceful Restart Flow Example:

    • All three routers exchange OPEN messages.
      • GR Capability = “I support graceful restart”
        • Flags:
          • R = 0 “I have not restarted”
          • Restart Time = 30 seconds “If I restart in the future, I expect to be UP in 30 seconds”
          • AFI/SAFI = IPv4 Unicast “I support GR for IPv4-Unicast”
          • F = 0 “Relevant only after a restart”
    • R1 advertises some prefixes: 1.1.0.0/16 and 2.2.0.0/16.
    • R2 receives updates and installs entries in its RIB/FIB.
    • R2 propagates the BGP updates received to R3.
    • R3 installs entries in its RIB/FIB.

    R2’s Control Plane Restarts

    Forwarding plane on R2

    • Keeps forwarding traffic using the routes installed in the FIB to 1.1.0.0/16 and 2.2.0.0/16.
    • Marks the routes in FIB as stale.

    R3 notices that the BGP session with R2 goes down (due to BFD, holdtime expired)

    • Normally, R3 would remove the BGP routes received from R2 (namely 1.1.0.0/16 and 2.2.0.0/16) from its RIB and FIB.

    However, because R2 advertised that it supports graceful restart, R3 will:

    • Keep the routes from R2 in its RIB/FIB
    • Mark all routes in the RIB as stale
    • Continue forwarding traffic for those routes in the FIB

    Topology Change

    • While R2 is down, something changes in the topology of the network. Let’s say R1 stop advertising prefix 1.1.0.0/16.
    • R1 withdraws route 1.1.0.0/16 from its RIB/FIB, and sends a BGP withdraw to all of its neighbors.
    • However, R1 cannot send a withdraw message to R2 because the R1-R2 BGP session is down at this point. We will recover from this “getting out of sync” problem later on.

    R2’s Control Plane comes back up

    • We assume it took less than 30 seconds for the control plane to come back up.
      • If it took longer, R3 would have “given up” and flushed the routes learned from R2 from its RIB/FIB.
    • The BGP sessions come back up:
      • GR capability “I support graceful restart”
      • Flags:
        • R = 1 “I did restart”
        • Restart Time = 30 “If I restart in the future, I expect to be done in 30 seconds”
        • AFI SAFI = IPv4 Unicast “I support GR for IPv4-Unicast”
        • F = 1 “** I DID PRESERVE FORWARDING STATE IN THE FIB **”

    Re-synchronization

    • R2 knows it restarted, so it is NOT going to send any updates until it has received all updates from its neighbors (R1 and R3), select the best routes, and update its RIB/FIB.
    • R1 originates prefixes 2.2.0.0/16 (but not 1.1.0.0/16 anymore due to the topology change).
      • R2 installs 2.2.0.0/16 in its RIB/FIB.
      • R2 already had an entry for 2.2.0.0/16 in its FIB which was marked stale. This stale marking is now removed; it is fresh again.
      • R2 does not receive an 1.1.0.0/16 update from R1. Hence, R2 does not have an entry for 1.1.0.0/16 in its RIB. But R2 does still have an entry for 1.1.0.0/16 in its FIB which is and remains marked stale.
      • R1 has finished sending all routes to R2, so it sends an end-of-rib marker to R2.

    At this point, R2 has received an end-of-rib marker from R1, but not yet from R3. So, it does not yet take any action (it needs to have received an end-of-rib marker from all neighbors).

    • R3 does not have any prefixes to send to R2, so it immediately sends an End-of-RIB marker.

    At this point, R2 has received End-of-RIB markers from all of its neighbors (R1 and R3), so it will take the following actions:

    • R2 will run the best route selection process for every destination prefix in its BGP table (in this example only 2.2.0.0/16)
    • R2 will install the selected best route for every prefix in the RIB into the FIB (only 2.2.0.0/16)
    • R2 will flush any remaining stale routes from the FIB (in this case 1.1.0.0/16)
    • R2 will start sending updates to advertise the routes in its RIB to the neighbors.
    • R2 propagates the BGP updates received from R1, to R3.
      • R2 has finished sending all routes to R3, so it sends an End-of-RIB marker to R3.
      • Note that R2 does not have routes to send to R1 (specifically it does not send the route for 2.2.0.0/16 back to R1 because of the AS-path loop). So, R2 immediately sends an End-of-RIB marker to R1 as well.
      • When R3 receives the end-of-rib marker from R2, it flushes all stale routes from R1 (in this case 1.1.0.0/16) from both is RIB/FIB.
      • R1 does the same when it receives the end-of-rib marker from R2, but it this example there is nothing to flush since R2 did not advertise any routes to R1.

    GR Deployment Considerations

    When deploying graceful restart for any routing protocol, there are two issues you need to keep in mind:

    • the impact of partial deployments, and
    • the interactions between BGP and the underlying IGP if both are not capable of and configured for graceful restart.

    In this network, we assume router D is not graceful restart capable, or it is not configured to respond to peers gracefully restarting. When B’s control plane restarts, it signals router A so A doesn’t reset its peering session and continues forwarding through router B. However, router D doesn’t recognize this signaling, and it resets its session with B.

    Instead, it reconverges on the path through router C as the best path and drops traffic along the path until the reconvergence is complete. In this case, then, the path between B and D will become asymmetric; in other cases, it’s possible to form a routing loop. It’s also possible that router D will begin rejecting the traffic forwarded by router B because its unicast reverse path forwarding check will fail.

    The solution to this problem is to make certain D is graceful restart capable, even if it’s not configured to restart gracefully on failure (or isn’t capable of it). Router D must be able to understand and respond correctly to router B’s signals during a graceful restart to prevent network problems form developing.

    If BGP is not capable of (or isn’t configured for) graceful restart, but the underlying interior gateway protocol is, some amount of traffic can be dropped while BGP is reconverging after a control plane restart.

    • Assume router A has chosen the routes learned through B as its best paths.
    • When router B restarts, A will reset its BGP peering session with B, but it will continue learning the same information through C, in fact, C will be setting the next hop on the BGP routes it is learning to the same next hop as B did before it reset.
    • Router A will not, however, reset its OSPF adjacency with B. Since OSPF is configured for graceful restart, it will believe that all paths reachable before the restart are still reachable through B, including the next hop for the routes it is learning through router C.
    • So router A will continue forwarding packets through router B, because router B is still the best path to the destinations learned through router C, based on the interior gateway protocol cost to the next hop, which is D. However, router B, when it receives these packets, may not have the forwarding information needed to forward them.
    • Since the BGP process on router B has reset, it’s likely that all the BGP learned routing information in the forwarding tables has been discarded, even though the OSPF learned forwarding information has been retained.
    • Router B, then, will drop all the BGP traffic forwarded along this path by router A. To resolve this problem, always make certain BGP and the underlying interior gateway protocols are both capable of and configured for graceful restart.

  • Understanding BGP Route Convergence

    BGP Route Convergence

    What is Routing Convergence? Routing convergence can be broadly defined as how quickly a routing protocol can become stable after changes occur in the network, for example, a protocol or link flap.

    Faster convergence leads to higher availability and improved network stability. Thus it is important that before the network is deployed in production, convergence time is properly calculated with thorough testing. But what is convergence time?

    If a link on primary path fails, the best path is impacted and leads to a traffic loss. Because of the failure event, a next-best path is computed. The amount of time during which there was a traffic loss in the network while the alternate path was not available to forward the traffic to the point where traffic starts flowing again is called the convergence time.

    Like any other dynamic routing protocol, BGP accepts routing updates from its neighbors. It then advertises those updates to its peers except to the one from which it received, only if the route is a best route. BGP uses an explicit withdrawal section in the update message to inform the peers on loss of the path so they can update their BGP table accordingly.

    Topology with Primary and Secondary Path

    As the networks grow larger, this could eventually pose scalability challenges and convergence issues especially to the service provider and enterprise networks to maintain an ever-increasing number of Transmission Control Protocol (TCP) sessions and routes. If the scale of the network has increased, the BGP process will have to process all the routes present in the BGP table and update its peers. In addition, the router processing the updates in such a scaled environment demand more memory and CPU resources. Because BGP is a key protocol for the Internet, it is important to ensure that BGP is highly convergent even with increased scale.

    BGP convergence depends on various factors. BGP convergence is all about the speed of the following:

    • Establishing sessions with a number of peers
    • Locally generate all the BGP paths (either via network statement, redistribution of static/connected/IGP routes) and/or from other component for other address-family for example, Multicast Virtual Private Network (MVPN) from multicast, Layer 2 Virtual Private Network (L2VPN) from l2vpn manager, and so on.)
    • Send and receive multiple BGP tables; that is, different BGP address-families to/from each peer
    • Upon receiving all the paths from peers, perform the best-path calculation to find the best path and/or multipath, additional-path, backup path
    • Installing the best path into multiple routing tables, such as the default or Virtual Routing and Forwarding (VRF) routing table
    • Import and export mechanism
    • For another address-family, like l2vpn or multicast, pass the path calculation result to different lower layer components

    BGP uses lot of CPU cycles when processing BGP updates and requires memory for maintaining BGP peers and routes in the BGP table. Based on the role of the BGP router in the network, appropriate hardware should be chosen. The more memory a router has, the more routes it can support, much like how a router with a faster CPU can support a larger number of peers.

    BGP updates rely on TCP, optimization of router resources such as memory and TCP session parameters such as maximum segment size (MSS), path MTU discovery, interface input queues, TCP window size, and so on help improve convergence.

    Scenario

    • R1‘s session to R7 just came up and follow the way that prefix 20.0.0.0/8 takes to propagate through AS 300.

    BGP Read-Only Mode

    Upon session establishment and exchanging the BGP OPEN messages, the router enters the “BGP Read-Only Mode“, this means that R1 will not start the BGP Best-Path Selection Process until it either receives all prefixes from R7 or reaches the BGP read-only mode timeout. The timeout is defined using the BGP process command bgp update-delay.

    The reason to hold the BGP best-path selection process is to ensure that the peer has supplied us all routing information. This allows minimizing the number of best-path selection process runs, simplify update generation and ensure better prefix per message packing, thus improving transportation efficiency.

    BGP Update-Delay Timer Read-Only Mode (Update Reception)

    This timer ensures that the peer has supplied us all routing information in order to minimize the number of BGP best path runs, simplify update generation and to better pack routes into TCP segments.

    When BGP establishes its first peer, a timer called the update-delay is triggered. This is by default set to 120 seconds and the BGP best path algorithm will not run until this timer expires or until the peer signals that it has sent all routes. The peer can signal that it’s done by either sending a BGP Keepalive or the BGP End of RIB message which is normally used with graceful restart (GR). The reason to hold the BGP best-path selection process is to ensure that the peer has supplied us all routing information in order to minimize the number of BGP best path runs, simplify update generation and to better pack routes into TCP segments.

    The BGP End-Of-RIB message is normally used for BGP graceful restart, but could also be used to explicitly signalize the end of BGP UPDATE exchange process. Even if BGP process does not support the End-of-RIB marker, Cisco’s BGP implementation always sends a Keepalive message when it finishes sending updates to a peer.

    It is clear that the best-path selection delay would be longer in case when peers have to exchange larger routing tables, or the underlying TCP transport and router ingress queue settings make the exchange slower.

    Defaults: 120 seconds
    bgp update-delay seconds [always]
    no bgp update-delay [seconds] [always]

    				
    					router bgp 64530
     bgp update-delay 240
    
    				
    			

    BGP Best-Path Selection

    When a BGP router leaves read-only mode, it starts the best-path selection process. This process walks over new information and compare it with the Local BGP RIB contents, selecting the best-path for every prefix. As soon as the best-path process is finished, BGP has to upload all routes to the RIB, before advertising them to the peers.

    This is a requirement of distance vector protocols – having the routing information active in the RIB before propagating it further. The RIB update will in turn trigger FIB information upload to the router’s line-cards, if the platform supports distributed forwarding. Both RIB and FIB updates are time-consuming and take the time proportional to the number of prefixes being updated.

    BGP Advertisement-Interval Timer (Update Generation)

    The primary cause for the slowness of the BGP convergence delay is the Minimum Route Advertisement Interval (MRAI). This timer forces the BGP routers to wait for at least that amount of time before sending an advertisement for the same prefix.

    The goal of this timer is to reduce route churn and to produce fewer BGP updates but it does slow down convergence. So instead of using flash updates triggered by a change, BGP waits for the expiration of the BGP advertisement-interval before sending out the BGP update. In this way if there are other changes that should be advertised the BGP process can prepare a more efficient update.

    After information has been committed to RIB, the router needs to replicate the best-paths to every peer that should receive it. The replication process could be most memory and CPU intensive as the process has to perform a full BGP table walk for every peer and construct the output for the corresponding BGP Adj-RIB-Out. This may require additional transient memory in the course of the update batch calculation. However, the update generation process is highly optimized in Cisco’s BGP implementation by means of dynamic update groups.

    The dynamic-update groups is that BGP process dynamically finds all neighbors sharing the same outbound policies, then elects a peer with the lowest IP address as the group leader and only generates the updates batch for the group leader. All other members of the same group receive the same updates.

    In our case, R1 has to generate two update sets: one for R5 and another for the pair of RR1 and RR2 route reflectors.

    R1 starts sending updates to R5 and RR1, RR2. This will take some time, depending on the BGP TCP transport settings and BGP table size. However, before R1 will ever start sending any updates to any peer/update group, it checks if Advertisement-Interval timer is running for this peer.

    BGP speaker starts this timer on per-peer basis every time its done sending the full batch of updates to the peer. If the subsequent batch is prepared to be sent and the timer is still running, the update will be delayed until the timer expires. This is a dampening mechanism to prevent unstable peers from flooding the network with updates. This timer really starts playing its role only for “Down-Up” or “Up-Down” convergence, as any rapid flapping changes are delayed for the amount of advertisement-interval seconds.

    The process repeats itself on RR1 and RR2, starting with the incoming UPDATE packet reception, best-path selection and update generation.

    As we can see, the main limiting factors of BGP convergence are BGP table size, transport-level settings and advertisement delay. The best-path selection time is proportional to the table size as well as time required for update batching.

    Defaults: IBGP 5 seconds / EBGP 30 seconds
    Command: neighbor {ip-address | peer-group-name} advertisement-interval seconds
    If an advertised route is flapping, usually caused when an interface is unstable, a flood of UPDATE and WITHDRAWN messages occurs.
    With the default value of 30 seconds for EBGP neighbors, BGP routing updates are sent only every 30 seconds, even if a route is flapping many times during this 30-seconds interval.

    				
    					router bgp 1
     neighbor 10.1.1.1 remote-as 1
     neighbor 10.2.1.2 remote-as 2
     neighbor 10.1.1.1 advertisement-interval 15
     neighbor 10.2.1.2 advertisement-interval 45
    exit
    				
    			

    Update Generation Improvements

    The following methods improve update generation, which are the basis for any BGP convergence tuning:

    • Peer Groups
    • BGP Dynamic Update Peer Groups
    • BGP read-only mode

  • Tshoot Missing BGP Routes

    Reasons that route advertisement fails between BGP peers are as follows:

    • Next-Hop Check Failure
    • Bad Network Design
    • Validity Check Failure
    • BGP Communities
    • Route filtering

    Most of these issues can be found by using the following:

    • BGP Loc-RIB: Just because a route is missing from the Global RIB, it does not mean the route did not make it into the router’s BGP table.
      • Examine the BGP Loc-RIB to see if the prefix exists in the BGP table. It is possible that the route installed in the BGP table but did not install into the RIB. Viewing the local BGP table is the first step in troubleshooting any missing route.
      • show bgp afi safi
    • BGP Adj-RIB-in: The BGP Loc-RIB table contains only valid routes that passed the router’s inbound route policies.
      • Examining the BGP Adj-RIB-in table verifies whether the peer received the NLRI. If the peer received it, the local inbound route policy prevents the route from installing into the Loc-RIB table.
      • Inbound Soft Configuration is required to view the BGP Adj-RIB-in table, because the table is purged by default after all inbound route-policy processing has occurred.
    • BGP Adj-RIB-out: Viewing the BGP Adj-RIB-out table on the advertising router verifies that the route was advertised and provides a list of the BGP PAs that were included with the route.
      • In the event that the route is not present in the advertising router’s BGP Adj-RIB-out table, check the advertising router’s BGP Loc-RIB table to verify the prefix exists there.
      • Assuming the prefix is in the Loc-RIB table, but not in the Adj-RIB-out table, then the outbound route policies are preventing the advertisement of the route.
      • Contents of the BGP Adj-RIB-out are viewed with the command show bgp afi safi neighbor-ip-address [prefix/prefix-length] advertised-routes.
    • Viewing BGP Neighbor Sessions: The information contained in the BGP neighbor session varies from platform to platform, but still provides a lot of useful information, such as the number of prefixes advertised, session and address-family options, the route maps/route filters/route policy applied specifically for that neighbor. The BGP neighbor session is displayed with the command show bgp afi safi neighbor ip-address.
    • Debug Commands: Debug commands provide the most amount of information about BGP.
      • On IOS nodes, BGP update debugs are enabled with the command debug bgp afi safi updates [in | out] [detail].
      • On IOS XR nodes, BGP update debugs are enabled with the command debug bgp update [afi afi safi] [in | out].
      • On NX-OS nodes, BGP update debugs are enabled with the command debug bgp updates [in | out].

    Topology

    This topology is used to demonstrate how to troubleshoot the various reasons a route could be missing from the routing table.

    • R1 is advertising the 10.0.0.0/8 aggregate prefix.
    • R1 is advertising the 10.1.1.0/24 prefix.
    • R2 is advertising the 10.2.2.0/24 prefix.

    BGP Troubleshooting Sample Topology

    Next-Hop Check Failures

    • R3 is missing the 10.0.0.0/8 network and the 10.1.1.0/24 network from the RIB.
    • Both of the missing routes are advertised from R1.

    The first step is to check the R3’s Loc-RIB BGP table. The 10.0.0.0/8 network and the 10.1.1.0/24 network are present, but notice that both entries are missing the best path marker >.

    Displaying an explicit network prefix with the command show bgp afi safi prefix/prefix-length, provides some clarity for why the NLRI was not selected as a best path.

    In the output, the next-hop 10.1.12.1 is inaccessible. Let’s verify that the next-hop exists on the router with the command show ip route next-hop-IP-address.

    The next-hop IP address is not available in the RIB. There are multiple solutions to this issue that include the following:

    1. R2 advertises the peering link (10.1.12.0/24) into BGP.
      1. R3 is adjacent to R2 and receives the route with a next-hop of 10.1.23.2, which is in R3’s RIB as a directly connected route. The 10.1.12.1 next-hop IP address would then be resolvable through a recursive lookup.
    2. Establish an IGP routing protocol within AS200 (R2, R3, and R4) and advertise the peering link (R1–R2) in OSPF, but make the peering link interface passive in OSPF.
    3. On R2 configure the next-hop-self feature in the address-family for the BGP peering with R3.
      1. All EBGP routes (that is, routes learned from R1) would then use R2 as their next-hop for any routes learned from R2.

    Validity Check Failure

    BGP performs a validity check upon receipt of prefixes. Specifically, BGP is looking for indicators of a loop, such as:

    • Identifying the router’s ASN in the AS-Path
    • Identifying the router’s RID in as the Route-Originator ID
    • Identifying the router’s RID as the Cluster ID

    AS-Path

    The AS-Path (BGP attribute AS_PATH) is used as a loop prevention mechanism. The AS-Path is not prepended as a NLRI is advertised to other IBGP peers. Some common scenarios for a router to identify its ASN in an NLRI’s AS-Path are as follows:

    • AS-Prepending: Industry standards dictate that the AS being prepended should be owned by your organization. However, some organizations may prepend a route with an ASN that they do not own. This is done for malicious purposes or unintentionally.
    • Route Aggregation: Default behavior for route aggregation is to not include any BGP attributes of the smaller routes that are being aggregated, which adds the atomic aggregate BGP attribute. The loss of path visibility could result in route feedback when an organization advertises an aggregate route that includes a smaller network that is advertised from your network. If the as-set keyword is used with the aggregation command, all the BGP attributes of the routes being summarized are included. This includes the AS-Path.

    After configuring the as-set keyword on R1, R1 includes the PAs from the smaller aggregate routes. For example, the 10.2.2.0/24 network that is being learned on R1 from AS200 would be aggregated into the 10.0.0.0/8 aggregate with the AS200 as part of the AS-Path.

    Detecting a router’s ASN in a route that is received from a peer can be accomplished by the following:

    • Viewing the BGP session on IOS routers.
    • Viewing the network routes that are advertised to the router.
    • Enabling debugging for BGP updates, which will indicate the AS-Path loop.

    R2 displays the BGP neighbor session details for R1. After examining the IPv4 address-family, routes were denied for an AS_PATH loop. It is important to note that the count of routes is a cumulative count of route advertisements throughout the life of that BGP session.

    The second method is to list the routes on R1 that were advertised to R2 that include the ASN of R2 (200). The 10.0.0.0/8 route includes the AS-Path of 200.

    The third method is to enable BGP debugging on R2 and initiate an inbound BGP soft-refresh.

    Originator-ID/Cluster-ID

    Another potential reason a NLRI fails the validity check is if the Originator-ID or Cluster-ID matches the receiving router’s RID. The Originator-ID is populated by a route-reflector (RR) with the advertising router’s RID, and the Cluster-ID is populated by the RR. The default Cluster-ID setting is the RR’s RID, unless it is specifically set, which is done for certain design scenarios. Checking the Originator-ID or Cluster-ID is considered a loop prevention mechanism.

    Assume that in the sample topology, that R4’s BGP RID was configured to 192.168.2.2, which unknowingly matches R2’s BGP RID.

  • Conditional Matching

    Filtering of Prefixes by Route Policy

    The last component for finding missing BGP routes is through the examination of the BGP routing policies. As stated before, BGP route policies are applied before routes are inserted into the Loc-RIB table and as prefixes leave the Loc-RIB before they are advertised to a BGP peer.

    IOS and NX-OS devices provide three methods of filtering routes inbound or outbound for a specific BGP peer. Each method could be used individually or simultaneously with other methods. The three methods are as follows:

    1. Prefix-list: A list of prefix matching specifications that permit or deny network prefixes in a top-down fashion similar to an ACL. An implicit deny is associated for any prefix that is not permitted.
    2. AS-Path ACL/Filtering: A list of regex commands that allows for the permit or deny of a network prefix based on the current AS-Path values. An implicit deny is associated for any prefix that is not permitted.
    3. Route-maps: Route-maps provide a method of conditional matching on a variety of prefix attributes and taking a variety of actions. Actions could be a simple permit or deny or could include the modification of BGP path attributes. An implicit deny is associated for any prefix that is not permitted.

    Conditional Matching

    Prefix-lists, AS-Path filtering, route-maps, and route-policy language typically use some form of conditional matching so that only certain BGP prefixes are blocked or accepted. BGP prefixes can be conditionally matched by a variety of path attributes.

    The most common techniques for conditionally matching a BGP prefix.

    Access Control Lists (ACL)

    Originally, ACLs were intended to provide filtering of packets flowing into or out of a network interface, similar to the functionality of a basic firewall. Today, ACLs provide a method of identifying networks within a route-map that are then used in routing protocols for filtering or manipulating.

    ACLs are composed of access control entries (ACEs), which are entries in the ACL that identify the action to be taken (permit or deny) and the relevant packet classification. Packet classification starts at the top (lowest sequence) and proceeds down (higher sequence) until a matching pattern is identified. When a match is found, the appropriate action (permit or deny) is taken and processing stops. At the end of every ACL is an implicit deny ACE, which denies all packets that did not match an earlier ACE in the ACL.

    Extended ACLs react differently when matching BGP routes than when matching IGP routes. The source fields match against the network portion of the route, and the destination fields match against the network mask. Extended ACLs were originally the only match criteria used by IOS with BGP before the introduction of prefix-lists.

    				
    					permit ip 10.0.0.0 0.0.0.0 255.255.0.0 0.0.0.0
    --> Permits only the 10.0.0.0/16 network
    
    permit ip 10.0.0.0 0.0.255.0 255.255.255.0 0.0.0.0
    --> Permits any 10.0.x.0 network with a /24 prefix length
    
    permit ip 172.16.0.0 0.0.255.255 255.255.255.0 0.0.0.255
    --> Permits any 172.16.x.x network with a /24 – /32 prefix length
    
    permit ip 172.16.0.0 0.0.255.255 255.255.255.128 0.0.0.127
    --> Permits any 172.16.x.x network with a /25 – /32 prefix length
    				
    			

    Prefix Matching

    Prefix lists (IOS and NX-OS) and prefix sets (IOS XR) provide another method of identifying networks in a routing protocol. They identify a specific IP address, network, or network range and allow for the selection of multiple networks with a variety of prefix lengths (subnet masks) by using a prefix match specification. This technique is preferred over the ACLs network selection method because it is easier to understand.

    The structure for a prefix match specification contains two parts: high-order bit pattern and high-order bit count, which determines the high order bits in the bit pattern that are to be matched. Some documentation refers to the high-order bit pattern as the address or network, and the high-order bit count as length or mask length.

    In Figure, the prefix match specification has the high-order bit pattern of 192.168.0.0 and a high-order bit count of 16. The high-order bit pattern has been converted to binary to demonstrate where the high-order bit count lays. Because there is not any additional matching length parameters included, the high-order bit count is an exact match.

    Basic Prefix Match Pattern

    The prefix match specification logic might look identical to the functionality of an access-list. The true power and flexibility comes by using matching length parameters to identify multiple networks with specific prefix lengths with one statement. The matching length parameter options are as follows:

    • le (less than or equal to <=)
    • ge (greater than or equal to >=) or both

    Figure demonstrates the prefix match specification with a high-order bit pattern of 10.168.0.0, high-order bit count of 13, and the matching length of the prefix must be greater than or equal to 24.

    Prefix Match Pattern with Matching Length Parameters

    • The 10.168.0.0/13 prefix does not qualify because the prefix length is less than the minimum of 24 bits,
    • whereas the 10.168.0.0/24 prefix does meet the matching length parameter.
    • The 10.173.1.0/28 prefix qualifies because the first 13 bits match the high-order bit pattern, and the prefix length is within the matching length parameter.
    • The 10.104.0.0/24 prefix does not qualify because the high-order bit-pattern does not match within the high-order bit count.

    Figure demonstrates a prefix match specification with a high-order bit pattern of 10.0.0.0, a high-order bit count of 8, and the matching length must be between 22 and 26.

    Prefix Match with Ineligible Matched Prefixes

    • The 10.0.0.0/8 prefix does not match because the prefix length is too short.
    • The 10.0.0.0/24 qualifies because the bit pattern matches and the prefix length is between 22 and 26.
    • The 10.0.0.0/30 prefix does not match because the bit pattern is too long.
    • Any prefix that starts with 10 in the first octet and has a prefix length between 22 and 26, matches.

    BGP Communities

    Conditionally matching BGP communities allows for selection of routes based upon the BGP communities within the route’s path attributes so that selective processing can occur in IOS route-map or IOS XR route policies. Conditionally matching on IOS and NX-OS devices requires the creation of a community list. A community list shares a similar structure to an ACL, can be standard or expanded, and can be referenced via number or name. Standard community lists are numbered 1-99 and match either well-known commnities or a private community number (as-number:16-bit-number). Expanded community lists are numbered 100-500 and use regex patterns.

    Regular Expressions (Regex)

    There may be times when conditionally matching off of network prefixes may be too complicated, and identifying all routes from a specific organization is preferred. In this manner, path selection can be made off of the BGP AS-Path.

    To parse through the large amount of available ASNs (4,294,967,295), regular expressions (regex) are used. Regular expressions are based on query modifiers to select the appropriate content. The BGP table is parsed with regex using the command show bgp afi safi regexp regex-pattern. NX-OS devices require the regex-pattern to be placed within a pair of double quotes “”.

    Regex Query Modifiers

    Note: The .^$*+()[]? characters are special control characters that cannot be used without using the backslash (\) escape character. For example, to match on the * in the output you would use the \* syntax.

    Examples

    Regex Description
    ^$ Matches an empty AS PATH so it will match all prefixes from the local AS.
    ^100_ Matches prefixes from AS 100 that is directly connected to our AS.
    _100_ Matches prefixes that transit AS 100.
    _100$ Matches prefixes that originated in AS 100.
    The $ ensures that it’s the beginning of the AS PATH.
    ^([0-9]+)_100 Matches prefixes from AS 100 where AS 100 is behind one of our directly connected AS’es.
    ^100_([0-9]+) Matches prefixes from the clients of directly connected AS 100.
    ^(100_)+([0-9]+) Matches prefixes from the clients of directly connected AS 100,
    where AS 100 might be doing AS PATH prepending.
    ^\65200\) Matches prefixed from confederation peer 65200.

    Scenario

    UnderScore _

    Query Modifier Function: Matches a space.
    Scenario: Display only ASs that passed through AS 100.

    Caret ^

    Query Modifier Function: Indicates the start of the string.
    Scenario: Display only routes that were advertised from AS 300.

    Dollar Sign $

    Query Modifier Function: Indicates the end of the string.
    Scenario: Display only routes that originated in AS 40.

    Brackets [ ]

    Query Modifier Function: Matches a single character or nesting within a range.
    Scenario: Display only routes with an AS that contains 11 or 14 in it.

    Hyphen –

    Query Modifier Function: Indicates a range of numbers in brackets.
    Scenario: Display only routes with the last two digits of the AS of 40, 50, 60, 70, or 80.

    Caret in Brackets [^]

    Query Modifier Function: Excludes the character listed in brackets.
    Scenario: Display only routes where the second AS from AS 100 or AS 300 does not start with 3, 4, 5, 6, 7, or 8. The first component of the regex query restricts the AS to the AS 100 or 300 with the regex query ^[13]00_, and the second component filters out ASs starting with 3-8 with the regex filter _[^3-8].

    Parentheses ( ) and Pipe |

    Query Modifier Function: Nesting of search patterns and provides or functionality.
    Scenario: Display only routes where the AS_PATH ends with AS 40 or 45 in it.

    Period .

    Query Modifier Function: Matches a single character, including a space.
    Scenario: Display only routes with an originating AS of 1-99. The regex
    query _..$ requires a space, and then any character after that (including other spaces).

    Plus Sign +

    Query Modifier Function: One or more instances of the character or pattern.
    Scenario: Display only routes where they contain at least one 10 in the AS path, but the pattern 100 should not be used in matching. When building this regex expression, the first portion is building the matching pattern of (10)+, and then add the restriction portion of the query of [^(100)]. The combined regex pattern is (10)+[^(100)].

    Question Mark ?

    Query Modifier Function: Matches one or no instances of the character or pattern.
    Scenario: Display only routes from the neighboring AS or its directly connected AS (that is, restrict to two ASs away). This query is more complicated and requires you to define an initial query for identifying the AS, which is [0-9]+. The second component includes the space and an optional second AS. The ? limits the AS match to one or two ASs.

    Asterisk *

    Query Modifier Function: Matches zero or more characters or patterns.
    Scenario: Display all routes from any AS. This may seem like a useless task but may be a valid requirement when using AS-Path access lists.