Reconvergence - Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS (2014)

Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS (2014)

Chapter 6. Reconvergence

The time to reconverge upon change and/or network failure has become an important factor in the delivery of business services. Historically, BGP was intentionally slow to reconverge as a way of avoiding route oscillation between Autonomous Systems. However, the requirement is now frequently to deliver sub-second reconvergence times because of the real-time nature of traffic being carried through Layer-3 and Layer-2 VPNs. This has meant changes not only in the speed with which BGP reacts to failure, but also changes in the protocol to allow for increased path visibility between BGP speakers. This chapter discusses some of the more notable changes.

Advertisement of Multiple Paths

Advertisement of Multiple Paths (ADD-PATH) is an extension to BGP that allows for advertisement of multiple paths for the same prefix/NLRI. This allows for upstream propagation without subsequent advertisements replacing earlier ones by an intervening BGP speaker's path-selection process. The purpose is to reduce route oscillations, enable load-balancing, and improve routing convergence by making an alternative path immediately available to a reconverging router.

To achieve this, each advertised path is identified by a 4-octet Path Identifier. To carry the Path Identifier in the UPDATE message, the NLRI encodings are extended by prepending the Path Identifier field. The combination of Path Identifier and prefix thereafter identifies a given path. Assignment of Path Identifier values is local. Where a BGP speaker readvertises a route with Path Identifiers, it must generate its own Path Identifier.

Figure 6-1 ADD-PATH NLRI

image

ADD-PATH is a capability negotiated in the OPEN exchange. During the exchange, the peers negotiate Send/Receive values indicating for a given AFI/SAFI whether they are willing to receive multiple paths from the peer, would like to send multiple paths to the peer, or both.

I'll use the topology in Figure 6-2 to illustrate the concept of the ADD-PATH capability. In this topology, routers R1 to R4 are in AS 64496; R1 and R2 are clients of RR1; and R3 and R4 are clients of RR2. Both R1 and R2 are peering externally with AS 64510 and are learning prefix 172.16.0.0/20. The physical topology isn't shown, but the objective is for R4 to receive multiple copies of the 172.16.0.0/20 prefix with redundant Next-Hops to provide for faster reconvergence under failure. Of course, without ADD-PATH, RR1 would receive the prefix 172.16.0.0/20 from both R1 and R2, but would execute the best-path selection algorithm and only propagate that best-path UPDATE upstream to RR2.

Figure 6-2 ADD-PATH Test Topology

image

In SR-OS, the ADD-PATH capability is added for each required Address Family, after which the user must indicate the maximum number of paths that the router should send for each prefix, followed by an optional receive keyword. (If the receive keyword is not included, it is enabled by default.) If you assume that a given BGP speaker has two paths to a given prefix and the ADD-PATH send keyword is configured with a value of two, both prefixes are propagated to its peers, and each propagated path has a different Path Identifier. For illustration, the routers R1 to R4 and RR1/RR2 are configured with a send value of 2. Output 6-1 shows an example of the required configuration at RR1.

Debug 6-1: Path Identifier Encoding

53 2013/04/24 13:34:28.80 UTC MINOR: DEBUG #2001 Base Peer 1: 192.0.2.12

"Peer 1: 192.0.2.12: UPDATE

Peer 1: 192.0.2.12 - Send BGP UPDATE:

Withdrawn Length = 0

Total Path Attr Length = 34

Flag: 0x40 Type: 1 Len: 1 Origin: 0

Flag: 0x40 Type: 2 Len: 6 AS Path:

Type: 2 Len: 1 < 64510 >

Flag: 0x40 Type: 3 Len: 4 Nexthop: 192.0.2.22

Flag: 0x40 Type: 5 Len: 4 Local Preference: 100

Flag: 0xc0 Type: 8 Len: 4 Community:

64510:3551

NLRI: Length = 8

172.16.0.0/20 Path-ID 3

"

First, you can verify the encoding of the Path Identifier in the IPv4 NLRI. This is shown in the UPDATE message from R1 towards RR1. In the output, the Path-ID is actually shown as a suffix of the prefix 172.16.0.0/20, but this is simply to make the output more readable—the Path Identifier is actually prepended to the prefix.

The Path Identifier generated by the BGP speaker is also shown in the show router bgp routes <prefix> hunt command. Output 6-2 shows the Path-ID that is generated by R1 when advertising the 172.16.0.0/20 prefix to RR1.

When RR-1 receives two paths for the same prefix but with different Path Identifiers, it advertises both paths upstream because of the add-path send 2 configuration, while generating its own Path Identifiers when readvertising the prefix. The same action is taken at RR2, which results in two paths being advertised to R4 as illustrated in Figure 6-3.

Figure 6-3 ADD-PATH Prefix Propagation

image

If a BGP speaker is configured with add-paths send n and has more than n paths available in the RIB-IN, it selects the best n overall paths for each prefix as candidates for upstream propagation while attempting to meet split-horizon and/or Next-Hop diversity objectives. This selection of the paths to advertise uses a modified path selection algorithm as follows:

i. If the best path is a non-BGP route exported to the neighbor, advertise only that path (unless advertise-inactive is set).

ii. If the best path is a BGP route from the neighbor and split-horizon applies, start with the next-best path and advertise the single best path from each set of paths with the same BGP next-hop until n paths have been advertised or there are no more valid paths.

iii. If the best path is a BGP route from the neighbor and split-horizon does not apply, start with the best path and advertise the single best path from each set of paths with the same BGP next-hop until n paths have been advertised or there are no more valid paths.

Output 6-1: ADD-PATH Configuration

bgp

group "RR"

family ipv4 ipv6

peer-as 64496

add-paths

ipv4 send 2 receive

neighbor 192.0.2.23

exit

exit

group "CLIENTS"

family ipv4 ipv6

cluster 192.0.2.12

peer-as 64496

add-paths

ipv4 send 2 receive

neighbor 192.0.2.13

exit

neighbor 192.0.2.22

exit

exit

Output 6-2: Path-ID Visibility

*A:R1# show router bgp routes 172.16.0.0/20 hunt | match post-lines 16 "RIB Out"

RIB Out Entries

-----------------------------------------------------------------------

Network : 172.16.0.0/20

Nexthop : 192.0.2.22

Path Id : 3

To : 192.0.2.12

Res. Nexthop : n/a

Local Pref. : 100 Interface Name : NotAvailable

Aggregator AS : None Aggregator : None

Atomic Aggr. : Not Atomic MED : None

AIGP Metric : None

Connector : None

Community : 64510:3551

Cluster : No Cluster Members

Originator Id : None Peer Router Id : 192.0.2.12

Origin : IGP

AS-Path : 64510

At R4, you can verify that both paths have been successfully advertised with a Next-Hop of R1 (Path-ID 8) and a Next-Hop of R2 (Path-ID 7). These paths can be used for backup (PIC) or load-balancing purposes (Multipath) depending on the user requirement.

Output 6-3: R4 BGP Routes

*A:R4# show router bgp routes 172.16.0.0/20

==================================================================

BGP Router ID:192.0.2.21 AS:64496 Local AS:64496

==================================================================

Legend -

Status codes : u - used, s - suppressed, h - history, d - decayed, * - valid

Origin codes : i - IGP, e - EGP, ? - incomplete, > - best, b - backup

==================================================================

BGP IPv4 Routes

==================================================================

Flag Network LocalPref MED Nexthop Path-Id Label As-Path

------------------------------------------------------------------

u*>i 172.16.0.0/20 100 None

192.0.2.22 8 -

64510

*i 172.16.0.0/20 100 100

192.0.2.13 7 -

64510

------------------------------------------------------------------

Routes : 2

==================================================================

ADD-PATH is supported in SR-OS for IPv4, IPv6, VPN-IPv4, and VPN-IPv6 Address Families and provides a good mechanism to increase BGP path visibility. For the VPN-IPv4/IPv6 Address Families, a well-known and widely deployed mechanism to achieve the same objective is to use different Route-Distinguishers at dual-homed sites to make a given set of IPv4/IPv6 prefixes unique to each site. Because ADD-PATH can be enabled on an Address Family basis, the choice is with the operator as to which mechanism is most suitable to the environment.

Best External

Best External (draft-ietf-idr-best-external ) is a mechanism that allows a BGP speaker to advertise its best external path to IBGP peers even if its own selected best path is received from an internal peer. By advertising the best external route when different from the best route, additional path visibility can be provided to the IBGP mesh. When two paths are available to reach a given destination and one is preferred, the availability of an alternate path in the RIB means that only a FIB update is required should the preferred Next-Hop fail. In addition, the presence of two paths can reduce route oscillation.

Best External does not require any protocol extensions, but instead modifies the route advertisement criteria of the base BGP specification (RFC 4271). Take the example of Figure 6-4 where routers R1, R2, and R3 form part of AS 64496 and are fully IBGP meshed. Router R1 is learning the prefix 172.16.0.0/20 externally with AS_PATH 64510, while router R2 is learning the prefix 172.16.0.0/20 with AS_PATH 64510 64511. In this scenario, router R2 does not advertise its externally learned prefix in IBGP. It prefers the internally learned prefix from R1 because of the shorter AS_PATH. The result is that router R3 only has one path to 172.16.0.0/20 in its RIB-IN.

Figure 6-4 Route Advertisement Without Best External

image

If, however, you enable Best External at R2, it advertises the prefix 172.16.0.0/20 learned through EBGP even though its best path is learned from an internal peer. The result is that router R3 now has two paths to prefix 172.16.0.0/20 and can therefore reconverge more quickly in the event of failure of the preferred Next-Hop. Best External is enabled through configuration of the advertise-external command followed by keywords for each applicable Address Family. It can be applied only at the global BGP level.

Output 6-4: Configuration of Best External

bgp

advertise-external ipv4 ipv6

group "IBGP"

family ipv4 ipv6

peer-as 64496

…etc

In a simple topology like Figure 6-4, Best External provides a good solution to increase path visibility. This solution is arguably better than ADD-PATH because it doesn't require any signaled/advertised extensions (only a modified route advertisement criteria), and therefore doesn't require that the whole network is upgraded/reconfigured to support it. However, if you modify the logical topology a little and insert a Route-Reflector as shown in Figure 6-5, the Best External mechanism doesn't achieve the objective of increasing path visibility to R3. This is because the Route-Reflector implements the standard best path algorithm across the internally learned paths from R1 and R2 (there is no external path) and advertises only the best path upstream to R3.

Figure 6-5 Best External with Route-Reflection

image

The Best External draft suggests some modifications to Route-Reflector route advertisement procedures in an effort to increase path visibility of paths advertised to and from a Route-Reflector cluster. To increase path visibility to a cluster, the draft suggests that if client-to-client reflection is disabled and the cluster operates as a mesh, a Route-Reflector may advertise to the cluster the preferred path from the set of paths not received from within the cluster. To increase path visibility from a cluster, the draft suggests that when advertising a route to a non-client IBGP peer, a BGP speaker may advertise an alternative best route from a cluster if the preferred path is learned from outside the cluster. Given the unlikely scenario of a Service Provider disabling client-to-client reflection on an existing Route-Reflector, these advertisement rules are not implemented in SR-OS. Where Route Reflection is used, the only solution is to enable ADD-PATH in conjunction with Best External.

The route advertisement rules suggested by the draft are, however, supported in SR-OS between members of a Confederated AS. That is, if a BGP speaker has advertise-external enabled and its preferred path is a route from a confed-EBGP peer in AS m, two things should happen:

· This preferred path should be advertised to all other confed-EBGP peers.

· The best internal route should be advertised to confed-EBGP peers in AS m.

· The best internal route is the one found by running the BGP path selection algorithm across the paths in the RIB-IN excluding those learned from member AS m.

Next-Hop Tracking

Next-Hop Tracking is a mechanism that actively monitors all route-table and MPLS tunnel-table modifications and immediately triggers BGP Next-Hop Resolution add/delete/modify messages to the FIB when a change is detected. Even when alternate paths are already programmed into the FIB (for example when PIC is enabled or ECMP/IBGP-Multipath is in use), the CPM still must notify the IOM/IMMs of reachability failure/changes to allow the datapath to be reprogrammed accordingly. Next-Hop Tracking is enabled by default (and cannot be disabled) and ensures that this update process is entirely event-driven based upon the current network state.

The active BGP Next-Hop for a unicast IPv4 NLRI is resolved by the longest prefix match of the IPv4 Next-Hop address that is installed and active in the forwarding table (and similar logic applies to an IPv6 BGP Next-Hop address associated with a unicast IPv6 NLRI). If there is no active and eligible longest prefix match for the Next-Hop address, associated BGP prefixes are flagged as invalid in the RIB-IN.

Assume a scenario as shown in Figure 6-6. Routers R1 to R5 form part of AS 64496 and each is peered in IBGP with Route-Reflector RR1. IS-IS is used as the IGP and the Autonomous System is entirely Level-2. Routers R1 and R2 are peering externally with AS 64510, and both are learning prefix 172.16.0.0/20. ADD-PATH is configured on all AS 64496 routers, and as a result router R5 receives two paths for 172.16.0.0/20 via R1 and R2. Because there is no BGP multipath in use, both are held in RIB-IN as valid but only one is installed in the route-table, which is the route via R2 (192.0.2.21).

Figure 6-6 Next-Hop Tracking Use-Case Topology

image

If you now simulate a failure of R2, router R5 must reconverge the BGP Next-Hop for the prefix 172.16.0.0/20. In Debug 6-2, the process of the Route-Table Manager (RTM) removing R2's system address (RTM DELETE) and modifying the active Next-Hop (RTM MODIFY) to R1's system address (192.0.2.22) is immediate. This is a function of Next-Hop Tracking.

Debug 6-2: Next-Hop Tracking

1 2013/06/18 11:23:04.04 UTC MINOR: DEBUG #2001 Base PIP

"PIP: ROUTE

instance 1 (Base), RTM DELETE event

New Route Info

prefix: 192.0.2.21/32 (0x9662bd38) preference: 18

metric: 200 backup metric: 0 owner: ISIS ownerId: 0

1 ecmp hops 0 backup hops:

hop 0: 192.0.2.150 @ if 2

"

4 2013/06/18 11:23:04.05 UTC MINOR: DEBUG #2001 Base PIP

"PIP: ROUTE

instance 1 (Base), RTM MODIFY event

New Route Info

prefix: 172.16.0.0/20 (0x96641d10) preference: 170

metric: 0 backup metric: 0 owner: BGP ownerId: 0

1 ecmp hops 0 backup hops:

hop 0: 192.0.2.22 @ if 0

"

Next, slightly modify the logical topology of Figure 6-6 so that router R5 becomes part of IS-IS Level 1 and routers R3 and R4 are Level-1-2 routers. R3 and R4 are redistributing 32-bit system addresses from Level 2 into Level 1 and are setting the Attach bit in LSPs that are sent into the Level-1 area so that router R5 has a default route from both Level-1-2 routers. This is a common scenario, but one that can affect how Next-Hop Tracking operates. If you again simulate a failure of router R2, the output in Debug 6-3 outlines the sequence of events. Routers R3 and R4 source Level-1 LSPs toward R5, removing reachability for R2's system address (RTM DELETE in frame 88). However, R5 still has two default routes toward its Level-1-2 routers and as a result still can resolve the current BGP Next-Hop (192.0.2.21) for prefix 172.16.0.0/20. In this simple topology, R5 continues to forward traffic to routers R3 and/or R4, who have both reconverged on router R1 thanks to Next-Hop Tracking. In other scenarios this could lead to sub-optimal routing, or even create a blackhole until a withdraw is received by the reconverging router. In this example, router R5 receives a withdraw message from RR1 (frame 90), at which point it modifies the Next-Hop from R2 to R1 (RTM MODIFY frame 91). The result is that the time to reconverge is increased by nine seconds (from RTM DELETE of the active Next-Hop system address from the route-table in frame 88 to RTM MODIFY of the Next-Hop to the alternate path in frame 91), but in general the outage is largely determined by BGP withdraw propagation time.

Debug 6-3: Next-Hop Tracking with Default Routes

88 2013/06/18 12:46:50.34 UTC MINOR: DEBUG #2001 Base PIP

"PIP: ROUTE

instance 1 (Base), RTM DELETE event

New Route Info

prefix: 192.0.2.21/32 (0x96641ba0) preference: 15

metric: 73 backup metric: 0 owner: ISIS ownerId: 0

1 ecmp hops 0 backup hops:

hop 0: 192.0.2.150 @ if 2

"

90 2013/06/18 12:46:59.05 UTC MINOR: DEBUG #2001

Base Peer 1: 192.0.2.12

"Peer 1: 192.0.2.12: UPDATE

Peer 1: 192.0.2.12 - Received BGP UPDATE:

Withdrawn Length = 8

172.16.0.0/20 Path-ID 9

Total Path Attr Length = 0

"

91 2013/06/18 12:46:59.05 UTC MINOR: DEBUG #2001 Base PIP

"PIP: ROUTE

instance 1 (Base), RTM MODIFY event

New Route Info

prefix: 172.16.0.0/20 (0x96641d10) preference: 170

metric: 0 backup metric: 0 owner: BGP ownerId: 0

1 ecmp hops 0 backup hops:

hop 0: 192.0.2.22 @ if 0

In this scenario, it's beneficial to exclude the default routes from the set of routes that are considered eligible for BGP Next-Hop resolution, and this is the purpose of Next-Hop Tracking policies. Output 6-8 shows an example of a Next-Hop Tracking Policy used to overcome the scenario described previously. First, the route-policy framework is used to create a route-policy that excludes the default route. The same route-policy is then referenced within the global or VPRN BGP context using the policy keyword within the next-hop-resolution node.

image

The route-policy used with Next-Hop Tracking should only attempt to match a prefix-list and/or a protocol name/instance. Other match criteria are not supported.

Output 6-5: Configuration of Next-Hop Tracking Policy

router

policy-options

begin

prefix-list "Default-Route"

prefix 0.0.0.0/0 exact

exit

policy-statement "NEXT-HOP-TRACKING"

entry 10

from

prefix-list "Default-Route"

exit

action reject

exit

exit

bgp

next-hop-resolution

policy "NEXT-HOP-TRACKING"

exit

exit

Once again, you can simulate a failure of router R2 with the preceding configuration in place. As shown in Debug 6-4, the modification of the BGP Next-Hop from R2 to R1 (frame 101) takes place immediately after R2's system address is deleted from the route-table (frame 100), proving that the Next-Hop Tracking policy is functioning as expected.

Debug 6-4: Next-Hop Tracking with Policy

100 2013/06/18 13:15:57.24 UTC MINOR: DEBUG #2001 Base PIP

"PIP: ROUTE

instance 1 (Base), RTM DELETE event

New Route Info

prefix: 192.0.2.21/32 (0x966418c0) preference: 15

metric: 73 backup metric: 0 owner: ISIS ownerId: 0

1 ecmp hops 0 backup hops:

hop 0: 192.0.2.150 @ if 2

"

101 2013/06/18 13:15:57.25 UTC MINOR: DEBUG #2001 Base PIP

"PIP: ROUTE

instance 1 (Base), RTM MODIFY event

New Route Info

prefix: 172.16.0.0/20 (0x96641d10) preference: 170

metric: 0 backup metric: 0 owner: BGP ownerId: 0

1 ecmp hops 0 backup hops:

hop 0: 192.0.2.22 @ if 0

"

The policy referenced in the next-hop-resolution node only affects which routes in the route-table are eligible to resolve a BGP Next-Hop address. The policy does not affect the way BGP Next-Hops are resolved to MPLS tunnels. If the network shown in Figure 6-6 was an MPLS network and the service at R5 was a VPRN, the policy would not be required when R5 was configured as an IS-IS Level-1 router. This is because during the simulated failure of R2, the system IP address of R2 would have been removed from the route-table of R5 and replaced by the default route or routes for BGP Next-Hop resolution. Importantly, however, the LSP to R2 would have been removed from the tunnel-table. As a result, the BGP route with Next-Hop of R2 would have been held in RIB-IN and flagged as “invalid.”

Prefix Independent Convergence (PIC)

In many networks, large numbers of prefixes are reachable via more than one path. BGP Prefix Independent Convergence (PIC) is the name for techniques that can reconverge upon failure in a time period that does not depend on the number of prefixes being restored. This is done by organizing BGP prefixes into Path-Lists consisting of primary paths together with precomputed backup paths, and through implementation of FIB hierarchy.

PIC can be categorized into either PIC Core or PIC Edge. Core PIC describes a scenario where a link or node on the path to the BGP Next-Hop fails, but the BGP Next-Hop remains reachable. Edge PIC describes a scenario where an edge node or edge link fails, resulting in a change of the BGP Next-Hop.

Core PIC

In SR-OS, Core PIC is implemented by programming each BGP route to an IP prefix with an indirect Next-Hop that is actually a pointer to a set of one or more IGP next-hops. Many BGP routes can share the same indirect Next-Hop. If the IP interface and/or MPLS LSP used to reach a BGP Next-Hop transitions, or there is a topology change, only the Next-Hop set is modified. Only a small number of FIB objects are modified without the requirement to modify the possibly large number of BGP prefixes.

The process of triggering updates to the FIB is managed by an event-based mechanism similar to Next-Hop Tracking that actively monitors all IGP and/or MPLS route-table and tunnel-table modifications and immediately triggers Next-Hop Resolution add/delete/modify messages to the FIB as appropriate.

Core PIC is enabled by default and cannot be disabled.

Edge PIC

As previously described, Edge PIC describes the scenario where an edge node or edge link fails, resulting in a change of BGP Next-Hop for a given number of prefixes. Edge PIC is applicable to a router only when more than one path is known. In most cases you must run a mechanism such as ADD-PATH for IP and/or VPN-IP prefixes, or unique Route-Distinguishers for VPN-IP prefixes.

When Edge PIC is enabled, the BGP decision process is modified so the output (best path) becomes a Path-List consisting of {primary, backup}. There may be one primary path or there may be more than one primary path, if, for example, BGP multipath is enabled and multiple equal-cost paths exist. The backup path is computed by executing the BGP decision process (down to the lowest IP address as the final tie-breaker) on all the available paths except those already selected for primary paths, or all those that have a Next-Hop attribute in common with selected primary paths. There may only be one backup path, and there may be none if the only available backup path fails to meet the previously described criteria.

When a route is programmed into the forwarding path (IOM/IMM) the associated {primary, backup} Path-List is also taken into account, and all routes with a common Path-List are grouped together retaining their primary/backup paths. If a BGP Next-Hop becomes unreachable (detected by Next-Hop Tracking) and no other valid primary paths are available, the IOM reprograms the common Path-Lists to use the backup path. This results in a rapid reconvergence of traffic that is independent of the number of prefixes.

Because the ADD-PATH feature is complimentary to the use of Edge PIC, I'll use the same topology used for ADD-PATH to illustrate the use of Edge PIC (repeated in Figure 6-7 for readability). Once again, routers R1 to R4 are in AS 64496. R1 and R2 are clients of RR1, and R3 and R4 are clients of RR2. Both R1 and R2 are peering externally with AS 64510 and are learning prefix 172.16.0.0/20. Using ADD-PATH functionality, the router R4 is receiving two paths for the prefix 172.16.0.0/24; one with a Next-Hop of R1 and one with a Next-Hop of R2.

Figure 6-7 Edge PIC Test Topology

image

Edge PIC is supported in the global routing context and within the VPRN context for IPv4 and IPv6 routes. It is enabled on a per-Address Family basis through the backup-path command, but can only be provisioned at the BGP level and not the group or neighbor level. Output 6-6 illustrates the configuration of Edge PIC at R4. Here the example is provided at global BGP level, but the configuration is identical within a VPRN context for the IPv4 and IPv6 Address Families. When enabled within the context of a VPRN, Edge PIC is applicable only to routes learned in IPv4/IPv6 BGP from PE-CE peers (I discuss Edge PIC for VPN-IPv4/VPN-IPv6 routes later in this section).

You can use a number of CLI commands to verify that a route with multiple paths has successfully installed a primary and backup route. One method is to validate against the RIB-IN as shown in Output 6-7. Here, the first entry for 172.16.0.0/20 with Next-Hop 192.0.2.22 is the primary path., The second entry with Next-Hop 192.0.2.13 is the backup path, denoted by the backup (b) flag. Another equally simple way is to validate against the route-table as shown in Output 6-8. In this output, the [B] flag denotes the presence of a backup route.

Output 6-6: Edge PIC Configuration at R4

bgp

backup-path ipv4

group "IBGP"

family ipv4 ipv6

peer-as 64496

add-paths

ipv4 send 1 receive

neighbor 192.0.2.23

exit

exit

Output 6-7: RIB-IN with Backup Route at R4

*A:R4# show router bgp routes 172.16.0.0/20

==================================================================

BGP Router ID:192.0.2.21 AS:64496 Local AS:64496

==================================================================

Legend -

Status codes : u - used, s - suppressed, h - history, d - decayed, * - valid

Origin codes : i - IGP, e - EGP, ? - incomplete, > - best, b - backup

==================================================================

BGP IPv4 Routes

==================================================================

Flag Network LocalPref MED

Nexthop Path-Id Label

As-Path

------------------------------------------------------------------

u*>i 172.16.0.0/20 100 None

192.0.2.22 22 -

64510

ub*i 172.16.0.0/20 100 None

192.0.2.13 23 -

64510

------------------------------------------------------------------

Routes : 2

==================================================================

Output 6-8: Route-Table with Backup at R4

*A:R4# show router route-table 172.16.0.0/20

==================================================================

Route Table (Router: Base)

==================================================================

Dest Prefix[Flags] Type Proto Age Pref

Next Hop[Interface Name] Metric

------------------------------------------------------------------

172.16.0.0/20 [B] Remote BGP 00h00m13s 170

192.0.2.130 0

------------------------------------------------------------------

No. of Routes: 1

Flags: L = LFA nexthop available B = BGP backup route available

n = Number of times nexthop is repeated

==================================================================

SR-OS also provides the capability to support Edge PIC for VPN-IPv4 and VPN-IPv6 Address Families on a per-VPRN basis. To enable Edge PIC for VPN-IPv4 and/or VPN-IPv6 Address Families on a per-VPRN basis, it is not possible to use the backup-path command under the global or VPRN BGP context. A different command, enable-bgp-vpn-backup ipv4|ipv6, is required at the VPRN context level.

When enabled, the VPRN route-table can be used again to verify the presence of a backup path, denoted by the presence of a [B] flag.

Output 6-9: Edge PIC for VPN-IPv4

service

vprn 20 customer 1 create

autonomous-system 64496

route-distinguisher 64496:20

auto-bind mpls

enable-bgp-vpn-backup ipv4

vrf-target target:64496:20

no shutdown

exit

exit

Output 6-10: VPRN Route-Table with Backup at R4

*A:R4# show router 20 route-table

=======================================================================

Route Table (Service: 20)

=======================================================================

Dest Prefix[Flags] Type Proto Age Pref

Next Hop[Interface Name] Metric

-----------------------------------------------------------------------

63.130.48.0/24 [B] Remote BGP VPN 00h31m16s 170

192.0.2.22 (tunneled) 0

-----------------------------------------------------------------------

No. of Routes: 1

Flags: L = LFA nexthop available B = BGP backup route available

n = Number of times nexthop is repeated

=======================================================================

image

Implementation of Edge PIC can have an impact on FIB resources in the presence of labeled BGP routes. If the primary Next-Hop corresponds to an unlabeled BGP route resolved by an IP route, the IOM CPU programs the data-plane (p-chip) with the backup path immediately after the failure has occurred. If the primary Next-Hop corresponds to a labeled BGP route resolved by an MPLS tunnel, the data-plane is programmed with the primary and backup NHLFE information prior to the failure. The difference between failure times is fairly insignificant; for the unlabeled BGP route the IOM CPU has to instruct the p-chip to replace one indirection object with another, and for the labeled BGP route the IOM CPU has to tell the p-chip that a failure of the primary path occurred.

Minimum Route Advertisement Interval

The Minimum Route Advertisement Interval (MRAI) is the minimum amount of time that must elapse between an advertisement and/or withdrawal of routes for a given prefix by a BGP speaker to a peer. Two UPDATE messages sent by a BGP speaker to a peer advertising or withdrawing routes must be separated by this interval.

Because an UPDATE message is subject to the MRAI, it clearly has an impact on convergence times. Therefore SR-OS allows for configuration of the MRAI at the BGP, group, and neighbor levels for both global and VPRN BGP using the min-route-advertisement command followed by an interval in seconds. The default is 30 seconds. Setting the correct value for the MRAI essentially represents a trade-off. The default 30-second configuration allows batching of multiple NLRIs into fewer UPDATE messages, but obviously implies a delay in propagation of those messages of up to 30 seconds at each BGP speaker that processes them. A more aggressive MRAI value of, for example, 1 or 2 seconds implies faster convergence but at the cost of a loss of NLRI packing.

The MRAI runs as a “time-window” and each configured MRAI runs independently. In other words, assume router R1 receives a BGP UPDATE for prefix “P” from an external peer and has an MRAI of five seconds configured for its IBGP peers. The prefix “P” is propagated into the Autonomous System “somewhere” between 0 and 5 seconds.

Output 6-11: MRAI Configuration

bgp

group "EBGP"

min-route-advertisement 10

exit

group "IBGP"

min-route-advertisement 5

exit

no shutdown

Output 6-12: Rapid Withdrawal Configuration

bgp

rapid-withdrawal

group "EBGP"

…etc

By default, the configured MRAI applies to advertisement of both feasible routes and unfeasible routes (withdraws). Frequently it is desirable that withdraws are propagated more quickly than UPDATE messages containing feasible routes in order to avoid black holes. In this case you can enable fast withdrawal of unfeasible routes independently of MRAI using the rapid-withdrawal command at the global BGP level or VPRN BGP level. When rapid withdrawal is enabled, the MRAI is completely bypassed and UPDATEs containing unfeasible routes are propagated immediately.

When considering what MRAI timers to use, and whether rapid withdrawal is required, consider other mechanisms that may relax the requirement for an aggressive MRAI, which almost certainly increases control plane load. These mechanisms may already be in place. Mechanisms such as ADD-PATH or multiple Route-Distinguishers for VPN-IPv4/IPv6, in conjunction with Edge PIC ensure that an alternate path is available for fast convergence. If the failure notification is provided by Next-Hop Tracking, there is no dependency on rapid BGP UPDATE/Withdraw propagation. The cost of implementing mechanisms such as ADD-PATH or Edge PIC may come elsewhere (for example, increased memory consumption or FIB space), but there will be a compromise that meets the network's requirements.

Even where the failure notification is advertised by BGP, an aggressive MRAI isn't always needed. A service that frequently calls for fast reconvergence is IP-VPN, for which, by default, SR-OS uses a label-per-VRF approach when signaling VPN-IPv4/VPN-IPv6 prefixes. Because there may be more than one site of a VPN connected to a PE router, this label-per-VRF approach dictates a requirement for an IP lookup on egress before forwarding packets to the preferred CE device (we cannot simply label-switch toward the CE because there is a single label but there are potentially multiple CE devices). This egress IP lookup provides the capability to protect traffic during a PE-CE link failure. Assume a scenario like Figure 6-8, where CE1 and CE2 are dual-homed to PE1 and PE2 respectively and both advertise prefix 172.31.100.0/24. PE1 and PE2 have different Route-Distinguishers, and as a result PE3 has two paths for prefix 172.31.100.0/24 but in steady state prefers PE1 as the Next-Hop. If the PE1-CE1 link fails, traffic will continue to be forwarded by PE3 toward PE1 until BGP has reconverged and PE3 selects PE2 as the new preferred Next-Hop. When this labeled traffic arrives, PE1 pops the label stack and performs a VRF table lookup where prefix 172.31.100.0/24 is advertised by PE2. PE1 thereafter reimposes a two-level label stack and forwards the traffic toward PE2, where it is subsequently forwarded to CE2. This “u-turn switch” will maintain traffic continuity of “in-flight” packets until PE3 receives the BGP withdraw (MP_UNREACH) from PE1 for VPN prefix 172.31.100.0/24 and updates its Next-Hop toward PE2.

Figure 6-8 Egress IP Lookup and PE-CE Link Failure

image

The egress IP lookup (or “u-turn switch”) minimizes service disruption during PE-CE link failure and also means that any outage experienced is abstract from the configured MRAI. However, it relies on an egress IP lookup to function. If, for example, the failure were an ASBR to ASBR link where traffic is label-switched, it would be ineffective and the outage length would be largely influenced by the MRAI.

BGP Anycast

Chapter 5 discussed the concept of seamless MPLS and the use of ABRs to provide separation between domains (core, aggregation, and access). Labeled unicast IPv4 routes are advertised between domains, with the ABR imposing Next-Hop-Self on those routes so that a router in a given domain only needs to identify a transport label (LDP/RSVP) to an exit point from that domain and not the actual destination of the packet. Labeled unicast IPv4 routes act as the glue to stitch together the access/aggregation/core domains at the ABR. When an ABR imposes Next-Hop-Self on a labeled unicast IPv4 route, it implies that the ABR is part of the control-plane and the data-plane, and because the failure of an ABR is likely to affect a number of services, it is important to have mechanisms in place to provide for fast reconvergence on failure.

One mechanism is the use of PIC previously described in this chapter. When applied to labeled unicast IPv4 routes, this mechanism provides a preprogrammed backup-path for an ABR failure but typically also requires the use of ADD-PATH. Here, I'll discuss a second mechanism outlined in (draft-ietf-mpls-seamless-mpls) that uses Anycast BGP so that a Point of Local Repair (PLR) can provide protection against an ABR node failure and redirect traffic to a backup ABR.

To achieve this protection, a primary ABR is configured with an additional interface known as a “context identifier,” which is advertised into the IGP and LDP. The context identifier is actually just an Anycast address, but for now I'll continue to refer to it as a context identifier. (Later, I'll use the terms Anycast address and context identifier interchangeably). This is also the IPv4 address that the primary ABR sets as the Next-Hop attribute on labeled unicast IPv4 routes that are advertised to the core domain. The backup ABR is configured with the primary ABR's context identifier, which it advertises into the IGP and LDP (but potentially with a higher IGP metric). A PLR would have a feasible backup path (LSP) to the context identifier, and if the primary ABR were to fail, the PLR would simply redirect traffic toward the backup ABR.

However this presents a small problem. When the backup ABR pops the LDP label that it advertised for the context identifier, the next label in the stack is a BGP label with a value advertised by the primary ABR. Because the backup ABR did not distribute this label (using conventional downstream label allocation), it cannot correctly forward the packet. The only way this can work is if the backup ABR had somehow programmed the BGP labels advertised by the primary ABR into its label FIB. That is exactly how the Anycast BGP mechanism overcomes this problem. It uses the concept of upstream label distribution together with context-specific label forwarding (RFC 5331) to program {FEC->Label} mappings on the backup ABR that were actually advertised in labeled BGP by the primary ABR. When the primary ABR advertises {FEC->Label} mappings in labeled unicast IPv4 BGP, the backup ABR uses the Next-Hop attribute (set to the context identifier/Anycast address) as an indication that this {FEC->Label} mapping should be programmed in a backup forwarding context to be used if the primary ABR fails.

I'll use Figure 6-9 as an example where ABR1 is the primary ABR, and ABR2 is the backup. Both ABR1 and ABR2 advertise the context identifier 192.0.2.253 into IS-IS and LDP, with ABR2 advertising a higher metric so that ABR1 is the PLR's preferred IGP Next-Hop for that IP address. ABR1 and ABR2 are IBGP peered with AGN1, AGN2, and AGN3 and exchange labeled unicast IPv4 routes for their perspective system addresses. ABR1 and ABR2 are also peered in IBGP through a core Route-Reflector and set Next-Hop-Self to the context identifier/Anycast address when they advertise the routes for AGN1, AGN2, and AGN3 toward the core. Without upstream label allocation and context-specific label forwarding, the forwarding state for the BGP advertised labels would be as shown in this diagram, where ABR1 and ABR2 have two different forwarding states based purely on labels that they each advertised downstream. In this state, you have no backup.

Figure 6-9 Anycast BGP Concept

image

Using the concept of context-specific label switching and upstream label allocation, if you now add the backup forwarding context shown in Figure 6-10 at ABR2, you have a backup solution. The backup forwarding state at ABR2 is a mirror of the native forwarding state at ABR1. If ABR1 fails and the PLR redirects traffic to ABR2, after it has popped the downstream advertised LDP label, it uses this backup forwarding context to correctly forward traffic toward its destination.

Figure 6-10 ABR2 Backup Forwarding Context

image

As previously indicated, you can use (Edge) PIC for labeled BGP or Anycast BGP to protect against ABR failures (or in certain circumstances downstream failures). However, there is a significant difference in how these methods detect failures. When you use PIC for labeled BGP, the failure very probably must be detected at a non-adjacent hop, so you need a mechanism like Next-Hop Tracking to be able to detect the failure. With Anycast BGP, the PLR is an adjacent hop, and therefore failure detection is fully localized.

Figure 6-11 illustrates the configuration requirements of Anycast BGP. In this figure, ABR1, ABR2, ABR3, and ABR4 form the core domain, and AGN1 and AGN2 each form an aggregation domain. Each of the AGNs is IBGP peered to the corresponding ABRs, while the ABRs are fully meshed in IBGP with each ABR representing its own cluster. IS-IS is enabled with the hierarchy depicted in the schematic with no redistribution of prefixes between IS-IS instances and/or levels. System addresses are advertised into labeled unicast IPv4 BGP, with ABRs imposing Next-Hop-Self on advertised prefixes in both directions (toward the core domain and toward the aggregation domain). The objective is that ABR1 and ABR2 will become primary/backup ABRs for each other. ABR1 is Master using Anycast address 192.0.2.253, while ABR4 is Master using Anycast address 192.0.2.254. Conversely, ABR1 is acting as backup for Anycast address 192.0.2.254, and ABR2 is backup for Anycast address 192.0.2.253.

Figure 6-11 Anycast BGP Test Topology

image

Before detailing the BGP Anycast configuration I'll cover the base BGP configuration required in this scenario. Much of this information is covered in Chapter 5's Seamless MPLS section, but I'll repeat it here for convenience. Output 6-13 shows the BGP configuration at ABR3 before Anycast is implemented.

Within the BGP configuration, the cluster command effectively means that the router is a Route-Reflector and that IBGP peers under this context are Route-Reflector clients. It can be enabled at the BGP level, group level, or neighbor level. In this instance all the peers of ABR3 are clients, so it is entered at the BGP level. The cluster command is followed by a cluster ID in dotted decimal format, which is populated into the Cluster ID attribute when reflected to clients by the Route-Reflector with the purpose of avoiding cluster loops. The advertise-inactive command overcomes an issue when using labeled BGP to advertise prefixes that are also known by some other more preferred protocol such as the IGP. The advertise-inactive command causes the best BGP route, and only the best route, to be advertised even if it is not the most preferred route within the system for a given destination (for example, an IGP route also exists). When the labeled BGP prefix has been advertised, a label swap entry is programmed even though the BGP prefix is inactive. The transport-tunnel command instructs BGP what transport-level MPLS mechanism should be used to resolve the BGP Next-Hop when the peers are non-adjacent. The options are RSVP-TE, LDP, or simply MPLS. The latter means that either can be used, with a preference given to an RSVP-TE LSP if it is available.

Finally, each of the neighbor statements is suffixed with the command advertise-label ipv4, which essentially enables the use of labeled BGP for the IPv4 Address Family.

To enable Anycast BGP, the first step is to configure the Anycast addresses, or context identifiers, at ABR1 and ABR2. Output 6-14 shows the necessary configuration at ABR1 where the mh-primary-interface and mh-secondary-interface commands at router level provide the context to configure the IPv4 addresses. The same configuration is made at ABR2 with the exception that the addresses are the inverse of each other.

Output 6-13: ABR1 Base BGP Configuration

bgp

cluster 192.0.2.23

transport-tunnel mpls

group "IBGP"

family ipv4

export "IPv4-IBGP"

peer-as 64496

advertise-inactive

neighbor 192.0.2.11

advertise-label ipv4

exit

neighbor 192.0.2.13

advertise-label ipv4

exit

neighbor 192.0.2.21

advertise-label ipv4

exit

neighbor 192.0.2.23

advertise-label ipv4

exit

exit

no shutdown

exit

exit

Output 6-14: Anycast Address (Context Identifier) Configuration at ABR1

router

mh-secondary-interface "Anycast-Backup"

address 192.0.2.254/32

no shutdown

exit

mh-primary-interface "Anycast-Master"

address 192.0.2.253/32

no shutdown

exit

exit

The next step is to advertise the Anycast addresses into IS-IS. The example in Output 6-15 shows both the Anycast Master and Backup addresses being passively added to the core IS-IS instance (instance 0) at ABR1. In reality, given that the ABRs impose Next-Hop-Self toward the core and aggregation domains in this example, it would be beneficial to provide the Anycast ABR resilience not only toward the core, but also toward the aggregation domain, and therefore advertise the Anycast addresses into IS-IS instance 1 as well. One last point is that the Primary/Backup addresses are advertised into IS-IS with different metrics to ensure that ABR1 is the shortest path for the Anycast Master address 192.0.2.253.

Finally, LDP prefix FECs are advertised for the Anycast addresses. By default, SR-OS advertises LDP-prefix FECs only for the system interface. Therefore an LDP export policy is applied at ABR1 and ABR2 to advertise FECs for the Anycast Master and Backup IPv4 addresses.

Output 6-15: Anycast BGP IS-IS Configuration at ABR1

router

isis

interface "Anycast-Master"

level-capability level-2

passive

level 2

metric 1

exit

no shutdown

exit

interface "Anycast-Backup"

level-capability level-2

passive

level 2

metric 1000

exit

no shutdown

exit

Output 6-16: Anycast BGP LDP Configuration at ABR1

router

ldp

export "Anycast-LDP"

exit

policy-options

begin

prefix-list "Anycast-Backup"

prefix 192.0.2.254/32 exact

exit

prefix-list "Anycast-Master"

prefix 192.0.2.253/32 exact

exit

policy-statement "Anycast-LDP"

entry 10

from

prefix-list "Anycast-Master"

exit

to

protocol ldp

exit

action accept

exit

exit

entry 20

from

prefix-list "Anycast-Backup"

exit

to

protocol ldp

exit

action accept

exit

exit

exit

commit

exit

From an ABR perspective, the configuration for BGP Anycast is complete. However, from a general perspective of BGP Anycast redundancy when deployed at an ABR, consider one final point. In a failure scenario where the Backup ABR receives labeled packets that were originally destined toward the Master ABR, it can only perform a swap action in the Anycast context-specific label-switching table. That is, the ABR can only swap a BGP label for a BGP label. It cannot perform other actions such as pop.

For example, in Figure 6-11 AGN-1 has a system address of 192.0.2.22, which is advertised into labeled IPv4 BGP to ABR1 and ABR2. ABR1, in its role as Route-Reflector subsequently advertises that route to ABR2 with the Next-Hop set to its Anycast Master address 192.0.2.253. However, ABR2 does not install this as a valid Anycast-label because its preferred route to AGN1 is through IS-IS/LDP. Any packets arriving at ABR2 that were originally destined towards ABR1 have a three-level label stack consisting of {LDP label, BGP label, and service-label}, but must be forwarded toward AGN1 with a two-level label stack of {LDP label, service-label}, which dictates a pop of the middle label in the stack.

This pop action is not possible. Therefore a requirement of Anycast BGP is that separate service loopbacks are used at the AGNs to provide end-to-end connectivity with ABR redundancy. These separate loopbacks are advertised into labeled IPv4 BGP but not IS-IS/LDP. As a result, the ABR has only a labeled BGP route for this loopback and can perform a swap action in the Anycast label-switching context. Referring again to Figure 6-11, AGN1 has a separate loopback address of 192.0.2.122 while AGN2 has a loopback address of 192.0.2.113 for this purpose.

Output 6-17 shows the Anycast context-specific label-switching table at ABR2 when only the system address (192.0.2.22) is advertised into labeled IPv4 BGP from AGN1. In this scenario, the system address of AGN1 is known through IS-IS/LDP at ABR2. The label-switch is not programmed and BGP Anycast is not functional. Conversely, Output 6-18 shows the same Anycast label-switching table when AGN1 advertises its additional loopback 192.0.2.122 only into labeled IPv4 BGP.

The label swap action programmed into the Anycast context-specific label-switching table reflects the label value advertised by ABR1 and AGN1 respectively. You can confirm this by comparing the BGP UPDATEs from both routers in Output 6-19. Traffic arriving at ABR2 that was originally destined for ABR1 has a BGP label of value 262135. To forward packets toward AGN1, this label must be swapped to a value of 262134 advertised by AGN1.

Output 6-17: Label-Switching Context at ABR2 without Separate Loopback

*A:ABR2# show router bgp anycast-label

==================================================================

BGP Anycast-MH labels

==================================================================

Secondary-MH-Addr ABR-Lbl Cfg-Time VPRN-ID

PE-Addr PE-Lbl Rem-Time Ref-Count

------------------------------------------------------------------

No Entries Found

Output 6-18: Label-Switching Context at ABR2 with Separate Loopback

*A:ABR2# show router bgp anycast-label

==================================================================

BGP Anycast-MH labels

==================================================================

Secondary-MH-Addr ABR-Lbl Cfg-Time VPRN-ID

PE-Addr PE-Lbl Rem-Time Ref-Count

------------------------------------------------------------------

192.0.2.253 262135 30 -

192.0.2.22 262134 - 1

==================================================================

Output 6-19: Advertised Labels from AGN1 and ABR1

*A:ABR2# show router bgp routes 192.0.2.122/32 detail | match expression "Nexthop"

Nexthop : 192.0.2.22

Res. Nexthop : 192.0.0.130 (LDP)

IPv4 Label : 262134

Nexthop : 192.0.2.253

Res. Nexthop : Unresolved

IPv4 Label : 262135

It should be clear that BGP Anycast can provide protection only to prefixes that are not known by the ABR through the IGP/LDP. It can provide protection only to prefixes that have a preferred route via labeled BGP. As illustrated throughout this section, this dictates the use of a separate loopback at the AGN that is thereafter used for signaling service labels. For services signaled using BGP, this dictates the use of this service loopback for VPN-IPv4/IPv6 peering using the local-address parameter at the group or neighbor level. For services signalled through targeted LDP, this dictates the use of the local-lsr-id parameter.

This use of a separate loopback not known in the IGP can be considered a significant drawback of BGP Anycast, but it can be argued that separate loopback addresses are beneficial if an operator is deploying seamless MPLS into a brownfield network. Most existing networks have evolved over time, and frequently IP addressing schemes can become untidy. The deployment of new addresses for seamless MPLS can provide an ideal opportunity to assign ranges to domains that can be more easily managed through policy.