Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS (2014)
Chapter 1. Getting Started
Although this book does not discuss the operation of BGP as a path-vector protocol, it's worth a quick recap on how a BGP speaker processes and stores routes in the Routing Information Bases (RIBs). The RIB within a BGP speaker is made up of three distinct parts: the Adj-RIB-In, the Loc-RIB, and the Adj-RIB-Out. The Adj-RIB-In stores routing information learned from inbound UPDATE messages advertised by peers to the local router. The routes in the Adj-RIB-In represent routes that are available to the path decision process. The Loc-RIB contains routing information the local router selected after applying policy to the routing information contained in the Adj-RIB-In. These are the routes that will be used by the local router. The Adj-RIB-Out stores information the local router selected for advertisement to its peers. This information is carried in UPDATE messages sourced by this router when advertising to peers. In summary, the Adj-RIB-In contains unprocessed routing information advertised by peers to the local router, the Loc-RIB contains the routes that have been selected by the local BGP speaker's best-path decision process, and the Adj-RIB-Out contains the routes for advertisement to peers in UPDATE messages. I'll use this terminology throughout the book, and may interchangeably use Adj-RIB-In or simply RIB-In, and Adj-RIB-Out or simply RIB-Out.
Enabling BGP in its most basic form is a very simple exercise. All you need is an IP interface toward a BGP peer and some minimal BGP configuration. For conciseness, Output 1-1 does not show the IP interface configuration. For exchange of IPv4 reachability, the only parameters required are an Autonomous System (AS) number defined within the global router context (or Virtual Private Routed Network [VPRN] context), an IP address for the peer, and a peer AS number. The IP address and peer AS number are entered in a BGP group context, often referred to as a peer group. Peer groups allow you to group together a set of peers that have a common administrative configuration, and are discussed further in Chapter 10.
Output 1-1: Basic BGP Configuration
router
autonomous-system 64496
bgp
group “EBGP”
neighbor 192.168.0.2
peer-as 64510
exit
exit
no shutdown
exit
exit
Session Negotiation and Capabilities
A Finite State Machine (FSM) is maintained for each BGP peer, and there are six possible states in the FSM. Initially, the FSM for the BGP peer is in the Idle state. In this state, the router listens for a TCP connection initiated by the remote peer or initiates the TCP connection itself. The second state is the Connect state, where the FSM is waiting for the TCP three-way handshake to be completed. If the TCP connection is not successfully established, the state is changed to Active and a further attempt is made to establish the TCP connection to the remote peer. (If the connection continues to fail, the FSM reverts to the Idle state.) If the TCP connection is successfully established, the FSM completes the BGP initialization, generates an OPEN message toward the peer, and changes its state to OpenSent. If an OPEN message is also received from the remote peer and the parameters contained in the OPEN message are acceptable, the router generates a KEEPALIVE message and changes its state to OpenConfirm. If the parameters of the OPEN message are not acceptable, a NOTIFICATION message is sent with the appropriate error code, and the state is reverted to Idle. While in the OpenConfirm state, if the router receives a KEEPALIVE message from the remote peer, it moves to the Established state. In the Established state, peers can send UPDATE messages to exchange routing information.
The OPEN message sent by each peer contains its AS number, Hold Time, BGP identifier, and some optional parameters. The notable optional parameter is the Capabilities parameter. The Capabilities parameter is defined in RFC 5942 and allows BGP speakers to exchange capability sets in the OPEN exchange. If both peers advertise a given capability, the peers can use that advertised capability on the peering. If either peer did not advertise the capability, it cannot be used.
The Capabilities parameter is encoded as a code, a length, and a value. The output in Debug 1-1 is taken from an OPEN negotiation between an SR-OS router and a test device. The SR-OS router sends its OPEN message with capability codes indicating support for IPv4 unicast Multi-Protocol (MP)-BGP, Route-Refresh, and 4-byte ASN support. The capability code for MP-BGP encodes a value (0x0 0x1 0x0 0x1) that represents an Address Family Identify (AFI) of IPv4 (0x0 0x01) and a Subsequent Address Family Identifier (SAFI) of unicast (0x0 0x1) indicating support only for IPv4 unicast MP-BGP. (The use of the AFI and SAFI for Multi-Protocol BGP is discussed in further detail later in this chapter.) The capability code for 4-Octet ASN also encodes a value indicating its 4-byte Autonomous System number. In this case the SR-OS router only has a 2-byte Autonomous System number; therefore, it is converted into a 4-byte Autonomous System number by setting the two high-order octets of the 4-octet field set to zero.
Figure 1-1 Finite State Machine
Conversely, the test device peer sends its OPEN message indicating support for IPv4 unicast MP-BGP, IPv6 unicast MP-BGP, and Route Refresh. In this OPEN message the capability code for MP-BGP appears twice; each occurrence contains a different capability value. The first occurrence indicates support for IPv4 unicast. The second occurrence, with value (0x0 0x2 0x0 0x1), represents an AFI of IPv6 (0x0 0x2) and a SAFI of unicast (0x0 0x1).
Debug 1-1: OPEN message with Capabilities Negotiation
135 2013/04/18 14:47:00.98 BST MINOR: DEBUG #2001 Base BGP
"BGP: OPEN
Peer 1: 192.168.0.2 - Send (Active) BGP OPEN: Version 4
AS Num 64496: Holdtime 90: BGP_ID 192.0.2.46: Opt Length 16
Opt Para: Type CAPABILITY: Length = 14: Data:
Cap_Code MP-BGP: Length 4
Bytes: 0x0 0x1 0x0 0x1
Cap_Code ROUTE-REFRESH: Length 0
Cap_Code 4-OCTET-ASN: Length 4
Bytes: 0x0 0x0 0x11 0xed
"
137 2013/04/18 14:47:00.97 BST MINOR: DEBUG #2001 Base BGP
"BGP: OPEN
Peer 1: 192.168.0.2 - Received BGP OPEN: Version 4
AS Num 64510: Holdtime 30: BGP_ID 192.168.0.2: Opt Length 16
Opt Para: Type CAPABILITY: Length = 14: Data:
Cap_Code MP-BGP: Length 4
Bytes: 0x0 0x1 0x0 0x1
Cap_Code MP-BGP: Length 4
Bytes: 0x0 0x2 0x0 0x1
Cap_Code ROUTE-REFRESH: Length 0
"
This asymmetric capability negotiation is acceptable from the perspective of the peering session, providing that the only optional capabilities used are IPv4 MP-BGP and Route-Refresh. If, for example, the peer advertises an IPv6 prefix using MP-BGP, this results in a NOTIFICATION message being sent. The integrity of the peering session thereafter is dependent on supported and configured error handling capabilities. Standard capabilities' codes are maintained by the Internet Assigned Numbers Authority (IANA) at www.iana.org/assignments/capability-codes/capability-codes.xml but vendor-specific capability codes are in widespread use. During capability exchange these should be ignored by a BGP speaker if not recognized.
Output 1-2: Local/Remote Capabilities
*A:R1# show router bgp neighbor 192.168.0.2 | match expression “Local|Remote”
Local AS : 64496 Local Port : 179
Local Address : 192.168.0.1
Local Family : IPv4
Remote Family : IPv4 IPv6
Local Capability : RtRefresh MPBGP 4byte ASN
Remote Capability : RtRefresh MPBGP
Local AddPath Capabi*: Disabled
Remote AddPath Capab*: Send - None
The Hold Times negotiated in the OPEN exchange do not have to be the same for the BGP session to be established. The BGP speaker calculates the active Hold Time value by using the smaller of its configured value and the value received in the OPEN message. In the OPEN exchange shown in Debug 1-1, SR-OS uses the default Hold Time of 90 seconds while the peer advertises a Hold Time of 30 seconds. This exchange results in both peers using a Hold Time of 30 seconds, with KEEPALIVE messages exchanged every (30/3) 10 seconds.
As previously described, when a BGP speaker has sent an OPEN message it moves to the OpenSent state, and when it has received a corresponding OPEN message from its peer it moves to OpenConfirm state. If the BGP speaker is happy with the contents of the received OPEN message, it responds with a KEEPALIVE message. When each BGP speaker has sent and received an OPEN message and KEEPALIVE message, they move to the ESTABLISHED state and can then exchange reachability information.
UPDATE Messages
This book does not explicitly detail all BGP message formats, but it's useful to review the basic BGP UPDATE format so you can understand the differences between it and the general format of Multi-Protocol BGP UPDATE messages. The Withdrawn Routes field contains a list of IP prefixes in the form <length, prefix> that are being withdrawn from service. The Network Layer Reachability Information (NLRI) field contains a list of IP prefixes, again in the form <length, prefix>, that can be reached from a given BGP speaker (subject to policy).
Debug 1-2: Active Hold Time
*A:R1# show router bgp neighbor 192.168.0.2 | match "Hold Time"
Hold Time : 90 Keep Alive : 30
Min Hold Time : 0
Active Hold Time : 30 Active Keep Alive : 10
Figure 1-2 UPDATE Message Format
The Path attributes field contains a sequence of attributes associated with an NLRI and each attribute can be placed into one of four categories: well-known mandatory, well-known discretionary, optional transitive, and optional non-transitive. Non-transitive simply refers to the fact that this attribute may be advertised into an AS but may not leave that AS.
Mandatory attributes must be present in the UPDATE message if NLRI is present (that is, the UPDATE does not purely carry Withdraw routes) and include the ORIGIN, AS_PATH, and NEXT_HOP attributes. Examples of well-known discretionary attributes include LOCAL_PREF and ATOMIC_AGGREGATE.
Output 1-3: UPDATE Message with NLRI
1 2013/06/09 09:07:10.11 BST MINOR: DEBUG #2001 Base Peer 1: 192.168.0.2
"Peer 1: 192.168.0.2: UPDATE
Peer 1: 192.168.0.2 - Received BGP UPDATE:
Withdrawn Length = 0
Total Path Attr Length = 18
Flag: 0x40 Type: 1 Len: 1 Origin: 0
Flag: 0x40 Type: 3 Len: 4 Nexthop: 192.168.0.2
Flag: 0x40 Type: 2 Len: 4 AS Path:
Type: 2 Len: 1 < 64510 >
NLRI: Length = 4
172.16.0.0/20
"
At the beginning of the path attribute field there is a 2-octet field that contains an Attribute Flags octet followed by the Attribute Type Code octet as shown in Figure 1-3.
Figure 1-3 Path Attribute Flags
The Attribute Type Code is a value defining the type of Attribute. Within the Attribute Flags octet the high-order bit (bit 0) is the Optional bit and defines whether the attribute is optional (1) or well-known (0). Bit 1 is the Transitive bit and defines whether an optional attribute is transitive (1) or non-transitive (0). Bit 2 is the Partial bit and defines whether an optional transitive attribute is recognized by a BGP speaker when advertising it to peers (0), or unrecognized (1). Note that if a BGP speaker recognizes the optional transitive attribute (and would therefore set the partial bit to 0), but the partial bit has already been set to 1 by some other AS, it must not be set back to zero by the processing speaker. In effect, when set, the partial bit provides visibility that some BGP speaker along the path didn't recognize the attribute. Bits 4-7 are reserved and should be set to zero (some early Internet drafts on error handling for optional-transitive attributes proposed the use of bit 4, but this proposal was largely superseded through widespread adoption of other error handling drafts discussed in Chapter 8).
Examples of optional non-transitive attributes include the MED, ORIGINATOR_ID, CLUSTER, MP_REACH, and MP_UNREACH attributes, while examples of optional transitive attributes include the AGGREGATOR and COMMUNITY attributes.
In order to withdraw a route from service once it has been advertised, the IP prefix previously advertised as NLRI in the UPDATE message can be advertised in the Withdrawn Routes field of an UPDATE message, or a replacement route with the same NLRI can be advertised. Equally, if the BGP session between two peers is closed, all routes advertised to each other are implicitly removed. If an UPDATE message carries only Withdrawn Routes and no NLRI, the mandatory attributes such as NEXT_HOP, ORIGIN, and AS_PATH need not be present.
NOTIFICATION Messages
A NOTIFICATION message is sent when an error condition is detected and causes the BGP session to close. The NOTIFICATION message contains fields for error codes, one or more error sub-codes associated with that error code, and a data field that provides some indication of the error condition. Error codes and sub-codes are contained in section 4.5 of RFC 4271, updated by RFC 4486 (Subcodes for BGP Cease NOTIFICATION Message).
Debug 1-3: UPDATE Message with Withdrawn Routes
3 2013/06/09 09:09:06.50 BST MINOR: DEBUG #2001 Base Peer 1: 192.168.0.2
"Peer 1: 192.168.0.2: UPDATE
Peer 1: 192.168.0.2 - Received BGP UPDATE:
Withdrawn Length = 4
172.16.0.0/20
Total Path Attr Length = 0"
Error conditions that require a NOTIFICATION message to be sent are categorized into three types:
· Those experienced during processing of the generic BGP message header
· Those experienced in processing of the OPEN message
· Those experienced in processing of UPDATE messages
When the BGP session is closed, the associated TCP connection is closed, the RIB-IN entries with the peer are cleared, and all resources allocated to that particular peer are released. Errors in the BGP message header are uncommon and indicate a fairly fundamental problem. Errors in the OPEN message are typically due to misconfiguration of peer parameters. However, errors in UPDATE messages are not uncommon, and have the potential to be extremely disruptive.
Debug 1-4: NOTIFICATION Message
11 2013/06/09 09:14:03.48 BST MINOR: DEBUG #2001 Base Peer 1: 192.168.0.2
"Peer 1: 192.168.0.2: NOTIFICATION
Peer 1: 192.168.0.2 - Received BGP NOTIFICATION: Code = 6 (CEASE) Subcode
= 4 (Administrative Reset)
Data Length = 16 Data: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0"
The original BGP specification called for a NOTIFICATION message to be generated under a number of conditions during error checking of attributes within UPDATE messages. More recent work (draft-ietf-grow-ops-reqs-for-bgp-error-handling) has called for alternative measures to be implemented under these circumstances in order to avoid this level of disruption. This point is discussed further in Chapter 8.
Multi-Protocol BGP
The Multi-Protocol extensions to BGP defined in RFC 4760 provide the capability for BGP to carry routing information for multiple network layer protocols such as IPv6, VPN-IPv4, VPN-IPv6, L2VPN, and Multicast-VPN, to name but a few. To identify individual network layer protocols and be able to associate them with Next-Hop information and the semantics of the NLRI, the extensions to Multi-Protocol BGP specified the use of the Address Family Identifier (AFI) and the Subsequent Address Family Identifier (SAFI).
AFI and SAFI assignments are administered by IANA at www.iana.org/assignments/address-family-numbers/address-family-numbers.xhtml and www.iana.org/assignments/safi-namespace/safi-namespace.xhtml. By way of example, a VPN-IPv4 prefix is represented as AFI 1 (IPv4), SAFI 128 (MPLS-labeled VPN address).
Two optional transitive attributes were introduced to support Multi-Protocol extensions to BGP: Multi-Protocol Reachable NLRI and Multi-Protocol Unreachable NLRI. The Multi-Protocol Reachable NLRI (MP_REACH_NLRI) is used to carry the set of reachable destination prefixes together with the Next-Hop information to be used for forwarding to those destination prefixes. Each MP_REACH_NLRI UPDATE message contains a single Next-Hop address and a list of NLRIs associated with that Next-Hop address.
At a minimum, an UPDATE message that carries the MP_REACH_NLRI must also carry the Next-Hop, Origin, and AS_PATH attributes in both EBGP and IBGP, and the LOCAL-PREF attribute in IBGP.
In contrast, Multi-Protocol Unreachable NLRI (MP_UNREACH_NLRI) is used to withdraw one or more unfeasible routes and has much the same format as the MP_REACH_NLRI attribute without the requirement to signal Next-Hop information.
Figure 1-4 MP_REACH_NLRI Encoding
Figure 1-5 MP_UNREACH_NLRI Encoding
Unlike the MP_REACH_NLRI, an UPDATE message containing the MP_UNREACH_NLRI attribute is not required to carry any other path attributes.
The capability to support Multi-Protocol BGP is negotiated in the OPEN exchange on an Address Family basis. By default, SR-OS signals the Multi-Protocol BGP capability for AFI/SAFI unicast IPv4 only. If other Address Families are added or removed at BGP/group/neighbor level, the OPEN exchange is renegotiated. To illustrate the encoding of the Multi-Protocol BGP MP_REACH_NLRI, Debug 1-5 shows an UPDATE message for IPv6 prefix 2a00:8010:1b00::/48. Note the Address Family, Next-Hop information, and prefix are all contained within the single MP_REACH_NLRI attribute.
The introduction of Multi-Protocol BGP was significant. BGP was already considered a very flexible protocol and relatively lightweight to support, and with the introduction of Multi-Protocol BGP AFI/SAFI and different NLRI it had become extensible to support any other network layer as you'll see in the following chapters.
UPDATE or MP_REACH, and Withdraw or MP_UNREACH are referred to interchangeably throughout this book.
Debug 1-5 UPDATE with MP_REACH_NLRI attribute
1 2013/05/02 13:54:46.39 BST MINOR: DEBUG #2001 Base Peer 1: 192.168.0.2
"Peer 1: 192.168.0.2: UPDATE
Peer 1: 192.168.0.2 - Received BGP UPDATE:
Withdrawn Length = 0
Total Path Attr Length = 42
Flag: 0x40 Type: 1 Len: 1 Origin: 0
Flag: 0x40 Type: 2 Len: 4 AS Path:
Type: 2 Len: 1 < 64510 >
Flag: 0x80 Type: 14 Len: 28 Multiprotocol Reachable NLRI:
Address Family IPV6
NextHop len 16 Global NextHop 2001:db8:1C00::3
2001:db8:1B00::/48"