Border Gateway Protocol

From Wikipedia, the free encyclopedia

The five-layer TCP/IP model
5. Application layer
DHCP · DNS · FTP · Gopher · HTTP · IMAP4 · IRC · NNTP · XMPP · POP3 · SIP · SMTP · SNMP · SSH · TELNET · RPC · RTP · RTCP · RTSP · TLS/SSL · SDP · SOAP · BGP · GTP · STUN · NTP · RIP· ...
4. Transport layer
TCP · UDP · DCCP · SCTP · RSVP · IGMP · OSPF· ...
3. Network/Internet Layer
IP (IPv4 · IPv6) · IS-IS · IPsec · ICMP · ARP · RARP · ...
2. Data link layer
802.11 · Wi-Fi · WiMAX · ATM · DTM · Token Ring · Ethernet · FDDI · Frame Relay · GPRS · EVDO · HSPA · HDLC · PPP · PPTP · L2TP · ...
1. Physical layer
Ethernet physical layer · ISDN · Modems · PLC · SONET/SDH · G.709 · OFDM ·Optical Fiber · Coaxial Cable · Twisted Pair · ...
This box: view • talk • edit

The Border Gateway Protocol (BGP) is the core routing protocol of the Internet. It works by maintaining a table of IP networks or 'prefixes' which designate network reachability among autonomous systems (AS). It is described as a path vector protocol. BGP does not use traditional IGP metrics, but makes routing decisions based on path, network policies and/or rulesets.

Since 1994, version four of the protocol has been in use on the Internet. All previous versions are now obsolete. The major enhancement in version 4 was support of Classless Inter-Domain Routing and use of route aggregation to decrease the size of routing tables. From January 2006, version 4 is codified in RFC 4271, which went through well over 20 drafts from the earlier RFC 1771 version 4. The RFC 4271 version corrected a number of errors, clarified ambiguities, and also brought the RFC much closer to industry practices.

BGP was created to replace the EGP routing protocol to allow fully decentralized routing in order to allow the removal of the NSFNet Internet backbone network. This allowed the Internet to become a truly decentralized system.

Very large private IP networks can also make use of BGP. An example would be the joining of a number of large Open Shortest Path First (OSPF) networks where OSPF by itself would not scale to size. Another reason to use BGP would be multihoming a network for better redundancy either to a multiple access points of a single ISP (RFC 1998) or to multiple ISPs.

Most Internet users do not use BGP directly. However, since most Internet service providers must use BGP to establish routing between one another (especially if they are multihomed), it is one of the most important protocols of the Internet. Compare this with Signalling System 7 (SS7), which is the inter-provider core call setup protocol on the PSTN.

[edit] BGP operation

BGP neighbors, or peers, are established by manual configuration between routers to create a TCP session on port 179. A BGP speaker will periodically send 19-byte keep-alive messages to maintain the connection (every 60 seconds by default). Among routing protocols, BGP is unique in using TCP as its transport protocol.

When BGP is running inside an autonomous system (AS), it is referred to as Internal BGP (IBGP Interior Border Gateway Protocol). When BGP runs between ASs, it is called External BGP (EBGP Exterior Border Gateway Protocol). Routers that sit on the boundary of an AS exchange information with the ISP are border or edge routers. In the Cisco operating system, IBGP routes have an administrative distance of 200. which is less preferred than either external BGP or any interior routing protocol. Other router implementations also prefer eBGP to IGPs, and IGPs to iBGP.

[edit] Optional Extensions negotiated at Connection Setup

During the OPEN handshake, BGP speakers can negotiate^[1] optional capabilities of the session, including multiprotocol extensions and various recovery modes. If the multiprotocol extensions to BGP ^[2] are negotiated at the time of creation, the BGP speaker can prefix the Network Layer Reachability Information (NLRI) it advertises with an address family prefix. These families include the default IPv4, but also IPv6, IPv4 and IPv6 Virtual Private Networks, and multicast BGP. Increasingly, BGP is used as a generalized signaling protocol to carry information about routes that may not be part of the global Internet, such as VPNs ^[3].

[edit] Finite state machine

In order to make decisions in its operations with other BGP peers, a BGP peer uses a simple finite state machine that consists of six states: Idle, Connect, Active, OpenSent, OpenConfirm, and Established. For each peer-to-peer session, a BGP implementation maintains a state variable that tracks which of these six states the session is in. The BGP protocol defines the messages that each peer should exchange in order to change the session from one state to another. The first mode is the “Idle” mode. In this mode BGP is listening for packets from a neighboring BGP router. The second mode is the “Connect” mode. In this state the router makes an attempt to connect to a neighboring router; this puts it in the “Active” state. The router then sends an open message to the neighboring router and waits to receive an “Open Confirm” message. Until it receives an “Open Confirm” message it remains in the active state. Finally, after receiving an “Open Confirm” message the router is placed in the “Established” state. Once established the router can now send and receive packets through BGP.

BGP state machine

[edit] Basic BGP UPDATES

Once a BGP session is running, the BGP speakers exchange UPDATE messages about destinations to which the speaker offers connectivity. In the protocol, the basic CIDR route description is called NLRI. NLRI includes the expected destination prefix, prefix length, path of autonomous systems to the destination and next hop in attributes, which can carry a wide range of additional information that affects the acceptance policy of the receiving router. BGP speakers incrementally announce new NLRI to which they offer reachability, but also announce withdrawals of prefixes to which the speaker no longer offers connectivity.

[edit] BGP Router Connectivity and Learning Routes

In the simplest arrangement all routers within a single AS and participating in BGP routing must be configured in a full mesh: each router must be configured as peer to every other router. This causes scaling problems, since the number of required connections grows quadratically with the number of routers involved. To get around this, two solutions are built into BGP: route reflectors (RFC 2796) and confederations (RFC 3065). For the following discussion of basic UPDATE processing, assume a full iBGP mesh.

[edit] Basic UPDATE Processing

A given BGP router may accept NLRI in UPDATEs from multiple neighbors and advertise NLRI to the same, or a different set, of neighbors. Conceptually, BGP maintains its own "master" routing table, called the Loc-RIB, separate from the main routing table of the router. For each neighbor, the BGP process maintains a conceptual Adj-RIB-In containing the NLRI received from the neighbor, and a conceptual Adj-RIB-Out for NLRI to be sent to the neighbor.

"Conceptual", in the preceding paragraph, means that the physical storage and structure of these various tables are decided by the implementer of the BGP code. Their structure is not visible to other BGP routers, although they usually can be interrogated with management commands on the local router. It is quite common, for example, to store both Adj-RIBs and the Loc-RIB in the same data structure, with additional information attached to the RIB entries. The additional information tells the BGP process such things as whether individual entries belong in the Adj-RIBs for specific neighbors, whether the per-neighbor route selection process made received policies eligible for the Loc-RIB, and whether Loc-RIB entries are eligible to be submitted to the local router's routing table management process.

By "eligible to be submitted", BGP will submit the routes that it considers best to the main routing table process. Depending on the implementation of that process, the BGP route is not necessarily selected. For example, a directly connected prefix, learned from the router's own hardware, is usually most preferred. As long as that directly connected route's interface is active, the BGP route to the destination will not be put into the routing table. Once the interface goes down, and there are no more preferred routes, the Loc-RIB route would be installed in the main routing table. Until recently, it was a common mistake to say "BGP carries policies". BGP really carried the information with which rules inside BGP-speaking routers could make policy decisions. Some of the information carried that is explicitly intended to be used in policy decisions are communities and multi-exit discriminators (MED).

[edit] Route Selection

The BGP standard specifies a number of decision factors, more than are used by any other common routing process, for selecting NLRI to go into the Loc-RIB. The first decision point for evaluating NLRI is that its next-hop attribute must be reachable (or resolvable). Another way of saying the next-hop must be reachable is that there must be an active route, already in the main routing table of the router, to the prefix in which the next-hop address is located.

Next, for each neighbor, the BGP process applies various standard and implementation-dependent criteria to decide which routes conceptually should go into the Adj-RIB-In. The neighbor could send several possible routes to a destination, but the first level of preference is at the neighbor level. Only one route to each destination will be installed in the conceptual Adj-RIB-In. This process will also delete, from the Adj-RIB-In, any routes that are withdrawn by the neighbor.

Whenever a conceptual Adj-RIB-In changes, the main BGP process decides if any of the neighbor's new routes are preferred to routes already in the Loc-RIB. If so, it replaces them. If a given route is withdrawn by a neighbor, and there is no other route to that destination, the route is removed from the Loc-RIB, and no longer sent, by BGP, to the main routing table manager. If the router does not have a route to that destination from any non-BGP source, the withdrawn route will be removed from the main routing table.

[edit] Per-Neighbor Decisions

After verifying that the next hop is reachable, if the route comes from an internal (i.e., iBGP) peer, the first rule to apply, according to the standard, is to examine the LOCAL_PREF attribute. If there are several iBGP routes from the neighbor, the one with the lowest LOCAL_PREF is selected, unless there are several routes with the same LOCAL_PREF. In the latter case, the route selection process moves to the next tie-breaker. While LOCAL_PREF is the first rule in the standard, once reachability of the NEXT_HOP is verified, Cisco and several other vendors first consider a decision factor called WEIGHT, which is local to the router (i.e., not transmitted by BGP). The route with the highest WEIGHT is preferred.

LOCAL_PREF, WEIGHT, and other criteria can be manipulated by local configuration and software capabilities. Such manipulation is outside the scope of the standard but is commonly used. For example, the COMMUNITY attribute (see below) is not directly used by the BGP selection process. The BGP neighbor process, however, can have a rule to set LOCAL_PREFERENCE or another factor based on a manually programmed rule to set the attribute if the COMMUNITY value matches some pattern-matching criterion. If the route was learned from an external peer, the per-neighbor BGP process computes a LOCAL_PREFERENCE value from local policy rules, and then compares the LOCAL_PREFERENCE of all routes from the neighbor.

At the per-neighbor level, ignoring implementation-specific policy modifiers, the order of tie-breaking rules is:

Prefer the route with the shortest AS_PATH. An AS_PATH is the set of AS numbers that must be traversed to reach the advertised destination. AS1-AS2-AS3 is shorter than AS4-AS5-AS6-AS-7.
Prefer routes with the lowest value of their ORIGIN attribute.
Prefer routes with the lowest MULTI_EXIT_DISC (multi-exit discriminator or MED) value.

Before the most recent edition of the BGP standard, if an UPDATE had no MULTI_EXIT_DISC value, several implementations created an MED with the least possible value. The current standard, however, specifies that missing MEDs are to be treated as the highest possible value. Since the now-specified rule may cause different behavior than the vendor interpretations, BGP implementations that used the nonstandard default value have a configuration feature that allows the old or standard rule to be selected.

[edit] Decision Factors at the LOC-Rib Level

Once candidate routes are received from neighbors, the Loc-RIB software applies additional tie-breakers to routes to the same destination.

If at least one route was learned from an external neighbor (i.e., the route was learned from eBGP), drop all routes learned from iBGP.
Prefer the route with the lowest interior cost to the NEXT_HOP, according to the main Routing Table. If two neighbors advertised the same route, but one neighbor is reachable via a low-bandwidth link and the other by a high-bandwidth link, and the interior routing protocol calculates lowest cost based on highest bandwidth, the route through the high-bandwidth link would be preferred and other routes dropped.

If there is more than one route still tied at this point, several BGP implementations offer a configurable option to load-share among the routes, accepting all (or all up to some number).

Prefer the route learned from the BGP speaker the numerically lowest BGP identifier
Prefer the route learned from the BGP speaker with the lowest peer IP address ==

[edit] Communities

BGP communities are sets of routes with some common attribute (RFC 1997). RFC 1998 shows one technique, based on communities, for multihoming with several connections to the same AS.

[edit] Uses of Multi-Exit Discriminators

MEDs, defined in the main BGP standard, were originally intended to show the advertising AS's preference, to another neighbor AS, the advertising AS's preference as to which of several links, to the same AS, are preferred as the place to which the accepting AS should transmit traffic. Another application of MEDs is to advertise the value, typically based on delay, of multiple AS that have presence at an IXP, that they impose to send traffic to some destination.

[edit] BGP problems and mitigation

[edit] iBGP scalability

An autonomous system with IBGP must have all of its IBGP peers connect to each other in a full mesh (where everyone speaks to everyone directly). This full-mesh configuration requires that each router maintain a session to every other router. In large networks, this number of sessions may degrade performance of routers, due either to a lack of memory, or too much CPU process requirements.

Route reflectors and confederations both reduce the number of iBGP peers to each router and thus reduce processing overhead. Route reflectors are a pure performance-enhancing technique, while confederations also can be used to implement more fine-grained policy.

Route reflectors^[4] reduce the number of connections required in an AS. A single router (or two for redundancy) can be made a route reflector: other routers in the AS need only be configured as peers to them.

Confederations are sets of autonomous systems. In common practice, ^[5] only one of the confederation AS numbers is seen by the Internet as a whole. Confederations are used in very large networks where a large AS can be configured to encompass smaller more manageable internal ASs.

Confederations can be used in conjunction with route reflectors. Confederations allow more fine-grained policy while route reflectors are a pure scaling technique, but either or both may be relevant to a particular situation.

Both confederations and route reflectors can be subject to persistent oscillation, unless specific design rules, affecting both BGP and the interior routing protocol, are followed ^[6].

However, these alternatives can introduce problems of their own, including the following:

route oscillation,
sub-optimal routing,
increase of BGP convergence time ^[7]

Additionally, route reflectors and BGP confederation were not designed to ease BGP router's configuration. Nevertheless, these are common tools for experienced BGP network architects. These tools may be combined, as, for example, a hierarchy of route reflectors.

[edit] Instability

The routing tables managed by a BGP implementation are adjusted continually to reflect actual changes in the network, such as links breaking and being restored or routers going down and coming back up. In the network as a whole it is normal for these changes to happen almost continuously, but for any particular router or link changes are supposed to be relatively infrequent. If a router is misconfigured or mismanaged then it may get into a rapid cycle between down and up states. This pattern of repeated withdrawal and reannouncement, known as route flapping, can cause excessive activity in all the other routers that know about the broken link, as the same route is continuously injected and withdrawn from the routing tables.

A feature known as route flap damping (RFC 2439) is built into many BGP implementations in an attempt to mitigate the effects of route flapping. Without damping the excessive activity can cause a heavy processing load on routers, which may in turn delay updates on other routes, and so affect overall routing stability. With damping, a route's flapping is exponentially decayed. At first instance when a route becomes unavailable but quickly reappears for whatever reason, then the damping does not take effect, so as to maintain the normal fail-over times of BGP. At the second occurrence, BGP shuns that prefix for a certain length of time; subsequent occurrences are timed out exponentially. After the abnormalities have ceased and a suitable length of time has passed for the offending route, prefixes can be reinstated and its slate wiped clean. Damping can also mitigate denial of service attacks; damping timings are highly customizable.

However, subsequent research has shown that flap damping can actually lengthen convergence times in some cases, and can cause interruptions in connectivity even when links are not flapping.^[8]^[9] Moreover, as backbone links and router processors have become faster, some network architects have suggested that flap damping may not be as important as it used to be, since changes to the routing table can be absorbed much faster by routers. This has led the RIPE Route Working Group to write that "with the current implementations of BGP flap damping, the application of flap damping in ISP networks is NOT recommended. ... If flap damping is implemented, the ISP operating that network will cause side-effects to their customers and the Internet users of their customers' content and services ... . These side-effects would quite likely be worse than the impact caused by simply not running flap damping at all." [1] Improving stability without the problems of flap damping is the subject of current research.[2]

[edit] Routing table growth

One of the largest problems faced by BGP, and indeed the Internet infrastructure as a whole, comes from the growth of the Internet routing table. If the global routing table grows to the point where some older, less capable, routers cannot cope with the memory requirements or the CPU load of maintaining the table, these routers will cease to be effective gateways between the parts of the Internet they connect. In addition, and perhaps even more importantly, larger routing tables take longer to stabilize (see above) after a major connectivity change, leaving network service unreliable, or even unavailable, in the interim.

Until late 2001, the global routing table was growing exponentially, threatening an eventual widespread breakdown of connectivity. In an attempt to prevent this from happening, there was a cooperative effort by ISPs to keep the global routing table as small as possible, by using CIDR and route aggregation. While this slowed the growth of the routing table to a linear process for several years, with the expanded demand for multihoming by end user networks the growth was once again exponential by the middle of 2004. The global routing table hit 200,000 entries on or about October 13, 2006.

A network black hole is often used to improve aggregation of the BGP global routing table.^{[citation needed]} Consider an AS that has been allocated the address space 172.16.0.0/16, from which it has assigned the prefixes 172.16.0.0/18, 172.16.64.0/18, and 172.16.192.0/18. The AS can advertise the whole block, 172.16.0.0/16. This AS will still receive traffic sent to the "hole", 172.16.128.0/18, but will silently discard it.

[edit] Requirements of a router for use of BGP for Internet and backbone-of-backbones purposes

Routers, especially small ones intended for Small Office/Home Office (SOHO) use, may not include BGP software. Some SOHO routers simply are not capable of running BGP using BGP routing tables of any size. Other commercial routers may need a specific software executable image that contains BGP, or a license that enables it. Open source packages that run BGP include GateD, GNU Zebra, Quagga, and OpenBGPD. Devices marketed as Layer 3 switches are less likely to support BGP than devices marketed as routers, but high-end Layer 3 Switches usually can run BGP.

Products marketed as switches may or may not have a size limitation on BGP tables, such as 20,000 routes, far smaller than a full Internet table plus internal routes. These devices, however, may be perfectly reasonable and useful when used for BGP routing of some smaller part of the network, such as a confederation-AS representing one of several smaller enterprises that are linked, by a BGP backbone of backbones, or a small enterprise that announces routes to an ISP but only accepts a default route and perhaps a small number of aggregated routes.

A BGP router used only for a network with a single point of entry to the internet may have a much smaller routing table size (and hence RAM and CPU requirement) than a multihomed network. Even simple multihoming can have modest routing table size. See RFC 4098 for vendor-independent performance parameters for single BGP router convergence in the control plane.

It is not a given that a router running BGP needs a large memory. The memory requirement depends on the amount of BGP information exchanged with other BGP speakers, and the way in which the particular router stores BGP information. Do be aware that the router may have to keep more than one copy of a route, so it can manage different policies for route advertising and acceptance to a specific neighboring AS. The term view is often used for these different policy relationships on a running router.

If one router implementation takes more memory per route than another implementation, this may be a legitimate design choice, trading processing speed against memory. A full BGP table from an external peer will have in excess of 222,000 routes as of June 2007. Large ISPs may add another 50% for internal and customer routes. Again depending on implementation, separate tables may be kept for each view of a different peer AS.

[edit] Open Source Implementations of BGP

6WINDGate, commercial embedded open-source routing modules from 6WIND including multi-core and network processors support.
Vyatta, a commercial open-source router / firewall.
Quagga, a fork of GNU Zebra for Unix-like systems.
GNU Zebra, a GPL routing suite supporting BGP4.
OpenBGPD, a BSD licensed implementation by the OpenBSD team.
XORP, the eXtensible Open Router Platform, a BSD licensed suite.
BIRD, a GPL routing package for Unix-like systems.

[edit] BGP simulators

BGPlay, a Java applet that presents a graphical visualization of BGP routes and updates for any real AS on the Internet
SSFnet, SSFnet network simulator includes a BGP implementation developed by BJ Premore
C-BGP, a BGP simulator able to perform large-scale simulation trying to model the ASes of the Internet or modelling ASes as large as Tier-1^[10].
BGP++, a patch integrating GNU Zebra software on ns-2 and GTNetS network simulators
ns-BGP, a BGP extension for ns-2 simulator based on the SSFnet implementation

[edit] References

^ Capabilities Advertisement with BGP-4,RFC 2842, R. Chandra & J. Scudder,May 2000
^ Multiprotocol Extensions for BGP-4,RFC 2858, T. Bates et al.,June 2000
^ BGP/MPLS VPNs.,RFC 2547, E. Rosen and Y. Rekhter,April 2004
^ BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP), RFC 4456, T. Bates et al, April 2006
^ Autonomous System Confederations for BGP, RFC 3065, P. Traina et al, February 2001
^ Border Gateway Protocol (BGP) Persistent Route Oscillation Condition, RFC 3345, D. McPherson et al, August 2002
^ Terminology for Benchmarking BGP Device Convergence in the Control Plane, RFC 4098, H. Berkowitz et al, June 2005
^ Route Flap Damping Exacerbates Internet Routing Convergence
^ Zhang, Beichuan; Pei Dan, Daniel Massey, Lixia Zhang (June 2005). Timer Interaction in Route Flap Damping. IEEE 25th International Conference on Distributed Computing Systems. Retrieved on 2006-09-26. “We show that the current damping design leads to the intended behavior only under persistent route flapping. When the number of flaps is small, the global routing dynamics deviates significantly from the expected behavior with a longer convergence delay.”
^ Modeling the routing of an Autonomous System with C-BGP

[edit] See also

[edit] External links

LinkRank A tool for BGP routing visualization by University of California, Los Angeles
BGP Routing Resources (includes a dedicated section on BGP & ISP Core Security)
BGP table statistics
ASNumber Firefox Extension showing the AS number and additional information of the website currently open
RIPE Routing Information Service collecting over 550 IPv4 and IPv6 BGP feeds at 14 sites around the world
RIS Looking Glass into the Default Free Routing zone of the Internet
RISwhois providing IPv4/IPv6 Address to BGP AS Origin Mapping
RIS BGPlay BGP routing visualization tool by Università degli Studi Roma Tre
Linux Magazine: Demystifying BGP (Good, Detailed BGP explanation; requires registration)
Some important BGP RFCs
- RFC 4456, BGP Route Reflection - An Alternative to Full Mesh Internal BGP (IBGP)(obsoletes: RFC 2796)
- RFC 4278, Standards Maturity Variance Regarding the TCP MD5 Signature Option (RFC 2385) and the BGP-4 Specification
- RFC 4277, Experience with the BGP-4 Protocol
- RFC 4276, BGP-4 Implementation Report
- RFC 4275, BGP-4 MIB Implementation Survey
- RFC 4274, BGP-4 Protocol Analysis
- RFC 4273, Definitions of Managed Objects for BGP-4
- RFC 4272, BGP Security Vulnerabilities Analysis
- RFC 4271, A Border Gateway Protocol 4 (BGP-4) (obsoletes: RFC 1771)
- RFC 3392, Capabilities Advertisement with BGP-4
- RFC 3065, Autonomous System Confederations for BGP
- RFC 2918, Route Refresh Capability for BGP-4
- RFC 1772, Application of the Border Gateway Protocol in the Internet Protocol (BGP-4) using SMIv2
Obsolete RFCs
- RFC 2796, Obsolete - BGP Route Reflection - An Alternative to Full Mesh IBGP
- RFC 1965, Obsolete - Autonomous System Confederations for BGP
- RFC 1771, Obsolete - A Border Gateway Protocol 4 (BGP-4)
- RFC 1657, Obsolete - Definitions of Managed Objects for the Fourth Version of the Border Gateway
- RFC 1655, Obsolete - Application of the Border Gateway Protocol in the Internet
- RFC 1654, Obsolete - A Border Gateway Protocol 4 (BGP-4)
- RFC 1105, Obsolete - Border Gateway Protocol (BGP)

BGP Interactions at Router Startup Described as a Sequence Diagram (PDF)

Retrieved from "http://en.wikipedia.org/wiki/Border_Gateway_Protocol"