You are here

TCP reliable data transfer

19 January, 2016 - 11:23

The original TCP data transfer mechanisms were defined in RFC 793. Based on the experience of using TCP on the growing global Internet, this part of the TCP specification has been updated and improved several times, always while preserving the backward compatibility with older TCP implementations. In this section, we review the main data transfer mechanisms used by TCP.

TCP is a window-based transport protocol that provides a bi-directional byte stream service. This has several implications on the fields of the TCP header and the mechanisms used by TCP. The three fields of the TCP header are :

  • sequence number. TCP uses a 32 bits sequence number. The sequence number placed in the header of a TCP segment containing data is the sequence number of the first byte of the payload of the TCP segment.
  • acknowledgement number. TCP uses cumulative positive acknowledgements. Each TCP segment contains the sequence number of the next byte that the sender of the acknowledgement expects to receive from the remote host. In theory, the acknowledgement number is only valid if the ACK flag of the TCP header is set. In practice almost all 1 TCP segments have their ACK flag set.
  • window. a TCP receiver uses this 16 bits field to indicate the current size of its receive window expressed in bytes.


Note: The Transmission Control Block
For each established TCP connection, a TCP implementation must maintain a Transmission Control Block (TCB). A TCB contains all the information required to send and receive segments on this connection RFC 793. This includes 2:
  • the local IP address
  • the remote IP address
  • the local TCP port number
  • the remote TCP port number
  • the current state of the TCP FSM
  • the maximum segment size (MSS)
  • snd.nxt : the sequence number of the next byte in the byte stream (the first byte of a new data segment that you send uses this sequence number)
  • snd.una : the earliest sequence number that has been sent but has not yet been acknowledged
  • snd.wnd : the current size of the sending window (in bytes)
  • rcv.nxt : the sequence number of the next byte that is expected to be received from the remote host
  • rcv.wnd : the current size of the receive window advertised by the remote host
  • sending buffer : a buffer used to store all unacknowledged data
  • receiving buffer : a buffer to store all data received from the remote host that has not yet been delivered to the user. Data may be stored in the receiving buffer because either it was not received in sequence or because the user is too slow to process it

The original TCP specification can be categorised as a transport protocol that provides a byte stream service and uses go-back-n.

To send new data on an established connection, a TCP entity performs the following operations on the corresponding TCB. It first checks that the sending buffer does not contain more data than the receive window advertised by the remote host (rcv.wnd). If the window is not full, up to MSS bytes of data are placed in the payload of a TCP segment. The sequence number of this segment is the sequence number of the first byte of the payload. It is set to the first available sequence number : snd.nxt and snd.nxt is incremented by the length of the payload of the TCP segment. The acknowledgement number of this segment is set to the current value of rcv.nxt and the window field of the TCP segment is computed based on the current occupancy of the receiving buffer. The data is kept in the sending buffer in case it needs to be retransmitted later.

When a TCP segment with the ACK flag set is received, the following operations are performed. rcv.wnd is set to the value of the window field of the received segment. The acknowledgement number is compared to snd.una. The newly acknowledged data is remove from the sending buffer and snd.una is updated. If the TCP segment contained data, the sequence number is compared to rcv.nxt. If they are equal, the segment was received in sequence and the data can be delivered to the user and rcv.nxt is updated. The contents of the receiving buffer is checked to see whether other data already present in this buffer can be delivered in sequence to the user. If so, rcv.nxt is updated again. Otherwise, the segment’s payload is placed in the receiving buffer.

Segment transmission strategies

In a transport protocol such as TCP that offers a bytestream, a practical issue that was left as an implementation choice in RFC 793 is to decide when a new TCP segment containing data must be sent. There are two simple and extreme implementation choices. The first implementation choice is to send a TCP segment as soon as the user has requested the transmission of some data. This allows TCP to provide a low delay service. However, if the user is sending data one byte at a time, TCP would place each user byte in a segment containing 20 bytes of TCP header 3. This is a huge overhead that is not acceptable in wide area networks. A second simple solution would be to only transmit a new TCP segment once the user has produced MSS bytes of data. This solution reduces the overhead, but at the cost of a potentially very high delay.

An elegant solution to this problem was proposed by John Nagle in RFC 896. John Nagle observed that the overhead caused by the TCP header was a problem in wide area connections, but less in local area connections where the available bandwidth is usually higher. He proposed the following rules to decide to send a new data segment when a new data has been produced by the user or a new ack segment has been received

if rcv.wnd>= MSS and len(data) >= MSS :     send one MSS-sized segment else     if there are unacknowledged data:       place data in buffer until acknowledgement has been received else     send one TCP segment containing all buffered data 

The first rule ensures that a TCP connection used for bulk data transfer always sends full TCP segments. The second rule sends one partially filled TCP segment every round-trip-time.

This algorithm, called the Nagle algorithm, takes a few lines of code in all TCP implementations. These lines of code have a huge impact on the packets that are exchanged in TCP/IP networks. Researchers have analysed the distribution of the packet sizes by capturing and analysing all the packets passing through a given link. These studies have shown several important results :

  • in TCP/IPv4 networks, a large fraction of the packets are TCP segments that contain only an acknowledgement. These packets usually account for 40-50% of the packets passing through the studied link
  • in TCP/IPv4 networks, most of the bytes are exchanged in long packets, usually packets containing up to 1460 bytes of payload which is the default MSS for hosts attached to an Ethernet network, the most popular type of LAN

The figure below provides a distribution of the packet sizes measured on a link. It shows a three-modal distribution of the packet size. 50% of the packets contain pure TCP acknowledgements and occupy 40 bytes. About 20% of the packets contain about 500 bytes 4 of user data and 12% of the packets contain 1460 bytes of user data. However, most of the user data is transported in large packets. This packet size distribution has implications on the design of routers as we discuss in the next chapter.

media/image100.png
Figure 4.43 Packet size distribution in the Internet  

Recent measurements indicate that these packet size distributions are still valid in today’s Internet, although the packet distribution tends to become bimodal with small packets corresponding to TCP pure acks (40-64 bytes depending on the utilisation of TCP options) and large 1460-bytes packets carrying most of the user data.

TCP windows

From a performance point of view, one of the main limitations of the original TCP specification is the 16 bits window field in the TCP header. As this field indicates the current size of the receive window in bytes, it limits the TCP receive window at 65535 bytes. This limitation was not a severe problem when TCP was designed since at that time high-speed wide area networks offered a maximum bandwidth of 56 kbps. However, in today’s network, this limitation is not acceptable anymore. The table below provides the rough 5 maximum throughput that can be achieved by a TCP connection with a 64 KBytes window in function of the connection’s round-trip-time

RTT

Maximum Throughput

1 msec

10 msec

100 msec

500 msec

524 Mbps

52.4 Mbps

5.24 Mbps

1.05 Mbps

To solve this problem, a backward compatible extension that allows TCP to use larger receive windows was proposed in RFC 1323. Today, most TCP implementations support this option. The basic idea is that instead of storing snd.wnd and rcv.wnd as 16 bits integers in the TCB, they should be stored as 32 bits integers. As the TCP segment header only contains 16 bits to place the window field, it is impossible to copy the value of snd.wnd in each sent TCP segment. Instead the header contains snd.wnd >> S where S is the scaling factor ( 0 ≤ S ≤ 14) negotiated during connection establishment. The client adds its proposed scaling factor as a TCP option in the SYN segment. If the server supports RFC 1323, it places in the SYN+ACK segment the scaling factor that it uses when advertising its own receive window. The local and remote scaling factors are included in the TCB. If the server does not support RFC 1323, it ignores the received option and no scaling is applied.

By using the window scaling extensions defined in RFC 1323, TCP implementations can use a receive buffer of up to 1 GByte. With such a receive buffer, the maximum throughput that can be achieved by a single TCP connection becomes :

RTT

Maximum Throughput

1 msec

10 msec

100 msec

500 msec

8590 Gbps

859 Gbps

86 Gbps

17 Gbps

These throughputs are acceptable in today’s networks. However, there are already servers having 10 Gbps interfaces... Early TCP implementations had fixed receiving and sending buffers 6. Today’s high performance implementations are able to automatically adjust the size of the sending and receiving buffer to better support high bandwidth flows [SMM1998]

TCP’s retransmission timeout

In a go-back-n transport protocol such as TCP, the retransmission timeout must be correctly set in order to achieve good performance. If the retransmission timeout expires too early, then bandwidth is wasted by retransmitting segments that have already been correctly received; whereas if the retransmission timeout expires too late, then bandwidth is wasted because the sender is idle waiting for the expiration of its retransmission timeout.

A good setting of the retransmission timeout clearly depends on an accurate estimation of the round-trip-time of each TCP connection. The round-trip-time differs between TCP connections, but may also change during the lifetime of a single connection. For example, the figure below shows the evolution of the round-trip-time between two hosts during a period of 45 seconds.

media/image101.png
Figure 4.44 Evolution of the round-trip-time between two hosts  

The easiest solution to measure the round-trip-time on a TCP connection is to measure the delay between the transmission of a data segment and the reception of a corresponding acknowledgement 7 .As illustrated in the figure below, this measurement works well when there are no segment losses.

media/image102.png
Figure 4.45 How to measure the round-trip-time ?  

However, when a data segment is lost, as illustrated in the bottom part of the figure, the measurement is ambiguous as the sender cannot determine whether the received acknowledgement was triggered by the first transmission of segment 123 or its retransmission. Using incorrect round-trip-time estimations could lead to incorrect values of the retransmission timeout. For this reason, Phil Karn and Craig Partridge proposed, in [KP91], to ignore the round-trip-time measurements performed during retransmissions.

To avoid this ambiguity in the estimation of the round-trip-time when segments are retransmitted, recent TCP implementations rely on the timestamp option defined in RFC 1323. This option allows a TCP sender to place two 32 bit timestamps in each TCP segment that it sends. The first timestamp, TS Value (TSval) is chosen by the sender of the segment. It could for example be the current value of its real-time clock 8. The second value, TS Echo Reply (TSecr), is the last TSval that was received from the remote host and stored in the TCB. The figure below shows how the utilization of this timestamp option allows for the disambiguation of the round-trip-time measurement when there are retransmissions.

: media/image103.png
Figure 4.46 Disambiguating round-trip-time measurements with the RFC 1323 timestamp option  

Once the round-trip-time measurements have been collected for a given TCP connection, the TCP entity must compute the retransmission timeout. As the round-trip-time measurements may change during the lifetime of a connection, the retransmission timeout may also change. At the beginning of a connection 9, the TCP entity that sends a SYN segment does not know the round-trip-time to reach the remote host and the initial retransmission timeout is usually set to 3 seconds RFC 2988.

The original TCP specification proposed in RFC 793 to include two additional variables in the TCB :

  • srtt : the smoothed round-trip-time computed as srrt =(α × srtt) + ((1 − α) × rtt) where rtt is the round-trip-time measured according to the above procedure and α a smoothing factor (e.g. 0.8 or 0.9)
  • rto : the retransmission timeout is computed as rto = min(60, max(1,β × srtt)) where β is used to take into account the delay variance (value : 1.3 to 2.0). The 60 and 1 constants are used to ensure that the rto is not larger than one minute nor smaller than 1 second.

However, in practice, this computation for the retransmission timeout did not work well. The main problem was that the computed rto did not correctly take into account the variations in the measured round-trip-time. Van Jacobson proposed in his seminal paper [Jacobson1988] an improved algorithm to compute the rto and implemented it in the BSD Unix distribution. This algorithm is now part of the TCP standard RFC 2988.

Jacobson’s algorithm uses two state variables, srtt the smoothed rtt and rttvar the estimation of the variance of the rtt and two parameters : α and β. When a TCP connection starts, the first rto is set to 3 seconds. When a first estimation of the rtt is available, the srtt, rttvar and rto are computed as

srtt=rttrttvar=rtt/2 rto=srtt+4*rttvar 

Then, when other rtt measurements are collected, srtt and rttvar are updated as follows :

rttvar = (1 − β) × rttvar + β ×|srtt − rtt| srtt = (1 − α) × srtt + α × rtt rto = srtt +4 × rttvar 

The proposed values for the parameters are α =1 /8 and β =1/4. This allows a TCP implementation, implemented in the kernel, to perform the rtt computation by using shift operations instead of the more costly floating point operations [Jacobson1988]. The figure below illustrates the computation of the rto upon rtt changes.

: media/image104.png
Figure 4.47 Example computation of the rto  

Advanced retransmission strategies

The default go-back-n retransmission strategy was defined in RFC 793. When the retransmission timer expires, TCP retransmits the first unacknowledged segment (i.e. the one having sequence number snd.una). After each expiration of the retransmission timeout, RFC 2988 recommends to double the value of the retransmission time-out. This is called an exponential backoff. This doubling of the retransmission timeout after a retransmission was included in TCP to deal with issues such as network/receiver overload and incorrect initial estimations of the retransmission timeout. If the same segment is retransmitted several times, the retransmission timeout is doubled after every retransmission until it reaches a configured maximum. RFC 2988 suggests a maximum retransmission timeout of at least 60 seconds. Once the retransmission timeout reaches this configured maximum, the remote host is considered to be unreachable and the TCP connection is closed. This retransmission strategy has been refined based on the experience of using TCP on the Internet. The first refinement was a clarification of the strategy used to send acknowledgements. As TCP uses piggybacking, the easiest and less costly method to send acknowledgements is to place them in the data segments sent in the other direction. However, few application layer protocols exchange data in both directions at the same time and thus this method rarely works. For an application that is sending data segments in one direction only, the remote TCP entity returns empty TCP segments whose only useful information is their acknowledgement number. This may cause a large overhead in wide area network if a pure ACK segment is sent in response to each received data segment. Most TCP implementations use a delayed acknowledgement strategy. This strategy ensures that piggybacking is used whenever possible, otherwise pure ACK segments are sent for every second received data segments when there are no losses. When there are losses or reordering, ACK segments are more important for the sender and they are sent immediately RFC 813RFC 1122. This strategy relies on a new timer with a short delay (e.g. 50 milliseconds) and one additional flag in the TCB. It can be implemented as follows

reception of a data segment:     if pkt.seq==rcv.nxt: # segment received in sequence       if delayedack :         send pure ack segment         cancel acktimer         delayedack=False       else:       delayedack=True       start acktimer    else: # out of sequence segment       send pure ack segment       if delayedack:         delayedack=False         cancel acktimer transmission of a data segment: # piggyback ack     if delayedack:        delayedack=False cancel acktimer acktimer expiration: send pure ack segment delayedack=False 

Due to this delayed acknowledgement strategy, during a bulk transfer, a TCP implementation usually acknowledges every second TCP segment received.

The default go-back-n retransmission strategy used by TCP has the advantage of being simple to implement, in particular on the receiver side, but when there are losses, a go-back-n strategy provides a lower performance than a selective repeat strategy. The TCP developers have designed several extensions to TCP to allow it to use a selective repeat strategy while maintaining backward compatibility with older TCP implementations. These TCP extensions assume that the receiver is able to buffer the segments that it receives out-of-sequence.

The first extension that was proposed is the fast retransmit heuristic. This extension can be implemented on TCP senders and thus does not require any change to the protocol. It only assumes that the TCP receiver is able to buffer out-of-sequence segments.

From a performance point of view, one issue with TCP’s retransmission timeout is that when there are isolated segment losses, the TCP sender often remains idle waiting for the expiration of its retransmission timeouts. Such isolated losses are frequent in the global Internet [Paxson99]. A heuristic to deal with isolated losses without waiting for the expiration of the retransmission timeout has been included in many TCP implementations since the early 1990s. To understand this heuristic, let us consider the figure below that shows the segments exchanged over a TCP connection when an isolated segment is lost.

As shown above, when an isolated segment is lost the sender receives several duplicate acknowledgements since the TCP receiver immediately sends a pure acknowledgement when it receives an out-of-sequence segment. A duplicate acknowledgement is an acknowledgement that contains the same acknowledgement number as a previous segment. A single duplicate acknowledgement does not necessarily imply that a segment was lost, as a simple reordering of the segments may cause duplicate acknowledgements as well. Measurements [Paxson99] have shown that segment reordering is frequent in the Internet. Based on these observations, the fast retransmit heuristic has been included in most TCP implementations. It can be implemented as follows

media/image105.png
Figure 4.48 Detecting isolated segment losses  
ack arrival:     if tcp.ack==snd.una: # duplicate acknowledgement       dupacks++       if dupacks==3:         retransmit segment(snd.una)     else:         dupacks=0        # process acknowledgement 

This heuristic requires an additional variable in the TCB (dupacks). Most implementations set the default number of duplicate acknowledgements that trigger a retransmission to 3. It is now part of the standard TCP specification RFC 2581. The fast retransmit heuristic improves the TCP performance provided that isolated segments are lost and the current window is large enough to allow the sender to send three duplicate acknowledgements.

The figure below illustrates the operation of the fast retransmit heuristic.

: media/image106.png
Figure 4.49 TCP fast retransmit heuristics  

When losses are not isolated or when the windows are small, the performance of the fast retransmit heuristic decreases. In such environments, it is necessary to allow a TCP sender to use a selective repeat strategy instead of the default go-back-n strategy. Implementing selective-repeat requires a change to the TCP protocol as the receiver needs to be able to inform the sender of the out-of-order segments that it has already received. This can be done by using the Selective Acknowledgements (SACK) option defined in RFC 2018. This TCP option is negotiated during the establishment of a TCP connection. If both TCP hosts support the option, SACK blocks can be attached by the receiver to the segments that it sends. SACK blocks allow a TCP receiver to indicate the blocks of data that it has received correctly but out of sequence. The figure below illustrates the utilisation of the SACK blocks.

media/image107.png
Figure 4.50 TCP selective acknowledgements  

An SACK option contains one or more blocks. A block corresponds to all the sequence numbers between the left edge and the right edge of the block. The two edges of the block are encoded as 32 bit numbers (the same size as the TCP sequence number) in an SACK option. As the SACK option contains one byte to encode its type and one byte for its length, a SACK option containing b blocks is encoded as a sequence of 2+8 × b bytes. In practice, the size of the SACK option can be problematic as the optional TCP header extension cannot be longer than 44 bytes. As the SACK option is usually combined with the RFC 1323 timestamp extension, this implies that a TCP segment cannot usually contain more than three SACK blocks. This limitation implies that a TCP receiver cannot always place in the SACK option that it sends, information about all the received blocks.

To deal with the limited size of the SACK option, a TCP receiver currently having more than 3 blocks inside its receiving buffer must select the blocks to place in the SACK option. A good heuristic is to put in the SACK option the blocks that have most recently changed, as the sender is likely to be already aware of the older blocks.

When a sender receives an SACK option indicating a new block and thus a new possible segment loss, it usually does not retransmit the missing segment(s immediately. To deal with reordering, a TCP sender can use a heuristic similar to fast retransmit by retransmitting a gap only once it has received three SACK options indicating this gap. It should be noted that the SACK option does not supersede the acknowledgement number of the TCP header. A TCP sender can only remove data from its sending buffer once they have been acknowledged by TCP’s cumulative acknowledgements. This design was chosen for two reasons. First, it allows the receiver to discard parts of its receiving buffer when it is running out of memory without loosing data. Second, as the SACK option is not transmitted reliably, the cumulative acknowledgements are still required to deal with losses of ACK segments carrying only SACK information. Thus, the SACK option only serves as a hint to allow the sender to optimise its retransmissions.

TCP congestion control

In the previous sections, we have explained the mechanisms that TCP uses to deal with transmission errors and segment losses. In a heterogeneous network such as the Internet or enterprise IP networks, endsystems have very different levels of performance. Some endsystems are high-end servers attached to 10 Gbps links while others are mobile devices attached to a very low bandwidth wireless link. Despite these huge differences in performance, a mobile device should be able to efficiently exchange segments with a high-end server.

To understand this problem better, let us consider the scenario shown in the figure below, where a server (A) attached to a 10 Mbps link is sending TCP segments to another computer (C) through a path that contains a 2 Mbps link.

: media/image108.png
Figure 4.51 TCP over heterogeneous links  

In this network, the TCP segments sent by the server reach router R1. R1 forwards the segments towards router R2. Router R2 can potentially receive segments at 10 Mbps, but it can only forward them at 2 Mbps to router R2 and then to host C. Router R2 contains buffers that allow it to store the packets that cannot immediately be forwarded to their destination. To understand the operation of TCP in this environment, let us consider a simplified model of this network where host A is attached to a 10 Mbps link to a queue that represents the buffers of router R2. This queue is emptied at a rate of 2 Mbps.

media/image109.png
Figure 4.52 TCP self clocking  

Let us consider that host A uses a window of three segments. It thus sends three back-to-back segments at 10 Mbps and then waits for an acknowledgement. Host A stops sending segments when its window is full. These segments reach the buffers of router R2. The first segment stored in this buffer is sent by router R2 at a rate of 2 Mbps to the destination host. Upon reception of this segment, the destination sends an acknowledgement. This acknowledgement allows host A to transmit a new segment. This segment is stored in the buffers of router R2 while it is transmitting the second segment that was sent by host A... Thus, after the transmission of the first window of segments, TCP sends one data segment after the reception of each acknowledgement returned by the destination 10. In practice, the acknowledgements sent by the destination serve as a kind of clock that allows the sending host to adapt its transmission rate to the rate at which segments are received by the destination. This TCP self-clocking is the first mechanism that allows TCP to adapt to heterogeneous networks [Jacobson1988]. It depends on the availability of buffers to store the segments that have been sent by the sender but have not yet been transmitted to the destination.

However, TCP is not always used in this environment. In the global Internet, TCP is used in networks where a large number of hosts send segments to a large number of receivers. For example, let us consider the network depicted below which is similar to the one discussed in [Jacobson1988] and RFC 896. In this network, we assume that the buffers of the router are infinite to ensure that no packet is lost.

media/image110.png
Figure 4.53 The congestion collapse problem  

If many TCP senders are attached to the left part of the network above, they all send a window full of segments. These segments are stored in the buffers of the router before being transmitted towards their destination. If there are many senders on the left part of the network, the occupancy of the buffers quickly grows. A consequence of the buffer occupancy is that the round-trip-time, measured by TCP, between the sender and the receiver increases. Consider a network where 10,000 bits segments are sent. When the buffer is empty, such a segment requires 1 millisecond to be transmitted on the 10 Mbps link and 5 milliseconds to be the transmitted on the 2 Mbps link. Thus, the round-trip-time measured by TCP is roughly 6 milliseconds if we ignore the propagation delay on the links. Most routers manage their buffers as a FIFO queue 11. If the buffer contains 100 segments, the round-triptime becomes 1 + 100 × 5+5 milliseconds as new segments are only transmitted on the 2 Mbps link once all previous segments have been transmitted. Unfortunately, TCP uses a retransmission timer and performs go-back-n to recover from transmission errors. If the buffer occupancy is high, TCP assumes that some segments have been lost and retransmits a full window of segments. This increases the occupancy of the buffer and the delay through the buffer... Furthermore, the buffer may store and send on the low bandwidth links several retransmissions of the same segment. This problem is called congestion collapse. It occurred several times in the late 1980s. For example, [Jacobson1988] notes that in 1986, the usable bandwidth of a 32 Kbits link dropped to 40 bits per second due to congestion collapse 12!

The congestion collapse is a problem that all heterogeneous networks face. Different mechanisms have been proposed in the scientific literature to avoid or control network congestion. Some of them have been implemented and deployed in real networks. To understand this problem in more detail, let us first consider a simple network with two hosts attached to a high bandwidth link that are sending segments to destination C attached to a low bandwidth link as depicted below.

: media/image111.png
Figure 4.54 The congestion problem  

To avoid congestion collapse, the hosts must regulate their transmission rate 13 by using a congestion control mechanism. Such a mechanism can be implemented in the transport layer or in the network layer. In TCP/IP networks, it is implemented in the transport layer, but other technologies such as Asynchronous Transfer Mode (ATM) or Frame Relay include congestion control mechanisms in lower layers.

Let us first consider the simple problem of a set of i hosts that share a single bottleneck link as shown in the example above. In this network, the congestion control scheme must achieve the following objectives [CJ1989] :

  1. The congestion control scheme must avoid congestion. In practice, this means that the bottleneck link cannot be overloaded. If ri(t) is the transmission rate allocated to host i at time t and R the bandwidth of the bottleneck link, then the congestion control scheme should ensure that, on average,

media/image112.png

    2. The congestion control scheme must be efficient. The bottleneck link is usually both a shared and an expensive resource. Usually, bottleneck links are wide area links that are much more expensive to upgrade than the local area networks. The congestion control scheme should ensure that such links are efficiently used. Mathematically, the control scheme should ensure that

media/image112.png

    3. The congestion control scheme should be fair. Most congestion schemes aim at achieving maxmin fairness. An allocation of transmission rates to sources is said to be max-min fair if :
  • no link in the network is congested
  • the rate allocated to source j cannot be increased without decreasing the rate allocated to a source i whose allocation is smaller than the rate allocated to source j [Leboudec2008] .

Depending on the network, a max-min fair allocation may not always exist. In practice, max-min fairness is an ideal objective that cannot necessarily be achieved. When there is a single bottleneck link as in the example above, max-min fairness implies that each source should be allocated the same transmission rate.

To visualise the different rate allocations, it is useful to consider the graph shown below. In this graph, we plot on the x-axis (resp. y-axis) the rate allocated to host B (resp. A). A point in the graph (rB,rA) corresponds to a possible allocation of the transmission rates. Since there is a 2 Mbps bottleneck link in this network, the graph can be divided into two regions. The lower left part of the graph contains all allocations (rB,rA) such that the bottleneck link is not congested (rA + rB < 2). The right border of this region is the efficiency line, i.e. the set of allocations that completely utilise the bottleneck link (rA + rB =2). Finally, the fairness line is the set of fair allocations.

: media/image113.png
Figure 4.55 Possible allocated transmission rates  

As shown in the graph above, a rate allocation may be fair but not efficient (e.g. rA =0.7,rB =0.7), fair and efficient ( e.g. rA =1,rB =1) or efficient but not fair (e.g. rA =1.5, rB =0.5). Ideally, the allocation should be both fair and efficient. Unfortunately, maintaining such an allocation with fluctuations in the number of flows that use the network is a challenging problem. Furthermore, there might be several thousands of TCP connections or more that pass through the same link 14.

To deal with these fluctuations in demand, which result in fluctuations in the available bandwidth, computer networks use a congestion control scheme. This congestion control scheme should achieve the three objectives listed above. Some congestion control schemes rely on a close cooperation between the endhosts and the routers, while others are mainly implemented on the endhosts with limited support from the routers.

A congestion control scheme can be modelled as an algorithm that adapts the transmission rate (ri(t)) of host i based on the feedback received from the network. Different types of feedbacks are possible. The simplest scheme is a binary feedback [CJ1989] [Jacobson1988] where the hosts simply learn whether the network is congested or not. Some congestion control schemes allow the network to regularly send an allocated transmission rate in Mbps to each host [BF1995].

Let us focus on the binary feedback scheme which is the most widely used today. Intuitively, the congestion control scheme should decrease the transmission rate of a host when congestion has been detected in the network, in order to avoid congestion collapse. Furthermore, the hosts should increase their transmission rate when the network is not congested. Otherwise, the hosts would not be able to efficiently utilise the network. The rate allocated to each host fluctuates with time, depending on the feedback received from the network. The figure below illustrates the evolution of the transmission rates allocated to two hosts in our simple network. Initially, two hosts have a low allocation, but this is not efficient. The allocations increase until the network becomes congested. At this point, the hosts decrease their transmission rate to avoid congestion collapse. If the congestion control scheme works well, after some time the allocations should become both fair and efficient.

media/image114.png
Figure 4.56 Evolution of the transmission rates  

Various types of rate adaption algorithms are possible. Dah Ming Chiu and Raj Jain have analysed, in [CJ1989], different types of algorithms that can be used by a source to adapt its transmission rate to the feedback received from the network. Intuitively, such a rate adaptation algorithm increases the transmission rate when the network is not congested (ensure that the network is efficiently used) and decrease the transmission rate when the network is congested (to avoid congestion collapse).

The simplest form of feedback that the network can send to a source is a binary feedback (the network is congested or not congested). In this case, a linear rate adaptation algorithm can be expressed as :

  • rate(t +1) = αC + βC rate(t) when the network is congested
  • rate(t +1) = αN + βN rate(t) when the network is not congested

With a linear adaption algorithm, αC ,αN ,βC and βN are constants. The analysis of [CJ1989] shows that to be fair and efficient, such a binary rate adaption mechanism must rely on Additive Increase and Multiplicative Decrease. When the network is not congested, the hosts should slowly increase their transmission rate (βN = 1 and αN > 0). When the network is congested, the hosts must multiplicatively decrease their transmission rate (βC < 1 and αC =0). Such an AIMD rate adaptation algorithm can be implemented by the pseudo-code below

# Additive Increase Multiplicative Decrease if congestion :     rate=rate*betaC # multiplicative decrease, betaC<1 
else     rate=rate+alphaN # additive increase, v0>0 


Note: Which binary feedback ?
Two types of binary feedback are possible in computer networks. A first solution is to rely on implicit feedback. This is the solution chosen for TCP. TCP’s congestion control scheme [Jacobson1988] does not require any cooperation from the router. It only assumes that they use buffers and that they discard packets when there is congestion.
TCP uses the segment losses as an indication of congestion. When there are no losses, the network is assumed to be not congested. This implies that congestion is the main cause of packet losses. This is true in wired networks, but unfortunately not always true in wireless networks. Another solution is to rely on explicit feedback. This is the solution proposed in the DECBit congestion control scheme [RJ1995] and used in Frame Relay and ATM networks. This explicit feedback can be implemented in two ways. A first solution would be to define a special message that could be sent by routers to hosts when they are congested. Unfortunately, generating such messages may increase the amount of congestion in the network. Such a congestion indication packet is thus discouraged RFC 1812. A better approach is to allow the intermediate routers to indicate, in the packets that they forward, their current congestion status. Binary feedback can be encoded by using one bit in the packet header. With such a scheme, congested routers set a special bit in the packets that they forward while non-congested routers leave this bit unmodified. The destination host returns the congestion status of the network in the acknowledgements that it sends. Details about such a solution in IP networks may be found in RFC 3168. Unfortunately, as of this writing, this solution is still not deployed despite its potential benefits.

The TCP congestion control scheme was initially proposed by Van Jacobson in [Jacobson1988]. The current specification may be found in RFC 5681. TCP relies on Additive Increase and Multiplicative Decrease (AIMD). To implement AIMD, a TCP host must be able to control its transmission rate. A first approach would be to use timers and adjust their expiration times in function of the rate imposed by AIMD. Unfortunately, maintaining such timers for a large number of TCP connections can be difficult. Instead, Van Jacobsonnoted that the rate of TCP congestion can be artificially controlled by constraining its sending window. A TCP connection cannot send data faster than window/rtt where window is the maximum between the host’s sending window and the window advertised by the receiver.

TCP’s congestion control scheme is based on a congestion window. The current value of the congestion window (cwnd) is stored in the TCB of each TCP connection and the window that can be used by the sender is constrained by min(cwnd, rwin, swin) where swin is the current sending window and rwin the last received receive window. The Additive Increase part of the TCP congestion control increments the congestion window by MSS bytes every round-trip-time. In the TCP literature, this phase is often called the congestion avoidance phase. The Multiplicative Decrease part of the TCP congestion control divides the current value of the congestion window once congestion has been detected.

When a TCP connection begins, the sending host does not know whether the part of the network that it uses to reach the destination is congested or not. To avoid causing too much congestion, it must start with a small congestion window. [Jacobson1988] recommends an initial window of MSS bytes. As the additive increase part of the TCP congestion control scheme increments the congestion window by MSS bytes every round-trip-time, the TCP connection may have to wait many round-trip-times before being able to efficiently use the available bandwidth. This is especially important in environments where the bandwidth × rtt product is high. To avoid waiting too many round-trip-times before reaching a congestion window that is large enough to efficiently utilise the network, the TCP congestion control scheme includes the slow-start algorithm. The objective of the TCP slow-start is to quickly reach an acceptable value for the cwnd. During slow-start, the congestion window is doubled every round-trip-time. The slow-start algorithm uses an additional variable in the TCB : sshtresh (slow-start threshold). The ssthresh is an estimation of the last value of the cwnd that did not cause congestion. It is initialised at the sending window and is updated after each congestion event.

In practice, a TCP implementation considers the network to be congested once its needs to retransmit a segment. The TCP congestion control scheme distinguishes between two types of congestion :

  • mild congestion. TCP considers that the network is lightly congested if it receives three duplicate acknowledgements and performs a fast retransmit. If the fast retransmit is successful, this implies that only one segment has been lost. In this case, TCP performs multiplicative decrease and the congestion window is divided by 2. The slow-start threshold is set to the new value of the congestion window.
  • severe congestion. TCP considers that the network is severely congested when its retransmission timer expires. In this case, TCP retransmits the first segment, sets the slow-start threshold to 50% of the congestion window. The congestion window is reset to its initial value and TCP performs a slow-start.

The figure below illustrates the evolution of the congestion window when there is severe congestion. At the beginning of the connection, the sender performs slow-start until the first segments are lost and the retransmission timer expires. At this time, the ssthresh is set to half of the current congestion window and the congestion window is reset at one segment. The lost segments are retransmitted as the sender again performs slow-start until the congestion window reaches the sshtresh. It then switches to congestion avoidance and the congestion window increases linearly until segments are lost and the retransmission timer expires ...

: media/image115.png
Figure 4.57 Evaluation of the TCP congestion window with severe congestion  

The figure below illustrates the evolution of the congestion window when the network is lightly congested and all lost segments can be retransmitted using fast retransmit. The sender begins with a slow-start. A segment is lost but successfully retransmitted by a fast retransmit. The congestion window is divided by 2 and the sender immediately enters congestion avoidance as this was a mild congestion.

media/image116.png
Figure 4.58 Evaluation of the TCP congestion window when the network is lightly congested  

Most TCP implementations update the congestion window when they receive an acknowledgement. If we assume that the receiver acknowledges each received segment and the sender only sends MSS sized segments, the TCP congestion control scheme can be implemented using the simplified pseudo-code 15 below

# Initialisation cwnd = MSS; ssthresh= swin; # Ack arrival if tcp.ack > snd.una : # new ack, no congestion     if cwnd < ssthresh : 
      # slow-start : increase quickly cwnd       # double cwnd every rtt cwnd = cwnd + MSS     else:       # congestion avoidance : increase slowly cwnd       # increase cwnd by one mss every rtt       cwnd = cwnd+ mss*(mss/cwnd) else: # duplicate or old ack     if tcp.ack==snd.una: # duplicate acknowledgement       dupacks++       if dupacks==3:         retransmitsegment(snd.una)         ssthresh=max(cwnd/2,2*MSS)         cwnd=ssthresh       else:         dupacks=0        # ack for old segment, ignored 
Expiration of the retransmission timer:   send(snd.una) # retransmit first lost segment   sshtresh=max(cwnd/2,2*MSS)   cwnd=MSS 

Furthermore when a TCP connection has been idle for more than its current retransmission timer, it should reset its congestion window to the congestion window size that it uses when the connection begins, as it no longer knows the current congestion state of the network.


Note: Initial congestion window
The original TCP congestion control mechanism proposed in [Jacobson1988] recommended that each TCP connection should begin by setting cwnd = MSS. However, in today’s higher bandwidth networks, using such a small initial congestion window severely affects the performance for short TCP connections, such as those used by web servers. Since the publication of RFC 3390, TCP hosts are allowed to use an initial congestion window of about 4 KBytes, which corresponds to 3 segments in many environments.

Thanks to its congestion control scheme, TCP adapts its transmission rate to the losses that occur in the network. Intuitively, the TCP transmission rate decreases when the percentage of losses increases. Researchers have proposed detailed models that allow the prediction of the throughput of a TCP connection when losses occur [MSMO1997] . To have some intuition about the factors that affect the performance of TCP, let us consider a very simple model. Its assumptions are not completely realistic, but it gives us good intuition without requiring complex mathematics.

This model considers a hypothetical TCP connection that suffers from equally spaced segment losses. If 1/p is the segment loss ratio, then the TCP connection successfully transfers 1 − 1 segments and the next segment is lo st. If we ignore the slow-start at the beginning of the connection, TCP in this environment is always in congestion avoidance as there are only isolated losses that can be recovered by using fast retransmit. The evolution of the congestion window is thus as shown in the figure below. Note the that x-axis of this figure represents time measured in units of one round-trip-time, which is supposed to be constant in the model, and the y-axis represents the size of the congestion window measured in MSS-sized segments.

: media/image117.png
Figure 4.59 Evolution of the congestion window with regular losses  

As the losses are equally spaced, the congestion window always starts at some value (W/2) and is incremented by one MSS every round-trip-time until it reaches twice this value (W). At this point, a segment is retransmitted and the cycle starts again. If the congestion window is measured in MSS-sized segments, a cycle lasts W/2 times. The bandwidth of the TCP connection is the number of bytes that have been transmitted during a given period of time. During a cycle, the number of segments that are sent on the TCP connection is equal to the area of the yellow trapeze in the figure. Its area is thus :

media/image118.png

However, given the regular losses that we consider, the number of segments that are sent between two losses (i.e. during a cycle) is by definition equal to 1/p. Thus,

media/image119.png

. The throughput (in bytes per second) of the TCP connection is equal to the number of segments transmitted divided by the duration of the cycle :

media/image120.png or after having eliminated W,

media/image121.png

More detailed models and the analysis of simulations have shown that a first order model of the TCP throughput when losses occur was

media/image122.png

. This is an important result which shows that :

  • TCP connections with a small round-trip-time can achieve a higher throughput than TCP connections having a longer round-trip-time when losses occur. This implies that the TCP congestion control scheme is not completely fair since it favors the connections that have the shorter round-trip-time
  • TCP connections that use a large MSS can achieve a higher throughput that the TCP connections that use a shorter MSS. This creates another source of unfairness between TCP connections. However, it should be noted that today most hosts are using almost the same MSS, roughly 1460 bytes.

In general, the maximum throughput that can be achieved by a TCP connection depends on its maximum window size and the round-trip-time if there are no losses. If there are losses, it depends on the MSS, the round-trip-time and the loss ratio.

media/image123.png


Note: The TCP congestion control zoo
The first TCP congestion control scheme was proposed by Van Jacobson in [Jacobson1988]. In addition to writing the scientific paper, Van Jacobson also implemented the slow-start and congestion avoidance schemes in release 4.3 Tahoe of the BSD Unix distributed by the University of Berkeley. Later, he improved the congestion control by adding the fast retransmit and the fast recovery mechanisms in the Reno release of 4.3 BSD Unix. Since then, many researchers have proposed, simulated and implemented modifications to the TCP congestion control scheme. Some of these modifications are still used today, e.g. :
  • NewReno ( RFC 3782), which was proposed as an improvement of the fast recovery mechanism in the Reno implementation
  • TCP Vegas, which uses changes in the round-trip-time to estimate congestion in order to avoid it [BOP1994]
  • CUBIC, which was designed for high bandwidth links and is the default congestion control scheme in the Linux 2.6.19 kernel [HRX2008]
  • Compound TCP, which was designed for high bandwidth links is the default congestion control scheme in several Microsoft operating systems [STBT2009]

A search of the scientific literature will probably reveal more than 100 different variants of the TCP congestion control scheme. Most of them have only been evaluated by simulations. However, the TCP implementation in the recent Linux kernels supports several congestion control schemes and new ones can be easily added. We can expect that new TCP congestion control schemes will always continue to appear.