6.5
The Internet Transport Protocols: TCP
UDP is a simple protocol and it has
some niche uses, such as client-server interactions and multimedia, but for
most Internet applications, reliable, sequenced delivery is needed. UDP cannot
provide this, so another protocol is required. It is called TCP and is the main
workhorse of the Internet. Let us now study it in detail.
TCP (Transmission Control Protocol)
was specifically designed to provide a reliable end-to-end byte stream over an
unreliable internetwork. An internetwork differs from a single network because
different parts may have wildly different topologies, bandwidths, delays,
packet sizes, and other parameters. TCP was designed to dynamically adapt to
properties of the internetwork and to be robust in the face of many kinds of
failures.
TCP was formally defined in RFC 793.
As time went on, various errors and inconsistencies were detected, and the
requirements were changed in some areas. These clarifications and some bug
fixes are detailed in RFC 1122. Extensions are given in RFC 1323.
Each machine supporting TCP has a
TCP transport entity, either a library procedure, a user process, or part of
the kernel. In all cases, it manages TCP streams and interfaces to the IP
layer. A TCP entity accepts user data streams from local processes, breaks them
up into pieces not exceeding 64 KB (in practice, often 1460 data bytes in order
to fit in a single Ethernet frame with the IP and TCP headers), and sends each
piece as a separate IP datagram. When datagrams containing TCP data arrive at a
machine, they are given to the TCP entity, which reconstructs the original byte
streams. For simplicity, we will sometimes use just ''TCP'' to mean the TCP
transport entity (a piece of software) or the TCP protocol (a set of rules).
From the context it will be clear which is meant. For example, in ''The user
gives TCP the data,'' the TCP transport entity is clearly intended.
The IP layer gives no guarantee that
datagrams will be delivered properly, so it is up to TCP to time out and
retransmit them as need be. Datagrams that do arrive may well do so in the
wrong order; it is also up to TCP to reassemble them into messages in the proper
sequence. In short, TCP must furnish the reliability that most users want and
that IP does not provide.
TCP service is obtained by both the
sender and receiver creating end points, called sockets, as discussed in Sec. 6.1.3. Each socket has a socket number
(address) consisting of the IP address of the host and a 16-bit number local to
that host, called a port. A port is the TCP name for a TSAP. For TCP service to
be obtained, a connection must be explicitly established between a socket on
the sending machine and a socket on the receiving machine. The socket calls are
listed in Fig. 6-5.
A socket may be used for multiple
connections at the same time. In other words, two or more connections may
terminate at the same socket. Connections are identified by the socket
identifiers at both ends, that is, (socket1, socket2). No virtual circuit
numbers or other identifiers are used.
Port numbers below 1024 are called well-known
ports and are reserved for standard services. For example, any process wishing
to establish a connection to a host to transfer a file using FTP can connect to
the destination host's port 21 to contact its FTP daemon. The list of well-known
ports is given at www.iana.org. Over 300 have been assigned. A few
of the better known ones are listed in Fig. 6-27.
It would certainly be possible to
have the FTP daemon attach itself to port 21 at boot time, the telnet daemon to
attach itself to port 23 at boot time, and so on. However, doing so would
clutter up memory with daemons that were idle most of the time. Instead, what
is generally done is to have a single daemon, called inetd (Internet daemon) in
UNIX, attach itself to multiple ports and wait for the first incoming connection.
When that occurs, inetd forks off a new process and executes the appropriate
daemon in it, letting that daemon handle the request. In this way, the daemons
other than inetd are only active when there is work for them to do. Inetd
learns which ports it is to use from a configuration file. Consequently, the
system administrator can set up the system to have permanent daemons on the
busiest ports (e.g., port 80) and inetd on the rest.
All TCP connections are full duplex
and point-to-point. Full duplex means that traffic can go in both directions at
the same time. Point-to-point means that each connection has exactly two end
points. TCP does not support multicasting or broadcasting.
A TCP connection is a byte stream,
not a message stream. Message boundaries are not preserved end to end. For
example, if the sending process does four 512-byte writes to a TCP stream,
these data may be delivered to the receiving process as four 512-byte chunks,
two 1024-byte chunks, one 2048-byte chunk (see Fig. 6-28), or some other way. There is no way
for the receiver to detect the unit(s) in which the data were written.
Figure 6-28. (a) Four 512-byte segments sent as separate IP
datagrams. (b) The 2048 bytes of data delivered to the application in a single
READ call.
Files in UNIX have this property
too. The reader of a file cannot tell whether the file was written a block at a
time, a byte at a time, or all in one blow. As with a UNIX file, the TCP
software has no idea of what the bytes mean and no interest in finding out. A
byte is just a byte.
When an application passes data to
TCP, TCP may send it immediately or buffer it (in order to collect a larger
amount to send at once), at its discretion. However, sometimes, the application
really wants the data to be sent immediately. For example, suppose a user is
logged in to a remote machine. After a command line has been finished and the
carriage return typed, it is essential that the line be shipped off to the
remote machine immediately and not buffered until the next line comes in. To
force data out, applications can use the PUSH flag, which tells TCP not to
delay the transmission.
Some early applications used the
PUSH flag as a kind of marker to delineate messages boundaries. While this
trick sometimes works, it sometimes fails since not all implementations of TCP
pass the PUSH flag to the application on the receiving side. Furthermore, if additional
PUSHes come in before the first one has been transmitted (e.g., because the
output line is busy), TCP is free to collect all the PUSHed data into a single
IP datagram, with no separation between the various pieces.
One last feature of the TCP service
that is worth mentioning here is urgent data. When an interactive user hits the
DEL or CTRL-C key to break off a remote computation that has already begun, the
sending application puts some control information in the data stream and gives
it to TCP along with the URGENT flag. This event causes TCP to stop
accumulating data and transmit everything it has for that connection
immediately.
When the urgent data are received at
the destination, the receiving application is interrupted (e.g., given a signal
in UNIX terms) so it can stop whatever it was doing and read the data stream to
find the urgent data. The end of the urgent data is marked so the application
knows when it is over. The start of the urgent data is not marked. It is up to
the application to figure that out. This scheme basically provides a crude
signaling mechanism and leaves everything else up to the application.
In this section we will give a
general overview of the TCP protocol. In the next one we will go over the
protocol header, field by field.
A key feature of TCP, and one which
dominates the protocol design, is that every byte on a TCP connection has its
own 32-bit sequence number. When the Internet began, the lines between routers
were mostly 56-kbps leased lines, so a host blasting away at full speed took
over 1 week to cycle through the sequence numbers. At modern network speeds,
the sequence numbers can be consumed at an alarming rate, as we will see later.
Separate 32-bit sequence numbers are used for acknowledgements and for the
window mechanism, as discussed below.
The sending and receiving TCP
entities exchange data in the form of segments. A TCP segment consists of a
fixed 20-byte header (plus an optional part) followed by zero or more data
bytes. The TCP software decides how big segments should be. It can accumulate
data from several writes into one segment or can split data from one write over
multiple segments. Two limits restrict the segment size. First, each segment,
including the TCP header, must fit in the 65,515-byte IP payload. Second, each
network has a maximum transfer unit, or MTU, and each segment must fit in the
MTU. In practice, the MTU is generally 1500 bytes (the Ethernet payload size)
and thus defines the upper bound on segment size.
The basic protocol used by TCP
entities is the sliding window protocol. When a sender transmits a segment, it
also starts a timer. When the segment arrives at the destination, the receiving
TCP entity sends back a segment (with data if any exist, otherwise without data)
bearing an acknowledgement number equal to the next sequence number it expects
to receive. If the sender's timer goes off before the acknowledgement is
received, the sender transmits the segment again.
Although this protocol sounds
simple, there are a number of sometimes subtle ins and outs, which we will
cover below. Segments can arrive out of order, so bytes 3072–4095 can arrive
but cannot be acknowledged because bytes 2048–-3071 have not turned up yet.
Segments can also be delayed so long in transit that the sender times out and
retransmits them. The retransmissions may include different byte ranges than
the original transmission, requiring a careful administration to keep track of
which bytes have been correctly received so far. However, since each byte in
the stream has its own unique offset, it can be done.
TCP must be prepared to deal with
these problems and solve them in an efficient way. A considerable amount of
effort has gone into optimizing the performance of TCP streams, even in the
face of network problems. A number of the algorithms used by many TCP
implementations will be discussed below.
Figure 6-29 shows the layout of a TCP segment.
Every segment begins with a fixed-format, 20-byte header. The fixed header may
be followed by header options. After the options, if any, up to 65,535 - 20 -
20 = 65,495 data bytes may follow, where the first 20 refer to the IP header
and the second to the TCP header. Segments without any data are legal and are
commonly used for acknowledgements and control messages.
Let us dissect the TCP header field
by field. The Source port and Destination port fields identify the local end
points of the connection. The well-known ports are defined at www.iana.org
but each host can allocate the others as it wishes. A port plus its host's IP
address forms a 48-bit unique end point. The source and destination end points
together identify the connection.
The Sequence number and Acknowledgement
number fields perform their usual functions. Note that the latter specifies the
next byte expected, not the last byte correctly received. Both are 32 bits long
because every byte of data is numbered in a TCP stream.
The TCP header length tells how many
32-bit words are contained in the TCP header. This information is needed
because the Options field is of variable length, so the header is, too.
Technically, this field really indicates the start of the data within the
segment, measured in 32-bit words, but that number is just the header length in
words, so the effect is the same.
Next comes a 6-bit field that is not
used. The fact that this field has survived intact for over a quarter of a
century is testimony to how well thought out TCP is. Lesser protocols would
have needed it to fix bugs in the original design.
Now come six 1-bit flags. URG is set
to 1 if the Urgent pointer is in use. The Urgent pointer is used to indicate a
byte offset from the current sequence number at which urgent data are to be
found. This facility is in lieu of interrupt messages. As we mentioned above,
this facility is a bare-bones way of allowing the sender to signal the receiver
without getting TCP itself involved in the reason for the interrupt.
The ACK bit is set to 1 to indicate
that the Acknowledgement number is valid. If ACK is 0, the segment does not
contain an acknowledgement so the Acknowledgement number field is ignored.
The PSH bit indicates PUSHed data.
The receiver is hereby kindly requested to deliver the data to the application
upon arrival and not buffer it until a full buffer has been received (which it
might otherwise do for efficiency).
The RST bit is used to reset a
connection that has become confused due to a host crash or some other reason.
It is also used to reject an invalid segment or refuse an attempt to open a
connection. In general, if you get a segment with the RST bit on, you have a
problem on your hands.
The SYN bit is used to establish
connections. The connection request has SYN = 1 and ACK = 0 to indicate that
the piggyback acknowledgement field is not in use. The connection reply does
bear an acknowledgement, so it has SYN = 1 and ACK = 1. In essence the SYN bit
is used to denote CONNECTION REQUEST and CONNECTION ACCEPTED, with the ACK bit
used to distinguish between those two possibilities.
The FIN bit is used to release a
connection. It specifies that the sender has no more data to transmit. However,
after closing a connection, the closing process may continue to receive data
indefinitely. Both SYN and FIN segments have sequence numbers and are thus
guaranteed to be processed in the correct order.
Flow control in TCP is handled using
a variable-sized sliding window. The Window size field tells how many bytes may
be sent starting at the byte acknowledged. A Window size field of 0 is legal
and says that the bytes up to and including Acknowledgement number - 1 have
been received, but that the receiver is currently badly in need of a rest and
would like no more data for the moment, thank you. The receiver can later grant
permission to send by transmitting a segment with the same Acknowledgement
number and a nonzero Window size field.
This was a consequence of a fixed
window size for each protocol. In TCP, acknowledgements and permission to send
additional data are completely decoupled. In effect, a receiver can say: I have
received bytes up through k but I do not want any more just now. This
decoupling (in fact, a variable-sized window) gives additional flexibility. We
will study it in detail below.
A Checksum is also provided for
extra reliability. It checksums the header, the data, and the conceptual
pseudoheader shown in Fig. 6-30. When performing this computation, the
TCP Checksum field is set to zero and the data field is padded out with an
additional zero byte if its length is an odd number. The checksum algorithm is
simply to add up all the 16-bit words in one's complement and then to take the
one's complement of the sum. As a consequence, when the receiver performs the
calculation on the entire segment, including the Checksum field, the result
should be 0.
The pseudoheader contains the 32-bit
IP addresses of the source and destination machines, the protocol number for
TCP (6), and the byte count for the TCP segment (including the header). Including
the pseudoheader in the TCP checksum computation helps detect misdelivered
packets, but including it also violates the protocol hierarchy since the IP
addresses in it belong to the IP layer, not to the TCP layer. UDP uses the same
pseudoheader for its checksum.
The Options field provides a way to
add extra facilities not covered by the regular header. The most important
option is the one that allows each host to specify the maximum TCP payload it
is willing to accept. Using large segments is more efficient than using small
ones because the 20-byte header can then be amortized over more data, but small
hosts may not be able to handle big segments. During connection setup, each
side can announce its maximum and see its partner's. If a host does not use this
option, it defaults to a 536-byte payload. All Internet hosts are required to
accept TCP segments of 536 + 20 = 556 bytes. The maximum segment size in the
two directions need not be the same.
For lines with high bandwidth, high
delay, or both, the 64-KB window is often a problem. On a T3 line (44.736
Mbps), it takes only 12 msec to output a full 64-KB window. If the round-trip
propagation delay is 50 msec (which is typical for a transcontinental fiber),
the sender will be idle 3/4 of the time waiting for acknowledgements. On a
satellite connection, the situation is even worse. A larger window size would
allow the sender to keep pumping data out, but using the 16-bit Window size
field, there is no way to express such a size. In RFC 1323, a Window scale option
was proposed, allowing the sender and receiver to negotiate a window scale
factor. This number allows both sides to shift the Window size field up to 14
bits to the left, thus allowing windows of up to 230 bytes. Most TCP
implementations now support this option.
Another option proposed by RFC 1106
and now widely implemented is the use of the selective repeat instead of go
back n protocol. If the receiver gets one bad segment and then a large number
of good ones, the normal TCP protocol will eventually time out and retransmit
all the unacknowledged segments, including all those that were received
correctly (i.e., the go back n protocol). RFC 1106 introduced NAKs to allow the
receiver to ask for a specific segment (or segments). After it gets these, it can
acknowledge all the buffered data, thus reducing the amount of data
retransmitted.
Connections are established in TCP
by means of the three-way handshake discussed in Sec. 6.2.2. To establish a connection, one side,
say, the server, passively waits for an incoming connection by executing the
LISTEN and ACCEPT primitives, either specifying a specific source or nobody in
particular.
The other side, say, the client,
executes a CONNECT primitive, specifying the IP address and port to which it
wants to connect, the maximum TCP segment size it is willing to accept, and
optionally some user data (e.g., a password). The CONNECT primitive sends a TCP
segment with the SYN bit on and ACK bit off and waits for a response.
When this segment arrives at the
destination, the TCP entity there checks to see if there is a process that has
done a LISTEN on the port given in the Destination port field. If not, it sends
a reply with the RST bit on to reject the connection.
If some process is listening to the
port, that process is given the incoming TCP segment. It can then either accept
or reject the connection. If it accepts, an acknowledgement segment is sent
back. The sequence of TCP segments sent in the normal case is shown in Fig. 6-31(a). Note that a SYN segment consumes 1
byte of sequence space so that it can be acknowledged unambiguously.
In the event that two hosts
simultaneously attempt to establish a connection between the same two sockets,
the sequence of events is as illustrated in Fig. 6-31(b). The result of these events is that
just one connection is established, not two because connections are identified
by their end points. If the first setup results in a connection identified by (x,
y) and the second one does too, only one table entry is made, namely, for (x, y).
The initial sequence number on a
connection is not 0 for the reasons we discussed earlier. A clock-based scheme
is used, with a clock tick every 4 µsec. For additional safety, when a host
crashes, it may not reboot for the maximum packet lifetime to make sure that no
packets from previous connections are still roaming around the Internet
somewhere.
Although TCP connections are full
duplex, to understand how connections are released it is best to think of them
as a pair of simplex connections. Each simplex connection is released
independently of its sibling. To release a connection, either party can send a
TCP segment with the FIN bit set, which means that it has no more data to
transmit. When the FIN is acknowledged, that direction is shut down for new
data. Data may continue to flow indefinitely in the other direction, however.
When both directions have been shut down, the connection is released. Normally,
four TCP segments are needed to release a connection, one FIN and one ACK for
each direction. However, it is possible for the first ACK and the second FIN to
be contained in the same segment, reducing the total count to three.
Just as with telephone calls in
which both people say goodbye and hang up the phone simultaneously, both ends
of a TCP connection may send FIN segments at the same time. These are each
acknowledged in the usual way, and the connection is shut down. There is, in
fact, no essential difference between the two hosts releasing sequentially or
simultaneously.
To avoid the two-army problem,
timers are used. If a response to a FIN is not forthcoming within two maximum
packet lifetimes, the sender of the FIN releases the connection. The other side
will eventually notice that nobody seems to be listening to it any more and
will time out as well. While this solution is not perfect, given the fact that
a perfect solution is theoretically impossible, it will have to do. In
practice, problems rarely arise.
The steps required to establish and
release connections can be represented in a finite state machine with the 11
states listed in Fig. 6-32. In each state, certain events are
legal. When a legal event happens, some action may be taken. If some other
event happens, an error is reported.
Each connection starts in the CLOSED
state. It leaves that state when it does either a passive open (LISTEN), or an
active open (CONNECT). If the other side does the opposite one, a connection is
established and the state becomes ESTABLISHED. Connection release can be
initiated by either side. When it is complete, the state returns to CLOSED.
The finite state machine itself is
shown in Fig. 6-33. The common case of a client actively
connecting to a passive server is shown with heavy lines—solid for the client,
dotted for the server. The lightface lines are unusual event sequences. Each
line in Fig. 6-33 is marked by an event/action pair. The
event can either be a user-initiated system call (CONNECT, LISTEN, SEND, or
CLOSE), a segment arrival (SYN, FIN, ACK, or RST), or, in one case, a timeout
of twice the maximum packet lifetime. The action is the sending of a control
segment (SYN, FIN, or RST) or nothing, indicated by —. Comments are shown in
parentheses.
Figure 6-33. TCP connection management finite state machine.
The heavy solid line is the normal path for a client. The heavy dashed line is
the normal path for a server. The light lines are unusual events. Each
transition is labeled by the event causing it and the action resulting from it,
separated by a slash.
One can best understand the diagram
by first following the path of a client (the heavy solid line), then later
following the path of a server (the heavy dashed line). When an application
program on the client machine issues a CONNECT request, the local TCP entity
creates a connection record, marks it as being in the SYN SENT state, and sends
a SYN segment. Note that many connections may be open (or being opened) at the
same time on behalf of multiple applications, so the state is per connection
and recorded in the connection record. When the SYN+ACK arrives, TCP sends the
final ACK of the three-way handshake and switches into the ESTABLISHED state.
Data can now be sent and received.
When an application is finished, it
executes a CLOSE primitive, which causes the local TCP entity to send a FIN
segment and wait for the corresponding ACK (dashed box marked active close).
When the ACK arrives, a transition is made to state FIN WAIT 2 and one
direction of the connection is now closed. When the other side closes, too, a FIN
comes in, which is acknowledged. Now both sides are closed, but TCP waits a
time equal to the maximum packet lifetime to guarantee that all packets from
the connection have died off, just in case the acknowledgement was lost. When
the timer goes off, TCP deletes the connection record.
Now let us examine connection
management from the server's viewpoint. The server does a LISTEN and settles
down to see who turns up. When a SYN comes in, it is acknowledged and the
server goes to the SYN RCVD state. When the server's SYN is itself
acknowledged, the three-way handshake is complete and the server goes to the ESTABLISHED
state. Data transfer can now occur.
When the client is done, it does a
CLOSE, which causes a FIN to arrive at the server (dashed box marked passive
close). The server is then signaled. When it, too, does a CLOSE, a FIN is sent
to the client. When the client's acknowledgement shows up, the server releases
the connection and deletes the connection record.
As mentioned earlier, window
management in TCP is not directly tied to acknowledgements as it is in most
data link protocols. For example, suppose the receiver has a 4096-byte buffer,
as shown in Fig. 6-34. If the sender transmits a 2048-byte
segment that is correctly received, the receiver will acknowledge the segment.
However, since it now has only 2048 bytes of buffer space (until the
application removes some data from the buffer), it will advertise a window of
2048 starting at the next byte expected.
Now the sender transmits another
2048 bytes, which are acknowledged, but the advertised window is 0. The sender
must stop until the application process on the receiving host has removed some
data from the buffer, at which time TCP can advertise a larger window.
When the window is 0, the sender may
not normally send segments, with two exceptions. First, urgent data may be
sent, for example, to allow the user to kill the process running on the remote
machine. Second, the sender may send a 1-byte segment to make the receiver
reannounce the next byte expected and window size. The TCP standard explicitly
provides this option to prevent deadlock if a window announcement ever gets
lost.
Senders are not required to transmit
data as soon as they come in from the application. Neither are receivers
required to send acknowledgements as soon as possible. For example, in Fig. 6-34, when the first 2 KB of data came in,
TCP, knowing that it had a 4-KB window available, would have been completely
correct in just buffering the data until another 2 KB came in, to be able to
transmit a segment with a 4-KB payload. This freedom can be exploited to
improve performance.
Consider a telnet connection to an
interactive editor that reacts on every keystroke. In the worst case, when a
character arrives at the sending TCP entity, TCP creates a 21-byte TCP segment,
which it gives to IP to send as a 41-byte IP datagram. At the receiving side,
TCP immediately sends a 40-byte acknowledgement (20 bytes of TCP header and 20
bytes of IP header). Later, when the editor has read the byte, TCP sends a
window update, moving the window 1 byte to the right. This packet is also 40
bytes. Finally, when the editor has processed the character, it echoes the
character as a 41-byte packet. In all, 162 bytes of bandwidth are used and four
segments are sent for each character typed. When bandwidth is scarce, this
method of doing business is not desirable.
One approach that many TCP
implementations use to optimize this situation is to delay acknowledgements and
window updates for 500 msec in the hope of acquiring some data on which to
hitch a free ride. Assuming the editor echoes within 500 msec, only one 41-byte
packet now need be sent back to the remote user, cutting the packet count and
bandwidth usage in half.
Although this rule reduces the load
placed on the network by the receiver, the sender is still operating
inefficiently by sending 41-byte packets containing 1 byte of data. A way to
reduce this usage is known as Nagle's algorithm (Nagle, 1984). What Nagle
suggested is simple: when data come into the sender one byte at a time, just
send the first byte and buffer all the rest until the outstanding byte is
acknowledged. Then send all the buffered characters in one TCP segment and
start buffering again until they are all acknowledged. If the user is typing
quickly and the network is slow, a substantial number of characters may go in
each segment, greatly reducing the bandwidth used. The algorithm additionally
allows a new packet to be sent if enough data have trickled in to fill half the
window or a maximum segment.
Nagle's algorithm is widely used by
TCP implementations, but there are times when it is better to disable it. In
particular, when an X Windows application is being run over the Internet, mouse
movements have to be sent to the remote computer. (The X Window system is the
windowing system used on most UNIX systems.) Gathering them up to send in
bursts makes the mouse cursor move erratically, which makes for unhappy users.
Another problem that can degrade TCP
performance is the silly window syndrome (Clark, 1982). This problem occurs
when data are passed to the sending TCP entity in large blocks, but an
interactive application on the receiving side reads data 1 byte at a time. To
see the problem, look at Fig. 6-35. Initially, the TCP buffer on the
receiving side is full and the sender knows this (i.e., has a window of size
0). Then the interactive application reads one character from the TCP stream.
This action makes the receiving TCP happy, so it sends a window update to the
sender saying that it is all right to send 1 byte. The sender obliges and sends
1 byte. The buffer is now full, so the receiver acknowledges the 1-byte segment
but sets the window to 0. This behavior can go on forever.
Clark's solution is to prevent the
receiver from sending a window update for 1 byte. Instead it is forced to wait
until it has a decent amount of space available and advertise that instead.
Specifically, the receiver should not send a window update until it can handle
the maximum segment size it advertised when the connection was established or
until its buffer is half empty, whichever is smaller.
Furthermore, the sender can also
help by not sending tiny segments. Instead, it should try to wait until it has
accumulated enough space in the window to send a full segment or at least one
containing half of the receiver's buffer size (which it must estimate from the
pattern of window updates it has received in the past).
Nagle's algorithm and Clark's
solution to the silly window syndrome are complementary. Nagle was trying to
solve the problem caused by the sending application delivering data to TCP a
byte at a time. Clark was trying to solve the problem of the receiving
application sucking the data up from TCP a byte at a time. Both solutions are
valid and can work together. The goal is for the sender not to send small
segments and the receiver not to ask for them.
The receiving TCP can go further in
improving performance than just doing window updates in large units. Like the
sending TCP, it can also buffer data, so it can block a READ
request from the application until it has a large chunk of data to provide.
Doing this reduces the number of calls to TCP, and hence the overhead. Of
course, it also increases the response time, but for noninteractive
applications like file transfer, efficiency may be more important than response
time to individual requests.
Another receiver issue is what to do
with out-of-order segments. They can be kept or discarded, at the receiver's
discretion. Of course, acknowledgements can be sent only when all the data up
to the byte acknowledged have been received. If the receiver gets segments 0,
1, 2, 4, 5, 6, and 7, it can acknowledge everything up to and including the
last byte in segment 2. When the sender times out, it then retransmits segment
3. If the receiver has buffered segments 4 through 7, upon receipt of segment 3
it can acknowledge all bytes up to the end of segment 7.
When the load offered to any network
is more than it can handle, congestion builds up. The Internet is no exception.
In this section we will discuss algorithms that have been developed over the
past quarter of a century to deal with congestion. Although the network layer
also tries to manage congestion, most of the heavy lifting is done by TCP
because the real solution to congestion is to slow down the data rate.
In theory, congestion can be dealt
with by employing a principle borrowed from physics: the law of conservation of
packets. The idea is to refrain from injecting a new packet into the network
until an old one leaves (i.e., is delivered). TCP attempts to achieve this goal
by dynamically manipulating the window size.
The first step in managing congestion
is detecting it. In the old days, detecting congestion was difficult. A timeout
caused by a lost packet could have been caused by either (1) noise on a
transmission line or (2) packet discard at a congested router. Telling the
difference was difficult.
Nowadays, packet loss due to
transmission errors is relatively rare because most long-haul trunks are fiber
(although wireless networks are a different story). Consequently, most
transmission timeouts on the Internet are due to congestion. All the Internet
TCP algorithms assume that timeouts are caused by congestion and monitor
timeouts for signs of trouble the way miners watch their canaries.
Before discussing how TCP reacts to
congestion, let us first describe what it does to try to prevent congestion
from occurring in the first place. When a connection is established, a suitable
window size has to be chosen. The receiver can specify a window based on its
buffer size. If the sender sticks to this window size, problems will not occur
due to buffer overflow at the receiving end, but they may still occur due to
internal congestion within the network.
In Fig. 6-36, we see this problem illustrated
hydraulically. In Fig. 6-36(a), we see a thick pipe leading to a
small-capacity receiver. As long as the sender does not send more water than
the bucket can contain, no water will be lost. In Fig. 6-36(b), the limiting factor is not the
bucket capacity, but the internal carrying capacity of the network. If too much
water comes in too fast, it will back up and some will be lost (in this case by
overflowing the funnel).
Figure 6-36. (a) A fast network feeding a low-capacity
receiver. (b) A slow network feeding a high-capacity receiver.
The Internet solution is to realize
that two potential problems exist—network capacity and receiver capacity—and to
deal with each of them separately. To do so, each sender maintains two windows:
the window the receiver has granted and a second window, the congestion window.
Each reflects the number of bytes the sender may transmit. The number of bytes
that may be sent is the minimum of the two windows. Thus, the effective window
is the minimum of what the sender thinks is all right and what the receiver
thinks is all right. If the receiver says ''Send 8 KB'' but the sender knows
that bursts of more than 4 KB clog the network, it sends 4 KB. On the other
hand, if the receiver says ''Send 8 KB'' and the sender knows that bursts of up
to 32 KB get through effortlessly, it sends the full 8 KB requested.
When a connection is established,
the sender initializes the congestion window to the size of the maximum segment
in use on the connection. It then sends one maximum segment. If this segment is
acknowledged before the timer goes off, it adds one segment's worth of bytes to
the congestion window to make it two maximum size segments and sends two
segments. As each of these segments is acknowledged, the congestion window is
increased by one maximum segment size. When the congestion window is n
segments, if all n are acknowledged on time, the congestion window is increased
by the byte count corresponding to n segments. In effect, each burst
acknowledged doubles the congestion window.
The congestion window keeps growing
exponentially until either a timeout occurs or the receiver's window is
reached. The idea is that if bursts of size, say, 1024, 2048, and 4096 bytes
work fine but a burst of 8192 bytes gives a timeout, the congestion window
should be set to 4096 to avoid congestion. As long as the congestion window
remains at 4096, no bursts longer than that will be sent, no matter how much
window space the receiver grants. This algorithm is called slow start, but it
is not slow at all (Jacobson, 1988). It is exponential. All TCP implementations
are required to support it.
Now let us look at the Internet
congestion control algorithm. It uses a third parameter, the threshold,
initially 64 KB, in addition to the receiver and congestion windows. When a
timeout occurs, the threshold is set to half of the current congestion window,
and the congestion window is reset to one maximum segment. Slow start is then
used to determine what the network can handle, except that exponential growth
stops when the threshold is hit. From that point on, successful transmissions
grow the congestion window linearly (by one maximum segment for each burst)
instead of one per segment. In effect, this algorithm is guessing that it is
probably acceptable to cut the congestion window in half, and then it gradually
works its way up from there.
As an illustration of how the
congestion algorithm works, see Fig. 6-37. The maximum segment size here is 1024
bytes. Initially, the congestion window was 64 KB, but a timeout occurred, so
the threshold is set to 32 KB and the congestion window to 1 KB for
transmission 0 here. The congestion window then grows exponentially until it
hits the threshold (32 KB). Starting then, it grows linearly.
Transmission 13 is unlucky (it
should have known) and a timeout occurs. The threshold is set to half the
current window (by now 40 KB, so half is 20 KB), and slow start is initiated
all over again. When the acknowledgements from transmission 14 start coming in,
the first four each double the congestion window, but after that, growth
becomes linear again.
If no more timeouts occur, the
congestion window will continue to grow up to the size of the receiver's
window. At that point, it will stop growing and remain constant as long as
there are no more timeouts and the receiver's window does not change size. As
an aside, if an ICMP SOURCE QUENCH packet comes in and is passed to TCP, this
event is treated the same way as a timeout. An alternative (and more recent
approach) is described in RFC 3168.
TCP uses multiple timers (at least
conceptually) to do its work. The most important of these is the retransmission
timer. When a segment is sent, a retransmission timer is started. If the
segment is acknowledged before the timer expires, the timer is stopped. If, on
the other hand, the timer goes off before the acknowledgement comes in, the
segment is retransmitted (and the timer started again). The question that
arises is: How long should the timeout interval be?
In the latter case, the expected
delay is highly predictable (i.e., has a low variance), so the timer can be set
to go off just slightly after the acknowledgement is expected, as shown in Fig. 6-38(a). Since acknowledgements are rarely
delayed in the data link layer (due to lack of congestion), the absence of an
acknowledgement at the expected time generally means either the frame or the
acknowledgement has been lost.
Figure 6-38. (a) Probability density of acknowledgement
arrival times in the data link layer. (b) Probability density of
acknowledgement arrival times for TCP.
TCP is faced with a radically
different environment. The probability density function for the time it takes
for a TCP acknowledgement to come back looks more like Fig. 6-38(b) than Fig. 6-38(a). Determining the round-trip time to
the destination is tricky. Even when it is known, deciding on the timeout
interval is also difficult. If the timeout is set too short, say, T 1
in Fig. 6-38(b), unnecessary retransmissions will
occur, clogging the Internet with useless packets. If it is set too long,
(e.g., T 2), performance will suffer due to the long retransmission
delay whenever a packet is lost. Furthermore, the mean and variance of the
acknowledgement arrival distribution can change rapidly within a few seconds as
congestion builds up or is resolved.
The solution is to use a highly
dynamic algorithm that constantly adjusts the timeout interval, based on
continuous measurements of network performance. The algorithm generally used by
TCP is due to Jacobson (1988) and works as follows. For each connection, TCP
maintains a variable, RTT, that is the best current estimate of the round-trip
time to the destination in question. When a segment is sent, a timer is
started, both to see how long the acknowledgement takes and to trigger a
retransmission if it takes too long. If the acknowledgement gets back before
the timer expires, TCP measures how long the acknowledgement took, say, M. It
then updates RTT according to the formula
where a is a smoothing factor that
determines how much weight is given to the old value. Typically a
= 7/8.
Even given a good value of RTT,
choosing a suitable retransmission timeout is a nontrivial matter. Normally,
TCP uses bRTT, but the trick is choosing b.
In the initial implementations, b was always 2, but experience showed
that a constant value was inflexible because it failed to respond when the
variance went up.
In 1988, Jacobson proposed making b
roughly proportional to the standard deviation of the acknowledgement arrival
time probability density function so that a large variance means a large b,
and vice versa. In particular, he suggested using the mean deviation as a cheap
estimator of the standard deviation. His algorithm requires keeping track of
another smoothed variable, D, the deviation. Whenever an acknowledgement comes
in, the difference between the expected and observed values, | RTT - M |, is computed.
A smoothed value of this is maintained in D by the formula
where a may or may not be the same value
used to smooth RTT. While D is not exactly the same as the standard deviation,
it is good enough and Jacobson showed how it could be computed using only
integer adds, subtracts, and shifts—a big plus. Most TCP implementations now
use this algorithm and set the timeout interval to
The choice of the factor 4 is
somewhat arbitrary, but it has two advantages. First, multiplication by 4 can
be done with a single shift. Second, it minimizes unnecessary timeouts and
retransmissions because less than 1 percent of all packets come in more than
four standard deviations late. (Actually, Jacobson initially said to use 2, but
later work has shown that 4 gives better performance.)
One problem that occurs with the
dynamic estimation of RTT is what to do when a segment times out and is sent
again. When the acknowledgement comes in, it is unclear whether the
acknowledgement refers to the first transmission or a later one. Guessing wrong
can seriously contaminate the estimate of RTT. Phil Karn discovered this
problem the hard way. He is an amateur radio enthusiast interested in
transmitting TCP/IP packets by ham radio, a notoriously unreliable medium (on a
good day, half the packets get through). He made a simple proposal: do not
update RTT on any segments that have been retransmitted. Instead, the timeout
is doubled on each failure until the segments get through the first time. This
fix is called Karn's algorithm. Most TCP implementations use it.
The retransmission timer is not the
only timer TCP uses. A second timer is the persistence timer. It is designed to
prevent the following deadlock. The receiver sends an acknowledgement with a
window size of 0, telling the sender to wait. Later, the receiver updates the
window, but the packet with the update is lost. Now both the sender and the
receiver are waiting for each other to do something. When the persistence timer
goes off, the sender transmits a probe to the receiver. The response to the
probe gives the window size. If it is still zero, the persistence timer is set
again and the cycle repeats. If it is nonzero, data can now be sent.
A third timer that some
implementations use is the keepalive timer. When a connection has been idle for
a long time, the keepalive timer may go off to cause one side to check whether
the other side is still there. If it fails to respond, the connection is
terminated. This feature is controversial because it adds overhead and may
terminate an otherwise healthy connection due to a transient network partition.
The last timer used on each TCP
connection is the one used in the TIMED WAIT state while closing. It runs for
twice the maximum packet lifetime to make sure that when a connection is
closed, all packets created by it have died off.
In theory, transport protocols
should be independent of the technology of the underlying network layer. In
particular, TCP should not care whether IP is running over fiber or over radio.
In practice, it does matter because most TCP implementations have been
carefully optimized based on assumptions that are true for wired networks but that
fail for wireless networks. Ignoring the properties of wireless transmission
can lead to a TCP implementation that is logically correct but has horrendous
performance.
The principal problem is the
congestion control algorithm. Nearly all TCP implementations nowadays assume
that timeouts are caused by congestion, not by lost packets. Consequently, when
a timer goes off, TCP slows down and sends less vigorously (e.g., Jacobson's
slow start algorithm). The idea behind this approach is to reduce the network load
and thus alleviate the congestion.
Unfortunately, wireless transmission
links are highly unreliable. They lose packets all the time. The proper
approach to dealing with lost packets is to send them again, and as quickly as
possible. Slowing down just makes matters worse. If, say, 20 percent of all
packets are lost, then when the sender transmits 100 packets/sec, the
throughput is 80 packets/sec. If the sender slows down to 50 packets/sec, the
throughput drops to 40 packets/sec.
In effect, when a packet is lost on
a wired network, the sender should slow down. When one is lost on a wireless
network, the sender should try harder. When the sender does not know what the
network is, it is difficult to make the correct decision.
Frequently, the path from sender to
receiver is heterogeneous. The first 1000 km might be over a wired network, but
the last 1 km might be wireless. Now making the correct decision on a timeout
is even harder, since it matters where the problem occurred. A solution
proposed by Bakne and Badrinath (1995), indirect TCP, is to split the TCP
connection into two separate connections, as shown in Fig. 6-39. The first connection goes from the
sender to the base station. The second one goes from the base station to the
receiver. The base station simply copies packets between the connections in
both directions.
The advantage of this scheme is that
both connections are now homogeneous. Timeouts on the first connection can slow
the sender down, whereas timeouts on the second one can speed it up. Other
parameters can also be tuned separately for the two connections. The
disadvantage of the scheme is that it violates the semantics of TCP. Since each
part of the connection is a full TCP connection, the base station acknowledges
each TCP segment in the usual way. Only now, receipt of an acknowledgement by
the sender does not mean that the receiver got the segment, only that the base
station got it.
A different solution, due to
Balakrishnan et al. (1995), does not break the semantics of TCP. It works by
making several small modifications to the network layer code in the base
station. One of the changes is the addition of a snooping agent that observes
and caches TCP segments going out to the mobile host and acknowledgements
coming back from it. When the snooping agent sees a TCP segment going out to
the mobile host but does not see an acknowledgement coming back before its
(relatively short) timer goes off, it just retransmits that segment, without
telling the source that it is doing so. It also retransmits when it sees
duplicate acknowledgements from the mobile host go by, invariably meaning that
the mobile host has missed something. Duplicate acknowledgements are discarded
on the spot, to avoid having the source misinterpret them as congestion.
One disadvantage of this
transparency, however, is that if the wireless link is very lossy, the source
may time out waiting for an acknowledgement and invoke the congestion control
algorithm. With indirect TCP, the congestion control algorithm will never be
started unless there really is congestion in the wired part of the network.
The Balakrishnan et al. paper also
has a solution to the problem of lost segments originating at the mobile host.
When the base station notices a gap in the inbound sequence numbers, it
generates a request for a selective repeat of the missing bytes by using a TCP
option.
Using these fixes, the wireless link
is made more reliable in both directions, without the source knowing about it
and without changing the TCP semantics.
While UDP does not suffer from the
same problems as TCP, wireless communication also introduces difficulties for
it. The main trouble is that programs use UDP expecting it to be highly
reliable. They know that no guarantees are given, but they still expect it to
be near perfect. In a wireless environment, UDP will be far from perfect. For
programs that can recover from lost UDP messages but only at considerable cost,
suddenly going from an environment where messages theoretically can be lost but
rarely are, to one in which they are constantly being lost can result in a
performance disaster.
Wireless communication also affects
areas other than just performance. For example, how does a mobile host find a
local printer to connect to, rather than use its home printer? Somewhat related
to this is how to get the WWW page for the local cell, even if its name is not
known. Also, WWW page designers tend to assume lots of bandwidth is available.
Putting a large logo on every page becomes counterproductive if it is going to
take 10 sec to transmit over a slow wireless link every time the page is
referenced, irritating the users no end.
As wireless networking becomes more
common, the problems of running TCP over it become more acute. Additional work
in this area is reported in (Barakat et al., 2000; Ghani and Dixit, 1999;
Huston, 2001; and Xylomenos et al., 2001).
Earlier in this chapter we looked at
remote procedure call as a way to implement client-server systems. If both the
request and reply are small enough to fit into single packets and the operation
is idempotent, UDP can simply be used, However, if these conditions are not
met, using UDP is less attractive. For example, if the reply can be quite
large, then the pieces must be sequenced and a mechanism must be devised to
retransmit lost pieces. In effect, the application is required to reinvent TCP.
Clearly, that is unattractive, but
using TCP itself is also unattractive. The problem is the efficiency. The
normal sequence of packets for doing an RPC over TCP is shown in Fig. 6-40(a). Nine packets are required in the
best case.
The nine packets are as follows:
- 1. The client sends a SYN packet to establish a connection.
- 2. The server sends an ACK packet to acknowledge the SYN packet.
- 3. The client completes the three-way handshake.
- 4. The client sends the actual request.
- 5. The client sends a FIN packet to indicate that it is done sending.
- 6. The server acknowledges the request and the FIN.
- 7. The server sends the reply back to the client.
- 8. The server sends a FIN packet to indicate that it is also done.
- 9. The client acknowledges the server's FIN.
Note that this is the best case. In
the worst case, the client's request and FIN are acknowledged separately, as
are the server's reply and FIN.
The question quickly arises of
whether there is some way to combine the efficiency of RPC using UDP (just two
messages) with the reliability of TCP. The answer is: Almost. It can be done
with an experimental TCP variant called T/TCP (Transactional TCP), which is
described in RFCs 1379 and 1644.
The central idea here is to modify
the standard connection setup sequence slightly to allow the transfer of data
during setup. The T/TCP protocol is illustrated in Fig. 6-40(b). The client's first packet contains
the SYN bit, the request itself, and the FIN. In effect it says: I want to
establish a connection, here is the data, and I am done.
When the server gets the request, it
looks up or computes the reply, and chooses how to respond. If the reply fits
in one packet, it gives the reply of Fig. 6-40(b), which says: I acknowledge your FIN,
here is the answer, and I am done. The client then acknowledges the server's FIN
and the protocol terminates in three messages.
However, if the result is larger
than 1 packet, the server also has the option of not turning on the FIN bit, in
which case it can send multiple packets before closing its direction.
It is probably worth mentioning that
T/TCP is not the only proposed improvement to TCP. Another proposal is SCTP (Stream
Control Transmission Protocol). Its features include message boundary
preservation, multiple delivery modes (e.g., unordered delivery), multihoming
(backup destinations), and selective acknowledgements (Stewart and Metz, 2001).
However, whenever someone proposes changing something that has worked so well
for so long, there is always a huge battle between the ''Users are demanding
more features'' and ''If it ain't broken, don't fix it'' camps.
No comments:
Post a Comment
silahkan membaca dan berkomentar