3.2
Error Detection and Correction
The telephone system has three
parts: the switches, the interoffice trunks, and the local loops. The first two
are now almost entirely digital in most developed countries. The local loops
are still analog twisted copper pairs and will continue to be so for years due
to the enormous expense of replacing them. While errors are rare on the digital
part, they are still common on the local loops. Furthermore, wireless
communication is becoming more common, and the error rates here are orders of
magnitude worse than on the interoffice fiber trunks. The conclusion is:
transmission errors are going to be with us for many years to come. We have to
learn how to deal with them.
As a result of the physical
processes that generate them, errors on some media (e.g., radio) tend to come
in bursts rather than singly. Having the errors come in bursts has both
advantages and disadvantages over isolated single-bit errors. On the advantage
side, computer data are always sent in blocks of bits. Suppose that the block
size is 1000 bits and the error rate is 0.001 per bit. If errors were
independent, most blocks would contain an error. If the errors came in bursts
of 100 however, only one or two blocks in 100 would be affected, on average.
The disadvantage of burst errors is that they are much harder to correct than
are isolated errors.
Network designers have developed two
basic strategies for dealing with errors. One way is to include enough
redundant information along with each block of data sent, to enable the
receiver to deduce what the transmitted data must have been. The other way is
to include only enough redundancy to allow the receiver to deduce that an error
occurred, but not which error, and have it request a retransmission. The former
strategy uses error-correcting codes and the latter uses error-detecting codes.
The use of error-correcting codes is often referred to as forward error
correction.
Each of these techniques occupies a
different ecological niche. On channels that are highly reliable, such as
fiber, it is cheaper to use an error detecting code and just retransmit the
occasional block found to be faulty. However, on channels such as wireless
links that make many errors, it is better to add enough redundancy to each
block for the receiver to be able to figure out what the original block was,
rather than relying on a retransmission, which itself may be in error.
To understand how errors can be
handled, it is necessary to look closely at what an error really is. Normally,
a frame consists of m data (i.e., message) bits and r redundant, or check,
bits. Let the total length be n (i.e., n = m + r). An n-bit unit containing
data and check bits is often referred to as an n-bit codeword.
Given any two codewords, say,
10001001 and 10110001, it is possible to determine how many corresponding bits
differ. In this case, 3 bits differ. To determine how many bits differ, just
exclusive OR the two codewords and count the number of 1 bits in the result,
for example:
The number of bit positions in which
two codewords differ is called the Hamming distance (Hamming, 1950). Its
significance is that if two codewords are a Hamming distance d apart, it will
require d single-bit errors to convert one into the other.
In most data transmission
applications, all 2m possible data messages are legal, but due to
the way the check bits are computed, not all of the 2n possible
codewords are used. Given the algorithm for computing the check bits, it is
possible to construct a complete list of the legal codewords, and from this
list find the two codewords whose Hamming distance is minimum. This distance is
the Hamming distance of the complete code.
The error-detecting and
error-correcting properties of a code depend on its Hamming distance. To detect
d errors, you need a distance d + 1 code because with such a code there is no
way that d single-bit errors can change a valid codeword into another valid
codeword. When the receiver sees an invalid codeword, it can tell that a
transmission error has occurred. Similarly, to correct d errors, you need a
distance 2d + 1 code because that way the legal codewords are so far apart that
even with d changes, the original codeword is still closer than any other
codeword, so it can be uniquely determined.
As a simple example of an
error-detecting code, consider a code in which a single parity bit is appended
to the data. The parity bit is chosen so that the number of 1 bits in the
codeword is even (or odd). For example, when 1011010 is sent in even parity, a
bit is added to the end to make it 10110100. With odd parity 1011010 becomes
10110101. A code with a single parity bit has a distance 2, since any
single-bit error produces a codeword with the wrong parity. It can be used to
detect single errors.
As a simple example of an
error-correcting code, consider a code with only four valid codewords:
0000000000, 0000011111, 1111100000,
and 1111111111
This code has a distance 5, which
means that it can correct double errors. If the codeword 0000000111 arrives,
the receiver knows that the original must have been 0000011111. If, however, a
triple error changes 0000000000 into 0000000111, the error will not be
corrected properly.
Imagine that we want to design a
code with m message bits and r check bits that will allow all single errors to
be corrected. Each of the 2m legal messages has n illegal codewords
at a distance 1 from it. These are formed by systematically inverting each of
the n bits in the n-bit codeword formed from it. Thus, each of the 2m
legal messages requires n + 1 bit patterns dedicated to it. Since the total
number of bit patterns is 2n, we must have (n + 1)2m 2n. Using n = m + r, this
requirement becomes (m + r + 1) 2r. Given m, this puts a
lower limit on the number of check bits needed to correct single errors.
This theoretical lower limit can, in
fact, be achieved using a method due to Hamming (1950). The bits of the
codeword are numbered consecutively, starting with bit 1 at the left end, bit 2
to its immediate right, and so on. The bits that are powers of 2 (1, 2, 4, 8,
16, etc.) are check bits. The rest (3, 5, 6, 7, 9, etc.) are filled up with the
m data bits. Each check bit forces the parity of some collection of bits,
including itself, to be even (or odd). A bit may be included in several parity
computations. To see which check bits the data bit in position k contributes
to, rewrite k as a sum of powers of 2. For example, 11 = 1 + 2 + 8 and 29 = 1 +
4 + 8 + 16. A bit is checked by just those check bits occurring in its
expansion (e.g., bit 11 is checked by bits 1, 2, and 8).
When a codeword arrives, the
receiver initializes a counter to zero. It then examines each check bit, k (k =
1, 2, 4, 8, ...), to see if it has the correct parity. If not, the receiver
adds k to the counter. If the counter is zero after all the check bits have
been examined (i.e., if they were all correct), the codeword is accepted as
valid. If the counter is nonzero, it contains the number of the incorrect bit.
For example, if check bits 1, 2, and 8 are in error, the inverted bit is 11,
because it is the only one checked by bits 1, 2, and 8. Figure 3-7 shows some 7-bit ASCII characters
encoded as 11-bit codewords using a Hamming code. Remember that the data are
found in bit positions 3, 5, 6, 7, 9, 10, and 11.
Hamming codes can only correct
single errors. However, there is a trick that can be used to permit Hamming
codes to correct burst errors. A sequence of k consecutive codewords are
arranged as a matrix, one codeword per row. Normally, the data would be
transmitted one codeword at a time, from left to right. To correct burst
errors, the data should be transmitted one column at a time, starting with the
leftmost column. When all k bits have been sent, the second column is sent, and
so on, as indicated in Fig. 3-7. When the frame arrives at the receiver,
the matrix is reconstructed, one column at a time. If a burst error of length k
occurs, at most 1 bit in each of the k codewords will have been affected, but
the Hamming code can correct one error per codeword, so the entire block can be
restored. This method uses kr check bits to make blocks of km data bits immune
to a single burst error of length k or less.
Error-correcting codes are widely
used on wireless links, which are notoriously noisy and error prone when
compared to copper wire or optical fibers. Without error-correcting codes, it
would be hard to get anything through. However, over copper wire or fiber, the
error rate is much lower, so error detection and retransmission is usually more
efficient there for dealing with the occasional error.
As a simple example, consider a
channel on which errors are isolated and the error rate is 10-6 per
bit. Let the block size be 1000 bits. To provide error correction for 1000-bit
blocks, 10 check bits are needed; a megabit of data would require 10,000 check
bits. To merely detect a block with a single 1-bit error, one parity bit per
block will suffice. Once every 1000 blocks, an extra block (1001 bits) will
have to be transmitted. The total overhead for the error detection +
retransmission method is only 2001 bits per megabit of data, versus 10,000 bits
for a Hamming code.
If a single parity bit is added to a
block and the block is badly garbled by a long burst error, the probability
that the error will be detected is only 0.5, which is hardly acceptable. The
odds can be improved considerably if each block to be sent is regarded as a
rectangular matrix n bits wide and k bits high, as described above. A parity
bit is computed separately for each column and affixed to the matrix as the
last row. The matrix is then transmitted one row at a time. When the block
arrives, the receiver checks all the parity bits. If any one of them is wrong,
the receiver requests a retransmission of the block. Additional retransmissions
are requested as needed until an entire block is received without any parity
errors.
This method can detect a single
burst of length n, since only 1 bit per column will be changed. A burst of
length n + 1 will pass undetected, however, if the first bit is inverted, the
last bit is inverted, and all the other bits are correct. (A burst error does
not imply that all the bits are wrong; it just implies that at least the first
and last are wrong.) If the block is badly garbled by a long burst or by
multiple shorter bursts, the probability that any of the n columns will have
the correct parity, by accident, is 0.5, so the probability of a bad block
being accepted when it should not be is 2-n.
Although the above scheme may
sometimes be adequate, in practice, another method is in widespread use: the polynomial
code, also known as a CRC (Cyclic Redundancy Check). Polynomial codes are based
upon treating bit strings as representations of polynomials with coefficients
of 0 and 1 only. A k-bit frame is regarded as the coefficient list for a
polynomial with k terms, ranging from xk - 1 to x0.
Such a polynomial is said to be of degree k - 1. The high-order (leftmost) bit
is the coefficient of xk - 1; the next bit is the
coefficient of xk - 2, and so on. For example, 110001 has
6 bits and thus represents a six-term polynomial with coefficients 1, 1, 0, 0,
0, and 1: x5 + x4 + x0.
Polynomial arithmetic is done modulo
2, according to the rules of algebraic field theory. There are no carries for
addition or borrows for subtraction. Both addition and subtraction are
identical to exclusive OR. For example:
Long division is carried out the
same way as it is in binary except that the subtraction is done modulo 2, as
above. A divisor is said ''to go into'' a dividend if the dividend has as many
bits as the divisor.
When the polynomial code method is
employed, the sender and receiver must agree upon a generator polynomial, G(x),
in advance. Both the high- and low-order bits of the generator must be 1. To
compute the checksum for some frame with m bits, corresponding to the
polynomial M(x), the frame must be longer than the generator polynomial. The
idea is to append a checksum to the end of the frame in such a way that the
polynomial represented by the checksummed frame is divisible by G(x). When the
receiver gets the checksummed frame, it tries dividing it by G(x). If there is
a remainder, there has been a transmission error.
The algorithm for computing the
checksum is as follows:
- Let r be the degree of G(x). Append r zero bits to the low-order end of the frame so it now contains m + r bits and corresponds to the polynomial xrM(x).
- Divide the bit string corresponding to G(x) into the bit string corresponding to xrM(x), using modulo 2 division.
- Subtract the remainder (which is always r or fewer bits) from the bit string corresponding to xrM(x) using modulo 2 subtraction. The result is the checksummed frame to be transmitted. Call its polynomial T(x).
Figure 3-8 illustrates the calculation for a
frame 1101011011 using the generator G(x) = x4 + x + 1.
It should be clear that T(x) is
divisible (modulo 2) by G(x). In any division problem, if you diminish the
dividend by the remainder, what is left over is divisible by the divisor. For
example, in base 10, if you divide 210,278 by 10,941, the remainder is 2399. By
subtracting 2399 from 210,278, what is left over (207,879) is divisible by
10,941.
Now let us analyze the power of this
method. What kinds of errors will be detected? Imagine that a transmission
error occurs, so that instead of the bit string for T(x) arriving, T(x) + E(x)
arrives. Each 1 bit in E(x) corresponds to a bit that has been inverted. If
there are k 1 bits in E(x), k single-bit errors have occurred. A single burst
error is characterized by an initial 1, a mixture of 0s and 1s, and a final 1,
with all other bits being 0.
Upon receiving the checksummed
frame, the receiver divides it by G(x); that is, it computes [T(x) + E(x)]/G(x).
T(x)/G(x) is 0, so the result of the computation is simply E(x)/G(x). Those
errors that happen to correspond to polynomials containing G(x) as a factor
will slip by; all other errors will be caught.
If there has been a single-bit
error, E(x) = xi, where i determines which bit is in error. If G(x)
contains two or more terms, it will never divide E(x), so all single-bit errors
will be detected.
If there have been two isolated
single-bit errors, E(x) = xi + xj, where i > j.
Alternatively, this can be written as E(x) = xj(xi -
j + 1). If we assume that G(x) is not divisible by x, a sufficient
condition for all double errors to be detected is that G(x) does not divide xk
+ 1 for any k up to the maximum value of i - j (i.e., up to the maximum frame
length). Simple, low-degree polynomials that give protection to long frames are
known. For example, x15 + x14 + 1 will not divide xk
+ 1 for any value of k below 32,768.
If there are an odd number of bits
in error, E(X) contains an odd number of terms (e.g., x5 + x2
+ 1, but not x2 + 1). Interestingly, no polynomial with an odd
number of terms has x + 1 as a factor in the modulo 2 system. By making x + 1a
factor of G(x), we can catch all errors consisting of an odd number of inverted
bits.
To see that no polynomial with an
odd number of terms is divisible by x + 1, assume that E(x) has an odd number
of terms and is divisible by x + 1. Factor E(x) into (x + 1) Q(x). Now evaluate
E(1) = (1 + 1)Q(1). Since 1 + 1 = 0 (modulo 2), E(1) must be zero. If E(x) has
an odd number of terms, substituting 1 for x everywhere will always yield 1 as
the result. Thus, no polynomial with an odd number of terms is divisible by x +
1.
Finally, and most importantly, a
polynomial code with r check bits will detect all burst errors of length r. A burst error of length k can be
represented by xi(xk - 1 + ... + 1), where i
determines how far from the right-hand end of the received frame the burst is
located. If G(x) contains an x0 term, it will not have xi
as a factor, so if the degree of the parenthesized expression is less than the
degree of G(x), the remainder can never be zero.
If the burst length is r + 1, the
remainder of the division by G(x) will be zero if and only if the burst is
identical to G(x). By definition of a burst, the first and last bits must be 1,
so whether it matches depends on the r - 1 intermediate bits. If all
combinations are regarded as equally likely, the probability of such an
incorrect frame being accepted as valid is ½r - 1.
It can also be shown that when an
error burst longer than r + 1 bits occurs or when several shorter bursts occur,
the probability of a bad frame getting through unnoticed is ½r,
assuming that all bit patterns are equally likely.
Certain polynomials have become
international standards. The one used in IEEE 802 is
Among other desirable properties, it
has the property that it detects all bursts of length 32 or less and all bursts
affecting an odd number of bits.
Although the calculation required to
compute the checksum may seem complicated, Peterson and Brown (1961) have shown
that a simple shift register circuit can be constructed to compute and verify
the checksums in hardware. In practice, this hardware is nearly always used.
Virtually all LANs use it and point-to-point lines do, too, in some cases.
For decades, it has been assumed
that frames to be checksummed contain random bits. All analyses of checksum
algorithms have been made under this assumption. Inspection of real data has
shown this assumption to be quite wrong. As a consequence, under some
circumstances, undetected errors are much more common than had been previously
thought (Partridge et al., 1995).
No comments:
Post a Comment
silahkan membaca dan berkomentar