7.4.6
Introduction to Video
We have discussed the ear at length
now; time to move on to the eye (no, this section is not followed by one on the
nose). The human eye has the property that when an image appears on the retina,
the image is retained for some number of milliseconds before decaying. If a
sequence of images is drawn line by line at 50 images/sec, the eye does not
notice that it is looking at discrete images. All video (i.e., television)
systems exploit this principle to produce moving pictures.
To understand video, it is best to
start with simple, old-fashioned black-and-white television. To represent the
two-dimensional image in front of it as a one-dimensional voltage as a function
of time, the camera scans an electron beam rapidly across the image and slowly
down it, recording the light intensity as it goes. At the end of the scan,
called a frame, the beam retraces. This intensity as a function of time is
broadcast, and receivers repeat the scanning process to reconstruct the image.
The scanning pattern used by both the camera and the receiver is shown in Fig. 7-70. (As an aside, CCD cameras integrate
rather than scan, but some cameras and all monitors do scan.)
The exact scanning parameters vary
from country to country. The system used in North and South America and Japan
has 525 scan lines, a horizontal-to-vertical aspect ratio of 4:3, and 30
frames/sec. The European system has 625 scan lines, the same aspect ratio of
4:3, and 25 frames/sec. In both systems, the top few and bottom few lines are
not displayed (to approximate a rectangular image on the original round CRTs).
Only 483 of the 525 NTSC scan lines (and 576 of the 625 PAL/SECAM scan lines)
are displayed. The beam is turned off during the vertical retrace, so many
stations (especially in Europe) use this time to broadcast TeleText (text pages
containing news, weather, sports, stock prices, etc.).
While 25 frames/sec is enough to
capture smooth motion, at that frame rate many people, especially older ones,
will perceive the image to flicker (because the old image has faded off the
retina before the new one appears). Rather than increase the frame rate, which
would require using more scarce bandwidth, a different approach is taken.
Instead of the scan lines being displayed in order, first all the odd scan
lines are displayed, then the even ones are displayed. Each of these half
frames is called a field. Experiments have shown that although people notice
flicker at 25 frames/sec, they do not notice it at 50 fields/sec. This
technique is called interlacing. Noninterlaced television or video is called progressive.
Note that movies run at 24 fps, but each frame is fully visible for 1/24 sec.
Color video uses the same scanning
pattern as monochrome (black and white), except that instead of displaying the
image with one moving beam, it uses three beams moving in unison. One beam is
used for each of the three additive primary colors: red, green, and blue (RGB).
This technique works because any color can be constructed from a linear
superposition of red, green, and blue with the appropriate intensities.
However, for transmission on a single channel, the three color signals must be
combined into a single composite signal.
When color television was invented,
various methods for displaying color were technically possible, and different
countries made different choices, leading to systems that are still
incompatible. (Note that these choices have nothing to do with VHS versus
Betamax versus P2000, which are recording methods.) In all countries, a
political requirement was that programs transmitted in color had to be receivable
on existing black-and-white television sets. Consequently, the simplest scheme,
just encoding the RGB signals separately, was not acceptable. RGB is also not
the most efficient scheme.
The first color system was
standardized in the United States by the National Television Standards
Committee, which lent its acronym to the standard: NTSC. Color television was
introduced in Europe several years later, by which time the technology had
improved substantially, leading to systems with greater noise immunity and
better colors. These systems are called SECAM (SEquentiel Couleur Avec Memoire),
which is used in France and Eastern Europe, and PAL (Phase Alternating Line)
used in the rest of Europe. The difference in color quality between the NTSC
and PAL/SECAM has led to an industry joke that NTSC really stands for Never
Twice the Same Color.
To allow color transmissions to be
viewed on black-and-white receivers, all three systems linearly combine the RGB
signals into a luminance (brightness) signal and two chrominance (color)
signals, although they all use different coefficients for constructing these
signals from the RGB signals. Oddly enough, the eye is much more sensitive to
the luminance signal than to the chrominance signals, so the latter need not be
transmitted as accurately. Consequently, the luminance signal can be broadcast
at the same frequency as the old black-and-white signal, so it can be received
on black-and-white television sets. The two chrominance signals are broadcast
in narrow bands at higher frequencies. Some television sets have controls
labeled brightness, hue, and saturation (or brightness, tint, and color) for
controlling these three signals separately. Understanding luminance and
chrominance is necessary for understanding how video compression works.
In the past few years, there has
been considerable interest in HDTV (High Definition TeleVision), which produces
sharper images by roughly doubling the number of scan lines. The United States,
Europe, and Japan have all developed HDTV systems, all different and all
mutually incompatible. Did you expect otherwise? The basic principles of HDTV
in terms of scanning, luminance, chrominance, and so on, are similar to the
existing systems. However, all three formats have a common aspect ratio of 16:9
instead of 4:3 to match them better to the format used for movies (which are
recorded on 35 mm film, which has an aspect ratio of 3:2).
The simplest representation of
digital video is a sequence of frames, each consisting of a rectangular grid of
picture elements, or pixels. Each pixel can be a single bit, to represent
either black or white. The quality of such a system is similar to what you get
by sending a color photograph by fax—awful. (Try it if you can; otherwise
photocopy a color photograph on a copying machine that does not rasterize.)
The next step up is to use 8 bits
per pixel to represent 256 gray levels. This scheme gives high-quality
black-and-white video. For color video, good systems use 8 bits for each of the
RGB colors, although nearly all systems mix these into composite video for
transmission. While using 24 bits per pixel limits the number of colors to
about 16 million, the human eye cannot even distinguish this many colors, let
alone more. Digital color images are produced using three scanning beams, one
per color. The geometry is the same as for the analog system of Fig. 7-70 except that the continuous scan lines
are now replaced by neat rows of discrete pixels.
To produce smooth motion, digital
video, like analog video, must display at least 25 frames/sec. However, since
good-quality computer monitors often rescan the screen from images stored in
memory at 75 times per second or more, interlacing is not needed and
consequently is not normally used. Just repainting (i.e., redrawing) the same
frame three times in a row is enough to eliminate flicker.
In other words, smoothness of motion
is determined by the number of different images per second, whereas flicker is
determined by the number of times the screen is painted per second. These two
parameters are different. A still image painted at 20 frames/sec will not show
jerky motion, but it will flicker because one frame will decay from the retina
before the next one appears. A movie with 20 different frames per second, each
of which is painted four times in a row, will not flicker, but the motion will
appear jerky.
The significance of these two
parameters becomes clear when we consider the bandwidth required for
transmitting digital video over a network. Current computer monitors most use
the 4:3 aspect ratio so they can use inexpensive, mass-produced picture tubes
designed for the consumer television market. Common configurations are 1024 x
768, 1280 x 960, and 1600 x 1200. Even the smallest of these with 24 bits per
pixel and 25 frames/sec needs to be fed at 472 Mbps. It would take a SONET
OC-12 carrier to manage this, and running an OC-12 SONET carrier into
everyone's house is not exactly on the agenda. Doubling this rate to avoid
flicker is even less attractive. A better solution is to transmit 25 frames/sec
and have the computer store each one and paint it twice. Broadcast television
does not use this strategy because television sets do not have memory. And even
if they did have memory, analog signals cannot be stored in RAM without
conversion to digital form first, which requires extra hardware. As a consequence,
interlacing is needed for broadcast television but not for digital video.
It should be obvious by now that
transmitting uncompressed video is completely out of the question. The only
hope is that massive compression is possible. Fortunately, a large body of
research over the past few decades has led to many compression techniques and
algorithms that make video transmission feasible. In this section we will study
how video compression is accomplished.
All compression systems require two
algorithms: one for compressing the data at the source, and another for
decompressing it at the destination. In the literature, these algorithms are
referred to as the encoding and decoding algorithms, respectively. We will use
this terminology here, too.
These algorithms exhibit certain
asymmetries that are important to understand. First, for many applications, a
multimedia document, say, a movie will only be encoded once (when it is stored
on the multimedia server) but will be decoded thousands of times (when it is
viewed by customers). This asymmetry means that it is acceptable for the
encoding algorithm to be slow and require expensive hardware provided that the
decoding algorithm is fast and does not require expensive hardware. After all,
the operator of a multimedia server might be quite willing to rent a parallel
supercomputer for a few weeks to encode its entire video library, but requiring
consumers to rent a supercomputer for 2 hours to view a video is not likely to be
a big success. Many practical compression systems go to great lengths to make
decoding fast and simple, even at the price of making encoding slow and
complicated.
On the other hand, for real-time
multimedia, such as video conferencing, slow encoding is unacceptable. Encoding
must happen on-the-fly, in real time. Consequently, real-time multimedia uses
different algorithms or parameters than storing videos on disk, often with
appreciably less compression.
A second asymmetry is that the
encode/decode process need not be invertible. That is, when compressing a file,
transmitting it, and then decompressing it, the user expects to get the
original back, accurate down to the last bit. With multimedia, this requirement
does not exist. It is usually acceptable to have the video signal after
encoding and then decoding be slightly different from the original. When the
decoded output is not exactly equal to the original input, the system is said
to be lossy. If the input and output are identical, the system is lossless.
Lossy systems are important because accepting a small amount of information
loss can give a huge payoff in terms of the compression ratio possible.
A video is just a sequence of images
(plus sound). If we could find a good algorithm for encoding a single image,
this algorithm could be applied to each image in succession to achieve video
compression. Good still image compression algorithms exist, so let us start our
study of video compression there. The JPEG (Joint Photographic Experts Group)
standard for compressing continuous-tone still pictures (e.g., photographs) was
developed by photographic experts working under the joint auspices of ITU, ISO,
and IEC, another standards body. It is important for multimedia because, to a
first approximation, the multimedia standard for moving pictures, MPEG, is just
the JPEG encoding of each frame separately, plus some extra features for
interframe compression and motion detection. JPEG is defined in International
Standard 10918.
JPEG has four modes and many
options. It is more like a shopping list than a single algorithm. For our
purposes, though, only the lossy sequential mode is relevant, and that one is
illustrated in Fig. 7-71. Furthermore, we will concentrate on
the way JPEG is normally used to encode 24-bit RGB video images and will leave
out some of the minor details for the sake of simplicity.
Step 1 of encoding an image with
JPEG is block preparation. For the sake of specificity, let us assume that the
JPEG input is a 640 x 480 RGB image with 24 bits/pixel, as shown in Fig. 7-72(a). Since using luminance and
chrominance gives better compression, we first compute the luminance, Y, and
the two chrominances, I and Q (for NTSC), according to the following formulas:
For PAL, the chrominances are called
U and V and the coefficients are different, but the idea is the same. SECAM is
different from both NTSC and PAL.
Separate matrices are constructed
for Y, I, and Q, each with elements in the range 0 to 255. Next, square blocks
of four pixels are averaged in the I and Q matrices to reduce them to 320 x
240. This reduction is lossy, but the eye barely notices it since the eye
responds to luminance more than to chrominance. Nevertheless, it compresses the
total amount of data by a factor of two. Now 128 is subtracted from each
element of all three matrices to put 0 in the middle of the range. Finally,
each matrix is divided up into 8 x 8 blocks. The Y matrix has 4800 blocks; the
other two have 1200 blocks each, as shown in Fig. 7-72(b).
Step 2 of JPEG is to apply a DCT (Discrete
Cosine Transformation) to each of the 7200 blocks separately. The output of
each DCT is an 8 x 8 matrix of DCT coefficients. DCT element (0, 0) is the
average value of the block. The other elements tell how much spectral power is
present at each spatial frequency. In theory, a DCT is lossless, but in
practice, using floating-point numbers and transcendental functions always
introduces some roundoff error that results in a little information loss.
Normally, these elements decay rapidly with distance from the origin, (0, 0),
as suggested by Fig. 7-73.
Once the DCT is complete, JPEG moves
on to step 3, called quantization,in which the less important DCT coefficients
are wiped out. This (lossy) transformation is done by dividing each of the
coefficients in the 8 x 8 DCT matrix by a weight taken from a table. If all the
weights are 1, the transformation does nothing. However, if the weights
increase sharply from the origin, higher spatial frequencies are dropped
quickly.
An example of this step is given in Fig. 7-74. Here we see the initial DCT matrix,
the quantization table, and the result obtained by dividing each DCT element by
the corresponding quantization table element. The values in the quantization
table are not part of the JPEG standard. Each application must supply its own,
allowing it to control the loss-compression trade-off.
Step 4 reduces the (0, 0) value of
each block (the one in the upper-left corner) by replacing it with the amount
it differs from the corresponding element in the previous block. Since these
elements are the averages of their respective blocks, they should change
slowly, so taking the differential values should reduce most of them to small
values. No differentials are computed from the other values. The (0, 0) values
are referred to as the DC components; the other values are the AC components.
Step 5 linearizes the 64 elements
and applies run-length encoding to the list. Scanning the block from left to
right and then top to bottom will not concentrate the zeros together, so a
zigzag scanning pattern is used, as shown in Fig. 7-75. In this example, the zig zag pattern
produces 38 consecutive 0s at the end of the matrix. This string can be reduced
to a single count saying there are 38 zeros, a technique known as run-length
encoding.
Now we have a list of numbers that
represent the image (in transform space). Step 6 Huffman-encodes the numbers for
storage or transmission, assigning common numbers shorter codes that uncommon
ones.
JPEG may seem complicated, but that
is because it is complicated. Still, since it often produces a 20:1 compression
or better, it is widely used. Decoding a JPEG image requires running the
algorithm backward. JPEG is roughly symmetric: decoding takes as long as
encoding. This property is not true of all compression algorithms, as we shall
now see.
Finally, we come to the heart of the
matter: the MPEG (Motion Picture Experts Group) standards. These are the main
algorithms used to compress videos and have been international standards since
1993. Because movies contain both images and sound, MPEG can compress both
audio and video. We have already examined audio compression and still image
compression, so let us now examine video compression.
The first standard to be finalized
was MPEG-1 (International Standard 11172). Its goal was to produce
video-recorder-quality output (352 x 240 for NTSC) using a bit rate of 1.2
Mbps. A 352 x 240 image with 24 bits/pixel and 25 frames/sec requires 50.7
Mbps, so getting it down to 1.2 Mbps is not entirely trivial. A factor of 40
compression is needed. MPEG-1 can be transmitted over twisted pair transmission
lines for modest distances. MPEG-1 is also used for storing movies on CD-ROM.
The next standard in the MPEG family
was MPEG-2 (International Standard 13818), which was originally designed for
compressing broadcast-quality video into 4 to 6 Mbps, so it could fit in a NTSC
or PAL broadcast channel. Later, MPEG-2 was expanded to support higher
resolutions, including HDTV. It is very common now, as it forms the basis for
DVD and digital satellite television.
The basic principles of MPEG-1 and
MPEG-2 are similar, but the details are different. To a first approximation,
MPEG-2 is a superset of MPEG-1, with additional features, frame formats, and
encoding options. We will first discuss MPEG-1, then MPEG-2.
MPEG-1 has three parts: audio,
video, and system, which integrates the other two, as shown in Fig. 7-76. The audio and video encoders work
independently, which raises the issue of how the two streams get synchronized
at the receiver. This problem is solved by having a 90-kHz system clock that
outputs the current time value to both encoders. These values are 33 bits, to
allow films to run for 24 hours without wrapping around. These timestamps are
included in the encoded output and propagated all the way to the receiver,
which can use them to synchronize the audio and video streams.
Now let us consider MPEG-1 video
compression. Two kinds of redundancies exist in movies: spatial and temporal.
MPEG-1 uses both. Spatial redundancy can be utilized by simply coding each
frame separately with JPEG. This approach is occasionally used, especially when
random access to each frame is needed, as in editing video productions. In this
mode, a compressed bandwidth in the 8- to 10-Mbps range is achievable.
Additional compression can be
achieved by taking advantage of the fact that consecutive frames are often
almost identical. This effect is smaller than it might first appear since many
moviemakers cut between scenes every 3 or 4 seconds (time a movie and count the
scenes). Nevertheless, even a run of 75 highly similar frames offers the
potential of a major reduction over simply encoding each frame separately with
JPEG.
For scenes in which the camera and
background are stationary and one or two actors are moving around slowly,
nearly all the pixels will be identical from frame to frame. Here, just
subtracting each frame from the previous one and running JPEG on the difference
would do fine. However, for scenes where the camera is panning or zooming, this
technique fails badly. What is needed is some way to compensate for this
motion. This is precisely what MPEG does; it is the main difference between
MPEG and JPEG.
MPEG-1 output consists of four kinds
of frames:
- I (Intracoded) frames: Self-contained JPEG-encoded still pictures.
- P (Predictive) frames: Block-by-block difference with the last frame.
- B (Bidirectional) frames: Differences between the last and next frame.
- D (DC-coded) frames: Block averages used for fast forward.
I-frames are just still pictures
coded using a variant of JPEG, also using full-resolution luminance and
half-resolution chrominance along each axis. It is necessary to have I-frames
appear in the output stream periodically for three reasons. First, MPEG-1 can
be used for a multicast transmission, with viewers tuning it at will. If all
frames depended on their predecessors going back to the first frame, anybody
who missed the first frame could never decode any subsequent frames. Second, if
any frame were received in error, no further decoding would be possible. Third,
without I-frames, while doing a fast forward or rewind, the decoder would have
to calculate every frame passed over so it would know the full value of the one
it stopped on. For these reasons, I-frames are inserted into the output once or
twice per second.
P-frames, in contrast, code
interframe differences. They are based on the idea of macroblocks, which cover
16 x 16 pixels in luminance space and 8 x 8 pixels in chrominance space. A
macroblock is encoded by searching the previous frame for it or something only
slightly different from it.
An example of where P-frames would
be useful is given in Fig. 7-77. Here we see three consecutive frames
that have the same background, but differ in the position of one person. The
macroblocks containing the background scene will match exactly, but the macroblocks
containing the person will be offset in position by some unknown amount and
will have to be tracked down.
The MPEG-1 standard does not specify
how to search, how far to search, or how good a match has to be to count. This
is up to each implementation. For example, an implementation might search for a
macroblock at the current position in the previous frame, and all other
positions offset ±Dx in the x direction and ±Dy in the y direction. For each
position, the number of matches in the luminance matrix could be computed. The
position with the highest score would be declared the winner, provided it was
above some predefined threshold. Otherwise, the macroblock would be said to be
missing. Much more sophisticated algorithms are also possible, of course.
If a macroblock is found, it is
encoded by taking the difference with its value in the previous frame (for
luminance and both chrominances). These difference matrices are then subject to
the discrete cosine transformation, quantization, run-length encoding, and
Huffman encoding, just as with JPEG. The value for the macroblock in the output
stream is then the motion vector (how far the macro-block moved from its
previous position in each direction), followed by the Huffman-encoded list of
numbers. If the macroblock is not located in the previous frame, the current
value is encoded with JPEG, just as in an I-frame.
Clearly, this algorithm is highly
asymmetric. An implementation is free to try every plausible position in the
previous frame if it wants to, in a desperate attempt to locate every last
macroblock, no matter where it moved to. This approach will minimize the
encoded MPEG-1 stream at the expense of very slow encoding. This approach might
be fine for a one-time encoding of a film library but would be terrible for
real-time videoconferencing.
Similarly, each implementation is
free to decide what constitutes a ''found'' macroblock. This freedom allows
implementers to compete on the quality and speed of their algorithms, but
always produce compliant MPEG-1. No matter what search algorithm is used, the
final output is either the JPEG encoding of the current macroblock or the JPEG
encoding of the difference between the current macroblock and one in the
previous frame at a specified offset from the current one.
So far, decoding MPEG-1 is
straightforward. Decoding I-frames is the same as decoding JPEG images.
Decoding P-frames requires the decoder to buffer the previous frame and then
build up the new one in a second buffer based on fully encoded macroblocks and
macroblocks containing differences from the previous frame. The new frame is
assembled macroblock by macroblock.
B-frames are similar to P-frames,
except that they allow the reference macro-block to be in either a previous
frame or in a succeeding frame. This additional freedom allows improved motion
compensation and is also useful when objects pass in front of, or behind, other
objects. To do B-frame encoding, the encoder needs to hold three decoded frames
in memory at once: the past one, the current one, and the future one. Although
B-frames give the best compression, not all implementations support them.
D-frames are only used to make it
possible to display a low-resolution image when doing a rewind or fast forward.
Doing the normal MPEG-1 decoding in real time is difficult enough. Expecting
the decoder to do it when slewing through the video at ten times normal speed
is asking a bit much. Instead, the D-frames are used to produce low-resolution
images. Each D-frame entry is just the average value of one block, with no
further encoding, making it easy to display in real time. This facility is
important to allow people to scan through a video at high speed in search of a
particular scene. The D-frames are generally placed just before the
corresponding I-frames so if fast forwarding is stopped, it will be possible to
start viewing at normal speed.
Having finished our treatment of
MPEG-1, let us now move on to MPEG-2. MPEG-2 encoding is fundamentally similar
to MPEG-1 encoding, with I-frames, P-frames, and B-frames. D-frames are not
supported, however. Also, the discrete cosine transformation uses a 10 x 10
block instead of a 8 x 8 block, to give 50 percent more coefficients, hence
better quality. Since MPEG-2 is targeted at broadcast television as well as
DVD, it supports both progressive and interlaced images, in contrast to MPEG-1,
which supports only progressive images. Other minor details also differ between
the two standards.
Instead of supporting only one
resolution level, MPEG-2 supports four: low (352 x 240), main (720 x 480),
high-1440 (1440 x 1152), and high (1920 x 1080). Low resolution is for VCRs and
backward compatibility with MPEG-1. Main is the normal one for NTSC
broadcasting. The other two are for HDTV. For high-quality output, MPEG-2
usually runs at 4–8 Mbps.
Video on demand is sometimes
compared to an electronic video rental store. The user (customer) selects any
one of a large number of available videos and takes it home to view. Only with
video on demand, the selection is made at home using the television set's
remote control, and the video starts immediately. No trip to the store is
needed. Needless to say, implementing video on demand is a wee bit more
complicated than describing it. In this section, we will give an overview of
the basic ideas and their implementation.
Is video on demand really like
renting a video, or is it more like picking a movie to watch from a 500-channel
cable system? The answer has important technical implications. In particular,
video rental users are used to the idea of being able to stop a video, make a
quick trip to the kitchen or bathroom, and then resume from where the video
stopped. Television viewers do not expect to put programs on pause.
If video on demand is going to
compete successfully with video rental stores, it may be necessary to allow
users to stop, start, and rewind videos at will. Giving users this ability
virtually forces the video provider to transmit a separate copy to each one.
On the other hand, if video on
demand is seen more as advanced television, then it may be sufficient to have
the video provider start each popular video, say, every 10 minutes, and run
these nonstop. A user wanting to see a popular video may have to wait up to 10
minutes for it to start. Although pause/resume is not possible here, a viewer
returning to the living room after a short break can switch to another channel
showing the same video but 10 minutes behind. Some material will be repeated,
but nothing will be missed. This scheme is called near video on demand. It
offers the potential for much lower cost, because the same feed from the video
server can go to many users at once. The difference between video on demand and
near video on demand is similar to the difference between driving your own car
and taking the bus.
Watching movies on (near) demand is but
one of a vast array of potential new services possible once wideband networking
is available. The general model that many people use is illustrated in Fig. 7-78. Here we see a high-bandwidth (national
or international) wide area backbone network at the center of the system.
Connected to it are thousands of local distribution networks, such as cable TV
or telephone company distribution systems. The local distribution systems reach
into people's houses, where they terminate in set-top boxes, which are, in
fact, powerful, specialized personal computers.
Attached to the backbone by
high-bandwidth optical fibers are numerous information providers. Some of these
will offer pay-per-view video or pay-per-hear audio CDs. Others will offer
specialized services, such as home shopping (letting viewers rotate a can of
soup and zoom in on the list of ingredients or view a video clip on how to
drive a gasoline-powered lawn mower). Sports, news, reruns of ''I Love Lucy,''
WWW access, and innumerable other possibilities will no doubt quickly become
available.
Also included in the system are
local spooling servers that allow videos to be placed closer to the users (in
advance), to save bandwidth during peak hours. How these pieces will fit
together and who will own what are matters of vigorous debate within the
industry. Below we will examine the design of the main pieces of the system:
the video servers and the distribution network.
To have (near) video on demand, we
need video servers capable of storing and outputting a large number of movies
simultaneously. The total number of movies ever made is estimated at 65,000
(Minoli, 1995). When compressed in MPEG-2, a normal movie occupies roughly 4 GB
of storage, so 65,000 of them would require something like 260 terabytes. Add
to this all the old television programs ever made, sports films, newsreels,
talking shopping catalogs, etc., and it is clear that we have an industrial-strength
storage problem on our hands.
The cheapest way to store large
volumes of information is on magnetic tape. This has always been the case and
probably always will be. An Ultrium tape can store 200 GB (50 movies) at a cost
of about $1–$2 per movie. Large mechanical tape servers that hold thousands of
tapes and have a robot arm for fetching any tape and inserting it into a tape
drive are commercially available now. The problem with these systems is the
access time (especially for the 50th movie on a tape), the transfer rate, and
the limited number of tape drives (to serve n movies at once, the unit would
need n drives).
Fortunately, experience with video
rental stores, public libraries, and other such organizations shows that not
all items are equally popular. Experimentally, when N movies are available, the
fraction of all requests being for the kth most popular one is approximately C/k.
Here C is computed to normalize the sum to 1, namely,
Thus, the most popular movie is
seven times as popular as the number seven movie. This result is known as Zipf's
law (Zipf, 1949).
The fact that some movies are much
more popular than others suggests a possible solution in the form of a storage
hierarchy, as shown in Fig. 7-79. Here, the performance increases as one
moves up the hierarchy.
An alternative to tape is optical
storage. Current DVDs hold 4.7 GB, good for one movie, but the next generation
will hold two movies. Although seek times are slow compared to magnetic disks
(50 msec versus 5 msec), their low cost and high reliability make optical juke
boxes containing thousands of DVDs a good alternative to tape for the more
heavily used movies.
Next come magnetic disks. These have
short access times (5 msec), high transfer rates (320 MB/sec for SCSI 320), and
substantial capacities (> 100 GB), which makes them well suited to holding
movies that are actually being transmitted (as opposed to just being stored in
case somebody ever wants them). Their main drawback is the high cost for storing
movies that are rarely accessed.
At the top of the pyramid of Fig. 7-79 is RAM. RAM is the fastest storage
medium, but also the most expensive. When RAM prices drop to $50/GB, a 4-GB
movie will occupy $200 dollars worth of RAM, so having 100 movies in RAM will
cost $20,000 for the 200 GB of memory. Still, for a video server feeding out 100
movies, just keeping all the movies in RAM is beginning to look feasible. And
if the video server has 100 customers but they are collectively watching only
20 different movies, it begins to look not only feasible, but a good design.
Since a video server is really just
a massive real-time I/O device, it needs a different hardware and software
architecture than a PC or a UNIX workstation. The hardware architecture of a
typical video server is illustrated in Fig. 7-80. The server has one or more
high-performance CPUs, each with some local memory, a shared main memory, a
massive RAM cache for popular movies, a variety of storage devices for holding
the movies, and some networking hardware, normally an optical interface to a
SONET or ATM backbone at OC-12 or higher. These subsystems are connected by an
extremely high speed bus (at least 1 GB/sec).
Now let us take a brief look at
video server software. The CPUs are used for accepting user requests, locating
movies, moving data between devices, customer billing, and many other
functions. Some of these are not time critical, but many others are, so some,
if not all, the CPUs will have to run a real-time operating system, such as a
real-time microkernel. These systems normally break work up into small tasks,
each with a known deadline. The scheduler can then run an algorithm such as
nearest deadline next or the rate monotonic algorithm (Liu and Layland, 1973).
The CPU software also defines the
nature of the interface that the server presents to the clients (spooling
servers and set-top boxes). Two designs are popular. The first one is a
traditional file system, in which the clients can open, read, write, and close
files. Other than the complications introduced by the storage hierarchy and
real-time considerations, such a server can have a file system modeled after
that of UNIX.
The second kind of interface is
based on the video recorder model. The commands to the server request it to
open, play, pause, fast forward, and rewind files. The difference with the UNIX
model is that once a PLAY command is given, the server just keeps pumping out
data at a constant rate, with no new commands required.
The heart of the video server
software is the disk management software. It has two main jobs: placing movies
on the magnetic disk when they have to be pulled up from optical or tape
storage, and handling disk requests for the many output streams. Movie
placement is important because it can greatly affect performance.
Two possible ways of organizing disk
storage are the disk farm and the disk array. With the disk farm, each drive
holds some number of entire movies. For performance and reliability reasons,
each movie should be present on at least two drives, maybe more. The other
storage organization is the disk array or RAID (Redundant Array of Inexpensive
Disks), in which each movie is spread out over multiple drives, for example,
block 0 on drive 0, block 1 on drive 1, and so on, with block n - 1 on drive n
- 1. After that, the cycle repeats, with block n on drive 0, and so forth. This
organizing is called striping.
A striped disk array has several
advantages over a disk farm. First, all n drives can be running in parallel, increasing
the performance by a factor of n. Second, it can be made redundant by adding an
extra drive to each group of n, where the redundant drive contains the
block-by-block exclusive OR of the other drives, to allow full data recovery in
the event one drive fails. Finally, the problem of load balancing is solved
(manual placement is not needed to avoid having all the popular movies on the
same drive). On the other hand, the disk array organization is more complicated
than the disk farm and highly sensitive to multiple failures. It is also
ill-suited to video recorder operations such as rewinding or fast forwarding a
movie.
The other job of the disk software
is to service all the real-time output streams and meet their timing
constraints. Only a few years ago, this required complex disk scheduling
algorithms, but with memory prices so low now, a much simpler approach is
beginning to be possible. For each stream being served, a buffer of, say, 10
sec worth of video (5 MB) is kept in RAM. It is filled by a disk process and
emptied by a network process. With 500 MB of RAM, 100 streams can be fed
directly from RAM. Of course, the disk subsystem must have a sustained data
rate of 50 MB/sec to keep the buffers full, but a RAID built from high-end SCSI
disks can handle this requirement easily.
The distribution network is the set
of switches and lines between the source and destination. As we saw in Fig. 7-78, it consists of a backbone, connected
to a local distribution network. Usually, the backbone is switched and the
local distribution network is not.
The main requirement imposed on the
backbone is high bandwidth. It used to be that low jitter was also a
requirement, but with even the smallest PC now able to buffer 10 sec of
high-quality MPEG-2 video, low jitter is not a requirement anymore.
Local distribution is highly chaotic,
with different companies trying out different networks in different regions.
Telephone companies, cable TV companies, and new entrants, such as power
companies, are all convinced that whoever gets there first will be the big
winner. Consequently, we are now seeing a proliferation of technologies being
installed. In Japan, some sewer companies are in the Internet business, arguing
that they have the biggest pipe of all into everyone's house (they run an
optical fiber through it, but have to be very careful about precisely where it
emerges). The four main local distribution schemes for video on demand go by
the acronyms ADSL, FTTC, FTTH, and HFC. We will now explain each of these in
turn.
ADSL is the first telephone
industry's entrant in the local distribution sweepstakes. The idea is that
virtually every house in the United States, Europe, and Japan already has a
copper twisted pair going into it (for analog telephone service). If these
wires could be used for video on demand, the telephone companies could clean
up.
The problem, of course, is that
these wires cannot support even MPEG-1 over their typical 10-km length, let
alone MPEG-2. High-resolution, full-color, full motion video needs 4–8 Mbps,
depending on the quality desired. ADSL is not really fast enough except for
very short local loops.
The second telephone company design
is FTTC (Fiber To The Curb). In FTTC, the telephone company runs optical fiber
from the end office into each residential neighborhood, terminating in a device
called an ONU (Optical Network Unit). On the order of 16 copper local loops can
terminate in an ONU. These loops are now so short that it is possible to run
full-duplex T1 or T2 over them, allowing MPEG-1 and MPEG-2 movies,
respectively. In addition, videoconferencing for home workers and small businesses
is now possible because FTTC is symmetric.
The third telephone company solution
is to run fiber into everyone's house. It is called FTTH (Fiber To The Home).
In this scheme, everyone can have an OC-1, OC-3, or even higher carrier if that
is required. FTTH is very expensive and will not happen for years but clearly
will open a vast range of new possibilities when it finally happens. In Fig. 7-63 we saw how everybody could operate his
or her own radio station. What do you think about each member of the family
operating his or her own personal television station? ADSL, FTTC, and FTTH are all
point-to-point local distribution networks, which is not surprising given how
the current telephone system is organized.
A completely different approach is HFC
(Hybrid Fiber Coax), which is the preferred solution currently being installed
by cable TV providers. It is illustrated in Fig. 2-47(a). The story goes something like this.
The current 300- to 450-MHz coax cables are being replaced by 750-MHz coax
cables, upgrading the capacity from 50 to 75 6-MHz channels to 125 6-MHz
channels. Seventy-five of the 125 channels will be used for transmitting analog
television.
The 50 new channels will each be
modulated using QAM-256, which provides about 40 Mbps per channel, giving a
total of 2 Gbps of new bandwidth. The headends will be moved deeper into the
neighborhoods so that each cable runs past only 500 houses. Simple division
shows that each house can then be allocated a dedicated 4-Mbps channel, which
can handle an MPEG-2 movie.
While this sounds wonderful, it does
require the cable providers to replace all the existing cables with 750-MHz
coax, install new headends, and remove all the one-way amplifiers—in short,
replace the entire cable TV system. Consequently, the amount of new
infrastructure here is comparable to what the telephone companies need for
FTTC. In both cases the local network provider has to run fiber into
residential neighborhoods. Again, in both cases, the fiber terminates at an
optoelectrical converter. In FTTC, the final segment is a point-to-point local
loop using twisted pairs. In HFC, the final segment is a shared coaxial cable.
Technically, these two systems are not really as different as their respective
proponents often make out.
Nevertheless, there is one real
difference that is worth pointing out. HFC uses a shared medium without
switching and routing. Any information put onto the cable can be removed by any
subscriber without further ado. FTTC, which is fully switched, does not have
this property. As a result, HFC operators want video servers to send out
encrypted streams so customers who have not paid for a movie cannot see it.
FTTC operators do not especially want encryption because it adds complexity,
lowers performance, and provides no additional security in their system. From
the point of view of the company running a video server, is it a good idea to
encrypt or not? A server operated by a telephone company or one of its
subsidiaries or partners might intentionally decide not to encrypt its videos,
claiming efficiency as the reason but really to cause economic losses to its
HFC competitors.
For all these local distribution
networks, it is possible that each neighborhood will be outfitted with one or
more spooling servers. These are, in fact, just smaller versions of the video
servers we discussed above. The big advantage of these local servers is that
they move some load off the backbone.
They can be preloaded with movies by
reservation. If people tell the provider which movies they want well in
advance, they can be downloaded to the local server during off-peak hours. This
observation is likely to lead the network operators to lure away airline
executives to do their pricing. One can envision tariffs in which movies
ordered 24 to 72 hours in advance for viewing on a Tuesday or Thursday evening
before 6 P.M, or after 11 P.M. get a 27 percent discount. Movies ordered on the
first Sunday of the month before 8 A.M. for viewing on a Wednesday afternoon on
a day whose date is a prime number get a 43 percent discount, and so on.
No comments:
Post a Comment
silahkan membaca dan berkomentar