7.3
The World Wide Web
The World Wide Web is an
architectural framework for accessing linked documents spread out over millions
of machines all over the Internet. In 10 years, it went from being a way to
distribute high-energy physics data to the application that millions of people
think of as being ''The Internet.'' Its enormous popularity stems from the fact
that it has a colorful graphical interface that is easy for beginners to use,
and it provides an enormous wealth of information on almost every conceivable
subject, from aardvarks to Zulus.
The Web (also known as WWW) began in
1989 at CERN, the European center for nuclear research. CERN has several
accelerators at which large teams of scientists from the participating European
countries carry out research in particle physics. These teams often have
members from half a dozen or more countries. Most experiments are highly
complex and require years of advance planning and equipment construction. The
Web grew out of the need to have these large teams of internationally dispersed
researchers collaborate using a constantly changing collection of reports,
blueprints, drawings, photos, and other documents.
The initial proposal for a web of
linked documents came from CERN physicist Tim Berners-Lee in March 1989. The
first (text-based) prototype was operational 18 months later. In December 1991,
a public demonstration was given at the Hypertext '91 conference in San
Antonio, Texas.
This demonstration and its attendant
publicity caught the attention of other researchers, which led Marc Andreessen
at the University of Illinois to start developing the first graphical browser,
Mosaic. It was released in February 1993. Mosaic was so popular that a year
later, Andreessen left to form a company, Netscape Communications Corp., whose
goal was to develop clients, servers, and other Web software. When Netscape went
public in 1995, investors, apparently thinking this was the next Microsoft,
paid $1.5 billion for the stock. This record was all the more surprising
because the company had only one product, was operating deeply in the red, and
had announced in its prospectus that it did not expect to make a profit for the
foreseeable future. For the next three years, Netscape Navigator and
Microsoft's Internet Explorer engaged in a ''browser war,'' each one trying
frantically to add more features (and thus more bugs) than the other one. In
1998, America Online bought Netscape Communications Corp. for $4.2 billion,
thus ending Netscape's brief life as an independent company.
In 1994, CERN and M.I.T. signed an
agreement setting up the World Wide Web Consortium (sometimes abbreviated as W3C),
an organization devoted to further developing the Web, standardizing protocols,
and encouraging interoperability between sites. Berners-Lee became the
director. Since then, several hundred universities and companies have joined
the consortium. Although there are now more books about the Web than you can
shake a stick at, the best place to get up-to-date information about the Web is
(naturally) on the Web itself. The consortium's home page is at www.w3.org.
Interested readers are referred there for links to pages covering all of the
consortium's numerous documents and activities.
From the users' point of view, the
Web consists of a vast, worldwide collection of documents or Web pages, often
just called pages for short. Each page may contain links to other pages
anywhere in the world. Users can follow a link by clicking on it, which then
takes them to the page pointed to. This process can be repeated indefinitely.
The idea of having one page point to another, now called hypertext, was
invented by a visionary M.I.T. professor of electrical engineering, Vannevar
Bush, in 1945, long before the Internet was invented.
Pages are viewed with a program
called a browser, of which Internet Explorer and Netscape Navigator are two
popular ones. The browser fetches the page requested, interprets the text and
formatting commands on it, and displays the page, properly formatted, on the
screen. An example is given in Fig. 7-18(a). Like many Web pages, this one
starts with a title, contains some information, and ends with the e-mail
address of the page's maintainer. Strings of text that are links to other
pages, called hyperlinks, are often highlighted, by underlining, displaying
them in a special color, or both. To follow a link, the user places the mouse
cursor on the highlighted area, which causes the cursor to change, and clicks
on it. Although nongraphical browsers, such as Lynx, exist, they are not as
popular as graphical browsers, so we will concentrate on the latter.
Voice-based browsers are also being developed.
Users who are curious about the
Department of Animal Psychology can learn more about it by clicking on its
(underlined) name. The browser then fetches the page to which the name is
linked and displays it, as shown in Fig. 7-18(b). The underlined items here can also
be clicked on to fetch other pages, and so on. The new page can be on the same
machine as the first one or on a machine halfway around the globe. The user
cannot tell. Page fetching is done by the browser, without any help from the
user. If the user ever returns to the main page, the links that have already
been followed may be shown with a dotted underline (and possibly a different
color) to distinguish them from links that have not been followed. Note that
clicking on the Campus Information line in the main page does nothing. It is
not underlined, which means that it is just text and is not linked to another
page.
The basic model of how the Web works
is shown in Fig. 7-19. Here the browser is displaying a Web
page on the client machine. When the user clicks on a line of text that is
linked to a page on the abcd.com server, the browser follows the hyperlink by
sending a message to the abcd.com server asking it for the page. When the page
arrives, it is displayed. If this page contains a hyperlink to a page on the xyz.com
server that is clicked on, the browser then sends a request to that machine for
the page, and so on indefinitely.
Let us now examine the client side
of Fig. 7-19 in more detail. In essence, a browser
is a program that can display a Web page and catch mouse clicks to items on the
displayed page. When an item is selected, the browser follows the hyperlink and
fetches the page selected. Therefore, the embedded hyperlink needs a way to
name any other page on the Web. Pages are named using URLs (Uniform Resource
Locators). A typical URL is
http://www.abcd.com/products.html
We will explain URLs later in this
chapter. For the moment, it is sufficient to know that a URL has three parts:
the name of the protocol (http), the DNS name of the machine where the page is
located (www.abcd.com), and (usually) the name of the file containing the page
(products.html).
When a user clicks on a hyperlink,
the browser carries out a series of steps in order to fetch the page pointed
to. Suppose that a user is browsing the Web and finds a link on Internet
telephony that points to ITU's home page, which is http://www.itu.org/home/index.html.
Let us trace the steps that occur when this link is selected.
- The browser determines the URL (by seeing what was selected).
- The browser asks DNS for the IP address of www.itu.org.
- DNS replies with 156.106.192.32.
- The browser makes a TCP connection to port 80 on 156.106.192.32.
- It then sends over a request asking for file /home/index.html.
- The www.itu.org server sends the file /home/index.html.
- The TCP connection is released.
- The browser displays all the text in /home/index.html.
- The browser fetches and displays all images in this file.
Many browsers display which step
they are currently executing in a status line at the bottom of the screen. In
this way, when the performance is poor, the user can see if it is due to DNS
not responding, the server not responding, or simply network congestion during
page transmission.
To be able to display the new page
(or any page), the browser has to understand its format. To allow all browsers
to understand all Web pages, Web pages are written in a standardized language
called HTML, which describes Web pages.
Although a browser is basically an
HTML interpreter, most browsers have numerous buttons and features to make it
easier to navigate the Web. Most have a button for going back to the previous
page, a button for going forward to the next page (only operative after the
user has gone back from it), and a button for going straight to the user's own
start page. Most browsers have a button or menu item to set a bookmark on a
given page and another one to display the list of bookmarks, making it possible
to revisit any of them with only a few mouse clicks. Pages can also be saved to
disk or printed. Numerous options are generally available for controlling the
screen layout and setting various user preferences.
In addition to having ordinary text
(not underlined) and hypertext (underlined), Web pages can also contain icons,
line drawings, maps, and photographs. Each of these can (optionally) be linked
to another page. Clicking on one of these elements causes the browser to fetch
the linked page and display it on the screen, the same as clicking on text.
With images such as photos and maps, which page is fetched next may depend on
what part of the image was clicked on.
Not all pages contain HTML. A page
may consist of a formatted document in PDF format, an icon in GIF format, a
photograph in JPEG format, a song in MP3 format, a video in MPEG format, or any
one of hundreds of other file types. Since standard HTML pages may link to any
of these, the browser has a problem when it encounters a page it cannot
interpret.
Rather than making the browsers
larger and larger by building in interpreters for a rapidly growing collection
of file types, most browsers have chosen a more general solution. When a server
returns a page, it also returns some additional information about the page.
This information includes the MIME type of the page (see Fig. 7-12). Pages of type text/html are just
displayed directly, as are pages in a few other built-in types. If the MIME
type is not one of the built-in ones, the browser consults its table of MIME
types to tell it how to display the page. This table associates a MIME type
with a viewer.
There are two possibilities:
plug-ins and helper applications. A plug-in is a code module that the browser
fetches from a special directory on the disk and installs as an extension to
itself, as illustrated in Fig. 7-20(a). Because plug-ins run inside the
browser, they have access to the current page and can modify its appearance.
After the plug-in has done its job (usually after the user has moved to a
different Web page), the plug-in is removed from the browser's memory.
Each browser has a set of procedures
that all plug-ins must implement so the browser can call the plug-in. For
example, there is typically a procedure the browser's base code calls to supply
the plug-in with data to display. This set of procedures is the plug-in's
interface and is browser specific.
In addition, the browser makes a set
of its own procedures available to the plug-in, to provide services to
plug-ins. Typical procedures in the browser interface are for allocating and
freeing memory, displaying a message on the browser's status line, and querying
the browser about parameters.
Before a plug-in can be used, it
must be installed. The usual installation procedure is for the user to go to
the plug-in's Web site and download an installation file. On Windows, this is
typically a self-extracting zip file with extension .exe. When the zip file is
double clicked, a little program attached to the front of the zip file is
executed. This program unzips the plug-in and copies it to the browser's
plug-in directory. Then it makes the appropriate calls to register the
plug-in's MIME type and to associate the plug-in with it. On UNIX, the
installer is often a shell script that handles the copying and registration.
The other way to extend a browser is
to use a helper application. This is a complete program, running as a separate
process. It is illustrated in Fig. 7-20(b). Since the helper is a separate
program, it offers no interface to the browser and makes no use of browser
services. Instead, it usually just accepts the name of a scratch file where the
content file has been stored, opens the file, and displays the contents.
Typically, helpers are large programs that exist independently of the browser,
such as Adobe's Acrobat Reader for displaying PDF files or Microsoft Word. Some
programs (such as Acrobat) have a plug-in that invokes the helper itself.
Many helper applications use the
MIME type application. A considerable number of subtypes have been defined, for
example, application/pdf for PDF files and application/msword for Word files.
In this way, a URL can point directly to a PDF or Word file, and when the user
clicks on it, Acrobat or Word is automatically started and handed the name of a
scratch file containing the content to be displayed. Consequently, browsers can
be configured to handle a virtually unlimited number of document types with no
changes to the browser. Modern Web servers are often configured with hundreds
of type/subtype combinations and new ones are often added every time a new
program is installed.
Helper applications are not
restricted to using the application MIME type. Adobe Photoshop uses image/x-photoshop
and RealOne Player is capable of handling audio/mp3, for example.
On Windows, when a program is
installed on the computer, it registers the MIME types it wants to handle. This
mechanism leads to conflict when multiple viewers are available for some
subtype, such as video/mpg. What happens is that the last program to register
overwrites existing (MIME type, helper application) associations, capturing the
type for itself. As a consequence, installing a new program may change the way
a browser handles existing types.
On UNIX, this registration process
is generally not automatic. The user must manually update certain configuration
files. This approach leads to more work but fewer surprises.
Browsers can also open local files,
rather than fetching them from remote Web servers. Since local files do not
have MIME types, the browser needs some way to determine which plug-in or
helper to use for types other than its built-in types such as text/html and image/jpeg.
To handle local files, helpers can be associated with a file extension as well
as with a MIME type. With the standard configuration, opening foo.pdf will open
it in Acrobat and opening bar.doc will open it in Word. Some browsers use the
MIME type, the file extension, and even information taken from the file itself
to guess the MIME type. In particular, Internet Explorer relies more heavily on
the file extension than on the MIME type when it can.
Here, too, conflicts can arise since
many programs are willing, in fact, eager, to handle, say, .mpg. During
installation, programs intended for professionals often display checkboxes for
the MIME types and extensions they are prepared to handle to allow the user to
select the appropriate ones and thus not overwrite existing associations by
accident. Programs aimed at the consumer market assume that the user does not
have a clue what a MIME type is and simply grab everything they can without
regard to what previously installed programs have done.
The ability to extend the browser
with a large number of new types is convenient but can also lead to trouble.
When Internet Explorer fetches a file with extension exe, it realizes that this
file is an executable program and therefore has no helper. The obvious action
is to run the program. However, this could be an enormous security hole. All a
malicious Web site has to do is produce a Web page with pictures of, say, movie
stars or sports heroes, all of which are linked to a virus. A single click on a
picture then causes an unknown and potentially hostile executable program to be
fetched and run on the user's machine. To prevent unwanted guests like this,
Internet Explorer can be configured to be selective about running unknown
programs automatically, but not all users understand how to manage the
configuration.
On UNIX an analogous problem can
exist with shell scripts, but that requires the user to consciously install the
shell as a helper. Fortunately, this installation is sufficiently complicated
that nobody could possibly do it by accident (and few people can even do it
intentionally).
So much for the client side. Now let
us take a look at the server side. As we saw above, when the user types in a
URL or clicks on a line of hypertext, the browser parses the URL and interprets
the part between http:// and the next slash as a DNS name to look up. Armed
with the IP address of the server, the browser establishes a TCP connection to
port 80 on that server. Then it sends over a command containing the rest of the
URL, which is the name of a file on that server. The server then returns the
file for the browser to display.
To a first approximation, a Web
server is similar to the server of Fig. 6-6. That server, like a real Web server, is
given the name of a file to look up and return. In both cases, the steps that
the server performs in its main loop are:
- Accept a TCP connection from a client (a browser).
- Get the name of the file requested.
- Get the file (from disk).
- Return the file to the client.
- Release the TCP connection.
Modern Web servers have more
features, but in essence, this is what a Web server does.
A problem with this design is that
every request requires making a disk access to get the file. The result is that
the Web server cannot serve more requests per second than it can make disk
accesses. A high-end SCSI disk has an average access time of around 5 msec,
which limits the server to at most 200 requests/sec, less if large files have
to be read often. For a major Web site, this figure is too low.
One obvious improvement (used by all
Web servers) is to maintain a cache in memory of the n most recently used
files. Before going to disk to get a file, the server checks the cache. If the
file is there, it can be served directly from memory, thus eliminating the disk
access. Although effective caching requires a large amount of main memory and
some extra processing time to check the cache and manage its contents, the
savings in time are nearly always worth the overhead and expense.
The next step for building a faster
server is to make the server multithreaded. In one design, the server consists
of a front-end module that accepts all incoming requests and k processing
modules, as shown in Fig. 7-21. The k + 1 threads all belong to the
same process so the processing modules all have access to the cache within the
process' address space. When a request comes in, the front end accepts it and
builds a short record describing it. It then hands the record to one of the
processing modules. In another possible design, the front end is eliminated and
each processing module tries to acquire its own requests, but a locking
protocol is then required to prevent conflicts.
The processing module first checks
the cache to see if the file needed is there. If so, it updates the record to
include a pointer to the file in the record. If it is not there, the processing
module starts a disk operation to read it into the cache (possibly discarding
some other cached files to make room for it). When the file comes in from the
disk, it is put in the cache and also sent back to the client.
The advantage of this scheme is that
while one or more processing modules are blocked waiting for a disk operation
to complete (and thus consuming no CPU time), other modules can be actively
working on other requests. Of course, to get any real improvement over the
single-threaded model, it is necessary to have multiple disks, so more than one
disk can be busy at the same time. With k processing modules and k disks, the
throughput can be as much as k times higher than with a single-threaded server
and one disk.
In theory, a single-threaded server
and k disks could also gain a factor of k, but the code and administration are
far more complicated since normal blocking READ system calls cannot be used to
access the disk. With a multithreaded server, they can be used since then a READ
blocks only the thread that made the call, not the entire process.
Modern Web servers do more than just
accept file names and return files. In fact, the actual processing of each
request can get quite complicated. For this reason, in many servers each processing
module performs a series of steps. The front end passes each incoming request
to the first available module, which then carries it out using some subset of
the following steps, depending on which ones are needed for that particular
request.
- Resolve the name of the Web page requested.
- Authenticate the client.
- Perform access control on the client.
- Perform access control on the Web page.
- Check the cache.
- Fetch the requested page from disk.
- Determine the MIME type to include in the response.
- Take care of miscellaneous odds and ends.
- Return the reply to the client.
- Make an entry in the server log.
Step 1 is needed because the
incoming request may not contain the actual name of the file as a literal
string. For example, consider the URL http://www.cs.vu.nl, which has an empty
file name. It has to be expanded to some default file name. Also, modern
browsers can specify the user's default language (e.g., Italian or English),
which makes it possible for the server to select a Web page in that language,
if available. In general, name expansion is not quite so trivial as it might at
first appear, due to a variety of conventions about file naming.
Step 2 consists of verifying the
client's identity. This step is needed for pages that are not available to the
general public..
Step 3 checks to see if there are
restrictions on whether the request may be satisfied given the client's
identity and location. Step 4 checks to see if there are any access
restrictions associated with the page itself. If a certain file (e.g., .htaccess)
is present in the directory where the desired page is located, it may restrict
access to the file to particular domains, for example, only users from inside
the company.
Steps 5 and 6 involve getting the
page. Step 6 needs to be able to handle multiple disk reads at the same time.
Step 7 is about determining the MIME
type from the file extension, first few words of the file, a configuration
file, and possibly other sources. Step 8 is for a variety of miscellaneous
tasks, such as building a user profile or gathering certain statistics.
Step 9 is where the result is sent
back and step 10 makes an entry in the system log for administrative purposes.
Such logs can later be mined for valuable information about user behavior, for
example, the order in which people access the pages.
If too many requests come in each
second, the CPU will not be able to handle the processing load, no matter how
many disks are used in parallel. The solution is to add more nodes (computers),
possibly with replicated disks to avoid having the disks become the next
bottleneck. This leads to the server farm model of Fig. 7-22. A front end still accepts incoming
requests but sprays them over multiple CPUs rather than multiple threads to
reduce the load on each computer. The individual machines may themselves be
multithreaded and pipelined as before.
One problem with server farms is
that there is no longer a shared cache because each processing node has its own
memory—unless an expensive shared-memory multiprocessor is used. One way to
counter this performance loss is to have a front end keep track of where it
sends each request and send subsequent requests for the same page to the same
node. Doing this makes each node a specialist in certain pages so that cache
space is not wasted by having every file in every cache.
Another problem with server farms is
that the client's TCP connection terminates at the front end, so the reply must
go through the front end. This situation is depicted in Fig. 7-23(a), where the incoming request (1) and
outgoing reply (4) both pass through the front end. Sometimes a trick, called TCP
handoff, is used to get around this problem. With this trick, the TCP end point
is passed to the processing node so it can reply directly to the client, shown
as (3) in Fig. 7-23(b). This handoff is done in a way that
is transparent to the client.
No comments:
Post a Comment
silahkan membaca dan berkomentar