URLs—Uniform
Resource Locators
We have repeatedly said that Web
pages may contain pointers to other Web pages. Now it is time to see in a bit
more detail how these pointers are implemented. When the Web was first created,
it was immediately apparent that having one page point to another Web page
required mechanisms for naming and locating pages. In particular, three
questions had to be answered before a selected page could be displayed:
- What is the page called?
- Where is the page located?
- How can the page be accessed?
If every page were somehow assigned
a unique name, there would not be any ambiguity in identifying pages.
Nevertheless, the problem would not be solved. Consider a parallel between
people and pages. In the United States, almost everyone has a social security
number, which is a unique identifier, as no two people are supposed to have the
same one. Nevertheless, if you are armed only with a social security number,
there is no way to find the owner's address, and certainly no way to tell
whether you should write to the person in English, Spanish, or Chinese. The Web
has basically the same problems.
The solution chosen identifies pages
in a way that solves all three problems at once. Each page is assigned a URL (Uniform
Resource Locator) that effectively serves as the page's worldwide name. URLs
have three parts: the protocol (also known as the scheme), the DNS name of the
machine on which the page is located, and a local name uniquely indicating the
specific page (usually just a file name on the machine where it resides). As an
example example, the Web site for the author's department contains several
videos about the university and the city of Amsterdam. The URL for the video
page is
http://www.cs.vu.nl/video/index-en.html
This URL consists of three parts:
the protocol (http), the DNS name of the host (www.cs.vu.nl), and the file name
(video/index-en.html), with certain punctuation separating the pieces. The file
name is a path relative to the default Web directory at cs.vu.nl.
Many sites have built-in shortcuts
for file names. At many sites, a null file name defaults to the organization's
main home page. Typically, when the file named is a directory, this implies a
file named index.html. Finally, ~user/ might be mapped onto user's WWW
directory, and then onto the file index.html in that directory. Thus, the
author's home page can be reached at
http://www.cs.vu.nl/~ast/
even though the actual file name is index.html
in a certain default directory.
Now we can see how hypertext works.
To make a piece of text clickable, the page writer must provide two items of
information: the clickable text to be displayed and the URL of the page to go
to if the text is selected..
When the text is selected, the
browser looks up the host name using DNS. Once it knows the host's IP address,
the browser establishes a TCP connection to the host. Over that connection, it
sends the file name using the specified protocol. Bingo. Back comes the page.
This URL scheme is open-ended in the
sense that it is straightforward to have browsers use multiple protocols to get
at different kinds of resources. In fact, URLs for various other common
protocols have been defined. Slightly simplified forms of the more common ones
are listed in Fig. 7-24.
Let us briefly go over the list. The
http protocol is the Web's native language, the one spoken by Web servers. HTTP
stands for HyperText Transfer Protocol.
The ftp protocol is used to access
files by FTP, the Internet's file transfer protocol. FTP has been around more
than two decades and is well entrenched. Numerous FTP servers all over the
world allow people anywhere on the Internet to log in and download whatever
files have been placed on the FTP server. The Web does not change this; it just
makes obtaining files by FTP easier, as FTP has a somewhat arcane interface (but
it is more powerful than HTTP, for example, it allows a user on machine A to
transfer a file from machine B to machine C).
It is possible to access a local
file as a Web page, either by using the file protocol, or more simply, by just
naming it. This approach is similar to using FTP but does not require having a
server. Of course, it works only for local files, not remote ones.
Long before there was an Internet,
there was the USENET news system. It consists of about 30,000 newsgroups in
which millions of people discuss a wide variety of topics by posting and
reading articles related to the topic of the newsgroup. The news protocol can
be used to call up a news article as though it were a Web page. This means that
a Web browser is simultaneously a news reader. In fact, many browsers have
buttons or menu items to make reading USENET news even easier than using
standard news readers.
Two formats are supported for the news
protocol. The first format specifies a newsgroup and can be used to get a list
of articles from a preconfigured news site. The second one requires the
identifier of a specific news article to be given, in this case AA0134223112@cs.utah.edu.
The browser then fetches the given article from its preconfigured news site
using the NNTP (Network News Transfer Protocol). We will not study NNTP in this
book, but it is loosely based on SMTP and has a similar style.
The gopher protocol was used by the
Gopher system, which was designed at the University of Minnesota and named
after the school's athletic teams, the Golden Gophers (as well as being a slang
expression meaning ''go for'', i.e., go fetch). Gopher predates the Web by
several years. It was an information retrieval scheme, conceptually similar to
the Web itself, but supporting only text and no images. It is essentially
obsolete now and rarely used any more.
The last two protocols do not really
have the flavor of fetching Web pages, but are useful anyway. The mailto
protocol allows users to send e-mail from a Web browser. The way to do this is
to click on the OPEN button and specify a URL consisting of mailto: followed by
the recipient's e-mail address. Most browsers will respond by starting an
e-mail program with the address and some of the header fields already filled
in.
The telnet protocol is used to
establish an on-line connection to a remote machine. It is used the same way as
the telnet program, which is not surprising, since most browsers just call the
telnet program as a helper application.
In short, the URLs have been
designed to not only allow users to navigate the Web, but to deal with FTP,
news, Gopher, e-mail, and telnet as well, making all the specialized user
interface programs for those other services unnecessary and thus integrating
nearly all Internet access into a single program, the Web browser. If it were
not for the fact that this idea was thought of by a physics researcher, it
could easily pass for the output of some software company's advertising
department.
Despite all these nice properties,
the growing use of the Web has turned up an inherent weakness in the URL
scheme. A URL points to one specific host. For pages that are heavily
referenced, it is desirable to have multiple copies far apart, to reduce the
network traffic. The trouble is that URLs do not provide any way to reference a
page without simultaneously telling where it is. There is no way to say: I want
page xyz, but I do not care where you get it. To solve this problem and make it
possible to replicate pages, IETF is working on a system of URNs (Universal
Resource Names). A URN can be thought of as a generalized URL. This topic is
still the subject of research, although a proposed syntax is given in RFC 2141.
As we have seen repeatedly, the Web
is basically stateless. There is no concept of a login session. The browser
sends a request to a server and gets back a file. Then the server forgets that
it has ever seen that particular client.
At first, when the Web was just used
for retrieving publicly available documents, this model was perfectly adequate.
But as the Web started to acquire other functions, it caused problems. For
example, some Web sites require clients to register (and possibly pay money) to
use them. This raises the question of how servers can distinguish between
requests from registered users and everyone else. A second example is from
e-commerce. If a user wanders around an electronic store, tossing items into
her shopping cart from time to time, how does the server keep track of the
contents of the cart? A third example is customized Web portals such as Yahoo.
Users can set up a detailed initial page with only the information they want
(e.g., their stocks and their favorite sports teams), but how can the server
display the correct page if it does not know who the user is?
At first glance, one might think
that servers could track users by observing their IP addresses. However, this
idea does not work. First of all, many users work on shared computers,
especially at companies, and the IP address merely identifies the computer, not
the user. Second, and even worse, many ISPs use NAT, so all outgoing packets
from all users bear the same IP address. From the server's point of view, all
the ISP's thousands of customers use the same IP address.
To solve this problem, Netscape
devised a much-criticized technique called cookies. The name derives from
ancient programmer slang in which a program calls a procedure and gets
something back that it may need to present later to get some work done. In this
sense, a UNIX file descriptor or a Windows object handle can be considered as a
cookie. Cookies were later formalized in RFC 2109.
When a client requests a Web page,
the server can supply additional information along with the requested page.
This information may include a cookie, which is a small (at most 4 KB) file (or
string). Browsers store offered cookies in a cookie directory on the client's
hard disk unless the user has disabled cookies. Cookies are just files or
strings, not executable programs. In principle, a cookie could contain a virus,
but since cookies are treated as data, there is no official way for the virus
to actually run and do damage. However, it is always possible for some hacker
to exploit a browser bug to cause activation.
A cookie may contain up to five
fields, as shown in Fig. 7-25. The Domain tells where the cookie came
from. Browsers are supposed to check that servers are not lying about their
domain. Each domain may store no more than 20 cookies per client. The Path is a
path in the server's directory structure that identifies which parts of the
server's file tree may use the cookie. It is often /, which means the whole
tree.
The Content field takes the form name
= value. Both name and value can be anything the server wants. This field is
where the cookie's content is stored.
The Expires field specifies when the
cookie expires. If this field is absent, the browser discards the cookie when
it exits. Such a cookie is called a nonpersistent cookie. If a time and date
are supplied, the cookie is said to be persistent and is kept until it expires.
Expiration times are given in Greenwich Mean Time. To remove a cookie from a
client's hard disk, a server just sends it again, but with an expiration time
in the past.
Finally, the Secure field can be set
to indicate that the browser may only return the cookie to a secure server.
This feature is used for e-commerce, banking, and other secure applications.
We have now seen how cookies are acquired,
but how are they used? Just before a browser sends a request for a page to some
Web site, it checks its cookie directory to see if any cookies there were
placed by the domain the request is going to. If so, all the cookies placed by
that domain are included in the request message. When the server gets them, it
can interpret them any way it wants to.
Let us examine some possible uses
for cookies. In Fig. 7-25, the first cookie was set by toms-casino.com
and is used to identify the customer. When the client logs in next week to
throw away some more money, the browser sends over the cookie so the server
knows who it is. Armed with the customer ID, the server can look up the
customer's record in a database and use this information to build an
appropriate Web page to display. Depending on the customer's known gambling
habits, this page might consist of a poker hand, a listing of today's horse
races, or a slot machine.
The second cookie came from joes-store.com.
The scenario here is that the client is wandering around the store, looking for
good things to buy. When she finds a bargain and clicks on it, the server
builds a cookie containing the number of items and the product code and sends
it back to the client. As the client continues to wander around the store, the
cookie is returned on every new page request. As more purchases accumulate, the
server adds them to the cookie. In the figure, the cart contains three items,
the last of which is desired in duplicate. Finally, when the client clicks on PROCEED
TO CHECKOUT, the cookie, now containing the full list of purchases, is sent
along with the request. In this way the server knows exactly what has been
purchased.
The third cookie is for a Web
portal. When the customer clicks on a link to the portal, the browser sends
over the cookie. This tells the portal to build a page containing the stock
prices for Sun Microsystems and Oracle, and the New York Jets football results.
Since a cookie can be up to 4 KB, there is plenty of room for more detailed
preferences concerning newspaper headlines, local weather, special offers, etc.
Cookies can also be used for the
server's own benefit. For example, suppose a server wants to keep track of how
many unique visitors it has had and how many pages each one looked at before
leaving the site. When the first request comes in, there will be no
accompanying cookie, so the server sends back a cookie containing Counter = 1.
Subsequent clicks on that site will send the cookie back to the server. Each
time the counter is incremented and sent back to the client. By keeping track
of the counters, the server can see how many people give up after seeing the
first page, how many look at two pages, and so on.
Cookies have also been misused. In
theory, cookies are only supposed to go back to the originating site, but
hackers have exploited numerous bugs in the browsers to capture cookies not
intended for them. Since some e-commerce sites put credit card numbers in
cookies, the potential for abuse is clear.
A controversial use of cookies is to
secretly collect information about users' Web browsing habits. It works like
this. An advertising agency, say, Sneaky Ads, contacts major Web sites and
places banner ads for its corporate clients' products on their pages, for which
it pays the site owners a fee. Instead of giving the site a GIF or JPEG file to
place on each page, it gives them a URL to add to each page. Each URL it hands
out contains a unique number in the file part, such as
http://www.sneaky.com/382674902342.gif
When a user first visits a page, P,
containing such an ad, the browser fetches the HTML file. Then the browser
inspects the HTML file and sees the link to the image file at www.sneaky.com,
so it sends a request there for the image. A GIF file containing an ad is
returned, along with a cookie containing a unique user ID, 3627239101 in Fig. 7-25. Sneaky records the fact that the user
with this ID visited page P. This is easy to do since the file requested (382674902342.gif)
is referenced only on page P. Of course, the actual ad may appear on thousands
of pages, but each time with a different file name. Sneaky probably collects a
couple of pennies from the product manufacturer each time it ships out the ad.
Later, when the user visits another
Web page containing any of Sneaky's ads, after the browser has fetched the HTML
file from the server, it sees the link to, say, http://www.sneaky.com/493654919923.gif
and requests that file. Since it already has a cookie from the domain sneaky.com,
the browser includes Sneaky's cookie containing the user ID. Sneaky now knows a
second page the user has visited.
In due course of time, Sneaky can
build up a complete profile of the user's browsing habits, even though the user
has never clicked on any of the ads. Of course, it does not yet have the user's
name (although it does have his IP address, which may be enough to deduce the
name from other databases). However, if the user ever supplies his name to any
site cooperating with Sneaky, a complete profile along with a name is now
available for sale to anyone who wants to buy it. The sale of this information
may be profitable enough for Sneaky to place more ads on more Web sites and
thus collect more information. The most insidious part of this whole business
is that most users are completely unaware of this information collection and
may even think they are safe because they do not click on any of the ads.
And if Sneaky wants to be
supersneaky, the ad need not be a classical banner ad. An ''ad'' consisting of
a single pixel in the background color (and thus invisible), has exactly the
same effect as a banner ad: it requires the browser to go fetch the 1 x 1-pixel
gif image and send it all cookies originating at the pixel's domain.
To maintain some semblance of
privacy, some users configure their browsers to reject all cookies. However,
this can give problems with legitimate Web sites that use cookies. To solve
this problem, users sometimes install cookie-eating software. These are special
programs that inspect each incoming cookie upon arrival and accept or discard
it depending on choices the user has given it (e.g., about which Web sites can
be trusted). This gives the user fine-grained control over which cookies are
accepted and which are rejected. Modern browsers, such as Mozilla (www.mozilla.org), have elaborate user-controls
over cookies built in.
No comments:
Post a Comment
silahkan membaca dan berkomentar