Cookie theory

It might sound like an obscure branch of theoretical physics, and possibly have some similar-looking enthusiasts, but the mysteries of cookies are quick to penetrate.

Cookies enhance Web requests

The communication between web browser and web server is defined by the HTTP network protocol. That protocol says that each URL request and its response is a pair of messages independent of the past and the future. Without something fancy like JavaScript and frames, there is no method for maintaining information between requests, and no mechanism to let the web server receive or send any such information. Cookies are an enhancement to HTTP that let this happen. Originally proposed for standards consideration by Netscape Communications (their proposal is at http://www.netscape.com/newsref/std/cookie_spec.html), the most official specification is easily readable here: http://www.cis.ohio-state.edu/htbin/rfc/rfc2109.html. Most of the basic points of the specification are covered in the following sections.

Each URL or HTTP request made by a browser user is turned into lines of text called headers for sending to the web server. When the web server issues a response, the same happens. Cookies are just extra header lines containing cookie-style information. This is all invisible to the user. So user requests and web server responses may occur with or without invisible cookies riding piggyback.

A key point is that the piggybacked cookies in the web server response get stored in the browser once received. Although a browser also reports its current cookies to the server when it makes requests, the server generally doesn't save them. This is almost the reverse of HTML form submissions: the browser has the long-term responsibility for the data not the server; the server says what to change, not the browser. However, the browser can use JavaScript to set its own cookies as well.

Anatomy of a Cookie

A cookie is much like a JavaScript variable, with a name and a value. However, unlike a variable, the existence of a cookie depends on several other attributes as well. This is because cookies can arrive at a browser from any web site in the world and need to be kept separate.

A cookie has the following attributes:

name

A cookie name is a string of characters. The rules are different to JavaScript variable names, but commonsense applies: use alphanumerics and underscores. Avoid using '$'. Cookie names are NOT case-sensitive. To really understand the naming rules, read the HTTP 1.1 standard. 'fred', 'my_big_cookie' and 'user66' are all valid cookie names. There are no reserved words or variable name limits.

value

The value part of a cookie is a string of any characters. That string must follow the rules for URLs which means the escape() and unescape() functions should be applied if one is set by JavaScript. The name and string together should be less than 4095 bytes. There are no 'undefined' or 'null' values for cookies, but zero length strings are possible.

domain

If two different web sites are viewed in a browser, they shouldn't be able to affect each other's cookies. Cookies have a domain property that restricts their visibility to one or more web sites.

Consider an example URL http://www.altavista.yellowpages.com.au/index.html. Any cookies with domain 'www.altavista.yellowpages.com.au' are readable from this page. Domains are also hierarchical—cookies with these domains: '.altavista.yellowpages.com.au', '.yellowpages.com.au' and '.com.au' could all be picked up by that URL in the browser. The leading full stop is required for partial addresses. To prevent bored University students making a cookie visible to every web page in the world, at least two domain portions must be specified.

In practice, the domain attribute isn't used much, because it defaults to the domain of the document it piggybacked into the browser on (very sensible), and because it's unlikely that you would want to share a cookie with another web site anyway.

path

In a similar manner to domains, the path attribute of a cookie restricts a cookie's visibility to a particular part of a web-server's directory tree. A web page such as http://www.microsoft.com/jscript might have a cookie with path '/jscript', which is only relevant to the JScript pages of that site. If a second cookie with the same name and domain also exists, but with the path '' (equivalent to '/'), then the web page would only see the first cookie, because its path is a closer match to the URL's path.

Paths represent directories, not individual files, so '/usr/local/tmp' is correct, but '/usr/local/tmp/myfile.htm' isn't. Forward slashes ('/' not '\') should be used. Trailing slashes as in '/usr/local/tmp/' should be avoided. That is why the top-level path is '' (a zero-length string), not '/'.

The name, domain and path combine to fully identify an individual cookie.

expiry time

The expiry time provides one of two cleanup mechanisms for cookies (see the next section for the other). Without such mechanisms, cookies might just build up in the browser forever, until the user's computer fills up.

The expiry time is optional. It is a moment in time. Without one, a cookie will survive only while the browser is running. With one, a cookie will survive even if the browser shuts down, but it will be discarded at the time dictated. If the time passes when the browser is down, the cookie is discarded when it next starts up. If the time dictated is zero or in the past, the cookie will be discarded immediately.

secure flag

This is a true/false attribute, which hints whether the cookie is too private for plain URL requests. The browser should only make secure (SSL) URL requests when sending this cookie. This attribute is less commonly used.

Browser Cookie restrictions

Browsers place restrictions on the number of cookies that can be held at any one time. The restrictions are:

RFC 2109 says at least these maximums. Netscape's specification and browsers say at most these maximums, in an attempt to guarantee that all your disk space won't be consumed.

If you rely heavily on cookies, you will soon exceed the Netscape limit of 20. In that case, that browser will throw out one of the 20 when the 21st arrives. This is a source of obscure bugs. It is better to use only one cookie, and pack multi-variable data into it via JavaScript utility routines—4096 bytes is quite a lot of space. Internet Explorer doesn't have the 20 cookies per domain limit.

The Netscape file that the cookie data resides in when the browser is shut down is called cookies.txt on Windows and Unix, and resides in the Netscape installation area (under each user for Netscape Communicator). It is a plain text file, automatically generated by the browser on shutdown, similar to the prefs.js file. The user can always delete this file if the browser is shutdown, which removes all cookies from their system. An example file:

# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookiespec.html 
# This is a generated file!  Do not edit.
www.geocities.com   FALSE / FALSE      937972424   GeoId   2035695874900187870
.linkexchange.com   TRUE  / FALSE      942191819   SAFE_COOKIE   3425efc81808cebe
www.macromedia.com   FALSE    FALSE      877627211   plugs   yes

The large number in the middle is expiry time in seconds from 1 January 1970. From this example, you can see that most web sites set one cookie only, and then it only contains a unique ID. Web sites often use this ID to look up their own records on the visitor holding the ID.

The equivalent files for Internet Explorer are stored by default in the C:\WINDOWS\COOKIES directory in .TXT files with the user's name. These are almost readable. Ironically, if you copy the files to a Unix computer, they are easily readable.

© 1997 by Wrox Press. All rights reserved.