Appendix A: MIME Type Detection in Internet Explorer 4.0

Contents  Index  Topic Contents

Previous Topic: Asynchronous and URL Monikers Overview
Next Topic: Asynchronous and URL Monikers Reference

The purpose of MIME type detection, or datasniffing, is to determine the MIME type (also known as content type or media type) of downloaded content using information from the following four sources:

The server-supplied MIME type, if available

An examination of the actual contents associated with a downloaded URL

The file name associated with the downloaded content (assumed to be derived from the associated URL)

Registry settings (file extension/MIME type associations or registered applications) in effect during the download

MIME type determination occurs in URL monikers through the FindMimeFromData method. Determining the MIME type allows URL monikers and other components to find and launch the correct object server or application to handle the associated content. This section provides a brief summary of the logic used in determining the MIME type from these sources, and discusses some of the issues involved.

FindMimeFromData contains hard-coded tests for (currently 26) separate MIME types (see Known MIME Types). This means that if a given buffer contains data in the format of one of these MIME types, a test exists in FindMimeFromData that is designed (by scanning through the buffer contents) to recognize the corresponding MIME type. A MIME type is known if it is one of these N MIME types. A MIME type is ambiguous if it is 'text/plain', 'application/octet-stream', an empty string, or null (that is, the server failed to provide it). A MIME type that is neither known nor ambiguous is termed unknown. The MIME types 'text/plain' and 'application/octet-stream' are termed ambiguous because they generally do not provide clear indications of which application or CLSID should be associated as the content handler. A MIME type inferred from any one of the four possible sources can be categorized into one of these three classifications.

FindMimeFromData typically receives three parameters when invoked—the cache file name (assumed to be derived from the associated URL), a pointer to a buffer containing up to the first 256 bytes of the content, and a "suggested" MIME type that typically corresponds to the server-provided MIME type (through the Content-type header). Determining the MIME type proceeds as follows:

If the "suggested" (server-provided) MIME type is unknown (not known and not ambiguous), FindMimeFromData immediately returns this MIME type as the final determination. The reason for this is that new MIME types are continually emerging, and these MIME types might have formats that are difficult to distinguish from the set of hard-coded MIME types for which tests exist. A good example of this is SGML, which can easily be classified incorrectly as HTML because it contains many of the same tags. Rather than weakening the hard-coded tests or risk incorrectly classifying new and as-yet-unknown MIME types for hard-coded known ones, priority is given to the server-supplied MIME type if it is unknown, since these MIME types are both specific and likely uncommon, and there are no hard-coded tests that can positively identify them.

If the server-provided MIME type is either known or ambiguous, the buffer is scanned in an attempt to verify or obtain a MIME type from the actual content. If a positive match is found (one of the hard-coded tests succeeded), this MIME type is immediately returned as the final determination, overriding the server-provided MIME type (this type of behavior is necessary to identify a .gif file being sent as text/html). During scanning, it is determined if the buffer is predominantly text or binary.

If no positive match is obtained during the data scan, and if the server-provided MIME type is known, an attempt is made to determine if the format (text or binary) of the known MIME type conflicts with the format (text or binary) that was determined from scanning the buffer. If no conflict exists (the data scan indicates primarily text and the server-provided MIME type has a text format, or the data scan indicates binary and the server-provided MIME type is a binary format), the server-provided MIME type is returned. The reasoning behind this is that new formats of MIME types might be added over time (image/tif is one example) and the hard-coded tests might not recognize these new formats (a different pattern match might be required). With the assumption that the basic format of MIME types will not change over time from primarily text to binary or vice-versa, it will suffice that the formats of the server-provided MIME type and the format found from scanning the data do not disagree. If this is the case, the server-provided MIME type is returned. The format types for known MIME types are stored in a media information structure in URL monikers.

If no positive match is obtained during the data scan, and the server-provided MIME type is ambiguous or the server-provided MIME type is known, and the data format agreement test in the previous step failed, an attempt is made to parse a file extension from the file name passed in. If this is successful, an attempt is made to find the MIME type associated with the file extension in the registry. This will be returned as the final determination if the MIME type associated with the file extension is unknown. The reason for this added requirement is as follows: If the file extension yields an ambiguous MIME type, this adds no information to what was already obtained through scanning the data. If the file extension yields a known MIME type, this MIME type should have been found during scanning. Since it was not, it is suspect, and is rejected. An example of this is an arbitrary plain-text file being returned through an ISAPI DLL, with the server returning 'text/plain' as the MIME type. Since the server-provided MIME type is ambiguous, a scan of the data is conducted that only confirms that the data is plain text. Subsequently, the file name is parsed for an extension. In this case, because the contents were downloaded using an ISAPI DLL, the URL and hence the cache file name will have a .dll file extension that has the MIME type 'application/x-msdownload' associated in the registry. This MIME type was already scanned for (application/x-msdownload is a known MIME type), was not found, and is therefore the wrong determination (this results in a file download as opposed to the desired behavior, which is to display the text in-pane).

If all of the preceding steps have failed to yield an unambiguous MIME type, a last check is made to see if any application is associated in the registry with the file extension parsed from the file name, if one exists. If an associated application is found, the final determination is automatically set to 'application/octet-stream'. This default value ensures that the registered application will be launched by the shell with the downloaded data, rather than displaying the data in-pane. As an example, this is necessary when downloading, among others, .bat and .cmd files, which are plain text files, are frequently identified by the server as 'text/plain', and have no associated MIME type in the registry. Without the final check for an associated application, these would be displayed in-pane, whereas the desired behavior is to launch the command interpreter. This is ensured by checking for an associated application, and defaulting to the final determined MIME type of 'application/octet-stream'. Other types of files, such as .reg files, behave similarly.

Finally, if no file extension is found, or one is found with no associated MIME type or registered application, the MIME type 'text/plain' is returned if the data scan indicated predominantly text, or 'application/octet-stream' if the data scan indicated binary, since this is the furthest correct determination that could be made.

Known MIME Types

Hard-coded tests exist for the following MIME types that currently exist in URL Moniker:

text/richtext

text/html

audio/x-aiff

audio/basic

audio/wav

image/gif

image/jpeg

image/pjpeg

image/tiff

image/x-png

image/x-xbitmap

image/bmp

image/x-jg

image/x-emf

image/x-wmf

video/avi

video/mpeg

application/postscript

application/base64

application/macbinhex40

application/pdf

application/x-compressed

application/x-zip-compressed

application/x-gzip-compressed

application/java

application/x-msdownload

Registry Locations

Location used by FindMimeFromData to find MIME type and progid from file extension:

HKEY_CLASSES_ROOT\.***

Location used by FindMimeFromData to find application from progid:

HKEY_CLASSES_ROOT\<ProgId>\shell\open\command

Location used by URL monikers to find CLSIDs from MIME types:

HKEY_CLASSES_ROOT\MIME\Database\Content Type


		Contents Index Topic Contents
	Previous Topic: Asynchronous and URL Monikers Overview Next Topic: Asynchronous and URL Monikers Reference