IFilter Interface

The IFilter interface scans documents for plain text and properties (attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining positional information. IFilter provides the foundation for building higher level applications such as document indexers and application-independent viewers.

IFilter is designed to meet the specific needs of full text search engines. It is up to the search engine to break the result of a call to IFilter::GetText or IFilter::GetValue into words, and store the results in an index.

Filtering

Documents are typically stored in private file formats that are opaque to the system. Most content indexing systems don't understand these private file formats and consequently don't index them. Content filters can be used to filter these private file formats. A particular content filter can read a particular file format.

When an indexer begins working on a document, it determines the file type, and uses the appropriate content filter. The content filter extracts text chunks from the document that can be sent to a search engine in a recognizable format.

Besides extracting text chunks, content filters can also recognize language shifts, as in the case of multi-lingual documents. If a particular document format tags such shifts, the content filter can emit the tags with the corresponding text chunk. These tags can then be used by the indexer to load the appropriate word breaker for the language. But, content filters can only do this if some sort of tag is included in the on-disk file format.

Content filters also handle embedded objects. When such an object is encountered in a document, its type should be identified and the appropriate content filter activated.

Because knowledge of a particular file format is encapsulated within the content filter, new file formats can be indexed simply by providing a content filter for the format.

Chunks

Each object can be asked to produce the chunks of Unicode text that it contains, converting ASCII to Unicode when necessary. Text within a chunk is intended to be linear and sequential, with the same attribute and locale. Two pieces of text that do not have such a relationship must be in different chunks. Separate text boxes in a graphics file, labels and titles on charts, and possibly even text in separate cells of a spreadsheet are all examples of text in separate chunks.

Each chunk is given a unique chunk identifier. These ULONG identifiers are guaranteed to remain constant until IFilter is released.

Repeated instantiations of IFilter, with the same initial parameters, will produce the same set of chunks. Multiple instantiations with different initial parameters may produce a different set of chunks. Changing the set of attributes (see following section) may re-partition the chunks of an object. Chunk id 0 is invalid.

Chunks may overlap, but a specific attribute should be applied to a given character only once.

Properties and Pseudoproperties

Text extracted by IFilter may be tagged with many attributes, but only one attribute at a time. When these attributes refer to text chunks they are treated as properties by the content indexer (the search engine). but not by the system. They are known as pseudoproperties.

Pseudoproperties are not accessible through the standard COM IPropertyStorage interface. Pseudoproperties allow the user to search for documents based on the value of some internal field in the document that has not been exposed to the system as a property. For example, a spreadsheet describing monthly sales for an employee might export employee-id and total-sales pseudo-properties. This would enable a query for all spreadsheets (months) in which some employee sold more than x dollars.

Pseudo-property names must follow COM property naming conventions. Each pseudo-property must be specified as property set\property pair. Failure to follow this naming convention will result in unpredictable query behavior. Specifying a pseudo-property name that matches a true COM-style property name may also result in undefined query behavior.

An IFilter implementation can also provide COM-style properties. These properties would be retrieved by calling IFilter::GetValue. Logically, they should be considered external annotations of a document. For example, this mechanism can be used to publish HTML anchors. If a class supports retrieval of COM properties through IPropertyStorage, An IFilter implementation can request that the caller of IFilter use IPropertyStorage to enumerate COM properties, either to replace or to supplement properties emitted by IFilter::GetValue.

Embedded and Linked Objects

An object must enumerate the chunks of text in its embedded objects. These nested chunks appear to the original caller as chunks of the outer object. There is no operating system support provided for this operation. The implementation of IFilter is responsible for binding to the IFilter interface of embedded objects. (IFilter will be implemented on the embedded object by the owner of that format.--standard COM containment.) If the current chunk is within an embedded object, all GetText and GetValue calls should be passed directly to the embedded object's IFilter, and the return values from the embedded object should be returned to the client. Other calls require some additional work. GetChunk, for example, may require renumbering chunk idetifiers to make them unique.

An object can optionally be asked to enumerate the chunks of text contained in its linked objects. As with embedded objects, your implementation of IFilter is responsible for binding to the IFilter interface of linked objects, then renumbering the chunks of the linked objects so they will appear to the original caller as chunks of the outer object. The same rules that apply to an embedded object's chunks apply to a linked object's chunks.

The original source of a chunk (embedded, linked, or top-level container) is not exposed by IFilter.

Proposed Uses of IFilter

Although clients of IFilter can use the interface in any way they see fit, it was designed for two tasks: filtering and viewing (browsing/hit-highlighting).

Full Text Search

Full text search engines are the simpler of the two filter clients. They scan objects for plain text, pseudo properties, and COM-style properties. They break the result of IFilter::GetText calls apart into words, normalize them and then store the result in a search engine. The locale identifier, if specified with a text chunk, is used to perform proper language-specific word breaking.

Viewing

Document viewing displays the results of full text queries. A simplistic model of the viewing process is that documents matching a query will be indexed on-the-fly, and the resulting in-memory index will be searched to locate query hits. A document viewer highlights and navigates between these hits.

Methods in Vtable Order

IUnknown Methods Description
QueryInterface Returns pointers to supported interfaces.
AddRef Increments reference count.
Release Decrements reference count.

IFilterMethods Description
Init Initializes filtering session
GetChunk Positions filter at beginning of next chunk and returns description
GetText Retrieves text from current chunk
GetValue Retrieves non-text values from chunk
BindRegion Retrieves interface representing specified portion of object—currently reserved for future use

See Also

IFILTER_INIT, IFILTER_FLAGS, CHUNKSTATE, CHUNK_BREAKTYPE, STAT_CHUNK, FILTERREGION