IFilter::GetChunk

[This is preliminary documentation and subject to change.]

SCODE IFilter::GetChunk( STAT_CHUNK * pStat );

GetChunk positions the filter at the beginning of the next chunk and returns a description of the chunk in pStat. After this call, the chunk described in pStat is the current chunk. The chunk descriptor is owned by the caller, but the property name pointer which may be set in the property specification is owned by the callee and should not be freed. Several operations (see below) can only be applied to the current chunk. Before GetChunk. has been called for the first time, there is no current chunk. When the current chunk is the last chunk, additional call(s) to GetChunk return FILTER_E_END_OF_CHUNKS. If the next chunk is an embedding for which a filter is not available, this call returns FILTER_E_EMBEDDING_UNAVAILABLE. If the next chunk is in an unavailable link, this call returns FILTER_E_LINK_UNAVAILABLE. Access failure may also be reported with FILTER_E_PASSWORD and FILTER_E_ACCESS. After an error return code of anything other than FILTER_E_END_OF_CHUNKS the next call to IFilter still fetches the next chunk after the unavailable one.

A description of the active chunk is placed in *pStat. This structure is defined as follows:

typedef enum tagCHUNKSTATE
{
    CHUNK_TEXT       = 0x1,
    CHUNK_VALUE      = 0x2
} CHUNKSTATE;
 
typedef enum tagCHUNK_BREAKTYPE
{
    CHUNK_NO_BREAK = 0,
    CHUNK_EOW      = 1,
    CHUNK_EOS      = 2,
    CHUNK_EOP      = 3,
    CHUNK_EOC      = 4
} CHUNK_BREAKTYPE;
 
typedef tagSTAT_CHUNK
{
    ULONG              idChunk;
    CHUNK_BREAKTYPE    breakType;
    CHUNKSTATE         flags;
    LCID               locale;
    FULLPROPSPEC       attribute;
    ULONG              idChunkSource;
    ULONG              cwcStartSource;
    ULONG              cwcLenSource;
} STAT_CHUNK;

The chunk identifier for this chunk is returned in idChunk. It must be unique from every other chunk identifier returned by GetChunk during a single instantiation of IFilter. Chunk identifiers must be in increasing order. The order in which chunks are returned should correspond to the order of their text in the source document. Some search engines may take advantage of the inter-attribute proximity exposed between chunks of various attributes.

The breakType field contains the type of break that precedes this chunk. These are defined as follows:

CHUNK_NO_BREAK: This means that there is no break placed between this chunk and the previous chunk — the chunks are glued together. All of the information in pStat except for breakType and idChunk are taken from the most recent STAT_CHUNK that did not specify CHUNK_NO_BREAK. The other fields in pStat are not modified. On exit, they contain whatever value was in them on entry to GetChunk. Derived chunks cannot be glued using CHUNK_NO_BREAK. A single word cannot span more than two glued chunks.
CHUNK_EOW: This means that there is a word break placed between this chunk and the previous chunk that had the same attribute. Use of CHUNK_EOW should be minimized.
Clients of IFilter may choose a word breaking algorithm that is in conflict with CHUNK_EOW decisions made in an IFilter implementation. A content query returns optimal results when the word breaking algorithm used to split phrases in the user's query matches the algorithm used to split words in the documents. The former is always provided by the search engine. The search engine algorithm is also used to split words within a chunk, but many small chunks separated by CHUNK_EOW may affect its accuracy.
CHUNK_EOS: This means that there is a sentence break placed between this chunk and the previous chunk that had the same attribute.
CHUNK_EOP: This means that there is a paragraph break placed between this chunk and the previous chunk that had the same attribute.
CHUNK_EOC: This means that there is a chapter break placed between this chunk and the previous chunk that had the same attribute.

A change in attribute implies a word, sentence, paragraph or chapter break.

The flags field indicates whether this chunk should be treated as text (for example, a sequence of words) or value. If flags is CHUNK_TEXT then IFilter::GetText should be used to retrieve the contents of the chunk and parse it as a series of words. If flags is CHUNK_VALUE then IFilter::GetValue should be used to retrieve the value and treat it as a single property value. If the filter wishes the same text to be treated as both text and value it should be emitted twice in two different chunks.

The locale field specifies the language and sub-language of this text. Chunk locale is used by document indexers to perform proper normalization of text. If the chunk is not text or a value of type VT_LPWSTR, VT_LPSTR or VT_BSTR. then this field is ignored.

The attribute field specifies which attribute should be applied to this chunk. If the filter wishes the same text to have more than one attribute, the filter needs to emit the text once for each attribute in separate chunks.

Take the following example that might appear in a book:

The small detective exclaimed, "C'est finis!"
    
    Confessions
    
    The room was silent for several minutes. After thinking very hard
    about it, the young woman asked, "But how did you know?"

This section might be broken into chunks as follows:

id	Text	breakType	flags	locale	attribute
1	The small dete	N/A	CHUNK_TEXT	ENGLISH_UK	CONTENT
2	ctive exclaimed,	CHUNK_NO_BREAK	N/A	N/A	N/A
3	"C'est finis!"	CHUNK_EOW	CHUNK_TEXT	FRENCH_BELGIAN	CONTENT
4	Confessions	CHUNK_EOC	CHUNK_TEXT	ENGLISH_UK	CHAPTER_ NAMES
5	Confessions	CHUNK_EOP	CHUNK_TEXT	ENGLISH_UK	CONTENT
6	The room was silent for several minutes.	CHUNK_EOP	CHUNK_TEXT	ENGLISH_UK	CONTENT
7	After thinking very hard about it, the young woman asked, "But how did you know?"	CHUNK_EOS	CHUNK_TEXT	ENGLISH_UK	CONTENT

If a GetChunk call to an IFilter implementation of an embedding or link returns FILTER_E_END_OF_CHUNKS, then it is the responsibility of the outer IFilter implementation to check to see if there are any more chunks outside of that embedding or link to be returned. For example, if a document has two embeddings and the first has returned FILTER_E_END_OF_CHUNKS, then the outer IFilter must call GetChunk on the IFilter for the next embedding.

In addition, before returning the results of a call to GetChunk of an embedded or linked object, the implementation must check to make sure that the chunk identifier is unique, and if it is not, renumber the chunk and keep a mapping of the new chunk identifier.

The fields

ULONG    idChunkSource;
    ULONG    cwcStartSource;
    ULONG    cwcLenSource;

are used. to describe the source of a derived chunk. If the text of the current non-contents chunk (psuedo-property or property) is derived from some contents chunk, the idChunkSource is set to the identifier of the source chunk, cwcStartSource is set to the offset at which the source text for the chunk starts in the source chunk, and finally cwcLenSource is either set to zero or to the length of. the source text from which the current chunk was derived. Zero signifies that there is character-by-character correspondence between the source text and the derived text. A non-zero value means that there is no such direct correspondence. This information is useful for the search engine when it wants to highlight the hits. If the query is done for a pseudo-property, the search engine highlights the original text from which the text of the property has been derived. For instance, for a C++ code filter, when searching for SampleFunction in a pseudo-property "function definitions," the browser highlights the function header inside the contents of a file. If the chunk is not derived, idChunkSource must be the same as idChunk. If the filter attributes specify a pseudo-property only, then there is no content chunk from which the current pseudo-property chunk is derived. In this case, idChunkSource must be set to 0, which is an invalid chunk identifier.