[This is preliminary documentation and subject to change.]
SCODE IFilter::GetChunk( STAT_CHUNK * pStat );
GetChunk positions the filter at the beginning of the next chunk and returns a description of the chunk in pStat. After this call, the chunk described in pStat is the current chunk. The chunk descriptor is owned by the caller, but the property name pointer which may be set in the property specification is owned by the callee and should not be freed. Several operations (see below) can only be applied to the current chunk. Before GetChunk. has been called for the first time, there is no current chunk. When the current chunk is the last chunk, additional call(s) to GetChunk return FILTER_E_END_OF_CHUNKS. If the next chunk is an embedding for which a filter is not available, this call returns FILTER_E_EMBEDDING_UNAVAILABLE. If the next chunk is in an unavailable link, this call returns FILTER_E_LINK_UNAVAILABLE. Access failure may also be reported with FILTER_E_PASSWORD and FILTER_E_ACCESS. After an error return code of anything other than FILTER_E_END_OF_CHUNKS the next call to IFilter still fetches the next chunk after the unavailable one.
A description of the active chunk is placed in *pStat. This structure is defined as follows:
typedef enum tagCHUNKSTATE
{
CHUNK_TEXT = 0x1,
CHUNK_VALUE = 0x2
} CHUNKSTATE;
typedef enum tagCHUNK_BREAKTYPE
{
CHUNK_NO_BREAK = 0,
CHUNK_EOW = 1,
CHUNK_EOS = 2,
CHUNK_EOP = 3,
CHUNK_EOC = 4
} CHUNK_BREAKTYPE;
typedef tagSTAT_CHUNK
{
ULONG idChunk;
CHUNK_BREAKTYPE breakType;
CHUNKSTATE flags;
LCID locale;
FULLPROPSPEC attribute;
ULONG idChunkSource;
ULONG cwcStartSource;
ULONG cwcLenSource;
} STAT_CHUNK;
The chunk identifier for this chunk is returned in idChunk. It must be unique from every other chunk identifier returned by GetChunk during a single instantiation of IFilter. Chunk identifiers must be in increasing order. The order in which chunks are returned should correspond to the order of their text in the source document. Some search engines may take advantage of the inter-attribute proximity exposed between chunks of various attributes.
The breakType field contains the type of break that precedes this chunk. These are defined as follows:
Clients of IFilter may choose a word breaking algorithm that is in conflict with CHUNK_EOW decisions made in an IFilter implementation. A content query returns optimal results when the word breaking algorithm used to split phrases in the user's query matches the algorithm used to split words in the documents. The former is always provided by the search engine. The search engine algorithm is also used to split words within a chunk, but many small chunks separated by CHUNK_EOW may affect its accuracy.
A change in attribute implies a word, sentence, paragraph or chapter break.
The flags field indicates whether this chunk should be treated as text (for example, a sequence of words) or value. If flags is CHUNK_TEXT then IFilter::GetText should be used to retrieve the contents of the chunk and parse it as a series of words. If flags is CHUNK_VALUE then IFilter::GetValue should be used to retrieve the value and treat it as a single property value. If the filter wishes the same text to be treated as both text and value it should be emitted twice in two different chunks.
The locale field specifies the language and sub-language of this text. Chunk locale is used by document indexers to perform proper normalization of text. If the chunk is not text or a value of type VT_LPWSTR, VT_LPSTR or VT_BSTR. then this field is ignored.
The attribute field specifies which attribute should be applied to this chunk. If the filter wishes the same text to have more than one attribute, the filter needs to emit the text once for each attribute in separate chunks.
Take the following example that might appear in a book:
The small detective exclaimed, "C'est finis!"
Confessions
The room was silent for several minutes. After thinking very hard
about it, the young woman asked, "But how did you know?"
This section might be broken into chunks as follows:
id | Text | breakType | flags | locale | attribute |
---|---|---|---|---|---|
1 | The small dete | N/A | CHUNK_TEXT | ENGLISH_UK | CONTENT |
2 | ctive exclaimed, | CHUNK_NO_BREAK | N/A | N/A | N/A |
3 | "C'est finis!" | CHUNK_EOW | CHUNK_TEXT | FRENCH_BELGIAN | CONTENT |
4 | Confessions | CHUNK_EOC | CHUNK_TEXT | ENGLISH_UK | CHAPTER_ NAMES |
5 | Confessions | CHUNK_EOP | CHUNK_TEXT | ENGLISH_UK | CONTENT |
6 | The room was silent for several minutes. | CHUNK_EOP | CHUNK_TEXT | ENGLISH_UK | CONTENT |
7 | After thinking very hard about it, the young woman asked, "But how did you know?" | CHUNK_EOS | CHUNK_TEXT | ENGLISH_UK | CONTENT |
If a GetChunk call to an IFilter implementation of an embedding or link returns FILTER_E_END_OF_CHUNKS, then it is the responsibility of the outer IFilter implementation to check to see if there are any more chunks outside of that embedding or link to be returned. For example, if a document has two embeddings and the first has returned FILTER_E_END_OF_CHUNKS, then the outer IFilter must call GetChunk on the IFilter for the next embedding.
In addition, before returning the results of a call to GetChunk of an embedded or linked object, the implementation must check to make sure that the chunk identifier is unique, and if it is not, renumber the chunk and keep a mapping of the new chunk identifier.
The fields
ULONG idChunkSource;
ULONG cwcStartSource;
ULONG cwcLenSource;
are used. to describe the source of a derived chunk. If the text of the current non-contents chunk (psuedo-property or property) is derived from some contents chunk, the idChunkSource is set to the identifier of the source chunk, cwcStartSource is set to the offset at which the source text for the chunk starts in the source chunk, and finally cwcLenSource is either set to zero or to the length of. the source text from which the current chunk was derived. Zero signifies that there is character-by-character correspondence between the source text and the derived text. A non-zero value means that there is no such direct correspondence. This information is useful for the search engine when it wants to highlight the hits. If the query is done for a pseudo-property, the search engine highlights the original text from which the text of the property has been derived. For instance, for a C++ code filter, when searching for SampleFunction in a pseudo-property "function definitions," the browser highlights the function header inside the contents of a file. If the chunk is not derived, idChunkSource must be the same as idChunk. If the filter attributes specify a pseudo-property only, then there is no content chunk from which the current pseudo-property chunk is derived. In this case, idChunkSource must be set to 0, which is an invalid chunk identifier.