IFilter: Sample of Indexing Document Content

Click to open or copy the files for the IFilter sample.

The IFilter interface was designed primarily to provide a uniform mechanism to extract character streams from formatted data. The goal was to provide ISVs with an interface that extracts text as the initial step in content indexing document data. IFilter can be implemented over any document format and the ISV can choose any API or interface to read the data format. For example, a content filter can be written that reads data using the Win32 file APIs or uses the OLE storage interfaces.

Any software author who stores textual data should consider implementing a content filter for the document format to allow content indexing systems to extract text.

The sample filter in this directory will extract text and properties from HTML pages. In addition to raw content, headings (levels 1 to 6), title and anchors are emitted as pseudo-properties. Title is also published as a full property available via IFilter::GetValue.

Building SDK Samples

This sample uses the following keywords:

_buf; _bytesreadfrommmbuffer; _cattributes; _cb; _ccharsreadahead; _charsreadfromtranslatedbuffer; _chrefchars; _chrefcharsfiltered; _clsidmetainfo; _clsidscriptinfo; _cpropchars; _cpropcharsfiltered; _ctagcharsread; _ctextchars; _cvaluechars; _cvaluecharsfiltered; _cwctranslatedbuffer; _dbginfo; _estate; _etoktype; _etoktypenext; _ffiltercontent; _ffiltermetatag; _ffilterscripttag; _fnomoretext; _fnonhtmlfile; _fungotchar; _fwrite; _guidpropset; _hfile; _hkey; _hmap; _htmlelementbag; _htmlifilter; _idchunk; _lerror; _locale; _pattributes; _pcurchar; _phashentrynext; _phtmlelement; _pmmstream; _propspec; _pstream; _pwchrefbuf; _pwcpropbuf; _pwcvaluebuf; _pwszfilename; _scanner; _serialstream; _set_se_translator; _ucursize; _ulchunkid; _ulcodepage; _ulenhrefbuf; _ulenpropbuf; _ulentagbuf; _ulidcontentchunk; _umaxsize; _wch; _wcsicmp; _wcsnicmp; _wcspath; addelement; addref; bindregion; canchortag; casttostruct; catch; cexception; cfullpropspec; changestate; chtmlelembagentry; chtmlelement; chtmlelementbag; chtmlifilter; chtmlifilterbase; chtmlifiltercf; chtmlscanner; cimagetag; cinputtag; closehandle; cmemorymappedinputstream; cmetatag; cmmstream; cmmstreambuf; cmmstreamconsecbuf; commonpageround; comp##inlinedebugout; concatenateproperty; cotaskmemalloc; cotaskmemfree; cpropertytag; createfile; createfilea; createfilemapping; createfilemappinga; createinstance; cregaccess; cscripttag; cserialstream; cspecialcharhashentry; cspecialcharhashtable; cstartoffileelement; ctextelement; ctitletag; debugbreak; declare_debug; declare_infolevel; defined; delete; disablethreadlibrarycalls; dllcanunloadnow; dllgetclassobject; dllmain; eatblanks; eattag; eattext; eof; ffiltercontent; flush; get; getblockofchars; getchar; getchunk; getclassid; getcurfile; getcurhtmlelement; geterrorcode; gethtmlelement; getinfo; getlasterror; getlocale; getlocaleinfoa; getlocaleinfow; getname; getnextchunkid; getnexthashentry; getoleerror; getosversion; getpropertyname; getpropertypropid; getpropset; getpropspec; getsystemdefaultlcid; gettext; gettokentype; getvalue; getversionexa; getwidechar; growtagbuffer; hash; hresult_from_win32; htmldebugout; ifilter_addref; ifilter_bindregion; ifilter_bindregion_proxy; ifilter_bindregion_stub; ifilter_getchunk; ifilter_getchunk_proxy; ifilter_getchunk_stub; ifilter_gettext; ifilter_gettext_proxy; ifilter_gettext_stub; ifilter_getvalue; ifilter_getvalue_proxy; ifilter_getvalue_stub; ifilter_init; ifilter_init_proxy; ifilter_init_stub; ifilter_queryinterface; ifilter_release; initfilterregion; initosversion; initstatchunk; interlockeddecrement; interlockedincrement; iscreateexisting; isdirty; isempty; ismatchproperty; isnonhtmlfile; isoleerror; ispropertyname; ispropertypropid; isstarttoken; isstoptoken; isunicodenumber; isvalid; iswdigit; iswritable; iswspace; llfromuls; load; localetocodepage; lockserver; lookup; map; mbstowcs; memcmp; memcpy; memset; midl_user_allocate; midl_user_free; multibytetowidechar; newk; ok; open; outputdebugstringa; psheading1; psheading2; psheading3; psheading4; psheading5; psheading6; pstitle; queryelement; queryhtmlelement; queryinterface; readtagintobuffer; regclosekey; regmetainfo; regopenkey; regopenkeya; regqueryvalueex; regqueryvalueexa; regscriptinfo; release; rewind; rtlcopymemory; save; savecompleted; scantag; scantagbuffer; set; setbuf; setendoffile; sethtmlelement; setinfo; setlasterrorex; setlocale; setnexthashentry; setproperty; setpropid; setpropset; setsize; setstarttokenflag; settokentype; shrinkfromfront; size; sizehigh; sizelow; skipcharsuntilnextrelevanttoken; skipremainingtextandgotonextchunk; sprintf; stringtoclsid; succeeded; switch; switchtonexthtmlelement; systemexceptiontranslator; tagnametotoken; throw; towlower; ungetchar; unmap; unmapviewoffile; va_end; va_start; vsprintf; wcscpy; wcslen; wcsncmp; wcstombs; wcstoul; win4assert; win4assertex