Content Search Operators

[This is preliminary documentation and subject to change.]

The following operators describe search operations on columns containing text. These operators behave like any other boolean operators, however, when used in conjunction with the DBOP_content_select operator, the resulting tables include two special columns describing the hit count and rank.

DBOP_content_select

This operator is similar to the DBOP_select operator. It takes a table T and boolean expression (selection condition) as inputs, and produces a table R as output. The rows in R are those rows that satisfy the selection condition. However, the columns of R include all the columns of T plus three additional columns representing the hit count, a rank, and a rank vector which together represent the quality of the match.

DBOP_content, DBOP_content_freetext

These operators are used to express content searches for words or phrases found in a column. The operators take one required input, a column name, specifying the column in which to search. There are two special column identifiers. The first, PROPID_STG_CONTENTS, means "look in the object contents" for data providers whose rows are actually objects. The second, PROPID_ QUERY_ALL, means "look in all columns." The text search pattern is stored in the DBCONTENT structure within the node (see below). DBOP_content_freetext differs from DBOP_content in that it instructs the query engine to use all of its text-processing tricks in evaluating the restriction. The DBOP_content node may specify a method to generate words matching a pattern (this field is ignored for DBOP_content_ freetext), which determines the manner in which the initial query term is expanded before searching text. GENERATE_METHOD_EXACT matches only the word or phrase specified in the content node. GENERATE_METHOD_PREFIX matches all words that begin with the text specified in the node. GENERATE_METHOD_INFLECT causes linguistic inflection of the specified word. The English word "run", for example, may expand to "run", "running", "ran", "runs" etc. Providers may support other generation methods (e.g., GENERATE_METHOD_MISSPELLING, GENERATE_METHOD_REGEX). The generation methods specified in this node are mutually exclusive. Not all of the levels specified above are available for every language. If GENERATE_METHOD_PREFIX or GENERATE_METHOD_INFLECT are used with a phrase (multiple words), the expansion applies to each word separately. The exact method of splitting a phrase into words is implementation-specific. These nodes also have a weight field, which may be used by a sophisticated consumer to give more or less weight in hit ranking to matches on this particular content restriction. The output of these operators is Boolean. The arguments internal to this node are specified using the following structure:

typedef struct tagDBCONTENT {
    DWORD      dwGenerateMethod;    // exact, prefix, inflect
    LONG       lWeight;             // weight of node
    LCID       lcid;                // locale
    LPWSTR     pwszPhrase;          // text
} DBCONTENT;
 
#define GENERATE_METHOD_EXACT     ( 0 )
#define GENERATE_METHOD_PREFIX    ( 1 )
#define GENERATE_METHOD_INFLECT   ( 2 )

PSGUID_QUERY is the guid for the property set for special content columns. The following are some of the interesting columns in this set.

#define PROPID_QUERY_RANKVECTOR (0x2) // column used to return rank
                                      // values of the
                                      // content_vector_or operator
 
#define PROPID_QUERY_RANK (0x3) // column used to return the final
                                // rank of each row
 
#define PROPID_QUERY_HITCOUNT (0x4) // column used to return the
                                    // number of content hits found
                                    // in a row
 
#define PROPID_QUERY_ALL    (0x6) // search in all text associated
                                  // with a row
 
#define PROPID_STG_CONTENTS (0x13) // search inside the contents on
                                   // an object

DBOP_content_proximity

Proximity nodes are used to rank hits within queries that search for several different pieces of text. The rank of an object matched by a proximity node increases as the words and phrases move closer together. The formula used to compute rank is specific to a given implementation. Generally a proximity node should be used as an n-ary AND node among multiple content restrictions. The internal arguments of this node are a proximity unit (DWORD), a proximity distance (ULONG), and a weight on the node (LONG), all specified with the DBCONTENTPROXIMITY structure.

typedef struct  tagDBCONTENTPROXIMITY    {
    DWORD dwProximityUnit;     // units
    ULONG ulProximityDistance; // how near is near?
    LONG lWeight;              // node weight
}    DBCONTENTPROXIMITY;

The following proximity units may be supported:

#define PROXIMITY_UNIT_WORD          ( 0 )
#define PROXIMITY_UNIT_SENTENCE      ( 1 )
#define PROXIMITY_UNIT_PARAGRAPH     ( 2 )
#define PROXIMITY_UNIT_CHAPTER       ( 3 )

The node takes two or more input subtrees, each of which should contain DBOP_content, DBOP_content_freetext, DBOP_content_proximity, DBOP_and, DBOP_or, or DBOP_not nodes. The output of the node is Boolean. It is logically similar to an AND, in the sense that, to produce a true value, all input subtrees must evaluate to true.

DBOP_content_vector_or

Used to support advanced content retrieval models. This node acts as an n-ary OR node, with two main differences. First, when a vector node is used the programmer has the option of retrieving a rank vector for each object, as a special, well-known column. This vector contains the rank of each child node in the vector. Second, the formula used to compute the overall rank of the vector may be specified. The following ranking formulas may be supported:

Minimum/AND (VECTOR_RANK_MIN)

Maximum/OR (VECTOR_RANK_MAX)

Inner product (VECTOR_RANK_INNER)

Dice coefficient (VECTOR_RANK_DICE)

Jaccard coefficient (VECTOR_RANK_JACCARD)

Cosine (VECTOR_RANK_COSINE)

There may be at most one vector node in the tree. The internal arguments of this node are a ranking method (DWORD) and a weight on the node (LONG), specified within the DBCONTENTVECTOR structure.

typedef struct tagDBCONTENTVECTOR {
    DWORD dwRankingMethod;    // jaccard, cosine, etc.
    LONG lWeight;             // weight of the vector node.
}DBCONTENTVECTOR;

There are two or more children, of exactly the same types as with DBOP_content_proximity above. The output is Boolean. It acts as an n-ary OR of the input subtrees.