The CONTAINS Predicate

CONTAINS Predicate

The CONTAINS predicate determines whether the content of files contains certain words and phrases. The CONTAINS predicate syntax is as follows:

property: The property to be searched. Its data type must be character-based. If property is not specified, the CONTENTS property is assumed. The value of the CONTENTS clause is the contents of the file after conversion to plain text (if necessary) by a filter. It is good practice to explicitly code the CONTENTS clause, rather than to accept it as a default.
simple_term: The term used to match the exact word or phrase being searched for. Its syntax is:

where word refers to one or more characters without spaces or punctuation and phrase refers to multiple words with spaces in between. Asian languages can have phrases made up of multiple words without any spaces in between.

Here are some examples of simple terms used in the context of the CONTAINS predicate in a WHERE clause:

WHERE CONTAINS( Contents, 'hockey' )

WHERE CONTAINS( Contents, ' "ice hockey" ')

In keeping with the standard for full-text products, the search for characters in a word or phrase is always case-insensitive, and noise words are not stored in the full-text index. For example, suppose there is one file with a value of “This is a dissertation on the use of ice-cream sandwiches as hockey pucks” and another file with the value “Dissertation on new ways of splitting the atom.” Since “this,” “is,” “a,” and so on are noise words, they are not stored in the full-text index, and the following two queries are effectively identical:

CONTAINS( Contents, ' "this is a dissertation" ' )

CONTAINS( Contents, 'dissertation' )

Both rows are returned as hits because in the first query, the noise words are removed before processing the query.

Combining Terms Using Boolean Operators

As with other SQL search conditions, more complex conditions can be specified by linking individual operands with the Boolean operators AND, OR, and AND NOT. In this case, the operands are any of the types of terms being discussed (simple_term, prefix_term, and so on). Except for the restriction that the OR NOT combination is not supported and that NOT cannot be specified before the first term, the rules are exactly the same as those used to combine individual predicates to form search conditions. For example, parentheses may be used to change the default priority order in which the operators are applied.

Here are some examples of simple terms being combined within a CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( Contents, 'hockey OR curling' )

WHERE CONTAINS( Contents, 'hockey AND NOT field')

WHERE CONTAINS( Contents, ' ("ice hockey" OR curling) AND NOT Canada ')

prefix_term

prefix_term is used to match words or phrases that begin with specified text. Its syntax is:

A prefix term consists of a simple term appended with an asterisk (*) to activate prefix matching on a word or phrase. All text that starts with the material before the * is matched. The wildcard symbol (*) in this case is similar to the percent symbol (%) in the LIKE predicate in that it matches zero, one, or more characters of the root words in a word or phrase. In a phrase, each word within the phrase is considered to be a prefix; for example, the term “local bus*” matches “locality busy,” “local bush,” and “locale bust.”

Here are two examples of prefix terms used in the context of the CONTAINS predicate in a WHERE clause. The first matches values that contain the word “atom,” “atomic,” “atomism,” “atomy,” and so on. The second matches values that contain the word “wine,” “vine,” or words such as “winery,” “wines,” “vineyard,” or “vinegar.”

WHERE CONTAINS( Contents, ' "atom*" ' )

WHERE CONTAINS( Contents, ' "wine*" OR "vine*" ')

proximity_term

proximity_term is used when the words or phrases being searched for must be close to one another. Its syntax is:

A proximity term is similar to an AND operator in that more than one word or phrase must exist in the value being searched. It differs from AND because the relevance of the match increases as the words appear closer together.

The syntax is designed to be extensible for possible future support for specification of units of proximity such as words, sentences, paragraphs, chapters, and so on.

NEAR, NEAR() and ~ share the same meaning: the first word or phrase is close to the second word or phrase. “Close” is a purposefully vague term that can mean “within 50 words,” but the algorithm is complicated. While words within the same sentence are one word distance apart, larger distances are assigned between units such as sentences, paragraphs, and chapters. Even if words or phrases are very far apart, the query is still considered to be satisfied; the row just has a low (zero) rank value. This can be avoided by specifying RANK > 0 as one of the predicates in the WHERE clause.

It is possible to chain-code the proximity matching. For example,” a ~ b ~ c “ means that a should be near b, which should be near c. Because of the fuzzy nature of full-text searches, it is often desirable to see the rank values. This can be done by including the RANK property in the select list of the query.

Here are some examples of proximity terms used in the context of the CONTAINS predicate in a WHERE clause:

WHERE CONTAINS( Contents, ' hockey ~ player ' )

This matches values that contain the word “hockey” in proximity to the word “player.”

WHERE CONTAINS( Contents, ' hockey ~ "play*" ')

This matches values that contain the word “hockey” in proximity to a word that starts with “play.”

WHERE CONTAINS( Contents, '  "great*"  ~ "Mike Nash" ') AND Rank > 0

This matches values that contain words starting with “great” in proximity to the phrase “Mike Nash.” Values that meet the criteria but have a ranking of 0 do not have rows returned.

generation_term

generation_term is used when the words being searched for need to be expanded to include the variants of the original word. Its syntax is:

The INFLECTIONAL predicate means that plural and singular forms of nouns and the various tenses of verbs will be matched. A single term will not match both exclusive noun and exclusive verb forms. The syntax is designed to be extensible enough to handle other linguistically generated forms, such as derivational, soundex, and thesaurus.

Here is an example of a generation term used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS(' FORMSOF (INFLECTIONAL, skate) ' )

This matches values that contain words such as “skate,” “skates,” “skated,” and “skating.”

weighted_term

weighted_term is used for queries that match a list of words and phrases, each optionally given its own weighting. Matching values must match only one element in the list. Its syntax is as follows, where n.nnn represents a decimal constant from zero through one.

A row is returned if there is a match on any one of the ISABOUT elements.

Each component in the vector can be optionally weighted. The assigned weight forces a different measurement of the rank value that is assigned to each row that matches the query.

Here are some examples of weighted terms used in the context of the CONTAINS predicate in a WHERE clause.

WHERE CONTAINS( Contents, ' ISABOUT(hockey, puck, goalie) ' )

This matches article values that contain any of the words “hockey,” “puck,” or “goalie.” The better matches will contain more than one of the words.

WHERE CONTAINS( Contents, 'ISABOUT("Canadian ice hockey" WEIGHT(1.0),

                                   "ice hockey" WEIGHT(.5),

                                   hockey WEIGHT(.2) )

' )

This matches article values that may have information about Canadian ice hockey, with higher rank values assigned to articles that have more words from the phrase.