Guidelines for Using Speech Recognition

Platform SDK: Web Telephony Engine

Guidelines for Using Speech Recognition

The role of a speech-recognition (SR) engine is to "listen" to the words that a caller speaks, translate them into text, and provide the text to the WTE. Using SR can be a convenient and natural way for the caller to enter data into a WTE application. For example, instead of requiring the caller to press a DTMF digit, you can use SR to let the user speak "one", "two", or any DTMF digit. You can also allow callers to speak the item they wish to select, such as "the shoe department" or "toys".

You may want to use SR for menus that include many possible choices. For example, suppose you wanted the caller to select the name of a city from a menu containing 50 different city names. Without SR, the caller would have to listen as Web telephony "speaks" each city's name and the corresponding action to take to select the city. With SR, your application could direct the caller to speak the name of the desired city. The SR engine would then try to match the caller's spoken city with one of the 50 possibilities.

The requirement to use SR is to have a SAPI 4.0a-compliant SR engine installed on the computer along with Web telephony. Also, speech recognition must be enabled for the application by setting the WTEApplication.UseSR configuration parameter to TRUE. You can enable SR for an entire application or only for specific menus or entry fields. Note that when SR is enabled, the caller can still select menu items and enter data into entry fields by pressing DTMF digits.

SR engines have two basic modes of operation that you can use in a WTE application: menu mode and free-dictation mode. In menu mode, the application prompts the caller to select from a predefined list of menu items. When the caller speaks, the SR engine attempts to match the speech to one of the items in the menu. In free-dictation mode, there is no predefined list of items. Instead, the SR engine attempts to translate the caller's words into text as the caller speaks them.

SR engines are most accurate in menu mode, when the choices the user can make are well known and limited. Although free-dictation mode tends to be less accurate, it is useful for getting small pieces of information that you cannot know in advance, or when a list of menu items would be too large to be practical. For example, you would use free-dictation mode to get the caller's age, address, telephone number, account number, and so on. For more information, see Taking Dictation.

If your application needs to gather large amounts of spoken input, such as spoken messages, from the caller, you should consider recording the input in a wave file instead of using SR. For more information, see Recording a Spoken Message.

Few SR engines can accurately recognize all speech all the time. For this reason, it is important to ask the caller to confirm each result returned by the SR engine. The WTE provides a default method for confirming results. For more information, see Confirming the Selection.