Cutting Edge: The Microsoft Speech SDK; Microsoft Internet Developer February 1999

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

Download the code (27KB)

Dino Esposito

The Microsoft Speech SDK

cience fiction movies frequently depict sophisticated computers that chat with humans. Until recently, talking computers in the real world existed only in advanced research labs and highly automated, mission-critical contexts. Adding speech capability to applications required special hardware and software, and the programming effort was considerable.
      Today, things are different. Writing a voice-enabled Win32® or Web application is a fairly routine task. The key to this change is the effort made in the last few years to produce the plumbing for speech synthesis and voice recognition, as well as high-level development tools.
      In this article, I'll examine some features of the Microsoft® Speech SDK 4.0, which is available with the most recent Platform SDK. I'll concentrate on the Text-to-Speech (TTS) engine, demonstrating how to employ it in Win32 applications and Web pages. I'll also provide a sample app based on an enhanced version of the Win32 MessageBox function, which can—using the same syntax as the original—repeat out loud the message shown on the screen.
      TTS is not the only type of voice-based software you can design and build with the Microsoft Speech SDK 4.0. There's also voice recognition, dictation, and telephony. Check out the SDK documentation to learn more.

What's New in Microsoft Speech SDK 4.0?
      This new release (in beta 2 at the time of this writing) includes support for telephony applications. This means you can mix the Telephony API (TAPI) and speech capability to program answering machines, voice mail, call routing systems, and in general any kind of application that works through a telephone.
      In addition, version 4.0 of the SDK boasts a number of improvements both in the raw API and the high-level wrappers. There are new functions for TTS and speech recognition, and new ActiveX® controls that facilitate the use of speech modules in Visual Basic® and Web pages. New tools, more sophisticated samples and support for continuous dictation are further enhancements.

Speech SDK Components
      The SDK is split into four basic engines: Speech Recognition (SR), Dictation Speech Recognition (DSR), Telephony (TEL), and Text-to-Speech (TTS). I'll briefly discuss the first three, and then turn to the TTS subsystem, building an example in C++.

Figure 1: SR

      Figure 1: SR

      Speech recognition is the ability of a computer to understand and interpret the spoken word. This engine lets you write applications that can treat the voice as an input device, just like the mouse, keyboard, and joystick. Figure 1 illustrates how the three-step process of speech recognition works. First, the digital sound captured by the sound card through a microphone is converted into a more manageable format. The converter translates the stream of amplitudes that form the digital sound wave into its frequency components. It's still a digital representation, but more akin to what a human ear really perceives. The next stage is the identification of phonemes—the elementary sounds that are the building blocks of words. Each frequency component of the sound is mapped to a specific phoneme. This process actually finishes the conversion from sounds to sentences. The final step is to analyze the string. A grammar, the list of words known to the program, lets the engine associate sets of phonemes with particular words.
      DSR is the process that converts sounds into strings while the user is speaking. DSR needs a large vocabulary that exists in context. A context-free grammar defines a specific set of words, while an in-context grammar involves a virtually endless list of words. Therefore, DSR needs more processing power than SR. Recognizing speech involves many variables, from the speaker's voice and accent to the type application.
      There are two types of dictation-based recognition: discrete and continuous. The Speech SDK 4.0 supports continuous dictation, the more sophisticated form. In discrete dictation, the application has a defined vocabulary of recognizable words, and the user must pause for a few milliseconds between words when speaking. This limitation doesn't exist in continuous dictation, where the end of a word need not be clearly indicated by a pause. This means much more work for the engine, necessitating high optimization and computing power for good results.
      TEL applications use objects called "telephony controls" that work like the constituent controls of a dialog box. A telephony application is composed of questions and answers exchanged between the application and its users. Each of these fragments of conversation occurs under the supervision of a telephony control.
      Another way to look at telephony controls is to view the conversation as a multiple-step wizard, where each stage presents a standardized question for the user, such as: "You have three new messages. Do you want to listen to them?" Since the answers to these questions can take many forms, controls can expect yes or no answers as well as dates, times, or numbers.

Hardware and Software
      The great news about speech support is that it no longer requires special hardware. You just need a 16-bit sound card, a microphone, and (for TEL applications) a fax/modem with speech support. (The SDK documentation warns that not all sound card drivers work well with the various engines.) A Pentium-class processor usually provides sufficient speed.
      To write voice-enabled applications you must install the Speech SDK 4.0, which comes in two flavors: basic and suite. The suite configuration includes all the engines you need: SR, DSR, and TTS. However, you can always employ other vendors' engines. The basic configuration consists of only the files to compile and the binaries.
      The software you need to distribute with voice-enabled applications depends upon what the application does and what it assumes. If you need Microsoft's speech engines, the SDK is freely distributable subject to license terms.

The Text-to-Speech Engine
      The TTS engine does basically the reverse of what the SR and the DSR engines do. Its input is plain ASCII text, and its output is a mono, 8-bit, 11 kHz audio format described as Pulse Code Modulation (PCM). PCM is a commonly used method to obtain a digital representation of an analog voice signal. A PCM bit stream is composed of a sequence of numbers that are samples of the voice amplitude.
      As mentioned earlier, a basic element in any speech- related engine is the phoneme. A phoneme is an atomic unit of sound that can be used to form words. You can identify a spoken language by looking at the set of its phonemes. The TTS engine provides a way to convert from a string to its phoneme-based representation and then to an audible sound. The sound then can be played through a computer's speakers or saved to disk as a WAV file.

Figure 2: TTS

      Figure 2: TTS

       Figure 2 shows how Microsoft's TTS engine processes its text input to produce a digital sound. Note that the engines of other vendors may work somewhat differently. The process can be broken into four steps. The first one is called text normalization, which defines how each word is to be spoken. Not all the words that appear in the plain ASCII text must be spoken the way they are written. Sometimes they need to be expanded or even replaced. Examples are numbers, abbreviations, quantities, currency, dates, times, and acronyms, which it treats as expandable macros. Engines differ in the quantity and quality of the rules applied at this step.
      When the normalizer finishes, the original text has been transformed into the collection of words that will be spoken. This set may be ambiguous since words can have different meanings and pronunciations depending upon the context. For example, the word "read" can be pronounced "reed" or "red." The second step, therefore, is to disambiguate words.
      Once the engine has an unambiguous set of words, it passes control to the module that associates each word with its constituent phonemes. This step can be accomplished in one of two ways. First, the module attempts to find the word in a pronunciation database. If the search is unsuccessful, it then tries to figure out phonemes by applying letter-to-sound rules. At this stage, the behavior of engines can differ. An enhanced engine like the Microsoft Speech SDK will try to tune and adjust the final sound by guessing the proper speed, timing, pitch, and volume of the voice.
      The final step in the process is the creation of a WAV file by converting phonemes and additional information into numbers representing voice amplitude samples. There are three techniques to convert a phoneme into an audible sound: word concatenation, synthesis, and subword concatenation. They are described in Figure 3. The Microsoft Speech SDK uses a synthesized voice.
      TTS has many possible applications. If you use it to vocalize online help, you can get rid of all the large sound files you may have used previously. Since the Speech SDK lets you vocalize text that's determined at runtime, you can employ TTS to enhance the user interface of an application. Using TTS, voice recognition and dictation is as easy as using an ActiveX control, a C++ class, or a COM object. The effort is minimal, but the payoff is high.

Developing with the Speech SDK
      The Speech SDK is available in three forms: ActiveX controls, C++ classes, and COM objects. My sample uses C++. For Visual Basic examples, refer to the Speech SDK documentation or the MSDN™ link provided at the end of this article. There is also a low-level API that lets you gain more control over the entire process. The SDK documentation provides complete information on the DirectTextToSpeech and DirectSpeechRecognition APIs.
      The Voice Text object uses TTS to produce synthesized speech output. To use it, you need to create a Voice Text object and register your application with the object. This object is shared globally among all applications. As mentioned earlier, the Microsoft Speech SDK creates sounds by using synthesized voices created by algorithms that simulate the movements of the human vocal cords. Thus the properties of the Voice Text object include attributes like tongue position, tension lip, and mouth opening size. Figure 4 lists all the available properties by category. Figure 5 lists the methods and events.

A Speech-enabled MessageBox
      Now let's see how to turn all this theory into practice. The sample program (see Figure 6) is a dialog-based application that tests the main TTS characteristics of the Speech SDK. It employs a speech-enabled MessageBox function that you can easily add to any other Win32 application. The new routine is called SAPIMessageBox and uses exactly the same prototype as the original. The only difference is that SAPIMessageBox recognizes a new flag

#define MB_ENABLESPEAK 0x00000800L

that causes the speech engine to play the text of the message upon dialog activation.

Figure 6: Speech Message Box

      Figure 6: Speech Message Box

      MessageBox is a modal dialog that's impossible to subclass unless you enter its internal code through a WH_CBT hook. I've chosen the MB_ENABLESPEAK constant so that it does not conflict with the other MessageBox flags.
      To support this new flag, the program must hook into the system code and detect when the MessageBox's window will be displayed. Figure 7 shows the source code for the SAPIMessageBox routine, which is defined as follows:

SAPIMessageBox(hWnd, szText, szTitle, MB_ICONINFORMATION | MB_OK | MB_ENABLESPEAK);

      The structure of the function is simple. If MB_ENABLESPEAK is not specified, then the code defaults to MessageBox. Otherwise, it installs a WH_CBT hook on the current thread and is notified of all the principal events concerning the windows that belong to that thread. Among others, the HCBT_ACTIVATE event is fired when the message box window is drawn to the screen. At this point, the code hooks the event and ask the TTS engine to play the message.
Playing Text with the TTS Engine
      At a certain point, SAPIMessageBox needs to play text. As shown in Figure 7, this is done via a function called TellMe, which is a wrapper for the Speech API. Let's look at this more closely. First of all, it includes the following header:

#include <spchwrap.h>

This defines all the classes that wrap the TTS COM components. To play text, I need CVoiceText. Since it actually manipulates COM objects under the hood, the COM engine must be initialized for the current apartment. The initialization also requires a further step: registering the current application with the TTS engine. This is done through the Init method. The specified name, which must be unique, is just for a record. This code snippet illustrates how to prepare the TTS engine to play text:

static CVoiceText *g_pSpeakObject; BOOL InitSpeech(void) { CoInitialize(NULL); g_pSpeakObject = new CVoiceText; if (!g_pSpeakObject) return false; if (g_pSpeakObject->Init(L"TellMeMore")) return false; return true; }

Once the engine is ready, all I have to do is issue calls to the method Speak.

HRESULT Speak ( PCWSTR pszSpeak, DWORD dwFlags = 0, PCWSTR pszTags = NULL);

The pszTags argument lets you apply tags to text to modify how it's pronounced. The text to speak must be a Unicode string that contains only the raw text. If you need to apply some tags to change pronunciation, use the pszTags argument. The Speak function supports a number of flags to indicate type and priority for the text. When you send text, it is queued and played back as soon as possible. To speed this up, you can assign it a priority.
      To improve the prosody (versification) of the text, you can specify tags that qualify the word or the words in their range. The tags come from a predefined database and apply to single words or sentences within the text. Examples of tags are shown in Figure 8. The syntax of a tag is as follows

\tagName=value\

where tagName is the name of the tag and value is its content, whether it's a number, a string, or whatever. More importantly, a tag must be enclosed by slashes. For example, the tags

\vol=30000\ Microsoft \vol=45000\ Internet \vol=65000\ Developer

play the name Microsoft Internet Developer with a progressively louder voice. As you can see, tags work on single words or strings throughout the text, so they must be embedded in the text to be spoken. This contrasts with the requirements of the Speak method where, unless you use the DirectTextToSpeech API, you can only specify general tags that apply to the text as a whole. The code needed to play text is:

g_pSpeakObject->Speak(wszText);

      When an application submits its text, the engine might be busy serving a previous request. You can accelerate the schedule for your text by using priorities, but this won't cancel the text that is currently playing. You can cancel the playback explicitly with the method StopSpeaking. Before doing so, you should use the method IsSpeaking to determine whether the engine is speaking or not:

if (g_pSpeakObject->IsSpeaking()) g_pSpeakObject->StopSpeaking();

      A similar tactic also is useful for the SAPIMessageBox. In fact, if the text to speak is long and the user dismisses the dialog before the engine completes, you should stop playback upon the window's destruction. Similarly, if the dialog must appear while the engine is speaking, it should take precedence, interrupting the current task. Figure 9 shows the source code from the sample program that deals directly with speech.

Modifying the Lexicon Database
      As you can see in Figure 5, there are a few system dialogs used to configure the TTS engine. The most important is the Lexicon dialog shown in Figure 10 that lets you modify the dictionary by adding new words, changing the pronunciation of words, and removing words.

Figure 10: The Lexicon of Dino

      Figure 10: The Lexicon of Dino

Adding a new word is a two-step process. First you type in the word, then you define its phonemes. Let's use an example to see how this works. The very first words I asked the TTS to speak were "Dino Esposito." While the last name was spoken correctly, the first name was not. The engine pronounced it "Dy-no" as in "dinosaur." The correct pronunciation, of course, is "Dee-no" (see Figure 11).

Figure 11: Say it right, damn you!

Figure 11: Say it right, damn you!

When you type in syllables to describe how the word should be pronounced, the phonemes edit box is updated automatically. Moreover, the dialog pronounces the text so you have immediate feedback. If you want a list of the available phonemes (English only), just click the button. The dialog shown in Figure 12 will appear. When you're finished, close the dialog and the new word will be stored for future use. Microsoft requests that you email new pronunciations to the development team for future releases of the engine.

Figure 12: Phonemes

Figure 12: Phonemes

The Sample Program
The sample program shown in Figure 6 lets you test all the features discussed so far. It displays a text box where the user can enter text and play it back by clicking on the Speak button. The text box is filled automatically each time the user saves something to the clipboard. As shown in Figure 13, when the user presses Ctrl+C to copy text to the clipboard, the caption on the program's taskbar button changes, and it flashes for a few seconds. If you click to bring the application to the top, the window caption reverts back to the original title, and the text box displays the text you just copied.

Figure 13: Clipboard's Content Changed

      To obtain notification when the clipboard's content changes, the program must define a clipboard viewer window that will be notified on any change, whatever the format of the data. The key message to handle is WM_DRAWCLIPBOARD. Figure 14 illustrates how to check whether the data copied to the clipboard is text data, and if so, how to access the text and fill the text box with it.
      Note that the effect shown in Figure 13 occurs only under Windows® 98 and Windows 2000. These two operating systems provide a slightly different implementation of the SetForegroundWindow API function. Rather than bringing the window to the foreground, the user is notified by flashing the program's taskbar button. Under Windows 95 and Windows NT® 4.0, the window is simply brought to the foreground.

Figure 15: The Demo Program

      Figure 15: The Demo Program

      The sample program also lets you compare the difference between the synthesis voice of the TTS engine and a recorded sound. By expanding the dialog as shown in Figure 15, you can listen to TTS speaking my name and then hear a WAV file I recorded. (The WAV file is a resource embedded in the executable.) Notice how the human voice, as you might expect, is more fluent and less mechanical.
      For the program to work properly, TTS speech must be installed on the target computer. If not, the program will load normally, but the user won't be able to play any text. For a better UI, you should gray out the appropriate buttons. The InitSpeech function defined in Figure 9 returns a Boolean value that indicates whether a TTS engine is present and working properly. Figure 16 summarizes five important considerations when designing speech-enabled applications.

Localized Engines
      A TTS engine is a separate component for a speech-enabled application. If you download and test the Microsoft Speech SDK 4.0, you'll get a TTS engine based on the English language. This poses a number of problems when it comes to localizing your software or developing speech-based software in non-English-speaking countries.
      There are ways to work around this. For example, you can assume that your customers already have a speech engine. TTS engines are often bundled with sound cards, so this is not unreasonable. Another possibility is to create a localized engine yourself. However, this is a difficult and time-consuming task that requires the skills of an expert linguist to design prosody rules and the sound of the language's phonemes.
      The best solution is to use one of the existing products available for the language of interest. Once you've chosen one of these, you can extend it to include specific words and abbreviations. For example, I was pleased to notice that Microsoft's TTS engine can correctly pronounce www, TTS, and @, and even recognizes the use of surrounding asterisks to emphasize a word (for example, I'm *very* pleased).

Speech-enabled Web Pages
      So far I've talked about how to use the TTS component to enhance the user interface of a Win32-based application. But is there a way to speech-enable Web pages? To vocalize messages and help files, use the HTML <OBJECT> tag and embed the ActiveX control provided with the SDK.
      More interesting is the voice recognition capability that the browser can provide by the means of external and pluggable components. For example, the user could submit a form or move back and forth among pages by "telling" the browser what to do. The possibilities depend upon what this additional module (that you write yourself or buy) can do.
      Speech-enabled Web pages could let the user jump to links or click buttons simply by pronouncing the text of the element. When designing speech-enabled Web pages, you should make sure of a few things. First and foremost, the text must be unique to let the engine associate it unequivocally with a command. Second, it should be easily pronounced. You should use descriptive text rather than URLs or acronyms, produce unambiguous sounds, use short phrases (say, up to five words), and avoid any special formatting that could confuse the SR engine.

Summary
      The Speech SDK I've discussed in this article is under development at Microsoft. Late-breaking news can be found at http://research.microsoft.com. If you're an MSDN subscriber, the Speech SDK 4.0 is already in your hands on the additional SDK CD. Try it out. It can give you lots of fresh ideas for writing new applications or updating existing ones.

MSDN

http://research.microsoft.com/stg

From the February 1999 issue of Microsoft Internet Developer.