Editor's note, MSJ SEPTEMBER 1998

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

September 1998

They say New York is a summer festival, but sweating out the third summer in a row in our inner-city offices without a whisper of A/C has left us all too hot and tired to care any more. That's why you're about to experience our worst segue of the year.
       Speaking of whispers: Whisper and Whistler are the code names for Microsoft's speech recognition and text-to-speech engines. The Speech API (SAPI) 4.0 SDK, now available in beta, exposes these technologies to developers.
       While light years beyond the early, Clapper-rivaling Voice Pilot, SAPI doesn't come without a price. Including all the speech components, a minimum of a 200MHz Pentium with an additional 42MB of RAM on top of the multimedia hardware is required. Programming with SAPI requires an understanding of the Speech object hierarchy, which is implemented on four levels. At the highest level, Voice, are the Voice Command, Dictation, and Text objects. The Sharing level allows the Voice-level objects to share speech engines. The DirectSpeechRecognition and DirectTextToSpeech objects occupy the next level, and the audio objects occupy the lowest level. The Voice APIs offer automatic resource and memory sharing between applications, while the Direct APIs provide full access to speech engines. The DirectSpeechRecognition and DirectTextToSpeech objects load the engines in-process and take control of the speakers and microphone.
       So how will your app listen? First, choose an audio input source. The speech-recognition engine acquires its data from the app-created audio-source object. Input sources can be the multimedia wave-in device, a .WAV file, or a specialized hardware device (sometimes known as a microphone). The next choice is a speech-recognition engine that will operate on the audio data. Microsoft provides a speech-recognition enumerator object to locate an engine. The app then creates an instance of the engine object and passes it to the audio-source object. After the engine and audio-source objects agree on a common data format, the engine creates an audio-source notification sink that it passes to the audio-source object. From then on, the audio-source object submits digital audio data to the engine through the notification sink.
       Somehow the app must be informed when the engine recognizes actual speech. This is done through an app-registered main notification sink. When something is said, the app creates one or more grammar objects and a grammar-notification sink for every grammar object. When the grammar object recognizes a word or phrase, or has other grammar-specific information for the application, it calls functions in the grammar-notification sinks. The application responds to the notifications and takes whatever actions are necessary. Usually the grammar object sends the grammar-notification sink a string indicating what was spoken. The engine may be a bit more sophisticated and provide alternative phrases that may have been spoken according to the timing or even the person who uttered the phrase. Check out http://research.microsoft.com/research/srg for more on Whisper and Whistler.
       Finally, not a whisper but a shout of congratulations to our Under the Hood columnist Matt Pietrek. This issue marks Matt's 60th monthly installment: five years of MSJ columns without a break! How did he do it? Necessity. Late or skipped columns stay on a writer's Equifax report for seven years. Going forward, Matt will be shifting to quarterly columns that will afford him more freedom to dig deeper and report on Windows NT 5.0 and Windows CE.
— J.F.

From the September 1998 issue of Microsoft Systems Journal

Send feedback to MSDN.

Look here for MSDN Online resources.