Microsoft Agent Speech Engine Support Requirements

ActiveXŽ technology for interactive software agents

Microsoft Corporation

October 1998

Download this document in Microsoft Word (.DOC) format (zipped, 8K).

Contents
Introduction
Requirements for Text-To-Speech Engines
Requirements for Speech Recognition Engines
Requirements for Supporting the Microsoft Linguistic Information Sound Editing Tool

Introduction

Microsoft Agent uses the Microsoft Speech Application Programming Interface (SAPI) to support speech input (speech recognition, or SR) and speech output (text-to-speech, or TTS). By supporting this standard, Microsoft Agent's speech services can be supported by other speech engines. This document describes the required SAPI interfaces used by Microsoft Agent.

Requirements for Text-To-Speech Engines

The engine must be fully SAPI 4.0-compliant. In addition, the engine must also support the following SAPI interfaces for tagged text and bookmark notifications. These interfaces enable Microsoft Agent to pace the output of text to a character's word balloon and lip-sync the character's mouth (or equivalent) with the spoken words.

ITTSCentralW

The engine must support TextData(), AudioReset(), Register(), Unregister(), and Inject().

ITTSNotifySinkW

The engine must call out through AudioStop(), AudioStart(), and Visual(). The Visual callback must provide IPA phonemes. (The International Phonetic Alphabet [IPA] is a universal notation for describing the phonetic content of spoken communication. All speakable phonemes have representations in IPA. Details of IPA are in the Microsoft Speech API specification [part of the Speech SDK 4.0 download] at http://www.microsoft.com/iit/. )

Although the Visual notification is fairly rich, Microsoft Agent uses only the cIPAPhoneme value to animate the mouth as the character speaks. Any Microsoft Agent-compatible engine must provide a closely synchronized stream of Visual notifications reflecting the phonetic content of the produced utterance. In this case, "relatively timely notification" is not adequate, because speaker-hearers are fairly sensitive to discrepancies between mouth position and acoustic content. Visual notifications need to be returned promptly.

ITTSBufNotifySinkW

The engine must call out through BookMark(). During preprocessing of speech output, Microsoft Agent code inserts bookmarks between "words" and uses the arrival of those bookmarks to drive the pacing of text in the word balloon. While SAPI does not require anything more than the arrival of those bookmarks at some time before the end of the utterance, the bookmarks must be returned in a relatively timely fashion to support Microsoft Agent.

Note that there is no strict concept of "word" in some languages, such as Japanese. Microsoft Agent's Speak method defines a "word" as a connected string of symbols that has a meaning and pronunciation in isolation. Microsoft Agent uses fairly simple parsing code to determine what a "word" is: it looks for symbols separated by white space. Thus, there are three "words" in the English string "The 101 Dalmatians": "the", "one hundred and one", and "dalmatians". (Text included in the Microsoft Agent Map tag is treated as a single "word" for display purposes.)

ITTSAttributesW

The engine must support pitch and speed attributes through the PitchSet(), PitchGet(), SpeedSet(), and SpeedGet() methods.

Requirements for Speech Recognition Engines

A speech recognition engine must also be a fully compliant Command and Control (C&C) engine according to SAPI 4.0. It must support multiple grammars in the binary format described in the specification and allow those grammars to be activated or deactivated in real time.

Note that SAPI 4.0 requires that speech recognition engines support the wide character, UNICODE interfaces. However, in supporting these interfaces, the engine should not depend on converting UNICODE data to ANSI, as the engine may not function correctly on some systems. For example, a Japanese engine that converts UNICODE to ANSI, may not work on an English Windows 95 system.

In addition, to be considered Microsoft Agent-compliant, the engine must return results objects upon the successful recognition of a phrase (through ISRGramNotifySinkW::PhraseFinish). These results objects must support ISRResBasic, as the specification requires. In addition, they should support ISRResScore. Although Microsoft Agent will run with an engine that supports only ISRResBasic, or even with an engine that returns no results objects whatsoever, performance will usually be significantly poorer with such engines. Many applications use the confidence values provided by the engine to control how they respond to various commands.

Requirements for Supporting the Microsoft Linguistic Information Sound Editing Tool

The Microsoft Linguistic Information Sound Editing Tool uses a speech recognition engine to produce word breaks and phonetic information for Windows Wave sound (.wav) files. The 2.0 version now supports use with other speech engines. Vendors that wish to support the sound editor must ensure their engines fully support the SAPI 4.0 specification for context-free grammar engines and the following requirements.

ISRResGraphEx and |Attributes

The results objects returned by the engine must support the ISRResGraphEx interface and the |Attributes interface. Merely supporting the ISRResGraphExinterface is enough to enable the sound editor to provide word break information, but does not provide the necessary support for phoneme information.

SRATTR PHONESEG

The sound editor requires that engines also support a special DWord attribute, SRATTR PHONESEG. The editor queries the engines to see |Attributeinterface and attempts to set the SRATT PHONESEG to 1. If that call succeeds, the sound editor assumes the engine's results will support the gathering of phoneme-segmentation information from results object.

#define    SRATTR_PHONESEG    MAKELONG  (1,  SRVEN_MICROSOFT)

When a results object is transmitted to the sound editor, it queries that object for its implementation of ISRResGraphEx. ISRResGraphEx contains a member function, DataGet, with type signature.

HRESULT    DataGet   (DWord,  dwID,  GUID gAttrib, SDATA  *psData)

Where dwID is the identifier of the graph object, gAttrib is a GUID corresponding to the attribute sought, and psData is a pointer to an SDATA object containing the data returned. The engine is responsible for allocating the data stored in psData through CoTaskMemAlloc. The calling application (in this case, the sound editor), is responsible for freeing it through CoTaskMemFree when it is finished.

DataGet is required to recognize three pre-defined GUIDs, which are listed in the SAPI documentation. An engine which returns a success code to the call to DWORDSet(SRATTR PHONESEG, 1) is also required to recognize a specific GUID, SRGARC PHONEMESEGMENTATION, when called with a dwID that corresponds to an edge in the graph.

DEFINE_GUID(SRGARC_PHONMESEGMENTATION, 0xd05405b0, 0x1db1, 0x11d2, 0x94, 0x2, 0x0, 0xc0, 0x4f, 0x8e, 0xf4, 0x8f);

When the call returns, psData should point to an array of DWORD-aligned structure of type PHONESEG, defined by:

typedef struct tagSRPHONESEG {
   DWORDdwSize;// Size of the SRPHONESEG
structure+ phone data
   QWORDqwStartTime; // SAPI timestamp of the start of the seqment
…QWORDqwEndTime;// SAPI timestamp of the end of the seqment
   intnScore;// The segment's score
   WCHARaPhones[0];// Array of phone(s) making up the segment
} SRPHONESEG,  *PSRPHONESEG;

qwStartTime and qwEndTime point to the beginning and end of each phoneme making up the word covered by the edge, and aPhones is an array of Unicode characters corresponding to the IPA representation of the phone, which were produced in this segment. (In some languages, there are phonemes which are spelled with more than one IPA phoneme. In English, for instance, the "long I" sound in the word "live" is actually a diphthong, made up of two simpler phonemes concatenated together.) The aPhones array should be zero terminated and padded at the end to make each SRPHONESEG structure in the array an even multiple of four bytes long.

For example, suppose the word spoken on arc 4 was "made". Then the call to DataGet(4, SRGARC_PHONEMESEGMENTATION, &sd) might return an array of three phoneme segments, /m/ running from qwStartTime=328434 bytes to qwEndTime=330354 bytes, /1e/ running from qwStartTime=330354 bytes to qwEndTime=344114 bytes, and /d/ running from qwStartTime=344114 bytes to qwEndTime=347314 bytes. These would be presented as a packed array of three SRPHONESEG structures of sizes 28, 32, and 28 bytes, respectively. Notice that there is some padding at the end of the middle SRPHONESEG structure, so that the next item in the array starts at a 4-byte boundary.

The SAPI 4.0 SDK includes a tool (SRFunc) for testing compliance with the SAPI 4.0 spec. Included in that tool is a test for compliance with this set of interfaces. The source code for that tool is a good place to start in order to understand how these interfaces will interact with the sound editor, and to debug the interfaces during development.