The Future of Microsoft Telephony API (TAPI) in Advanced Media Processing and Control

Michelle Quinton
Microsoft Corporation

December 1999

Summary: TAPI 3.0, the version of Microsoft Telephony API (TAPI) that ships with Microsoft Windows 2000, is the latest step in the evolution of computer telephony. TAPI will continue to evolve into an even more compelling platform on which to develop telephony applications. This paper discusses the expanded media processing and control features planned for the next version of TAPI. (15 printed pages)

Introduction

Since its inception, Microsoft’s Telephony API (TAPI) has provided a robust and steadily evolving interface between the telephone and the computer. TAPI 3.0, the version of Microsoft’s Telephony API that ships with Microsoft® Windows® 2000, improves on previous versions of TAPI in several ways. By supporting IP Telephony, combining call and media control, and exposing COM objects, TAPI 3.0 makes IP and traditional telephony applications easy to develop. This article assumes the reader is generally familiar with TAPI 3.0; for general background, see the article “IP Telephony with TAPI 3.0.”

TAPI 3.1, the next version of TAPI, will continue to evolve TAPI into an even more compelling development platform, incorporating expanded media processing and control API. This paper looks in depth at the plans for these features and shows how they will build on the foundations of TAPI 3.0.

Note TAPI 3.1 is a working name for these features. This name may change in the future.

Telephony Scenarios

TAPI’s new advanced media processing and control support will enable simple development of many telephony scenarios and applications. The three significant scenarios we’ll discuss here are:

Interactive voice response (IVR) systems
Voice mail or unified messaging (UM) systems
Speech-enabled Web applications

IVR Systems

An IVR system is used to collect and provide information to a caller through an automated process. This allows the caller to be helped with minimal or no interaction with an actual human.

In an IVR system, a call is answered automatically when it is received, and a prerecorded message is played over the call. This message usually offers callers choices, such as “Press one to talk to customer service, press two to talk to accounting….” Callers can then press a touch-tone button on their phones, generating a tone that the IVR system can detect. The call is then routed according to the response from the caller.

An IVR system can also collect additional numeric information from the caller. For example, a customer may call her credit card company’s customer service line. Before she is transferred to a customer service agent, the caller enters her account number in through her phone pad. The IVR system can collect that information, look up information related to her account, and provide that information to the agent’s computer when the call arrives at the agent’s phone.

IVR systems can also use speech recognition for caller input. The prerecorded message played to the user could be something like “Say ‘yes’ if you would like to talk to a customer service agent.” The IVR system would then use speech recognition to listen for the caller response. This is very similar to the previous example. The difference is the way that the caller provides information to the system. A more advanced speech recognition enabled system may support more extensive and natural grammars, such as being able to interpret the caller asking, “What is my current balance and why did my check bounce?”

Text-to-speech features are also commonly incorporated into IVR systems. For example, a credit card company’s IVR system may have a feature that will read the caller’s current balance to them. That balance may be rendered speech through a text-to-speech engine.

Figure 1. A traditional IVR scenario

Above is a representation of a traditional IVR scenario. The PBX and IVR system are both stand-alone hardware devices. The PBX handles routing the call to the request extension, and the IVR system performs the media-processing-related tasks of playing outgoing messages and detecting input from the user. The PBX can transfer the call to the agent’s extension, but the agent’s desktop computer retrieves the information collected from the IVR.

The IVR scenario can also be implemented through IP telephony, as shown in Figure 2.

Figure 2. An IP telephony IVR scenario

In the IP Telephony implementation of the IVR scenario, a call from the PSTN network goes through a IP/PSTN gateway, and is handled as an IP telephony call throughout the rest of the system. The call router now takes place of the PBX, and routes calls as necessary. The IVR server performs the media processing tasks of playing outgoing messages and detecting input from the user. This media processing is done on the host computer. The call data store is used as a generic data store that the router and IVR server can write information to. The call router, IVR server, and call data store are all PCs that can run a variety of services. Additionally, the agent no longer needs a proprietary desktop phone extension that is connected to a proprietary PBX. Instead, she can receive the call through IP telephony, and use the standard sound card in her PC as her telephony audio device.

Unified Messaging Systems

A unified messaging system allows a user’s voice mail to be saved as a computer file, then accessed through that user’s e-mail system. In a common UM scenario, an incoming call rings at a corporate user’s extension. After four rings, the call is forwarded to the voice mail system. The voice mail system plays the outgoing message for that user, and then records the caller’s message. That message is then inserted into the user’s e-mail system. The user receives an e-mail message with the message saved as an attachment, and with a sound card is able to listen to the voice mail message.

Unified messaging systems can also incorporate many of the features from the IVR scenarios. For example, the UM system may allow a specific touch-tone to signal that the caller wants to stop the outgoing message and start recording a message right away. The UM system may support users calling into the system to listen to their voice mail messages as well as their e-mail, which can be rendered into speech through a text-to-speech engine. Users may navigate through the UM system by using touch-tones or through speech recognition. For example, users may be presented with the prompt, “Press 1 to listen to voice mail, press 2 to listen to new e-mail messages.”

Figure 3. A unified messaging scenario

Speech-Enabled Web Applications

The growth of the Internet and the World Wide Web as well as advances in speech recognition and text-to-speech technologies will enable a whole set of new telephony scenarios or new solutions to old problems. One intriguing scenario involves the use of a voice specific markup language, such as the Windows Telephony Engine (WTE), which allows Web pages to have voice interaction specific tags. This would allow Web content and services to be accessed through telephony connections and rendered as speech rather than text or graphics.

For an example, let’s look at a Web site that provides movie times. A customer could use his Web browser to access this site, which would allow him to search for movies and find the times they are playing at a specific theater. With a speech-enabled site, the customer could call in to a machine running a “voice browser” that understands the voice tags that have been embedded in the Web site’s pages, much like a Web browser understands HTML tags on a Web page. This voice browser would then render the content of the page into speech, and use speech recognition to navigate through the site based on what the user says.

Figure 4. Speech-enabled Web applications

Media Processing and Control in the Three Scenarios

Telephony media processing tasks are already fairly well defined. When we analyze IVR, UM systems, and other telephony scenarios, the basic tasks break down into the following areas:

Speech recognition
Text to speech
Tone generation
Tone detection
File recording
File playback

In the sections that follow, we’ll take a general look at how TAPI 3.0 handles media processing, and then we’ll see how TAPI 3.1 will handle each of these tasks.

The TAPI 3.0 Object Model

In the TAPI 3.0 object model that is displayed in Figure 5, an application uses the TAPI object as an entry point into TAPI. From the TAPI object, the application can enumerate Address objects, which represent an endpoint from which calls can be made or received. From the Address object, the application can enumerate which terminal objects can be used with calls on that address. Terminals are central to TAPI’s media model. They are used to indicate how the media on the call should be captured, rendered or processed.

When an outgoing call is made or an incoming call is received, the application controls that call through the Call object. From the Call object, an application can enumerate Stream objects. A Stream object represents a media stream on the call based on media type and direction. For example, in a common phone call between two people (a full-duplex audio call), two streams happen simultaneously: audio capture and audio render.

Figure 5. The TAPI 3.0 object model

The application uses the Terminal and Stream objects to set up and control media streaming on a call. In the TAPI model, a Terminal object can be selected on a Stream object. TAPI then performs the media processing indicated by the Terminal that is on that Stream. In the example of a full-duplex audio call, a Microphone Terminal may be selected on the audio capture Stream, and a Speakers Terminal may be selected on the audio render Stream. TAPI sets up media streaming based on this selection, and the call goes through the microphone and speakers on the user’s machine.

Media Support with TAPI 3.0

TAPI 3.0 supports media processing through its terminal object abstraction. In the TAPI 3.0 object model, a terminal represents a source or sink of media on a call. For example, a microphone would be a source of audio and a set of speakers would be a sink for audio on a full-duplex interactive audio call.

Interactive Voice and Video Terminals

For interactive voice and video calls, TAPI 3.0 defines four terminal object types: microphones, speakers, video capture devices, and video windows. Each of the terminals has an interface or set of interfaces it supports that abstract its functionality for the TAPI application. Each terminal object supports the ITTerminal interface, which gives basic information about the terminal, such as the type of terminal it is, like speaker or a microphone, whether it is a capture or render device, and what type of media it supports. Additionally, audio terminals all support the ITBasicAudioTerminal interface, which has methods to set and get volume, gain, and other audio properties. TAPI 3.0 also defines interfaces specific to video terminals.

Using these terminals, it is very simple to set up an interactive call in TAPI 3.0. The application can either enumerate through the terminals that exist on the system and choose the ones it is interested in, or request a default terminal based on the media type and direction of media stream. It then takes these terminals and selects them on a call. After that, the TAPI infrastructure will take care of all the media streaming associated with that call, based on the address the call is on and the terminals selected.

Media Streaming Terminals

TAPI 3.0 defines a fifth terminal object called the Media Streaming Terminal (MST). The MST allows the developer to fetch and inject media samples into the media stream and has a set of interfaces that allow the application developer a fine level of control over the media stream itself. Using the MST, the developer must set up the media format, buffer size, and control streaming programmatically. This allows the approximately same degree of control over the media stream on a call that a wave device would. The MST does not provide the media processing abstraction as the other terminals defined in TAPI 3.0. However, it does allow the developer to directly access the media stream on a call, and therefore can be used to do any media processing that the application developer chooses to implement.

Media Support with TAPI 3.1

TAPI 3.0 handles media streaming on interactive calls well and provides support for generic media processing through the MST. However, it does not provide an abstraction of common telephony media processing tasks. These media processing tasks are necessary to implement the telephony scenarios described at the beginning of this article.

The TAPI 3.1 media model will address these issues in the TAPI 3.0 media model by defining new terminal objects based on specific media processing tasks. By defining these new terminal objects, TAPI 3.1 will make it easier to develop applications with complex media control requirements. Note that the list of new terminals to be defined is identical to our list of basic media tasks:

Speech recognition
Text-to-speech
Tone detection
Tone generation
File recording
File playback

Speech Recognition

With recent advances in automatic speech recognition (ASR) technology and ever increasing computer-processing power, ASR has become a feature that is accessible to many applications.

TAPI’s ASR terminal will be based on the Microsoft’s Speech API (SAPI) interfaces and objects. The interfaces that the ASR terminal exposes will be similar to the SAPI interfaces, with simple integration among TAPI applications with SAPI ASR engines. However, TAPI can also support ASR terminals that are not based on SAPI. This capability is discussed in the MSP Base Classes section of this document.

Events are central to the ASR terminal’s functionality. Whenever the ASR terminal recognizes a word or phrase, an event is fired to the application through the standard TAPI 3.0 event mechanism, informing the application of what was recognized. The application then processes the event and performs the required actions based on what was recognized.

The ASR terminal can be in dictation mode, in which it will try to recognize free-form speech, or specific grammars can be loaded which will limit the words that the terminal will recognize. The specific grammar model fits very well into the IVR scenario. For example, if the outgoing prompt is “Say ‘balance’ to retrieve your balances, or ‘agent’ to talk to an agent”, then the terminal is only interested in listening for the words “balance” and “agent”. The application developer would load a grammar in the terminal, which indicates that it is only interested in those two words, and this makes it much easier for the terminal to process the incoming voice stream.

Text-to-Speech

Text-to-speech (TTS) is conceptually the opposite of ASR, taking text and rendering it into speech. TAPI’s TTS terminal will be based on SAPI 5.0 interfaces in the same way that the ASR terminal’s are. Since the speech generated by TTS engines does not sound as good as prerecorded speech, TTS is usually in scenarios where a prerecorded message cannot be used. An example of this would be when a user can call into a UM system and have their e-mail read to them.

The TTS terminal will have a simple interface with two main methods—loading the string of text to render into speech, and choosing the voice that the speech will be generated in. When a TTS terminal is selected, the application developer simply provides the string to speak, and the terminal handles sending the generated speech out the media stream.

Tone Detection

IVR systems have always been based on tone detection. Dual tone multifrequency tones (DTMFs) are more commonly known as the tones that are generated by pressing the buttons of a touch-tone phone. They have always been used as input to a IVR system, although ASR is slowing becoming more common for user input as it becomes more reliable and more accessible to applications.

Accurately detecting DTMFs in a media stream is a very processor-intensive activity, and is almost always handled off the host computer, usually on a digital signal processor (DSP). When detected, the hardware signals an event to the host computer indicating the DTMF that was detected.

In addition to DTMFs, some systems need to detect other well-known tones, such as dial tone or busy tone, as well as generic tones that can be described as a collection of frequencies, volumes, durations, and cadences. Again, tone detection is not normally handled on the host computer, and not all systems can handle detecting tones other then DTMFs.

The TAPI tone detection terminal can be used to request tone detection on a call. The terminal will allow the application to request DTMF, a well-known tone, or generic tone detection. Like ASR, tone detection is event driven—that is, when a tone is detected, the terminal generates an event, and the application processes the event and bases its actions on the event.

Tone Generation

IVR and UM systems use tone generation in two ways. First, the user calling into the system usually has to have some way of generating tones to communicate their selections to the system. Traditionally, this has simply been pressing the buttons on a touch-tone phone. As computer become more integrated with telephony, however, users may be making calls from an application running on their computers that controls their phones, and therefore the application would need some way of generating the necessary tones. Second, and much more common, is that existing voice mail systems, PBXs, and IVRs use DTMFs to communicate between each other. As a call is transferred between the systems and information needs to be passed back and forth, the systems need a reliable way to communicate information, and DTMFs are traditionally used for this purpose.

As computer telephony becomes more popular, and the telephony intelligence moves into computer-based software, tone generation will become more important. In one interesting scenario, a user’s PC has a “dumb” phone device attached to it. The phone device can only stream audio to and from the PC, and an application on the PC controls all the streaming. When the user picks up the phone device, he would expect to hear a dial tone as he does with a traditional phone today. Using tone generation, the application controlling the streaming to the phone could stream dial tone to the phone to complete the user experience.

TAPI’s tone generation terminal will allow applications to generate DTMFs, well-known tones like a dial tone or busy tone, and generic tones on a call.

File Recording

File recording is very important in the UM system scenario. When a call enters a voice mail system, the system has to record the message that the caller is leaving, and then store that file either in the e-mail system or some separate voice mail storage system.

The TAPI File terminal will make writing UM systems on TAPI simple. The application can use the terminal to create a new file to record to, overwrite an existing file, or append to an existing file. It will be able to set the file format. Although the scenarios described in this document are audio-based, the file terminal will allow an application to save audio and video files, such as files in the .avi or .asf formats.

Additionally, TAPI will define an XML schema for storing information about files, and will provide support for saving that XML data in the same file as the media itself. This information will include basic information, such as the time of call, caller ID, length of call, and so forth. It may also include additional information, such as a transcript of the call or some sort of tagging of the call, to indicate a harassment call, for example. Because an XML schema will define the data, it will be easy to extend schema for vendor specific information.

File Playback

File playback is a necessary component in both IVR and UM systems—it is used to play an outgoing message over a call. In an IVR, this message usually contains a series of choices for the caller to choose from: “Press ‘1’ for…”, while in UM, this message is often the outgoing voice mail message of the person receiving the call: “Hi, this is Jane. I’m not here right now…”

In IVR systems, file playback often involves playing several files sequentially over the call. For example, credit card company IVR systems are often used to inform callers of their current balances. The balance can be generated by TTS, as described above, or can be generated by concatenating several prerecorded files. In this example, the IVR system would have files that contain the words “one,” “two,” “three,” and so on, as well as words like “thousand,” “hundred,” “dollars,” and “cents.” The application can easily determine which files need to be played to inform the user of their balance.

The TAPI File Playback terminal will make file playback in a TAPI application easy. The terminal will allow the application to specify a file, set of files, or an IStream COM object as an input. The terminal will automatically handle opening the file, reading the contents, and sending the data into the outgoing media stream.

Telephony Scenarios Revisited

Let’s take a look at the scenarios described at the beginning of the document again, in light of the TAPI 3.1 features.

Remember that an IVR system collects information from a caller based on DTMF detection and ASR. Information is provided to the caller through file playback and TTS. Through the TAPI terminal object and media processing architecture, writing these complex applications is simplified.

With the new file terminal, the developer writing the application does not need to know how to open a file, read buffers from a file, or how to stream media over a call. The developer simply creates the terminal object, indicates what the file is, associates the terminal with the call, and uses the Play() method to start the message playing over the media stream.

Table 1. IVR scenario

IVR System	TAPI 3.1 usage
The customer calls a toll-free number for support.	The IVR system is notified of a call through TAPI.
The call is answered automatically, and the IVR system plays a message indicating the customer’s options.	The TAPI file terminal is used to play the message.
The customer chooses the option to get her account balance.	The TAPI tone detection terminal is used to listen for the customer’s selection.
The IVR system asks the user to input her account number.	The TAPI file playback terminal is used to play the prompt.
The customer enters her account number.	The TAPI tone detection terminal is used to listen for the customer’s account number.
The IVR system looks up the account, and finds the balance.
The IVR system uses TTS to provide the balance to the customer.	The TAPI TTS terminal is used to convert the account balance to speech.
The customer hangs up.

Table 2. UM scenario

UM System	TAPI 3.1 Usage
The caller calls a business associate at his or her extension.	A phone control application running on the business associate’s desktop is notified of a call.
The call rings 4 times and is forwarded to a voice mail system.	The UM system issues a TAPI command to transfer the call.
The voice mail system looks up who the call was for originally, finds his or her outgoing message, and plays the outgoing message to the caller.	The TAPI file playback terminal is used to play the outgoing message.
The caller presses a DTMF as soon as the message starts to indicate that he or she wants to start recording a message right away.	The TAPI tone detection terminal is used to listen for the DTMF. The TAPI file playback terminal is stopped.
The playback stops and the system begins to record the message.	The TAPI file record terminal is used to record the message.
The user finishes the message and hangs up.	Call disconnect is indicated through TAPI, and file record is stopped.
The UM system saves the file, attaches it to an e-mail message, and sends the e-mail message to the user’s e-mail account.	TAPI file terminal saves the file.

Table 3. Speech-enabled Web application scenario

Speech Web Application	TAPI 3.1 Usage
User calls into speech Web access point.	TAPI is used to answer the call.
Voice browser renders information to user.	Voice browser uses TTS terminal to render text to user.
User makes request.	Voice browser uses ASR terminal to recognize user’s request, and navigates browser based on request.
Two previous steps repeated until user hangs up.	TAPI reports call disconnected to voice browser.

TAPI Development Tools

MSP Base Classes

Telephony Service Providers (TSPs) are hardware drivers that expose telephony devices through the TAPI infrastructure. Every TSP can have a Media Stream Provider (MSP) that provides support for media streaming in TAPI 3.0. It is the MSP’s responsibility to implement the terminal objects that TAPI defines.

To make MSPs easy to create, Microsoft provides the MSP base classes, from which MSP developers can derive their MSPs. The MSP base classes provide the default implementation of all the terminals defined by TAPI.

For TAPI 3.0, the MSP base classes implement the five terminals described at the beginning of this article: microphone terminal, speakers terminal, video capture device terminal, video window terminal, and the media streaming terminal. The MSP base classes use Microsoft’s DirectShow® technology for all media streaming.

To create an MSP derived from the MSP base classes, the MSP developer must create a DirectShow filter that handles media streaming to and from its telephony device. Once that is done, the MSP developer can modify the sample MSP provided in the SDK to use the correct filter and fill in whatever communication is needed between the MSP and TSP components.

For TAPI 3.1, Microsoft will continue to support the MSP base classes, and will add a default implementation of all the terminals described in this article. The default implementation of the ASR and TTS terminals will use SAPI 5.0 to do the speech recognition and text to speech. As long as the terminals conform to the objects and interfaces defined by TAPI, MSP developers can override the default implementation of any terminal to implement their own versions.

Additionally, MSPs need not be derived from the MSP base classes. MSP developers can write their own MSPs from scratch. Developers would most likely choose this option if they feel their media streaming software could not fit into the DirectShow framework.

Plug-in Terminals

TAPI 3.1 will also define a plug-in terminal framework. This framework will allow third parties to provide their own terminals and terminal implementations that will work with any MSPs that are derived from the MSP base classes.

Again, plug-in terminals must be based on DirectShow and must follow the TAPI terminal object and interface definitions. Plug-in terminals could be used if a company defined a new file format that is not supported by the default file terminal. This company could develop its own file terminal, and make that terminal available to any MSP. On the other hand, a company may be transitioning applications from TAPI 2.x to TAPI 3.1 and may have a lot of existing code that manipulates streams directly to do ASR or file playback. They can encapsulate that current code in a TAPI terminal to ease the transition and to leverage their existing work.

TAPI 3.1 Media Architecture

The architecture for TAPI 3.1 builds on the TAPI 3.0 architecture. TAPI 3.0 TSPs and MSPs provide the call and media control for TAPI applications. The new TAPI 3.1 terminals will appear as plug-in terminals and will simply be inserted into the DirectShow filter graph that controls the media streaming for the call. Third parties can provide their own plug-in terminals that will work with any MSP created from the TAPI 3.1 MSP base classes. The MSP will not need to be recompiled for the plug-in terminal to work.

Figure 6 shows an example of the TAPI 3.1 architecture. Here the H.323 TSP and MSP work with the new TAPI 3.1 Speech Recognition Terminal. The H.323 TSP sets up and controls the call. The H.323 MSP sets up and controls the media streaming. The MSP inserts whatever terminal has been selected by the application into its media stream.

Figure 6. The new TAPI 3.1 architecture

Host Processing vs. DSP Processing

In existing computer telephony environments, most media processing is done on telephony boards, rather than on the host computer. As computer-processing power continues to grow, more media processing can be done on the host computer. Many companies today have computer telephony solutions that do some or all media processing on the host computer.

Scalability through IP Telephony

Using IP Telephony, it is simple to build a scalable system that performs host-processing-intensive media tasks. For example, at the beginning of this article, we looked at an IP Telephony-based IVR system. To make that system scalable, the system would simply need more IVR server PCs and the call router would need to be updated to know how to route calls to an IVR server based on the current load on each server. With this logic in place, the system can be scaled incrementally as needed.

Figure 7. A scalable IP telephony-based IVR scenario

MSP Base Class Support

The MSP base classes can only be used for media processing on the host computer. Additionally, SAPI only supports host-based ASR and TTS. However, the TAPI object model itself does not enforce host-based processing. The MSP is responsible for all media processing and implements the terminal object. Therefore, any MSP can be written to proxy all media processing requests to the telephony hardware. The MSP developer cannot take advantage of the MSP base classes to do this, bit this option is available.

Conclusion

TAPI continues to evolve to meet the needs of telephony system developers. By analyzing traditional and emerging telephony scenarios, we have seen how the plans for media processing and control in TAPI 3.1 will help telephony developers easily write applications that implement these scenarios. By providing a compelling telephony and media model abstraction, TAPI applications that are written today for existing telephony platforms can run on new IP telephony-based systems without modification.

--------------------------------------------

THIS IS PRELIMINARY DOCUMENTATION. The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This BETA document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

The BackOffice logo, Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.

Other product or company names mentioned herein may be the trademarks of their respective owners.