A High-Level Look at Text-to-Speech via the Microsoft Voice Text Object

Robert Coleridge
MSDN Content Development Group

Introduction

This article looks at the Microsoft® Speech SDK and, in particular, the Voice Text object as it is used in Microsoft Visual Basic®. This object uses text-to-speech technology to produce computer-generated speech. A complete discussion of computer-generated speech is beyond the scope of this article. For information on obtaining the Speech SDK, see the section of this article titled "Obtaining Microsoft Text-to-Speech."

The Voice Text object is available in two forms: a standard COM interface IVoiceText and companion interfaces, and also an ActiveX® COM object, VtxtAuto.dll. Since there are not many samples showing how to use this object in Visual Basic, this article will focus on the VTxtAuto ActiveX COM object and use it in the Visual Basic sample, Speak2Me.

Click to view or copy the Speak2Me sample files associated with this technical article.

Text-to-Speech

Editor's note The information in this section is adapted from the Speech API SDK documentation.

The term "Text-to-Speech," or TTS for short, refers to the process by which plain text is converted into digital audio and then "spoken." This "speaking" can be in the form of actually sending the audio through a computer's speakers (or other capable device), or simply saving the converted audio for later playback.

For the most part, all TTS conversion engines can be broken out into one of three methods used to convert phonemes (the smallest phonetic unit in a language that is capable of conveying a distinction in meaning, such as the m of mat and the b of bat in English) into audible sound. The supplied Microsoft Speech engines use the second method. The three methods are described in the following paragraphs.

Concatenation of Words

This method works by joining previously recorded phrases and words to construct a complete sentence. Electronically it is the easiest to perform and as such it is the most commonly used method today. Most electronic voice-mail systems use this type of system. For example, the voice message "You have [two] new messages" is a three-part message that consists two standard parts: "You have" and "new messages" and a selected part, the "two," selected from a list of prerecorded "numbers."

Synthesis of Words

This method generates "synthetic" words by electronically applying mathematical algorithms to simulate throat length, mouth cavity, lip shape, and tongue position. The speech generated by this method sounds mechanical, but by proper application of the various algorithms, the sound can be made to seem like a human voice.

Diphone Concatenation

This method concatenates short digital-audio segments together and mathematically smoothes out the gaps to produce a continuous sound. Each segment contains two sounds, one that leads into the sound, and one that finishes the sound. A good example is the word "hello." The word "hello" consists of four segments, or phonemes: h eh l œ.

For a more detailed explanation of these methods, please see the documentation that comes with the Microsoft Speech SDK (available on the MSDN Library at: SDK Documentation/Additional SDK Documentation/Speech API SDK 3.0).

Why Use Text-to-Speech?

TTS is a technology that can be used when an application needs to give some audible response or message to the user and where full digital recordings are not appropriate. There are two good reasons why digital recordings would be inappropriate. The first is that digital recordings are relatively large for the gain in speech understandability. The second is that unless the developer knows every phrase or sentence that might be required, there is no way to store what is needed.

Possible Applications for Text-To-Speech

TTS also offers a number of other benefits. In general, TTS is most useful for short phrases or for situations when a prerecorded phrase is not practical. TTS has the following practical uses:

Reading dynamic text. TTS is useful for phrases that vary too much to record and store using all possible alternatives. This is what my sample application, "Speak2Me," is designed to show: when the contents of the clipboard change, the text is spoken as it changes.
Proofreading. Audible proofreading of text can help users catch errors missed by visual proofreading.
Conserving storage space. TTS is useful for phrases that would take up too much storage space if they were prerecorded in a digital-audio format.
Event notification and audible feedback. TTS works well for informational messages. For example, a cashier is entering a sale into a cash register and enters an obviously unlikely quantity. A TTS-enabled application could inform the cashier of the mistake and proceed or cancel it depending on the cashier's reaction. This type of notification should only be used in conjunction with a visible message in case the user turns the sound off or is out of hearing range.
Telephony. One of the major up-and-coming uses for TTS is in the area of telephony, the use of computers and telephones. Imagine an answering machine that could tailor its outgoing answer depending on the caller ID. Read the Telephony section of the Microsoft Speech API documentation for a full description of telephony.

Obtaining Microsoft Text-to-Speech

The Microsoft TTS engine can be obtained from the Microsoft Research Web site at http://research.microsoft.com/. Navigate to the download section and there you will find a number of free downloads. Look for the Speech SDK version 3.0 or higher. This SDK contains the Text-to-Speech as well as several other Microsoft voice technologies, which I will cover briefly in a moment.

There are two options for downloading the Speech SDK. The first is to download just the Speech SDK without the Speech engines. This download is for those who have already downloaded the full SDK or have their own engines. This option, at the time of writing, was about 3 MB in size.

The second option is to download the entire Speech SDK and the Microsoft Speech engines. Although this option is greater than 13 MB in size, it contains all of the tools necessary for doing voice work. Note that the Speak2Me samples files associated with this article will not function without a speech engine.

Microsoft Speech SDK

Although this article deals with Microsoft TTS only, the Speech SDK contains two other very powerful tools, which are out of the scope of this article. They are the Microsoft Voice Command system, which allows you to control your computer by voice command, and the Microsoft Voice Dictation system, which is, in essence, the reverse of TTS. You use a microphone to speak to your computer and this package converts your words into text. Both of these technologies are very powerful and demand a significant investment in understanding. For further details I would suggest reading the documentation that comes with the Speech SDK. More up-to-date documentation can be found at the Microsoft Research site at http://research.microsoft.com/.

The Microsoft VTxtAuto Object

The following methods and properties are part of the VTxtAuto object.

Parameterless methods

General syntax: object.method()

AudioFastForward	This method advances audio playback by approximately one sentence or phrase.
AudioPause	This method pauses output for the current site. This affects all applications using the site, so the application should resume speech output as soon as possible by calling the AudioResume method.
AudioResume	This method resumes the text-to-speech engine after it has been paused by the AudioPause method.
AudioRewind	This method backs up audio playback by approximately one sentence or phrase.
StopSpeaking	This method interrupts the text that is currently being spoken by the voice-text object and deletes any buffers containing text waiting to be spoken.

Parameterized methods

Register

This method registers an application with the voice-text object. This method must be used prior to any other method being invoked.

Syntax: object.Register(sSitename as String, sApplicationname as String)

Parameters:

SSitename	Name of the site/location to connect to, such as "Phone Line 1." To use the default site ("Local PC"), specify an empty string.
SApplicationname	Name of the application. For example, "Microsoft Word." An application can use this information to display the source of text. This parameter cannot be NULL.

Example: Call Vtxt.Register("", "VB Test App")

Speak

Renders a text string into speech.

Syntax: object.Speak(sTextToSpeak as String, lFlags as Long)

Parameters:

STextToSpeak	A string of text to be spoken.
LFlags	Type of speech and its priority. The type of speech can be one of these values:
	Value	Description
	vtxtst_COMMAND	An instruction to the user, such as "Insert the disk."
	vtxtst_NUMBERS	Text that is numeric and that should be read in numeric style.
	vtxtst_QUESTION	A question, such as "Do you want to save the file?"
	vtxtst_READING	Text that is being read from a document, such as an e-mail message.
	vtxtst_SPREADSHEET	Text that is being read from a spreadsheet, such as columns of numbers.
	vtxtst_STATEMENT	A neutral, informative statement, such as "You have new messages." This is the default type of speech.
	vtxtst_WARNING	A warning, such as "Your printer is out of paper."

The priority can be one of these values:

Value	Description
vtxtsp_VERYHIGH	Play the text immediately, interrupting text that is currently being spoken, if any. The interrupted text resumes playing as soon as the very high priority text is finished, although the interrupted text may not be correctly synchronized.
vtxtsp_HIGH	Play the text as soon as possible, after text that is currently being spoken but before any other text in the queue.
vtxtsp_NORMAL	Add the text to the end of the queue. This is the default priority.

Example: Call Vtxt.Speak(ESpeakBuf.TEXT, vtxtst_STATEMENT)

Properties

Callback

This property specifies the name of a class module containing methods that receive notifications from the voice-text automation object. The class module must contain a SpeakingStarted method and a SpeakingDone method.

Example:

You have a class module called VTxtCallback, which contains the following code:

Option Explicit
Function SpeakingStarted()
    Debug.Print "Started"
End Function
Function SpeakingDone()
    Debug.Print "Stopped"
End Function

And somewhere in your initialization routine, you have the following:

Set oVoice = New VTxtAuto.VTxtAuto
oVoice.Register App.Title, App.EXEName
oVoice.Callback = "Speak2Me.VTxtCallback"

This code will set up the callback routine in the VTxtAuto object to point to your class module.

Enabled

This property enables or disables voice text. Disabling a site prevents the engine from playing text through the site. If Enabled is TRUE, voice text is enabled. If Enabled is FALSE, voice text is disabled.

Syntax: object.Enabled
Example: Vtxt.Enabled = True

IsSpeaking

This property indicates whether the voice-text object is in the process of speaking text for the voice-text site. A site is global and may be in use by another application. If IsSpeaking is TRUE, the voice-text object is currently speaking text. If IsSpeaking is FALSE, the object is not speaking. The IsSpeaking property is read-only; an application cannot set this property.

Syntax: object.IsSpeaking
Example: Vtxt.IsSpeaking = True

Speed

This property sets or gets the speed at which speech is spoken, in words per minute. Setting the Speed property to 0 causes speech to be spoken at the slowest speed; setting it to 1 to causes speech to be spoken at the fastest speed. A "safe" range for this property would be 30–270. Anything outside of these values (other than 0,1) will work but does not necessarily generate good speech.

Syntax: object.Speed
Example: Vtxt.Speed = ESpeed.TEXT
MySpeed = Vtxt.Speed

A Sample Application: Speak2Me

This sample application was written to demonstrate just how easy it is to use the VTxtAuto ActiveX COM object. The sample does one of two things. The user can type in text and have that text spoken when they click on a button, or the application can be set to monitor the clipboard and speak its contents whenever it changes.

The single form is as shown in Figure 1.

Figure 1. The clipboard speaker form

The form has only five "moving parts": the editbox, the Speak button, the Exit button, the Monitor clipboard checkbox, and an invisible timer control. Whenever the Speak button is clicked then the text in the editbox will be spoken via the TTS engine.

If the Monitor clipboard checkbox is checked, then whenever the contents of the clipboard change the contents are placed into the editbox and then "spoken." This feature demonstrates a possible use for vision-impaired people. With this sample application running in the background, all they would have to do is highlight a piece of text in a word processor, transfer it to the clipboard, and hear it spoken. The rate at which the clipboard is sampled is determined by the interval set on the timer control.

Design

The code for the sample is very straightforward and should be self-explanatory. The only part that requires a more detailed explanation is the setup of the callback class module. I'll examine the core code of the sample first, and then go into detail about the callback code.

The form (frmMain) contained in the file frmMain.frm contains the following code (some incidental code has either been eliminated or reduced for brevity).

Option Explicit
Dim oVoice As VTxtAuto.VTxtAuto   'Voice object
Dim bTimerLoop As Boolean   'Looping constraint
Dim bMonitorClipboard As Boolean     'Toggle for clipboard monitoring

The code below is invoked when the Speak button is clicked. This code is used in two ways: to start the speaking or to end the speaking. The caption on the button toggles between "Start" and "Stop." The callback routine is the actual code that changes the caption. The bMonitorClipboard variable is used to determine whether or not to speak via the timer code, or to simply speak the current contents of the editbox.

The code checks to see if the VTxtAuto control is speaking, in which case the control is told to stop speaking. If the control is not already speaking, either the timer loop is enabled for speaking or the contents of the editbox are spoken.

Private Sub btnSpeak_Click()
   If bMonitorClipboard Then
      'Speak or stop speaking, depending on state
      If oVoice.IsSpeaking Or bTimerLoop Then
         'Disable all constraints and redo button caption
         bTimerLoop = False
         tmrSpeak.Enabled = False
         oVoice.StopSpeaking
      Else
         'Indicate we're speaking and start
         tmrSpeak.Enabled = True
      End If
   Else
      'Speak or stop speaking, depending on state
      If oVoice.IsSpeaking Then
         'Disable all constraint and redo button caption
      oVoice.StopSpeaking
      Else
         'Indicate we're speaking and start SPEAKING
         oVoice.Speak txtTextToSpeak, vtxtst_READING
      End If
   End If
End Sub

The code below is the form initialization and termination code. In the initialization code we create a new instance of the VTxtAuto control, register it, and connect the callback code to the object. In the termination code we stop any outstanding speaking and clean up our resources.

Private Sub Form_Load()
   'Get instance of Voice object and register it
   Set oVoice = New VTxtAuto.VTxtAuto
   oVoice.Register App.Title, App.EXEName
   oVoice.Callback = "Speak2Me.VTxtCallback"
   'Greet user
   oVoice.Speak txtTextToSpeak, vtxtst_READING
   'Set for non looping speech
   bTimerLoop = False
End Sub
Private Sub Form_Unload(Cancel As Integer)
   'Stop any speaking and clean up
   If oVoice.IsSpeaking Then
      oVoice.StopSpeaking
   end If
   Set oVoice = Nothing
End Sub

The code below is the timer code. It checks to see if the object is already speaking. If it is, the routine is merely exited. The next step is to compare the existing editbox contents with the contents of the clipboard. If they are the same, the routine is exited.

If the two previous checks pass, then the clipboard contents are stored for future comparisons and the text is spoken via the VTxtAuto object.

Private Sub tmrSpeak_Timer()
Dim sText As String
   'Exit if already speaking, to prevent overlap
   If oVoice.IsSpeaking Then
      Exit Sub
   End If
   'Get clipboard text and exit if not changed
   sText = Clipboard.GetText
   If sText = txtTextToSpeak Then
      Exit Sub
   End If
   'Set up text and looping constraint
   txtTextToSpeak = sText
   bTimerLoop = True
   'Speak to user
   oVoice.Speak txtTextToSpeak, vtxtst_READING
End Sub

The callback class (VTxtCallback) contained in the file vtxtclbk.cls is the code that the instantiated VTxtAuto object calls when it wants to raise an event. The Visual Basic specification states that code like this must be contained within a class module, and not just an ordinary .bas module. Accordingly, the .cls module contains just the code that is necessary for the two specified events. The following code demonstrates the two events that the VTxtAuto object can interface with.

Option Explicit
Function SpeakingStarted()
   frmMain.btnSpeak.Caption = "&Stop"
End Function
Function SpeakingDone()
   frmMain.btnSpeak.Caption = "&Speak"
End Function

Conclusion

Performing text-to-speech conversion used to be an expensive and complicated process that most developers were not knowledgeable enough to code. With the Microsoft Speech SDK, and in particular, the Text-to-Speech VTxtAuto ActiveX COM object, this complexity is no longer an issue. With a few simple commands, such as register and speak, any developer can create a TTS-enabled application.

No longer are applications limited to plain old dialog box messages. Now the application can speak to you!