Amiga® RKM Devices: 8 Narrator Device

This chapter describes the narrator device which, together with the
translator library, provides all of the Amiga's text-to-speech functions.
The narrator device is used to produce high-quality human-like speech in
real time.


   Feature        Description         Function
   -------        -----------         --------
   NDB_NEWIORB    Flag                Use V37 features
   NDB_WORDSYNC   Flag                Synchronize speech/mouth on words
   NDB_SYLSYNC    Flag                Synchronize speech/mouth on syllables
   F0enthusiasm   narrator_rb field   F0 excursion factor
   F0perturb      narrator_rb field   Amount of F0 perturbation
   F1adj          narrator_rb field   F1 adjustment in \pm5% steps
   F2adj          narrator_rb field   F2 adjustment in \pm5% steps
   F3adj          narrator_rb field   F3 adjustment in \pm5% steps
   A1adj          narrator_rb field   A1 adjustment in decibels
   A2adj          narrator_rb field   A2 adjustment in decibels
   A3adj          narrator_rb field   A3 adjustment in decibels
   articulate     narrator_rb field   Transition time multiplier
   centralize     narrator_rb field   Degree of vowel centralization
   centphon       narrator_rb field   Pointer to central ASCII phon
   AVbias         narrator_rb field   Amplitude of voicing bias
   AFbias         narrator_rb field   Amplitude of frication bias
   priority       narrator_rb field   Priority while speaking

   Compatibility Warning:
   The new features for the 2.0 narrator device are not backwards

 Narrator Device Commands and Functions 
 Device Interface 
 Writing to the Narrator Device 
 Reading from the Narrator Device 
 How to Write Phonetically for Narrator 
 A More Technical Explanation 
 Example Speech and Mouth Movement Program 
 Additional Information on the Narrator Device 

8 Narrator Device / Narrator Device Commands and Functions

Command         Operation
-------         ---------
CMD_FLUSH       Purge all active and queued requests for the narrator

CMD_READ        Read mouth shapes associated with an active write from the
                narrator device.

CMD_RESET       Reset the narrator port to its initialized state. All
                active and queued I/O requests will be aborted.  Restarts
                the device if it has been stopped.

CMD_START       Restart the currently active speech (if any) and resume
                queued I/O requests.

CMD_STOP        Stop any currently active speech and prevent queued I/O
                requests from starting.

CMD_WRITE       Write a stream of characters to the narrator device and
                generate mouth movement data for reads.

Exec Functions as Used in This Chapter
AbortIO()       Abort a command to the narrator device. If the command is
                in progress, it is stopped immediately.  If it is queued,
                it is removed from the queue.

BeginIO()       Initiate a command and return immediately (asynchronous
                request).  This is used to minimize the amount of system

CloseDevice()   Relinquish use of the narrator device.  All requests must
                be complete.

CheckIO()       Return the status of an I/O request.

CloseLibrary()  Relinquish use of a previously opened library.

DoIO()          Initiate a command and wait for completion (synchronous
                request). Should be used with care because it will not
                return control if the request does not complete.

OpenDevice()    Obtain use of the narrator device.

OpenLibrary()   Obtain use of a library.

SendIO()        Initiate a command and return immediately (asynchronous

WaitIO()        Wait for the completion of an asynchronous request. When
                the request is complete the message will be removed from
                reply port.

Exec Support Functions as Used in This Chapter
CreateExtIO()   Create an extended I/O request structure of type
                narrator_rb.  This structure will be used to communicate
                commands to the narrator device.

CreatePort()    Create a signal message port for reply messages from the
                narrator device.  Exec will signal a task when a message
                arrives at the port.

DeleteExtIO()   Delete an extended I/O request structure created by

DeletePort()    Delete the message port created by CreatePort().

8 Narrator Device / Device Interface

The narrator device operates like all other Amiga devices.  To use the
narrator device, you must first open it.  This initializes certain global
areas, opens the audio device, allocates audio channels, and performs
other housekeeping functions.  Once open, the device is ready to receive
I/O commands (most typically CMD_WRITE andCMD_READ). Finally, when
finished, the user should close the device.  This will free some buffers
and allow the entire device to be expunged should the system require
memory.  See the Introduction to Amiga System Devices chapter for general
information on device usage.

The narrator device uses two extended I/O request structures: narrator_rb
for write commands (to produce speech output) and mouth_rb for read
commands (to receive mouth shape changes and word/syllable synchronization
events).  Both I/O request structures have been expanded (in a backwards
compatible fashion) for the V37 narrator device with several new fields

    struct narrator_rb
       struct IOStdReq  message; /* Standard IORequest Block       */
       UWORD   rate;             /* Speaking rate (words/minute)   */
       UWORD   pitch;            /* Baseline pitch in Hertz        */
       UWORD   mode;             /* Pitch mode                     */
       UWORD   sex;              /* Sex of voice                   */
       UBYTE   *ch_masks;        /* Pointer to audio allocation maps  */
       UWORD   nm_masks;         /* Number of audio allocation maps   */
       UWORD   volume;           /* Volume. 0 (off) thru 64        */
       UWORD   sampfreq;         /* Audio sampling frequency       */
       UBYTE   mouths;           /* If non-zero, generate mouths   */
       UBYTE   chanmask; /* Which ch mask used (internal - do not modify)*/
       UBYTE   numchan;  /* Num ch masks used (internal- do not modify) */
       UBYTE   flags;            /* New feature flags              */
       UBYTE   F0enthusiasm;     /* F0 excursion factor            */
       UBYTE   F0perturb;        /* Amount of F0 perturbation      */
       BYTE    F1adj;            /* F1 adjustment in +- 5% steps   */
       BYTE    F2adj;            /* F2 adjustment in +- 5% steps   */
       BYTE    F3adj;            /* F3 adjustment in +- 5% steps   */
       BYTE    A1adj;            /* A1 adjustment in decibels      */
       BYTE    A2adj;            /* A2 adjustment in decibels      */
       BYTE    A3adj;            /* A3 adjustment in decibels      */
       UBYTE   articulate;       /* Transition time multiplier     */
       UBYTE   centralize;       /* Degree of vowel centralization */
       char    *centphon;        /* Pointer to central ASCII phon  */
       BYTE    AVbias;           /* Amplitude of voicing bias      */
       BYTE    AFbias;           /* Amplitude of frication bias    */
       BYTE    priority;         /* Priority while speaking        */
       BYTE    pad1;             /* For alignment                  */

    struct mouth_rb
       struct  narrator_rb voice;/* Speech IORequest Block         */
       UBYTE   width;            /* Mouth width (returned value)   */
       UBYTE   height;           /* Mouth height (returned value)  */
       UBYTE   shape;            /* Internal use, do not modify    */
       UBYTE   sync;             /* Returned sync events           */

Details on the meaning of the various fields of the two I/O request blocks
can be found in the Writing to the Narrator Device and
Reading from the Narrator Device sections later in this chapter.  See
the include file devices/narrator.h for the complete structure definitions.

 The Amiga Speech System 
 Opening The Narrator Device 
 Closing The Narrator Device 

8 / Device Interface / The Amiga Speech System

The speech system on the Amiga is divided into two subsystems:

   *  The translator library,  consisting of a single function:
      Translate(), which converts an English string into its phonetic
      representation, and

   *  The narrator device, which uses the phonetic representation
      (generated either manually or by the translator library) as input to
      generate human-like speech and play it out via the audio device.

The two subsystems can be used either together or individually. Generally,
hand coding phonetic text will produce better quality speech than using
the translator library, but this requires the programmer to "hard code"
the phonetic text in the program or otherwise restrict the input to
phonetic text only.  If the program must handle arbitrary English input,
the translator library should be used.

Below is an example of how you would use the translator library to
translate a string for the narrator device.

    #define BUFLEN 500

    APTR EnglStr;                   /* pointer to sample input string */
    LONG EnglLen;                   /* input length */
    UBYTE PhonBuffer[BUFLEN];       /* place to put the translation */
    LONG rtnCode;                   /* return code from function */

    struct narrator_rb *VoiceIO;    /* speaking I/O request block */
    struct mouth_rb *MouthIO;       /* mouth movement I/O request block */

    EnglStr = "This is Amiga speaking.";    /* a test string */
    EnglLen = strlen(EnglStr);
    rtnCode = Translate(EnglStr, EnglLen, (APTR)&PhonBuffer[0], BUFLEN);

    voice_io->message.io_Command = CMD_WRITE;
    voice_io->message.io_Offset  = 0;
    voice_io->message.io_Data    = PhonBuffer;
    voice_io->message.io_Length  = strlen(PhonBuffer);
    DoIO((struct IORequest *)VoiceIO)

This chapter discusses only the narrator device; refer to the "Translator
Library" chapter of the Amiga ROM Kernel Reference Manual: Libraries for
more information on the translator library.

While the narrator device on the Amiga supports all of the major device
commands (see the  Narrator Device Commands and Functions section), two of
these commands do most of the work in the device.  They are:

   *  CMD_WRITE - This command is used to send a phonetic string to the
      device to be spoken.  The narrator_rb I/O request block also contains
      several parameters which can be set to control  various aspects of
      the speech, such as pitch, speaking rate, male/female voice, and so
      on. Some of the options are rather arcane.  See the
      Writing to the Narrator Device section for a complete list
      of options and their descriptions.

   *  CMD_READ - The narrator device can be told to generate various
      synchronization events which the user can query.  These events are:
      mouth shape changes, word sync, and/or syllable sync.  The events can
      be generated singly or in any combination, as requested by the user.
      Word and syllable synchronization events are new to system 2.0 and
      later (V37 and later of the narrator device).  See the
      Reading from the Narrator Device section for more details.

8 / Device Interface / Opening The Narrator Device

Three primary steps are required to open the narrator device:

   *  Create a message port using CreatePort(). Reply messages from the
      device must be directed to a message port.

   *  Create an extended I/O request structure of type narrator_rb.  The
      narrator_rb structure is created by the CreateExtIO() function.

   *  Open the narrator device.  Call OpenDevice() passing the I/O request.

    struct MsgPort *VoiceMP;
    struct narrator_rb *VoiceIO;

    if (VoiceMP = CreatePort("speech_write",0))
        if (VoiceIO = (struct narrator_rb *)
                        CreateExtIO(VoiceMP,sizeof(struct narrator_rb));
            if (OpenDevice("narrator.device", 0, VoiceIO, 0))
                    printf("narrator.device did not open\n");

When the narrator device is first opened, it initializes certain fields in
the user's narrator_rb I/O request structure.  In order to maintain
backwards compatibility with older versions of the narrator device, a
mechanism was needed for the device to ascertain whether it was being
opened with a V37 or pre-V37 style I/O request structure. The pad field in
the pre-V37 narrator_rb I/O request structure (which no one should have
ever touched!) has been replaced by the flags field in the V37 narrator_rb
structure, and is our path to upward compatibility.  The device checks to
see if a bit is set in this flags field.  This bit must be set before
opening the device if V37 or later features of the narrator device are to
be used.  There are two defined constants in the include file, NDB_NEWIORB
and NDF_NEWIORB. NDB_NEWIORB specifies the bit which must be set in the
flags field, NDF_NEWIORB is the field definition of the bit (1 <<

Once the device is opened, the mouth_rb (read) I/O request structure can
be set up.  Each CMD_READ request must be matched with an associated
CMD_WRITE request.  This is necessary for the device to match the various
sync events with a particular utterance.  The read I/O request structure
is easily set up as follows:

   *  Create a read message port using the CreatePort() function.

   *  Allocate memory for the mouth_rb extended I/O request structure using

   *  Copy the narrator_rb I/O request structure used to open the device
      into the voice field of the mouth_rb I/O request structure. This will
      set the fields necessary for the device to make the correct
      correspondence between read and write requests.

   *  Copy the pointer to the read message port returned from CreatePort()
      into the voice.message.io_Message.mn_ReplyPort field of the mouth_rb

The following code fragment, in conjunction with the OpenDevice() code
fragment above, shows how to set up the mouth_rb structure:

    struct  MsgPort   *MouthMP;
    struct  mouth_rb  *MouthIO;

    if (MouthMP = CreatePort("narrator_read", 0))
      if (!(MouthIO = (struct mouth_rb *)
                 AllocMem(sizeof(struct mouth_rb),MEMF_PUBLIC|MEMF_CLEAR)))
          /* Copy I/O request used in OpenDevice */
          MouthIO->voice = *VoiceIO;
          /* Set port */
          printf("AllocMem failed\n");
        printf("CreatePort failed\n");

8 / Device Interface / Closing The Narrator Device

Each OpenDevice() must be eventually matched by a call to CloseDevice().
This is necessary to allow the system to expunge the device in low memory
conditions.  As long as any task has the device open, or has forgotten to
close it before terminating, the narrator device will not be expunged.

All I/O requests must have completed before the task can close the device.
If any requests are still pending, the user must abort them before closing
the device.

    if (!(CheckIO(VoiceIO)
        AbortIO(VoiceIO);  /* Abort queued or in progress request */
    WaitIO((struct IORequest *)VoiceIO); /* Wait for abort to do its job */
    CloseDevice(VoiceIO);                /* Close the device */

8 Narrator Device / Writing to the Narrator Device

You write to the narrator device by passing a narrator_rb I/O request to
the device with CMD_WRITE set in io_Command, the number of bytes to be
written set in io_Length and the address of the write buffer set in

    VoiceIO->message.io_Command = CMD_WRITE;
    VoiceIO->message.io_Offset  = 0;
    VoiceIO->message.io_Data    = PhonBuffer;
    VoiceIO->message.io_Length  = strlen(PhonBuffer);
    DoIO((struct IORequest *)VoiceIO);

You can control several characteristics of the speech, as indicated in the
narrator_rb struct shown in the Device Interface section.

Generally, the narrator device attempts to speak in a non-regional dialect
of American English.  With pre-V37 versions of the device, the user could
change only a few of the more basic aspects of the speaking voice such as
pitch, male/female, speaking rate, etc.  With the V37 and later versions
of the narrator device, the user can now change many more aspects of the
speaking voice.  In addition, in the pre-V37 device, only mouth shape
changes could be queried by the user.  With the V37 device, the user can
also receive start of word and start of syllable synchronization events.
These events can be generated independently, giving the user much greater
flexibility in synchronizing voice to animation or other effects.

The following describes the fields of the narrator_rb structure:

    Points to a NULL-terminated ASCII phonetic input string.  For
    backwards compatibility issues, the string may also be terminated
    with a "#" symbol.  See the How to Write Phonetically for Narrator
    section of this chapter for details.

    Length of the input string.  The narrator device  will parse the
    input string until either a NULL or a "#" is encountered, or until
    io_Length characters have been processed.

    The speaking rate in words/minute.  Range is from 40 to 400 wpm.

    The baseline pitch of the speaking voice.  Range is 65 to 320 Hertz.

    The F0 (pitch) mode.  ROBOTICF0 produces a monotone pitch, NATURALF0
    produces a normal pitch contour, and MANUALF0 (new for V37 and later)
    gives the user more explicit control over the pitch contour by
    creative use of accent numbers.  In MANUALF0 mode, a given accent
    number will have the same effect on the pitch regardless of its
    position in the sentence and its relation to other accented
    syllables. In NATURALF0 mode, accent numbers have a reduced effect
    towards the end of sentences (especially long ones).  In addition,
    the proximity of other accented syllables, the number of syllables in
    the word, and the number of phrases and words in the sentence all
    affect the pitch contour.  In MANUALF0 mode these things are ignored
    and it's up to the user to do the controlling.  This has the
    advantage of being able to have the pitch be more expressive.  The
    F0enthusiasm field will scale the effect.

    Controls the sex of  the speaking voice (MALE or FEMALE).  In
    actuality, only the formant targets are changed.  The user must still
    change the pitch and speaking rate of the voice to get the correct
    sounding sex.  See the include files for default pitch and rate

    Pointer to a set of audio allocation maps.  See the "Audio Device"
    chapter for details.

    Number of audio allocation maps.  See the "Audio Device" chapter
    for details.

    Sets the volume of the speaking voice.  Range 0 - 64.

    The synthesizer is ``tuned" to a sampling frequency of 22,200 Hz.
    Changing sampfreq affects pitch and formant tunings and can be used
    to create unusual vocal effects.  For V37 and later, it is
    recommended that F1, F2, and F3adj be used instead to achieve this

    If set to a non-zero value will direct the narrator device  to
    generate mouth shape changes and send this data to the user in
    response to read requests.  See the Reading from the Narrator Device
    section for more details.

    Used internally by the narrator device. The user should not modify
    this field.

    Used internally by the narrator device.  The user should not modify
    this field.

flags (V37)
    Used to specify V37 features of the device.  Possible bit settings
    are: NDB_NEWIORB - I/O request block  uses V37 features. NDB_WORDSYNC
    - Device should generate start of word sync events. NDB_SYLSYNC -
    Device should generate start of syllable sync events. These bit
    definitions and their corresponding field definitions (NDF_NEWIORB,
    NDF_WORDSYNC, and NDF_SYLSYNC) can be found in the include files.

F0enthusiasm (V37)
    The value of this field controls the scaling of pitch (F0) excursions
    used on accented syllables and has the effect of making the narrator
    device sound more or less "enthusiastic" about what it is saying.
    It is calibrated in 1/32s with unity (32) being the default value.
    Higher values cause more F0 variation, lesser values cause less.
    This feature is most useful in manual F0 mode.

F0perturb (V37)
    Non-zero values in this field cause varying amounts of random
    low-frequency modulation of the pitch (F0).  In other words, the
    pitch shakes in much the same way as an elderly person's voice does.
    Range is 0 to 255.

F1adj, F2adj, F3adj (V37)
    Changes the tuning of the formant frequencies. A formant is a major
    vocal tract resonance, and the frequencies of these formants move
    continuously as we speak.  Traditionally, they have been given the
    abbreviations of F1, F2, F3... with F1 being the one lowest in
    frequency.  Moving these formants away from their normal positions
    causes drastic changes in the sound of the voice and is a very
    powerful tool in the creation of character voices.  This adjustment
    is in \pm5% steps.  Positive values raise the formant frequencies and
    vice versa. The default is zero.  Use these adjustments instead of
    changing sampfreq.

A1adj, A2adj, A3adj (V37)
    In a parallel formant synthesizer, the amplitudes of the formants
    need to be specified along with their frequencies.  These fields bias
    the amplitudes computed by the narrator device.  This is useful for
    creating different tonal balances (bass or treble), and listening to
    formants in isolation for educational purposes.  The adjustments are
    calibrated directly in \pm1db (decibel) steps.  Using negative values
    will cause no problems; use of positive numbers can cause clipping.
    If you want to raise an amplitude, try cutting the others the same
    relative amount, then bring them all up equally until clipping is
    heard, then back them off.  This should produce an optimum setting.
    This field has a +31 to -32 db range and the value -32db is
    equivalent to -infinity, shutting that formant off completely.

articulate (V37)
    According to the popular theories of speech production, we move our
    articulators (jaw, tongue, lips, etc.) smoothly from one "target"
    position to the next.  These articulatory targets correspond to
    acoustic targets specified by the narrator device for each phoneme.
    The device calculates the time it should take to get from one target
    to the next and this field allows you to intervene in that process.
    Values larger than the default will cause the transitions to be
    proportionately longer and vice versa.  This field is calibrated in
    percent with 100 being the default.  For example, a value of 50 will
    cause the transitions to take half the normal time, with the result
    being "sharper", more deliberate sounding speech (not necessarily
    more natural).  A value of 200  will cause the transitions to be
    twice as long, slurring the speech.  Zero is a special value in the
    narrator device will take special measures to create no transitions
    at all and each phoneme will simply be abutted to the next.

centralize (V37)
    This field together with centphon can be used to create regional
    accent effects by modifying vowel sounds.  centralize specifies the
    degree (in percent) to which vowel targets are "pulled" towards the
    targets of the vowel specified by centphon.   The default value of 0%
    indicates that each vowel in the utterance retains its own target
    values.  The maximum value of 100% indicates that each vowel's
    targets are replaced by the targets of the specified vowel.
    Intermediate values control the degree of interpolation between the
    utterance vowel's targets and the targets of the vowel specified by

centphon (V37)
    Pointer to an ASCII string specifying the vowel whose targets are
    used in the interpolation specified by centralize.  The vowels which
    can be specified are: IY, IH, EH, AE, AA, AH, AO, OW, UH, ER, UW.
    Specifying other than these will result in an error code being

AVbias, AFbias (V37)
    Controls the relative amplitudes of the voiced and unvoiced speech
    sounds.  Voiced sounds are those made with the vocal cords vibrating,
    such as vowels and some consonants like y, r, w, and m.  Unvoiced
    sounds are made without the vocal cords vibrating and use the sound
    of turbulent air, such as s, t, sh, and f.  Some sounds are
    combinations of both such as z and v.  AVbias and AFbias change the
    default amplitude of the voiced and unvoiced components of the sounds
    respectively.  (AV stands for Amplitude of Voicing and AF stands for
    Amplitude of Frication).  These fields are calibrated in \pm1db steps
    and have the same range as the other amplitude biases, namely +31 to
    -32 db.  Again, positive values may cause clipping.  Negative values
    are the most useful.

priority (V37)
    Task priority while speaking.  When the narrator device begins to
    synthesize a sentence, the task priority remains unchanged while it
    is calculating acoustic parameters.  However, when speech begins at
    the end of this process, the priority is bumped to 100 (the default
    value). If you wish, you may change this to anything you want.
    Higher values will tend to lock out most anything while speech is
    going on, and lower values may cause audible breaks in the speech
    output. The following example shows how to issue a write request to
    the narrator device.  The first write is done with the default
    parameter settings.  The second write is done after modifying the
    first and third formant loudness and using the centralization feature.

The following example shows how to issue a write request to the narrator
device.  The first write is done with the default parameter settings.  The
second write is done after modifying the first and third formant loudness
and using the centralization feature.


8 Narrator Device / Reading from the Narrator Device

All read requests to the narrator device must be matched to an associated
write request.  This is done by copying the narrator_rb structure used in
the OpenDevice() call into the voice field of the mouth_rb I/O request
structure.  You must do this after the call to OpenDevice().  Matching the
read and write requests allows the narrator device to coordinate I/O
requests across multiple uses of the device.

In pre-V37 versions of the narrator device, only mouth shape changes can
be queried from the device.  This is done by setting the mouths field of
the narrator_rb I/O request structure (the write request) to a non-zero
value.  The write request is then sent asynchronously to the device and
while it is in progress, synchronous read requests are sent to the device
using the mouth_rb I/O request structure.  When the mouth shape has
changed, the device will return the read request to the user with bit 0
set in the sync field of the mouth_rb.  The fields width and height of the
mouth_rb structure will contain byte values which are proportional to the
actual width and height of the mouth for the phoneme currently being
spoken.  Read requests sent to the narrator device are not returned to the
user until one of two things happen: either the mouth shape has changed
(this prevents the user from having to constantly redraw the same mouth
shape), or the speech has completed.  The user can check io_Error to
determine if the mouth shape has changed (a return code of 0) or if the
speech has completed (return code of ND_NoWrite).

In addition to returning mouth shapes, reads to the V37 narrator device
can also perform two new functions: word and syllable sync. To generate
word and/or syllable sync events, the user must specify several bits in
the flags field of the write request (narrator_rb structure).  The bits
are NDB_WORDSYNC and NDB_SYLSYNC, for start of word and start of syllable
synchronization events, respectively, and, of course, NDB_NEWIORB, to
indicate that the V37 I/O request is required.

NDB_WORDSYNC and NDB_SYLSYNC tell the device to expect read requests and
to generate the appropriate event(s). As with mouth shape change events,
the write request is sent asynchronously to the device and, while it is in
progress, synchronous read requests are sent to the device.  The sync
field of the mouth_rb structure will contain flags indicating which events
(mouth shape changes, word sync, and/or syllable sync) have occurred.

The returned sync field flags are:

    bit 0 (0x01) -> mouth shape change event
    bit 1 (0x02) -> start-of-word synchronization event
    bit 2 (0x04) -> start-of-syllable synchronization event

and 1 or more flags may be set for any particular read.

As with mouth shape changes, read requests will not return until the
requested event(s) have occurred, and the user must test the io_Error
field of the mouth_rb structure to tell when the speech has completed (an
error return of  ND_NoWrite).

Several read events can be compressed into a single event.  This can occur
in two ways: first when two dissimilar events occur between two successive
read requests.  For example, a single read may return both a mouth change
and a syllable sync event.  This should not present a problem if the user
checks for all events. The second is when multiple events of the same type
occur between successive read requests.  This is of no great concern in
dealing with mouth shape changes because, presumably, mouth events are
used to drive animation, and the animation procedure will simply draw the
current mouth shape.

    Watch Those Sync Events.
    When word or syllable sync is desired, the narrator device may
    compress multiple sync events into a single sync event.  Missing a
    word or syllable sync may cause word highlighting (for example) to
    lose sync with the speech output.  A future version of the device
    will include an extension to the mouth_rb I/O request structure which
    will contain word and syllable counts and, possibly, other
    synchronization methods.

The following code fragment shows the basics of how to perform reads from
the narrator device.  For a more complete example, see the sample program
at the end of this chapter.  For this fragment, take the code of the
previous write example as a starting point.  Then the following code would
need to be added:

    struct  mouth_rb  *MouthIO;     /* Pointer to read IORequest block */
    struct  MsgPort   *MouthMP;     /* Pointer to read message port    */

 * (1) Create a message port for the read request.
    if (!(MouthMP = CreatePort("narrator_read", 0L)))
        BellyUp("Read CreatePort failed");

 * (2) Create an extended IORequest of type mouth_rb.
    if (!(MouthIO = (struct mouth_rb *)
                     CreateExtIO(MouthMP, sizeof(struct mouth_rb))))
        BellyUp("Read CreateExtIO failed");

 * (3) Set up the read IORequest. Do this after the call to OpenDevice().
 * We assume that the write IORequest and the OpenDevice have been done
    MouthIO->voice  =  *SpeakIO;
    MouthIO->voice.message.io_Message.mn_ReplyPort = ReadMsgPort;
    MouthIO->voice.message.io_Command = CMD_READ;

 * (4) Set the flags field of the narrator_rb write request to return the
 *     desired sync events.  If mouth shape changes are required, then the
 *     mouths field of the IORequest should be set to a non-zero value.

    SpeakIO->mouths = 1;            /* Generate mouth shape changes */
    SpeakIO->flags = NDF_NEWIORB  | /* Indicates V37 style IORequest */
                     NDF_WORDSYNC | /* Request start-of-word sync events */
                     NDF_SYLSYNC;   /* Request start-of-syll sync events */

 * (5) Issue asynchronous write request. The driver initiates the write
 *      request and returns immediately.


 * (6) Issue synchronous read requests. For each request we check the sync
 *     field to see which events have occurred.  Since any combination of
 *     events can be returned in a single read, we must check all
 *     possibilities.  We continue looping until the read request returns
 *     an error of ND_NoWrite, which indicates that the write request has
 *     completed.

    for (DoIO(MouthIO);MouthIO->voice.message.io_Error
                             != ND_NoWrite;DoIO(MouthIO))
          if (MouthIO->sync & 0x01)  DoMouthShape();
          if (MouthIO->sync & 0x02)  DoWordSync();
          if (MouthIO->sync & 0x04)  DoSyllableSync();

 *  (7) Finally, we must perform a WaitIO() on the original write request.


8 Narrator Device / How to Write Phonetically for Narrator

This section describes in detail the procedure used to specify phonetic
strings to the narrator speech synthesizer. No previous experience with
phonetics is required. The only thing you may need is a good pronunciation
dictionary for those times when you doubt your own ears. You do not have
to learn a foreign language or computer language. You are just going to
learn how to write down the English that comes out of your own mouth. In
writing phonetically you do not have to know how a word is spelled, just
how it is said.

                            TABLE OF PHONEMES


          Phoneme      Example              Phoneme     Example
          -------      -------              -------     --------
          IY           beet, eat            IH          bit, in
          EH           bet, end             AE          bat, ad
          AA           bottle, on           AH          but, up
          AO           ball, awl            UH          book, soot
          ER           bird, early          OH          border
          AX*          about, calibrate     IX*         solid, infinite

         * AX and IX should never be used in stressed syllables.


          Phoneme      Example              Phoneme     Example
	  -------      -------		    -------     --------
          EY           bay,aid              AY          bide,I
          OY           boy,oil              AW          bound,owl
          OW           boat,own             UW          brew,boolean


          Phoneme      Example              Phoneme    Example
	  -------      -------   	    -------    -------
          R            red                  L           long
          W            wag                  Y           yellow,comp(Y)uter
          M            men                  N           no
          NX           sing                 SH          shy
          S            soon                 TH          thin
          F            fed                  ZH          pleasure
          Z            has,zoo              DH          then
          V            very                 WH          when
          CH           check                J           judge
          /H           hole                 /C          loch
          B            but                  P           put
          D            dog                  T           toy
          K            keg,copy             G           guest

                             Special Symbol

          Phoneme               Example             Explanation
          -------               -------             -----------
          DX                    pity                tongue flap
          Q                     kitt(Q)en           glottal stop
          QX                                        silent vowel

                         Contractions (see text)

                               UL = AXL
                               IL = IXL
                               UM = AXM
                               IM = IXM
                               UN = AXN
                               IN = IXN

                          Digits and Punctuation

    Digits 1-9    Syllabic stress, ranging from secondary through emphatic
    .             Period - sentence final character.
    ?             Question mark - sentence final character
    -             Dash - phrase delimiter
    ,             Comma - clause delimiter
    ()            Parentheses - noun phrase delimiters (see text)

The narrator device  works on utterances at the sentence level.  Even if
you want to say only one word, it will treat it as a complete sentence.
Therefore, narrator wants one of two punctuation marks to appear at the
end of every sentence - a period or a question mark.  The period is used
for almost all utterances and will cause a final fall in pitch to occur at
the end of a sentence.  The question mark is used at the end of yes/no
questions only, and results in a final rise in pitch.

For example, the question, Do you enjoy using your Amiga? would take a
question mark at the end, while the question, What is your favorite color?
should be followed (in the phonetic transcription) with a period.  If no
punctuation appears at the end of a string, narrator will append a dash to
it, which will result in a short pause.  Narrator recognizes other
punctuation marks as well, but these are left for later discussion.

 Phonetic Spelling 	      Hints For Intelligibility 
 Stress And Intonation        Example Of English And Phonetic Texts 
 Punctuation 		      Concluding Remarks 

8 / How to Write Phonetically for Narrator / Phonetic Spelling

Utterances are usually written phonetically using an alphabet of symbols
known as IPA (International Phonetic Alphabet).  This alphabet is found at
the front of most good dictionaries.  The symbols can be hard to learn and
were not readily available on computer keyboards, so the Advanced Research
Projects Agency (ARPA) came up with the ARPABET, a way of representing
each symbol using one or two upper case letters. Narrator uses an expanded
version of the ARPABET to specify phonetic sounds.

A phonetic sound, or phoneme, is a basic speech sound, a speech atom.
Working backwards: sentences can be broken into words, words into
syllables, and syllables into phonemes.  The word cat has three letters
and (coincidentally) three phonemes.  Looking at the table of phonemes we
find the three sounds that make up the word cat. They are the phonemes K,
AE, and T, written as KAET. The word cent translates as SEHNT. Notice that
both words begin with the letter c, but because they are pronounced
differently they have different phonetic spellings. These examples
introduce a very important concept of phonetic spelling: spell it like it
sounds, not like it looks.

 Choosing the Right Vowel 
 Choosing the Right Consonant 
 Contractions and Special Symbols 

8 / Phonetic Spelling / Choosing the Right Vowel

Phonemes, like letters, are divided into two categories: vowels and
consonants.  Loosely defined, a vowel is a continuous sound made with the
vocal cords vibrating and air exiting the mouth (as opposed to the nose).
A consonant is any other sound, such as those made by rushing air (like S
or TH), or by interruptions in the air flow by the lips or tongue (B  or
T).  All vowels use a two letter ASCII phonetic code while consonants use
a one or two letter code.

In English we write with only five vowels: a, e, i, o, and u.  It would be
easy if we only said  five vowels.  However, we say more than 15 vowels.
Narrator provides for most of them.  Choose the proper vowel by listening:
Say the word aloud,  perhaps extending the vowel sound you want to hear
and then compare the sound you are making to the sounds made by the vowels
in the examples on the phoneme list.  For example, the a in apple sounds
the same as the a in cat, not like the a in Amiga, talk, or made.  Notice
also that some of the example words in the list do not even use any of the
same letters contained in the phoneme code; for example AA as in bottle.

Vowels are divided into two groups: those that maintain the same sound
throughout their durations and those that change their sound.  The ones
that change are called diphthongs.  Some of us were taught the terms long
and short  to describe vowel sounds.  Diphthongs fall into the long
category, but these two terms are inadequate to fully differentiate
between vowels and should be avoided.  The diphthongs are the last six
vowels listed in the table.  Say the word made out loud very slowly.
Notice how the a starts out like the e in bet  but ends up like the e in
beet.  The a, therefore, is a diphthong in this word and we would use EY
to represent it.  Some speech synthesis systems require you to specify the
changing sounds in diphthongs as separate elements, but narrator takes
care of the assembly of diphthongal sounds for you.

8 / Phonetic Spelling / Choosing the Right Consonant

Consonants are divided into many categories by phoneticians, but we need
not concern ourselves with most of them.  Picking the correct consonant is
very easy if you pay attention to just two categories: voiced and
unvoiced.  A voiced consonant is made with the vocal cords vibrating, and
an unvoiced one is made when the vocal cords are silent. Sometimes English
uses the same letter combinations to represent both. Compare the th in
thin with the th in then. Notice that the first is made with air rushing
between the tongue and upper teeth.  In the second, the vocal cords are
vibrating also.  The voiced th phoneme is DH and the unvoiced one is TH.
Therefore, thin is phonetically spelled as THIHN while the word then is
spelled DHEHN.

A sound that is particularly subject to mistakes is voiced and unvoiced S,
phonemes Z and S, respectively.   Clearly the word bats ends with an S and
the word has ends with a Z. But, how do you spell close?  If you say "What
time do you close?", you spell it with a Z, and if you are saying "I love
to be close to you." you use an S.

Another sound that causes some confusion is the r sound.  There are two
different r-like phonemes in the Narrator alphabet:  R under the
consonants and ER under the vowels.  Use ER if the r sound is the vowel
sound in the syllable like in bird, absurd, and flirt.  Use the R if the r
sound precedes or follows another vowel sound in that syllable as in car,
write, and craft.

8 / Phonetic Spelling / Contractions and Special Symbols

There are several phoneme combinations that appear very often in English
words.  Some of these are caused by our laziness in pronunciation.  Take
the word connector for example.  The o in the first syllable is almost
swallowed out of existence. You would not use the AA phoneme; you would
use the AX phoneme instead. It is because of this relaxation  of vowels
that we find ourselves using AX and IX very often.  Since this relaxation
frequently occurs before l, m, and n, narrator has a shortcut for typing
these combinations. Instead of personal being spelled PERSIXNAXL, we can
spell it PERSINUL, making it a little more readable.  Anomaly goes from
combination.  It may be hard to decide whether to use the AX or IX brand
of relaxed vowel.  The only way to find out is to use both and see which
sounds best.

Other special symbols are used internally by narrator.  Sometimes they are
inserted into or substituted for part of your input sentence.  You can
type them in directly if you wish.  The most useful is probably the Q or
glottal stop, an interruption of air flow in the glottis.  The word
Atlantic has one between the t and the l. Narrator knows there should be a
glottal stop there and saves you the trouble of typing it. But narrator is
only close to perfect, so sometimes a word or word pair might slip by that
would have sounded better with a Q stuck in someplace.

8 / How to Write Phonetically for Narrator / Stress And Intonation

It is not enough to tell narrator what you want said.  For the best
results you must also tell narrator how you want it said.  In this way you
can alter a sentence's meaning, stress important words, and specify the
proper accents in polysyllabic words.  These things improve the
naturalness and thus the intelligibility of the spoken output.

Stress and intonation are specified by the single digits 1-9 following a
vowel phoneme code.  Stress and intonation are two different things, but
are specified by a single number.

Stress is, among other things, the elongation of a syllable.  A syllable
is either stressed or not, so the presence of a number after the vowel in
a syllable indicates stress on that syllable.  The value of the number
indicates the intonation. These numbers are referred to here as stress
marks  but keep in mind that they also affect intonation.

Intonation here means the pitch pattern or contour of an utterance. The
higher the stress mark, the higher the potential for an accent in pitch.
A sentence's basic contour is comprised of a quickly rising pitch gesture
up to the first stressed syllable in the sentence, followed by a slowly
declining tone throughout the sentence, and finally, a quick fall to a low
pitch on the last syllable.  The presence of additional stressed syllables
causes the pitch to break its slow, declining pattern with rises and falls
around each stressed syllable.  Narrator uses a very sophisticated
procedure to generate natural pitch contours based on how you mark the
stressed syllables.

 How and Where to Put the Stress Marks 
 Which Stress Value Do I Use? 

8 / / Stress And Intonation / How and Where to Put the Stress Marks

The stress marks go immediately to the right of vowel phoneme codes. The
word cat has its stress marked after the AE, e.g., KAE5T. You generally
have no choice about the location of a number; there is definitely a right
and wrong location.  A number should either go after a vowel or it should
not.  Narrator will not flag an error if you forget to put a stress mark
in or if you place it on the wrong vowel. It will only tell you if a
stress mark has been put after a non-vowel, i.e., consonant or punctuation.

The rules for placing stress marks are as follows:

   *  Always place a stress mark in a content  word.  A content word is one
      that contains some meaning.  Nouns, verbs, and adjectives are all
      content words, they tell the listener what you are talking about.
      Words like but, if, and the are not content words. They do not convey
      any real world meaning, but are required to make the sentence
      function, so they are given the name function words.

   *  Always place a stress mark on the accented syllable(s) of
      polysyllabic words, whether they are content or function words.  A
      polysyllabic word is any word of more than one syllable.  Commodore
      has its stress (often called accent) on the first syllable and would
      be spelled KAA5MAXDOHR, while computer is stressed on the second
      syllable: KUMPYUW5TER.

If you are in doubt about which syllable gets the stress, look up the word
in a dictionary and you will find an accent mark over the stressed
syllable.  If more than one syllable in a word receives stress, they
usually are not of equal value.  These are referred to as primary and
secondary stresses.  The word understand has its first and last syllables
stressed, with the syllable stand getting the primary stress and the
syllable un getting the secondary stress. This produces the phonetic
representation AH1NDERSTAE4ND.  Syllables with secondary stress should be
marked with a value of only 1 or 2.

Compound words (words with more than one root) such as baseball, software,
and lunchwagon can be written as one word, but should be thought of as
separate words when marking stress.  Thus, lunchwagon would be spelled
LAH5NCHWAE2GIN.  Notice that the lunch got a higher stress mark than the
wagon.   This is common in compound words, the first word usually receives
the primary stress.

8 / / Stress And Intonation / Which Stress Value Do I Use?

If you get the spelling and stress mark positions correct, you are 95
percent of the way to a good sounding sentence.  The next thing to do is
decide on the stress mark values.  They can be roughly related to parts of
speech, and you can use the table shown below as a guide to assigning

                        RECOMMENDED STRESS VALUES

                  Part of Speech         Stress Value
                  --------------         ------------
                  Exclamations               9
                  Adverbs                    7
                  Quantifiers                7
                  Nouns                      5
                  Adjectives                 5
                  Verbs                      4
                  Pronouns                   3
                  Secondary stress            1 or 2
                  Everything else             None

The above values merely suggest a range.  If you want attention directed
to a certain word, raise its value.  If you want to downplay a word, lower
it.  Sometimes even a function word can be the focus of a sentence.  It is
quite conceivable that the word to in the sentence Please deliver this to
Mr. Smith.  could receive a stress mark of 9. This would add focus to the
word, indicating that the item should be delivered to Mr. Smith in person.

8 / How to Write Phonetically for Narrator / Punctuation

In addition to the period or question mark that is required at the end of
a sentence, Narrator also recognizes dashes, commas, and parentheses.

The comma goes where you would normally put a comma in an English
sentence.  It causes narrator to pause with a slightly rising pitch,
indicating that there is more to come.  The use of additional commas, that
is, more than would be required for written English - is often helpful.
They serve to set clauses off from one another.  There is a tendency for a
listener to lose track of the meaning of a sentence if the words run
together.  Read your sentence aloud while pretending to be a newscaster.
The locations for additional commas should leap out at you.

The dash serves almost the same purpose as the comma, except that the dash
does not cause the pitch to rise so severely.  A rule of thumb is: Use
dashes to divide phrases and commas to divide clauses.

Parentheses provide additional information to narrator's intonation
function.  They should be put around noun phrases of two or more content
words.  This means that the noun phrase, a giant yacht should be
surrounded with parentheses because it contains two content words, giant
and yacht.  The phrase my friend should not have parentheses around it
because it contains only one content word.  Noun phrases can get fairly
large, like the best time I've ever had or a big basket of fruit and nuts.
The parentheses are most effective around these large phrases; the smaller
ones can sometimes go without.  The effect of parentheses is subtle, and
in some sentences you might not notice their presence.  In sentences of
great length, however, they help provide for a very natural contour.

8 / How to Write Phonetically / Hints For Intelligibility

There are a few tricks you can use to improve the intelligibility of a
sentence.  Often, a polysyllabic word is more recognizable than a
monosyllabic word.  For instance, instead of saying huge, say enormous.
The longer version contains information in every syllable, thus giving the
listener a greater chance to hear it correctly.

Another good practice is to keep sentences to an optimal length. Writing
for reading and writing for speaking are two different things. Try not to
write a sentence that cannot be easily spoken in one breath. Such a
sentence tends to give the impression that the speaker has an infinite
lung capacity and sounds unnatural.  Try to keep sentences confined to one
main idea; run-on sentences tend to lose their meaning.

New terms should be highly stressed the first time they are heard. This
gives the listener something to cue on, and can aid in comprehension.

The insertion of the glottal stop phoneme Q at the end of a word can
sometimes help prevent slurring of one word into another.  When we speak,
we do not pause at the end of each word, but instead transition smoothly
between words.  This can sometimes reduce intelligibility by eliminating
word boundary cues.  Placing a Q, (not the silent vowel QX) at the end of
a word results in some phonological effects taking place which can restore
the word boundary cues.

8 / Writing Phonetically / English And Phonetic Text Example

Cardiomyopathy.  I had never heard of it before,
but there it was listed as the form of heart disease
that felled not one or two but all three of the
artificial heart recipients.  A little research
produced some interesting results.  According to an
article in the Nov. 8, 1984, New
England Journal of Medicine, cigarette smoking causes
this lethal disease that weakens the heart's pumping
power.  While the exact mechanism is not clear, Dr.
Arthur J. Hartz speculated that nicotine or carbon
monoxide in the smoke somehow poisons the heart and
leads to heart failure.


8 / How to Write Phonetically for Narrator / Concluding Remarks

This guide should get you off to a good start in phonetic writing for
Narrator.  The only way to get really proficient is to practice.  Many
people become good at it in as little as one day.  Others make continual
mistakes because they find it hard to let go of the rules of English
spelling, so trust your ears.

8 Narrator Device / A More Technical Explanation

The narrator speech synthesizer is a computer model of the human speech
production process.  It attempts to produce accurately spoken utterances
of any English sentence, given only a phonetic representation as input.
Another program in the Amiga speech system, the translator device, derives
the required phonetic spelling from English text. Timing and pitch
contours are produced automatically by the synthesizer software.

In humans, the physical act of producing speech sounds begins in the
lungs.  To create a voiced sound, the lungs force air through the vocal
folds (commonly called the vocal cords), which are held under tension and
which periodically interrupt the flow of air, thus creating a buzz-like
sound.  This buzz, which has a spectrum rich in harmonics, then passes
through the vocal tract and out the lips and nose, which alters its
spectrum drastically.  This is because the vocal tract acts as a frequency
filter, selectively reinforcing some harmonics and suppressing others.  It
is this filtering that gives a speech sound its identity.  The amplitude
versus frequency graph of the filtering action is called the vocal tract
transfer function.  Changing the shape of the throat, tongue, and mouth
retunes the filter system to accentuate different frequencies.

The sound travels as a pressure wave through the air, and it causes the
listener's eardrum to vibrate.  The ear and brain of the listener decode
the incoming frequency pattern.  From this the listener can subconsciously
make a judgement about what physical actions were performed by the speaker
to make the sound.  Thus the speech chain is completed, the speaker having
encoded his physical actions on a buzz via selective filtering and the
listener having turned the sound into guesses about physical actions by
frequency decoding.

Now that we know how humans produce speech, how does the Amiga do it? It
turns out that the vocal tract transfer function is not random, but tends
to accentuate energy in narrow bands called formants.  The formant
positions move fairly smoothly as we speak, and it is the formant
frequencies to which our ears are sensitive.  So, luckily, we do not have
to model throat, tongue, teeth and lips with our computer, we can imitate
formant actions instead.

A good representation of speech requires up to five formants, but only the
lowest three are required for intelligibility.  The pre-V37 Narrator had
only three formants, while the V37 Narrator has five formants for a more
natural sounding voice.  We begin with an oscillator that produces a
waveform similar to that which is produced by the vocal folds, and we pass
it through a series of resonators, each tuned to a different formant
frequency.  By controlling the volume and pitch of the oscillator and the
frequencies of the resonators, we can produce highly intelligible and
natural-sounding speech.  Of course the better the model the better the
speech; but more importantly, experience has shown that the better the
control of the model's parameters, the better the speech.

Oscillators, volume controls, and resonators can all be simulated
mathematically in software, and it is by this method that the narrator
system operates.  The input phonetic string is converted into a series of
target values for the various parameters.  A system of rules then operates
on the string to determine things such as the duration of each phoneme and
the pitch contour.  Transitions between target values are created and
smoothed to produce natural, continuous changes from one sound to the next.

New values are computed for each parameter for every 8 milliseconds of
speech, which produces about 120 acoustic changes per second.  These
values drive a mathematical model of the speech synthesizer.  The accuracy
of this simulation is quite good.  Human speech has more formants that the
narrator model, but they are high in frequency and low in energy content.

The human speech production mechanism is a complex and wonderful thing.
The more we learn about it, the better we can make our computer
simulations.  Meanwhile, we can use synthetic speech as yet another
computer output device to enhance the man/machine dialogue.

8 Narrator Device / Additional Information on the Narrator Device

Additional programming information on the narrator device can be found in
the include files and the Autodocs for the narrator device and the
Autodocs for the translator library.  All are contained in the Amiga ROM
Kernel Reference Manual: Includes and Autodocs.

                    Narrator Device Information
                 INCLUDES        devices/narrator.h

                 AUTODOCS        narrator.doc

Converted on 22 Apr 2000 with RexxDoesAmigaGuide2HTML 2.1 by Michael Ranner.