Using Prosodic Characteristics in Czech Dialog System

Jana Kleckova and Vaclav Matousek



Faculty of Applied Sciences, University of West Bohemia in Pilsen

Univerzitni 22 CZ-306 14 Plzen

Czech Republic

kleckova@kiv.zcu.cz

 

ABSTRACT The contributed paper deals with use of the database of prosody attributes which have been being created for the Czech dialog system. A prototype of the recognition and understanding system was developed in the Department of Computer Science within the framework of the Copernicus Joint Research Project COP-1694 "SQEL" (Spoken Queries in European Languages). In Czech language featured by a free-word-ordering the prosody serves a critical information for the recognition and understanding system.For some sentences the intonation is essential to determine the core of a communication, depending on a speaker who thus emphasizes a meaning of a sentence. The prosodic characteristics included in the sentence (features describing fundamental frequency F0, voice energy , the length of a pause behind and before the word, the speaking rate, flags indicating word finality and the lexical word accent) are stored in the database and consequently exploited by the linguistic module as an additional information used for recognizing and understanding the spontaneous speech. Having processed the characteristics by usual methods of statistics the database can also be used to generate answers in the dialog system. The module was implemented in the C language an supported by the ORACLE database.

 

KEYWORDS spontaneous speech, free-word-ordering, prosody, fundamental frequency, voice energy, recognition and understanding system



1. INTRODUCTION

A prototype of the dialog system developed in the Department of Computer Science carries continously spoken human machine dialogs utilizing speech input and output techniques (Kleckova 1995). The main components of the system prototype are: the speech input/output interface, the acoustic phonetic recognizer, the linguistic processor, and the dialog manager.

In Czech language with its free-word-ordering intonation serves a critical information for the recognition and understanding system. For some sentences, the intonation is essential to determine the core of a communication, depending on a speaker who uses intonation to emphasive the meaning of a sentence. The design of the module for suprasegmental type processing is based on the partitioning the speech into sentences. In a such system prosodic attributes are determined by the acoustic--phonetic module. The time distribution of the voice energy and of the fundamental frequency is monitored within the period of a single sentence. The length of a pause as well as flags indicating word finality and lexical word accent are determined. Consequently, this information is used to associate the sentence with a certain type. The attributes determined by this procedure are used as the second input to the linguistic module.




2. PROSODIC PROPERTIES

The pitch movement in a spoken utterance that is not related to differences in meaning of the words is called the intonation. The intonation often serves an information of a broad meaning nature. For example, the falling pitch we hear at the end of a statement in Czech such as " Vlak uz odjel. (The train has already gone.)" indicates that the utterance is complete. And on the contrary, the question " Vlak uz odjel? (Has the train already gone?)" in Czech equivalent has rising intonation. For this reason, falling intonation at the end of an utterance is called a terminal intonation contour. On the other hand, rising or unvaried level intonation, often indicates incompleteness (Palkova 1994). However, Czech sentences that contain question words like kdy, co, kdo, jak (when, what, who) usually do not have any rising intonation. It is like that the words in the question suffice to indicate that the answer is expected. The fact that rising or level intonations are correlated with incompleteness and falling intonation with completeness admits other utilizations of the intonation. One of them helps to make clear the interpretation of potentially ambiguous utterances (Noth 1993). The prosody is a very complex subject. Besides the intonation the hierachy of pauses is very important. Pauses of standard length in the places of punctuation marks between syntactic units are felt as bizzare in the spontaneous speech. After several experiments have been treid out, a three-tier pause hierarchy seems acceptable in Czech.

Pause

Duration of pause [ms] for speech rate

Classification of punctuaction marks

P1

8 - 10

{,}

P2

80 - 100

{- : }

P3

200 - 240

{; . ? !}

Table 1. Three-tier pause hierarchy

Examples:

...(P3) Prosim Vas, (P1) muzete mi rict, (P2) kdy jede vlak do Prahy? (P3) ....

(Please, can you tell me, when the train to Prague is going. ....)

....(P3) Jak vidite,(P1) nezbyva nam mnoho casu.(P3}...

(You see, there’s not much time left.)

.....(P3)Uz se prilis nezdrzuj,(P1) Vaclave,(P2) a pokus se vlak dobehnout (P3)...

(Don’t get stuck, Vincent, and try to run out the train.)

To make finer distinction of pauses would require to respect semantic relations of units in the dialog.



3. PROSODY MODULE

3.1 Structure of Prosody Module

The design of the prosody module is based on the partitioning the speech into sentences. The sentences are processed using the following method:

The whole sentence is represented by 2n features. The quality of the pattern recognition depends on the choice of the type and of the number of features. To classify sentences according to prosody, the prosodic characteristics must be computed and then considered as features. However, the features must be normalized and their number reduced to simplify recognition which then follows. Taking into account properties of the neural network employed in the recognition the number associated with a simple sentence must be the same for each one. The optimal number of features is proposed to be set 40 (in particular we consider 20 features energy and 20 features of frequency). As to our experience, a greater number than that proposed above does not improve the recognition. The attributes determined by this procedure form another input to the linguistic module, see Fig.3.1. The structure of the modules for the prosodic information processing is introduced in Fig. 3.2.

3.2 Comparison of Other Approaches in Recognition by Neural Nets

The multilayer neural network with the back-propagation learning algorithm is used. The activation function in the hidden layers and in the output layer is a sigmoid. A neural network model based on self organization is capable of performing the classification and trying to represent the prosodic properties. Basically, the Kohonen clustering network is used here as a semantic map. The algorithm passed through two stages, self-organization and testing. The model has the flexibility of accepting linguistic input and can provide output decision in terms of membership values. The input vector incorporates a partial class membership during self-organization. An index of disorder was used to determine a measure of the ordering of the output space and controlled the number of sweeps required in the process.Unlike Kohonen's conventional model, the proposed net was capable of producing a fuzzy partitioning of the output space and could thereby provide a more faithful representation for ill-defined or fuzzy data with overlaping classes. Incorporation of fuzziness into the input and output of the proposed model was seen to result in better performance as compared to the original Kohonen's model and the hard version (Pal 1977).



4. EXPERIMENTAL RESULTS

The experiments reported in this paper were performed on a subset of Czech sentences. The sentences were generated using the ERBA templates (Kleckova 1998). The summarize the results is introduced in the tables Tab.2, Tab.3 we can state that:

  1. we are able to detect the types of the sentences.
  2. the set of 80 features is has been probed (40 features energy and 40 features of frequency), but the results are unconvincing.
  3. a greater number than that proposed above does not improve the recognition.

 

In the near future we shall use a neuronal network with different sets of prosodic features like duration of words, syllables and syllables nucleus, etc.

Type of the sentence

Corpus of sentences

Correct - Number

Correct-Percentage

Incorrect - Number

Incorect- Percentage

annoncement

100

71

71

29

29

question (query)

100

89

89

11

11

order

50

40

80

10

20

investigation question

100

92

92

8

8

TOTAL

350

292

83

58

17

Table 2. Assignement of sentence with respect to the speaker’s standpoint

 

Type of the sentence

Corpus of sentences

Correct - Number

Correct - Percentage

Incorrect - Number

Incorrect - Percentage

Complex sentence -

all sentences are

significant

100

82

82

18

18

Complex sentence -

first clause is

unsignificant

100

75

75

15

15

Single sentence

100

79

79

11

11

TOTAL

300

256

78,6

44

21,4

Table 3. Differentiating a complex sentence from a single sentence

 

5. CONCLUSION

The subsystem uses prosodic information takes the form the fundamental voice frequency and energy. The prosodic information processed using a special arranged neural network forms an important attribute of an n-element array which, being taken as an image of the sentence analyzed, enters the linguistic analyzer to be further processed (Kleckova 1997). The system described above operates over a flexible data basis by means of components of the database system ORACLE and the language SQL. Some other alternatives considered for further development of the system have been experimentally tested .

 

 

REFERENCES

Kleckova, J. and Krutisova, J. and Matousek, V. and Ocelikova, J.(1995). "An Automatic Creation of the Language Model for the Spontaneuos Czech Speech recognizer."In: Proceedings of the European Conference EUROSPEECH’95, Madrid, September 1995, 1185-1195.

Kleckova, J. (1997) "Creation of the Language Model for the Spontaneous Czech Speech Recognizer: Sentence Formulas". In: Proceedings of the 4th International Workshop on Systems, Signals and Image Processing, Poznan, May 1997, 93-96.

Kleckova, J. and Matousek, V.(1998) " Detection of Sentence Types by the Integrated Prosody Module". In: Proceedings TSD’98, Brno, September 1998, 235 - 240.

Noth, E. (1993) " Prosodische Information in der automatischen Spracher kennung Berechnung und Anwendung. Disertation, Erlangen 1990.

Palkova, Z.(1994) "Fonetika a fonologie cestiny." Univerzita Karlova Praha, 1994.

Pal S.K. and Dutta Majumder D.(1977) "Fuzzy sets and decision making approaches in vowel and speaker recognition." In: IEEE Trans. Syst., Man, Cybern., vol.7, pp. 625-629,1977.

SNNS Stuttgart Neural Network Simulator (1995).

User Manual, Version 4.1, cerven 1995.