Zum Hauptinhalt springen
TU Graz/ TU Graz/ Services/ News+Stories/

Graz Language Database Improves Automatic Speech Recognition of Austrian German

12/12/2024 | TU Graz news | Research

By Falko Schoklitsch

With the “Graz corpus of read and spontaneous speech”, researchers at TU Graz have developed new methods for speech recognition of Austrian German using speech data from 38 people.

Image source: andreusK/Adobe Stock

Second-language speakers who come to Austria with a good knowledge of German usually find it difficult to understand the local dialects. Similarly, speech recognition systems often fail to decode regionally accented word choice and pronunciation. Barbara Schuppler from the Signal Processing and Speech Communication Laboratory at Graz University of Technology (TU Graz), together with researchers from the Know Center and the University of Graz, has investigated the complexity of conversational speech, built up a database of conversations in Austrian German and gained new knowledge about how to improve speech recognition. The results were recently published in the paper “What’s so complex about conversational speech?” in the journal Computer Speech & Language. The project was funded by the Austrian Science Fund FWF.

Free-flowing conversations in the recording studio

One of the main aims of the project was to improve the accuracy of automatic speech recognition (ASR) systems in spontaneous conversations with speakers from Austria. The team focused on the challenges posed by spontaneity, short sentences, overlapping speakers and dialectal accent in everyday conversations. In order to have a suitable database, the researchers set up the GRASS database (Graz corpus of read and spontaneous speech). It contains recordings of 38 speakers, which include both read texts and spontaneous conversations in which two people who knew each other well spoke freely for an hour in the recording studio without being given a topic. Since the same speakers were recorded in both speaking styles, the research team was able to eliminate the influence of speaker identity and recording quality on ASR performance.

Based on the database, the team compared various ASR architectures, including the long-established HMM models (hidden Markov models) and the relatively new transformer-based models. This showed that transformer-based models, such as the Whisper speech recognition system, work very well for longer sentences with a lot of context, but have problems with short, fragmentary sentences that frequently occur in conversations. Traditional HMM-based systems that were explicitly trained with pronunciation variations proved to be more robust for short sentences and dialectal language. The researchers therefore want to pursue a hybrid system approach that combines the strengths of both architectures. They have already combined a transformer model with a knowledge-based lexicon and a statistical language model, thereby achieving significant improvements.

Possible use in medical diagnostics

The team also analysed how characteristics such as speech rate, intonation and word choice influence the accuracy of speech recognition. These findings can contribute to the development of ASR systems that better understand human speech in all its nuances. The team plans to continue research in these areas and incorporate the findings into the development of new, more robust speech recognition systems. However, the results of the project also have interesting potential applications beyond this, particularly in the fields of medical diagnostics and human-computer interaction. In the future, ASR systems could be used to recognise dementia or epilepsy based on speech patterns in spontaneous conversations or to make interaction with social robots more natural.

“Spontaneous speech, especially in dialogue, has completely different characteristics compared to a recited or read speech,” says Barbara Schuppler. “By analysing human-human communication in particular, we have gained important findings in our project that also help us technically and open up new areas of application. Together with partners from the PMU Salzburg, Med Uni Graz and Med Uni Vienna, we are already working on follow-up projects to create socially relevant applications based on the foundations we have created in the Austrian Science Fund project.”

Information

Publication:
What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures
Autors: Julian Linke, Bernhard C. Geiger, Gernot Kubin, Barbara Schuppler
In: Computer Speech & Language, Volume 90, March 2025
DOI: https://doi.org/10.1016/j.csl.2024.101738

Contact

Barbara SCHUPPLER
Ass.Prof. Mag.rer.nat. Dr.
TU Graz | Signal Processing and Speech Communication Laboratory
Phone: +43 316 873 4366
b.schupplernoSpam@tugraz.at