Hey Siri, what’s the weather going to be like? More and more people are talking to their phones. But speech recognition doesn’t work for everyone. Research by the Multimedia Computing Group now seems to be cautiously changing that.
Matthijs Valkering was studying to become a doctor and was doing his residency when he had an accident. He ended up in a wheelchair and has difficulty speaking due to nerve damage. An automatic speech recognizer developed especially for him by Yuanyuan Zhang and the Delft Inclusive Speech Communication Lab is designed to support him in his teaching. (Photo: Edda Heinsman)
It’s a rainy, windy autumn morning, nine o’clock, yet the lecture hall is full for the question and answer session on descriptive statistics and probability theory at the Faculty of Health Sciences (VU). At the end of the lecture, Matthijs Valkering takes the floor. He conjures up a Mentimeter on the screen, a kind of interactive quiz. The room is silent. Do the students have any questions? ‘No’ appears in large letters on the screen. It’s good that the question is displayed, because Valkering is difficult to understand. He has dysarthria, a speech disorder.
More than 400,000 people in the Netherlands have dysarthria. This speech disorder can occur after damage to the nervous system or, for example, in Parkinson’s disease. “People with dysarthria speak more slowly and softly than average and less clear articulation. This makes their speech much more difficult to understand,” explains Odette Scharenborg. Since this summer, she has held the Delft chair in inclusive speech communication (Multimedia Computing Group). “If someone with dysarthria tries automatic speech recognition, forget it. You can get error rates of up to 300 percent.”
‘Systems such as Azure or Siri recognize people with an Achterhoek accent much less well than those with a Randstad accent’
And Scharenborg is going to do something about that. “We want to develop speech technology for everyone, regardless of how you speak, who you are, or what language you speak.”
Metropolitan accent
Filling in your calendar, turning on the radio, or dictating a written message in WhatsApp using your voice: handy! More and more people and companies are making grateful use of automatic speech-to-text technology. For people who have difficulty typing due to a disability, it is a godsend. But, as Scharenborg explains, automatic speech recognition software does not currently work well for many people. “I myself come from the Achterhoek region. Systems such as Azure or Siri recognize people with an Achterhoek accent much less well than those with a Randstad accent. If you come from Limburg, you really have a problem.”
This also applies to children, and even skin color plays a role, according to Scharenborg, based on a paper from the US. Automatic speech recognition models require a lot of speech data to train. “Our Dutch models have been trained with data from the Corpus Spoken Dutch, a collection of 900 hours of speech fragments. Interviews, telephone conversations, television programs, and all free to use. They are standard adult native speakers with no deviations in accent or speech production, preferably younger than sixty.”
Although there is now also the addition of Jasmin, which includes speech from children, the elderly, and non-native speakers, recognition remains problematic.
“Even if you train a model specifically on this data, it remains less effective. The variability in non-standard speaker groups is much greater. My hypothesis is that current models cannot cope well with this greater variability.”
Speech recognizer better than humans
Current models are good at understanding ‘average’ or ‘standard’ speech, but humans are still better at it. With speech disorders, however, the situation is different. Scharenborg: “We now have data that shows the opposite for the first time: that the automatic speech recognizer is better than humans.”
And that’s where Matthijs Valkering comes in. Despite his speech disorder, the teaching assistant at VU University Amsterdam wants to focus on teaching. “I use Abilia Lightwriter and Google Translate a lot, which are nice to work with. But there are no automatic speech models that understand me well.” He contacted Odette Scharenborg: “I wanted to see if she had a way to give me more support for teaching with current technology.” Valkering’s request in 2023 came at just the right time. Scharenborg: “We wanted to start working with inclusive speech models, but we didn’t have any data.” TU Delft PhD student Yuanyuan Zhang started creating a model tailored precisely to Valkering’s voice.
Practice, practice, practice
How do you teach a model to understand someone with dysarthria? “Practice, practice, practice,” says Valkering. Those recording sessions with Valkering made a big impression on Zhang. “He would arrive in his wheelchair, driving through the rain, all the way from Amsterdam by train. Matthijs recorded hours of speech, both in Dutch and English, with a view to his academic career. We recorded spontaneous speech and speech related to Matthijs’ work. I was impressed by how dedicated he was, so patient and enthusiastic. That makes you want to work extra hard, and you can get frustrated if the research isn’t progressing fast enough.”
Valkering also remembers the practice sessions in the soundproof studio as intensive. “Talking is very tiring. I speak slowly, and you can see that…“ Valkering moves his head to the side and swallows. ”I take occasional breaks to swallow.” Days of recording ultimately yielded more than eight hours of usable material. Zhang was satisfied and got to work with the new dataset: DysOne. It is the first dataset with video and audio in English and Dutch, created in direct collaboration with someone with dysarthria.
Inspiring teacher
Back at VU University Amsterdam, in the lecture hall where students Izdihar Elorufi and Elaha Haqpal approach Matthijs Valkering after class with a question. He explains calmly. The students listen attentively. What is it like to be taught by Valkering? Elorufi: “It’s different from normal. But he tries very hard and is very active on Canvas (a kind of Brightspace, where you can see your homework and watch videos about the course – ed.). He answers questions well and has just helped us a lot.” Fellow student Elaha Haqpal agrees: “During the first lecture, he shared his story with us, which was very inspiring. Despite certain physical limitations, he continues to pursue his passion. I think that’s wonderful.”
“That’s how it always goes,” Valkering says, referring to his explanation to the two students. He prefers to explain or teach to small groups. He finds it difficult to stand in front of a larger audience at the moment. “Ultimately, I want to integrate speech models into education. For example, I want to be automatically subtitled while I’m lecturing, so that people who don’t understand me well can still understand me.” Zhang looks further ahead: “A text-to-speech or speech-to-speech system with voice conversion, so that you hear your own – unslurred – voice, would be fantastic. That’s an idea for the future.”

Promising speech model
And now there is a model specially trained on Valkering’s voice. “The model now also works locally on the computer, so it no longer goes through the cloud,” says Zhang. “This makes it more practical and cheaper. The fact that the data is stored locally also contributes to privacy.” It will still be a while before the speech recognition software is finished and Valkering can actually start using it. But the initial results are promising. Zhang: “Last month, we conducted tests during the Speech Science festival in Ahoy, Rotterdam. The audience had an average score of 35.5 percent, meaning they got 35.5 out of 100 words right. Our system, on the other hand, is already at 86.4 percent correct. Although this accuracy is still low compared to commercial systems, it can offer sufficient benefits for people who listen to dysarthric speech. These test results were obtained using the data we collected with Matthijs, but we plan to conduct further tests with Matthijs’ current speech—provided he agrees.”
‘I expect that within a few years, speech recognition will work for everyone, including people with speech disorders’
Valkering cannot yet use the model in practice in his lessons. But he is already quite satisfied: “I want to contribute to the development of speech models for people with dysarthria.” And there is potentially a great need for this. Zhang: “I had great difficulty understanding my own grandfather at the end of his life. I didn’t know what it was at the time, but now I recognize it; it was probably dysarthria too.“ The dataset of Valkering’s voice will be made available for other scientific research. Valkering is hopeful: ”I expect that within a few years, speech recognition will work for everyone, including people with speech disorders.”
The experiment is already bearing fruit for Valkering personally: “It’s a win-win situation for me. Not only does the dataset help others, but my own speech has improved dramatically thanks to all the talking I do, because I keep challenging myself. Valkering used to carry a speech computer with him at all times, which converts typed words into spoken language, but now he increasingly leaves it at home. “It’s only useful in the pub, where there’s often a lot of noise, making me even harder to understand.”
Do you have a question or comment about this article?
E.Heinsman@tudelft.nl

Comments are closed.