I’m working on a modified speech to text feature that should take in a users speech and convert it to text but I want the output text to be exactly what the user is saying. This means I want to detect word disfluency’s such as stammers like “sstttop” and “pppplease”. Ive already written a Java program that does the speech to text but I need to know if it’s possible to modify it to detect speech disfluency. Any input and help would be much appreciated.
Question posted in Android Studio
The official documentation can be found here.
The official documentation can be found here.
3
Answers
I think it’s better to improve the structure of the text from the speech delivered by stammer
My first guess would be that you would have to analyze the time that a user spends producing each specific sound. For example, one S could be the ‘s’ sound for half a second whereas two ‘s’s could be represented by the user producing the sound for one second. I understand that this is not completely accurate but best guess I can think of.
As someone who used a lot speech to text apis, What are you looking for is a little bit difficulte to acheive, however, there is a feature that can maybe help you. Depending on the provider you are using, try to look for
custom vocabulary
, it allow you to specify some words to keep in mind when transcribing the audio with a boosting value.however
Disfluencies
, I believe it’s closely related to the provider you are using, some of them will completely remove filler workd likeassembly
except for some values, others like microsoft will give you many transcriptions, some of them with word disfluency’s and others not. Please take a look at this link: microsoft Disfluency removal