skip to Main Content

I’m working on a modified speech to text feature that should take in a users speech and convert it to text but I want the output text to be exactly what the user is saying. This means I want to detect word disfluency’s such as stammers like “sstttop” and “pppplease”. Ive already written a Java program that does the speech to text but I need to know if it’s possible to modify it to detect speech disfluency. Any input and help would be much appreciated.

3

Answers


  1. I think it’s better to improve the structure of the text from the speech delivered by stammer

    Login or Signup to reply.
  2. My first guess would be that you would have to analyze the time that a user spends producing each specific sound. For example, one S could be the ‘s’ sound for half a second whereas two ‘s’s could be represented by the user producing the sound for one second. I understand that this is not completely accurate but best guess I can think of.

    Login or Signup to reply.
  3. As someone who used a lot speech to text apis, What are you looking for is a little bit difficulte to acheive, however, there is a feature that can maybe help you. Depending on the provider you are using, try to look for custom vocabulary, it allow you to specify some words to keep in mind when transcribing the audio with a boosting value.

    however Disfluencies, I believe it’s closely related to the provider you are using, some of them will completely remove filler workd like assembly except for some values, others like microsoft will give you many transcriptions, some of them with word disfluency’s and others not. Please take a look at this link: microsoft Disfluency removal

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search