I’m using azure speech to text to find timestamps of utterances in a wav file.
The problem I’m encountering is that if the user has recorded numbers, for instance "I’m going to count to three. One, two, three, here I come". The numbers are omitted from the output. This happens both for English and other languages.
I can understand utterances like ‘eh’ and ‘ah’ being omitted, but numbers? why is that the default.
I’m using:
- speechConfig.OutputFormat = OutputFormat.Detailed;
- the default language model.
Can I somehow configure the SpeechRecognizer differently so it also outputs numbers?
2
Answers
I found the reason my results did not recognizing numbers. It was in my own code. In my postprocessing I was trying to get rid of punctuation marks from the result. Here I was also accidently getting rid of numbers.
.wav
audio file to text without the loss of data.output:
But apparently there is a bug in the conversion model where if there is a pause between
I'm going to count to three.
andOne, two, three, here I come
. The model will omit theOne, two, three, here I come
sentence from the audio file.Also, I couldn’t find anything in this MSDOC on audio config class to configure the audio settings regarding this issue.