I use OpenAI’s Whisper python lib for speech recognition. How can I get word-level timestamps?
To transcribe with OpenAI’s Whisper (tested on Ubuntu 20.04 x64 LTS with an Nvidia GeForce RTX 3090):
conda create -y --name whisperpy39 python==3.9
conda activate whisperpy39
pip install git+https://github.com/openai/whisper.git
sudo apt update && sudo apt install ffmpeg
whisper recording.wav
whisper recording.wav --model large
If using an Nvidia GeForce RTX 3090, add the following after conda activate whisperpy39
:
pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch
2
Answers
https://openai.com/blog/whisper/ only mentions "phrase-level timestamps", I infer from it that word-level timestamps are not obtainable without adding more code.
From one of the Whisper authors:
https://github.com/jianfch/stable-ts (MIT License):
Note that:
Another option: use some word-level forced alignment program. E.g., Lhotse (Apache-2.0 license) has integrated both Whisper ASR and Wav2vec forced alignment:
I created a repo to recover word-level timestamps (and confidence), and also more accurate segment timestamps:
https://github.com/Jeronymous/whisper-timestamped
It is built based on the cross-attention weights of Whisper, as in this notebook in the Whisper repo. I tuned a bit the approach to get better location, and added the possibility to get the cross-attention on the fly, so there is no need to run the Whisper model twice. There is no memory issue when processing long audio.
Note: first, I tried the approach of using wav2vec model to realign Whisper’s transcribed words to input audio. It works reasonably well, but it has many drawbacks : it needs to handle a separate (wav2vec) model, to perform another inference on the full signal, to have one wav2vec model per language, to normalize the transcribed text so that the set of characters fits the one of wav2vec model (e.g. converting numbers in characters, symbols like "%", currencies…). Also the alignment can have troubles on disfluencies that are usually removed by Whisper (so part of what would recognize wav2vec model is missing, like start of sentences that are reformulated).