skip to Main Content

I use OpenAI’s Whisper python lib for speech recognition. How can I get word-level timestamps?


To transcribe with OpenAI’s Whisper (tested on Ubuntu 20.04 x64 LTS with an Nvidia GeForce RTX 3090):

conda create -y --name whisperpy39 python==3.9
conda activate whisperpy39
pip install git+https://github.com/openai/whisper.git 
sudo apt update && sudo apt install ffmpeg
whisper recording.wav
whisper recording.wav --model large

If using an Nvidia GeForce RTX 3090, add the following after conda activate whisperpy39:

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch

2

Answers


  1. Chosen as BEST ANSWER

    https://openai.com/blog/whisper/ only mentions "phrase-level timestamps", I infer from it that word-level timestamps are not obtainable without adding more code.

    From one of the Whisper authors:

    Getting word-level timestamps are not directly supported, but it could be possible using the predicted distribution over the timestamp tokens or the cross-attention weights.

    https://github.com/jianfch/stable-ts (MIT License):

    This script modifies methods of Whisper's model to gain access to the predicted timestamp tokens of each word without needing addition inference. It also stabilizes the timestamps down to the word level to ensure chronology.

    Note that:


    Another option: use some word-level forced alignment program. E.g., Lhotse (Apache-2.0 license) has integrated both Whisper ASR and Wav2vec forced alignment:

    enter image description here


  2. I created a repo to recover word-level timestamps (and confidence), and also more accurate segment timestamps:
    https://github.com/Jeronymous/whisper-timestamped

    It is built based on the cross-attention weights of Whisper, as in this notebook in the Whisper repo. I tuned a bit the approach to get better location, and added the possibility to get the cross-attention on the fly, so there is no need to run the Whisper model twice. There is no memory issue when processing long audio.

    Note: first, I tried the approach of using wav2vec model to realign Whisper’s transcribed words to input audio. It works reasonably well, but it has many drawbacks : it needs to handle a separate (wav2vec) model, to perform another inference on the full signal, to have one wav2vec model per language, to normalize the transcribed text so that the set of characters fits the one of wav2vec model (e.g. converting numbers in characters, symbols like "%", currencies…). Also the alignment can have troubles on disfluencies that are usually removed by Whisper (so part of what would recognize wav2vec model is missing, like start of sentences that are reformulated).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search