Does Azure speech_synthesizer.speak_text_async() really execute asynchronously?

brentlyjr
March 18, 2024
200 views
0 votes
2 Answers

I am using the Azure SpeechSynthesizer libraries in python. I have written the code that will translate some text into speech. I am finding that you need to make a get() call on the result to actually have it do any speech synthesis. But this get() call is essentially blocking.

pull_stream = speechsdk.audio.PullAudioOutputStream()
stream_config = speechsdk.audio.AudioOutputConfig(stream=pull_stream)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)

result = speech_synthesizer.speak_text_async(text)
result.get()
del speech_synthesizer

If I don’t call result.get(), I am unable to pull any data from the stream. But when I call result.get(), it blocks for several seconds while it translates the text to speech. I have run this with an AudioOutputConfig of filename to have it just save to a wave file, and the timing is about the same. So I know it is doing the same work regardless of whether I get the output as a stream or a file.

Any pointers on how to get this to actually work asynchronously so I can pull from the stream as it is translating, and not have to wait until it completes?

Answers

Chosen as BEST ANSWER

Using Dasani's code, I was able to modify it and get it work. I had to convert PCM to WAV format before saving it out to a file. And I had a really weird hack where I needed to remove part of the buffer I get in the synthesizing callback. See the code to understand. I played around with various sizes and 46 bytes seems like the right amount.

import azure.cognitiveservices.speech as speechsdk
import time
import wave
import io

subscription_key = "<speech_key>"
region = "<speech_key>"
text = "Hello Kamali, welcome."
output_wave = "output.wav"
audio_buffer = io.BytesIO()
still_synthesizing = True

def synthesis_callback(evt):
    global audio_buffer

    header_offset = 46
    chunk_size = len(evt.result.audio_data) - header_offset

    if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
        audio_buffer.write(evt.result.audio_data[-chunk_size:])
    elif evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Speech synthesis completed.")

def completed_callback(evt):
    global still_synthesizing

    print("Synthesis completed")
    still_synthesizing = False

pull_stream = speechsdk.audio.PullAudioOutputStream()
stream_config = speechsdk.audio.AudioOutputConfig(stream=pull_stream)
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)

speech_synthesizer.synthesis_started.connect(lambda evt: print("Synthesis started"))
speech_synthesizer.synthesizing.connect(synthesis_callback)
speech_synthesizer.synthesis_completed.connect(completed_callback)

result = speech_synthesizer.speak_text_async(text)

# No need to make a .ge() call
# No need to remove speech_synthesizer
# del speech_synthesizer

# Give it time to work asynchronously
while still_synthesizing == True:
    time.sleep(.1)

# Save the PCM data to a WAV file
audio_buffer.seek(0)
with wave.open(output_wave, 'wb') as wav_file:
    wav_file.setnchannels(1)  # Mono
    wav_file.setsampwidth(2)   # 8-bit
    wav_file.setframerate(16000)  # Sample rate
    wav_file.writeframes(audio_buffer.getvalue())

print(f"Audio saved to {output_wave}")

(Edit)

I tried the following code to convert text to speech using result = speech_synthesizer.speak_text_async(text).get() with a .wav file and successfully converted the text to speech.

Code :

import azure.cognitiveservices.speech as speechsdk
import threading

subscription_key = "<speech_key>"
region = "<speech_key>"
text = "Hello Kamali,welcome."
output_file = "output.wav"  

def synthesis_callback(evt):
    """
    Callback function to handle speech synthesis events.
    """
    if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
        audio_data = evt.result.audio_data
        with open(output_file, "ab") as f:
            f.write(audio_data)
    elif evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Speech synthesis completed.")
pull_stream = speechsdk.audio.PullAudioOutputStream()
stream_config = speechsdk.audio.AudioOutputConfig(stream=pull_stream)
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)

speech_synthesizer.synthesis_started.connect(lambda evt: print("Synthesis started"))
speech_synthesizer.synthesizing.connect(synthesis_callback)
speech_synthesizer.synthesis_completed.connect(lambda evt: print("Synthesis completed"))

result = speech_synthesizer.speak_text_async(text).get()  

del speech_synthesizer
print(f"Audio saved to {output_file}")

Output :

The code below successfully converted the text to speech output as follows.

C:Usersxxxxxxxxxkamali> python test.py
Synthesis started
Audio saved to output.wav

Please signup or login to give your own answer.

Click here to cancel reply.