skip to Main Content

I am using the Azure SpeechSynthesizer libraries in python. I have written the code that will translate some text into speech. I am finding that you need to make a get() call on the result to actually have it do any speech synthesis. But this get() call is essentially blocking.

pull_stream =
stream_config =
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)

result = speech_synthesizer.speak_text_async(text)
del speech_synthesizer

If I don’t call result.get(), I am unable to pull any data from the stream. But when I call result.get(), it blocks for several seconds while it translates the text to speech. I have run this with an AudioOutputConfig of filename to have it just save to a wave file, and the timing is about the same. So I know it is doing the same work regardless of whether I get the output as a stream or a file.

Any pointers on how to get this to actually work asynchronously so I can pull from the stream as it is translating, and not have to wait until it completes?



  1. Chosen as BEST ANSWER

    Using Dasani's code, I was able to modify it and get it work. I had to convert PCM to WAV format before saving it out to a file. And I had a really weird hack where I needed to remove part of the buffer I get in the synthesizing callback. See the code to understand. I played around with various sizes and 46 bytes seems like the right amount.

    import azure.cognitiveservices.speech as speechsdk
    import time
    import wave
    import io
    subscription_key = "<speech_key>"
    region = "<speech_key>"
    text = "Hello Kamali, welcome."
    output_wave = "output.wav"
    audio_buffer = io.BytesIO()
    still_synthesizing = True
    def synthesis_callback(evt):
        global audio_buffer
        header_offset = 46
        chunk_size = len(evt.result.audio_data) - header_offset
        if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
        elif evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print("Speech synthesis completed.")
    def completed_callback(evt):
        global still_synthesizing
        print("Synthesis completed")
        still_synthesizing = False
    pull_stream =
    stream_config =
    speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)
    speech_synthesizer.synthesis_started.connect(lambda evt: print("Synthesis started"))
    result = speech_synthesizer.speak_text_async(text)
    # No need to make a .ge() call
    # No need to remove speech_synthesizer
    # del speech_synthesizer
    # Give it time to work asynchronously
    while still_synthesizing == True:
    # Save the PCM data to a WAV file
    with, 'wb') as wav_file:
        wav_file.setnchannels(1)  # Mono
        wav_file.setsampwidth(2)   # 8-bit
        wav_file.setframerate(16000)  # Sample rate
    print(f"Audio saved to {output_wave}")

  2. I tried the following code to convert text to speech using result = speech_synthesizer.speak_text_async(text).get() with a .wav file and successfully converted the text to speech.

    Code :

    import azure.cognitiveservices.speech as speechsdk
    import threading
    subscription_key = "<speech_key>"
    region = "<speech_key>"
    text = "Hello Kamali,welcome."
    output_file = "output.wav"  
    def synthesis_callback(evt):
        Callback function to handle speech synthesis events.
        if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
            audio_data = evt.result.audio_data
            with open(output_file, "ab") as f:
        elif evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print("Speech synthesis completed.")
    pull_stream =
    stream_config =
    speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)
    speech_synthesizer.synthesis_started.connect(lambda evt: print("Synthesis started"))
    speech_synthesizer.synthesis_completed.connect(lambda evt: print("Synthesis completed"))
    result = speech_synthesizer.speak_text_async(text).get()  
    del speech_synthesizer
    print(f"Audio saved to {output_file}")

    Output :

    The code below successfully converted the text to speech output as follows.

    C:Usersxxxxxxxxxkamali> python
    Synthesis started
    Audio saved to output.wav

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top