skip to Main Content

I am using the Azure SpeechSynthesizer libraries in python. I have written the code that will translate some text into speech. I am finding that you need to make a get() call on the result to actually have it do any speech synthesis. But this get() call is essentially blocking.

pull_stream = speechsdk.audio.PullAudioOutputStream()
stream_config = speechsdk.audio.AudioOutputConfig(stream=pull_stream)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)

result = speech_synthesizer.speak_text_async(text)
result.get()
del speech_synthesizer

If I don’t call result.get(), I am unable to pull any data from the stream. But when I call result.get(), it blocks for several seconds while it translates the text to speech. I have run this with an AudioOutputConfig of filename to have it just save to a wave file, and the timing is about the same. So I know it is doing the same work regardless of whether I get the output as a stream or a file.

Any pointers on how to get this to actually work asynchronously so I can pull from the stream as it is translating, and not have to wait until it completes?

2

Answers


  1. Chosen as BEST ANSWER

    Using Dasani's code, I was able to modify it and get it work. I had to convert PCM to WAV format before saving it out to a file. And I had a really weird hack where I needed to remove part of the buffer I get in the synthesizing callback. See the code to understand. I played around with various sizes and 46 bytes seems like the right amount.

    import azure.cognitiveservices.speech as speechsdk
    import time
    import wave
    import io
    
    subscription_key = "<speech_key>"
    region = "<speech_key>"
    text = "Hello Kamali, welcome."
    output_wave = "output.wav"
    audio_buffer = io.BytesIO()
    still_synthesizing = True
    
    def synthesis_callback(evt):
        global audio_buffer
    
        header_offset = 46
        chunk_size = len(evt.result.audio_data) - header_offset
    
        if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
            audio_buffer.write(evt.result.audio_data[-chunk_size:])
        elif evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print("Speech synthesis completed.")
    
    def completed_callback(evt):
        global still_synthesizing
    
        print("Synthesis completed")
        still_synthesizing = False
    
    pull_stream = speechsdk.audio.PullAudioOutputStream()
    stream_config = speechsdk.audio.AudioOutputConfig(stream=pull_stream)
    speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)
    
    speech_synthesizer.synthesis_started.connect(lambda evt: print("Synthesis started"))
    speech_synthesizer.synthesizing.connect(synthesis_callback)
    speech_synthesizer.synthesis_completed.connect(completed_callback)
    
    result = speech_synthesizer.speak_text_async(text)
    
    # No need to make a .ge() call
    # No need to remove speech_synthesizer
    # del speech_synthesizer
    
    # Give it time to work asynchronously
    while still_synthesizing == True:
        time.sleep(.1)
    
    # Save the PCM data to a WAV file
    audio_buffer.seek(0)
    with wave.open(output_wave, 'wb') as wav_file:
        wav_file.setnchannels(1)  # Mono
        wav_file.setsampwidth(2)   # 8-bit
        wav_file.setframerate(16000)  # Sample rate
        wav_file.writeframes(audio_buffer.getvalue())
    
    print(f"Audio saved to {output_wave}")
    

  2. I tried the following code to convert text to speech using result = speech_synthesizer.speak_text_async(text).get() with a .wav file and successfully converted the text to speech.

    Code :

    import azure.cognitiveservices.speech as speechsdk
    import threading
    
    subscription_key = "<speech_key>"
    region = "<speech_key>"
    text = "Hello Kamali,welcome."
    output_file = "output.wav"  
    
    def synthesis_callback(evt):
        """
        Callback function to handle speech synthesis events.
        """
        if evt.result.reason == speechsdk.ResultReason.SynthesizingAudio:
            audio_data = evt.result.audio_data
            with open(output_file, "ab") as f:
                f.write(audio_data)
        elif evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print("Speech synthesis completed.")
    pull_stream = speechsdk.audio.PullAudioOutputStream()
    stream_config = speechsdk.audio.AudioOutputConfig(stream=pull_stream)
    speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)
    
    speech_synthesizer.synthesis_started.connect(lambda evt: print("Synthesis started"))
    speech_synthesizer.synthesizing.connect(synthesis_callback)
    speech_synthesizer.synthesis_completed.connect(lambda evt: print("Synthesis completed"))
    
    result = speech_synthesizer.speak_text_async(text).get()  
    
    del speech_synthesizer
    print(f"Audio saved to {output_file}")
    

    Output :

    The code below successfully converted the text to speech output as follows.

    C:Usersxxxxxxxxxkamali> python test.py
    Synthesis started
    Audio saved to output.wav
    

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search