skip to Main Content

I’m using F1 (non-free tier, OpenAI Neural non-HD voices) of text-to-speech on Azure through the python API. I’m having deterministic partial completions with an ‘Internal Server Error’ and ‘partial data received’ message ending the audio rendering mid-word. And yet, the same SSML works flawlessly using my same TTS instance through Speech Studio.

input SSML xml file: demo.xml

standalone python API code: demo.py

log file output: log.txt (you can see synthesis timing out)

  • SSML works in speech studio
  • SSML fails to fully render using the python code
  • but, the SSML partially renders, so the speech sdk config is correct

Log File Excerpt

[405035]: 35806ms SPX_TRACE_VERBOSE:  synthesizer_timeout_management.cpp:85 IsTimeout: synthesis might timeout, current RTF: 0.77 (threshold: 2.00), frame interval 9967 ms (threshold 3000ms)
[405035]: 35856ms SPX_TRACE_WARNING: synthesizer_timeout_management.cpp:80 IsTimeout: synthesis timed out, current RTF: 0.78 (threshold: 2.00), frame interval 10017 ms (threshold 3000ms)
[405035]: 35857ms SPX_DBG_TRACE_VERBOSE:  usp_tts_engine_adapter.cpp:376 StopSpeaking
[405035]: 35857ms SPX_DBG_TRACE_VERBOSE:  usp_tts_engine_adapter.cpp:1040 Response: On Error: Code:6, Message: Timeout while synthesizing. Current RTF: 0.775118 (threshold 2), frame interval 10018ms (threshold 3000ms)..

2

Answers


  1. Chosen as BEST ANSWER

    Cool! What we learned is that the python API breaks if SSML elements are indented by spaces. I'd call that a bug, but I haven't read the SSML spec to know better.

    Thanks to Suresh for suggesting the SSML may be to blame, even if some speech services accept it okay.


  2. The Python API fails to handle SSML that is indented by spaces. When the SSML elements were indented, the API would break, leading to incomplete synthesis.

    Here’s the version of the SSML that works without indentation:

    <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">
      <voice name="en-US-ShimmerMultilingualNeural">
        Lorem ipsum dolor sit amet! Consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    
        <break time="800ms"/>
        <prosody contour="(1%, +19%) (45%, -12%) (100%, -36%)">
          A TEMPOR INCIDIDUNT
        </prosody>
        <break time="800ms"/>
    
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
      </voice>
    </speak>
    

    Code sample:

    import azure.cognitiveservices.speech as speechsdk
    
    def render_ssml_to_file(ssml: str, filename: str):
        SPEECH_KEY = "YourAzureTTSSubscriptionKey"  # Replace with your actual Azure Speech Key
        SPEECH_REGION = "northcentralus"
        speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)
        
        speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3)
        audio_config = speechsdk.audio.AudioOutputConfig(filename=filename)
        speech_config.speech_synthesis_voice_name = 'en-US-EmmaNeural'
        speech_config.enable_audio_logging()
        speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "./log.txt")
    
        speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
        speech_synthesis_result = speech_synthesizer.speak_ssml_async(ssml).get()
    
        if speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = speech_synthesis_result.cancellation_details
            print(f"Speech synthesis canceled: {cancellation_details.reason}")
            if cancellation_details.error_details:
                print(f"Error details: {cancellation_details.error_details}")
    
    # Read SSML and render it to MP3
    try:
        with open('demo.xml', 'r', encoding='utf-8') as file:
            text = file.read()
        render_ssml_to_file(text, 'demo.mp3')
    except Exception as e:
        print(f"Error reading SSML file: {e}")
    
    • By removing the indentation from the SSML, the issue was resolved. This behavior indicates a potential bug in the Azure Python SDK’s SSML parsing.

    Result:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search