Azure TTS partially renders MP3 output then 'Internal Server Error'

GGibson
October 18, 2024
101 views
0 votes
2 Answers

I’m using F1 (non-free tier, OpenAI Neural non-HD voices) of text-to-speech on Azure through the python API. I’m having deterministic partial completions with an ‘Internal Server Error’ and ‘partial data received’ message ending the audio rendering mid-word. And yet, the same SSML works flawlessly using my same TTS instance through Speech Studio.

input SSML xml file: demo.xml

standalone python API code: demo.py

log file output: log.txt (you can see synthesis timing out)

SSML works in speech studio
SSML fails to fully render using the python code
but, the SSML partially renders, so the speech sdk config is correct

Log File Excerpt

[405035]: 35806ms SPX_TRACE_VERBOSE:  synthesizer_timeout_management.cpp:85 IsTimeout: synthesis might timeout, current RTF: 0.77 (threshold: 2.00), frame interval 9967 ms (threshold 3000ms)
[405035]: 35856ms SPX_TRACE_WARNING: synthesizer_timeout_management.cpp:80 IsTimeout: synthesis timed out, current RTF: 0.78 (threshold: 2.00), frame interval 10017 ms (threshold 3000ms)
[405035]: 35857ms SPX_DBG_TRACE_VERBOSE:  usp_tts_engine_adapter.cpp:376 StopSpeaking
[405035]: 35857ms SPX_DBG_TRACE_VERBOSE:  usp_tts_engine_adapter.cpp:1040 Response: On Error: Code:6, Message: Timeout while synthesizing. Current RTF: 0.775118 (threshold 2), frame interval 10018ms (threshold 3000ms)..

Answers

Chosen as BEST ANSWER
- GGibson
- October 18, 2024 at 6:35 am
- 0 votes
0
Cool! What we learned is that the python API breaks if SSML elements are indented by spaces. I'd call that a bug, but I haven't read the SSML spec to know better.

Thanks to Suresh for suggesting the SSML may be to blame, even if some speech services accept it okay.

(Edit)

The Python API fails to handle SSML that is indented by spaces. When the SSML elements were indented, the API would break, leading to incomplete synthesis.

Here’s the version of the SSML that works without indentation:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">
  <voice name="en-US-ShimmerMultilingualNeural">
    Lorem ipsum dolor sit amet! Consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    <break time="800ms"/>
    <prosody contour="(1%, +19%) (45%, -12%) (100%, -36%)">
      A TEMPOR INCIDIDUNT
    </prosody>
    <break time="800ms"/>

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </voice>
</speak>

Code sample:

import azure.cognitiveservices.speech as speechsdk

def render_ssml_to_file(ssml: str, filename: str):
    SPEECH_KEY = "YourAzureTTSSubscriptionKey"  # Replace with your actual Azure Speech Key
    SPEECH_REGION = "northcentralus"
    speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)
    
    speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3)
    audio_config = speechsdk.audio.AudioOutputConfig(filename=filename)
    speech_config.speech_synthesis_voice_name = 'en-US-EmmaNeural'
    speech_config.enable_audio_logging()
    speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "./log.txt")

    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
    speech_synthesis_result = speech_synthesizer.speak_ssml_async(ssml).get()

    if speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_synthesis_result.cancellation_details
        print(f"Speech synthesis canceled: {cancellation_details.reason}")
        if cancellation_details.error_details:
            print(f"Error details: {cancellation_details.error_details}")

# Read SSML and render it to MP3
try:
    with open('demo.xml', 'r', encoding='utf-8') as file:
        text = file.read()
    render_ssml_to_file(text, 'demo.mp3')
except Exception as e:
    print(f"Error reading SSML file: {e}")

By removing the indentation from the SSML, the issue was resolved. This behavior indicates a potential bug in the Azure Python SDK’s SSML parsing.

Result:

Please signup or login to give your own answer.

Click here to cancel reply.