skip to Main Content

My final objective is to use TTS to get some Indic text converted into audio and pass that audio to a messaging system that accepts mp3 and ogg. Ogg is preferred.

I am on Ubuntu and my flow for getting audio string is something like this.

  1. Text in Indic language is passed to an API
  2. API returns a json with a key value called audioContent. audioString = response.json()['audio'][0]['audioContent']
  3. The decoded string is arrived by using this decode_string = base64.b64decode(dat)

I am currently converting it to mp3 and as you can see I am writing the wave file first and then converting it into an mp3.

wav_file = open("output.wav", "wb")
decode_string = base64.b64decode(audioString)
wav_file.write(decode_string)

# Convert this to mp3 file
print('mp3file')
song = AudioSegment.from_wav("output.wav")
song.export("temp.mp3", format="mp3")

Is there a way to convert audioString directly to ogg file without doing the io?

I’ve tried torchaudio and pyffmpeg to load audioString and do the conversion but it doesn’t seem to be working.

2

Answers


  1. We may write the WAV data to FFmpeg stdin pipe, and read the encoded OGG data from FFmpeg stdout pipe.
    My following answer describes how to do it with video, and we may apply the same solution to audio.


    Piping architecture:

     --------------------  Encoded      ---------  Encoded      ------------
    | Input WAV encoded  | WAV data    | FFmpeg  | OGG data    | Store to   |
    | stream             | ----------> | process | ----------> | BytesIO    |
     --------------------  stdin PIPE   ---------  stdout PIPE  -------------
    

    The implementation is equivalent to the following shell command:
    cat input.wav | ffmpeg -y -f wav -i pipe: -acodec libopus -f ogg pipe: > test.ogg


    According to Wikipedia, common audio codecs for OGG format are Vorbis, Opus, FLAC, and OggPCM (I selected Opus audio codec).

    The example uses ffmpeg-python module, but it’s just a binding to FFmpeg sub-process (FFmpeg CLI must be installed, and must be in the execution path).


    Execute FFmpeg sub-process with stdin pipe as input and stdout pipe as output:

    ffmpeg_process = (
        ffmpeg
        .input('pipe:', format='wav')
        .output('pipe:', format='ogg', acodec='libopus')
        .run_async(pipe_stdin=True, pipe_stdout=True)
    )
    

    The input format is set to wav, the output format is set to ogg and the selected encoder is libopus.


    Assuming the audio file is relatively large, we can’t write the entire WAV data at once, because doing so (without "draining" stdout pipe) causes the program execution to halt.

    We may have to write the WAV data (in chunks) in a separate thread, and read the encoded data in the main thread.

    Here is a sample for the "writer" thread:

    def writer(ffmpeg_proc, wav_bytes_arr):
        chunk_size = 1024  # Define chunk size to 1024 bytes (the exacts size is not important).
        n_chunks = len(wav_bytes_arr) // chunk_size  # Number of chunks (without the remainder smaller chunk at the end).
        remainder_size = len(wav_bytes_arr) % chunk_size  # Remainder bytes (assume total size is not a multiple of chunk_size).
    
        for i in range(n_chunks):
            ffmpeg_proc.stdin.write(wav_bytes_arr[i*chunk_size:(i+1)*chunk_size])  # Write chunk of data bytes to stdin pipe of FFmpeg sub-process.
    
        if (remainder_size > 0):
            ffmpeg_proc.stdin.write(wav_bytes_arr[chunk_size*n_chunks:])  # Write remainder bytes of data bytes to stdin pipe of FFmpeg sub-process.
    
        ffmpeg_proc.stdin.close()  # Close stdin pipe - closing stdin finish encoding the data, and closes FFmpeg sub-process.
    

    The "writer thread" writes the WAV data in small chucks.
    The last chunk is smaller (assume the length is not a multiple of chuck size).

    At the end, stdin pipe is closed.
    Closing stdin finish encoding the data, and closes FFmpeg sub-process.


    In the main thread, we are starting the thread, and read encoded "OGG" data from stdout pipe (in chunks):

    thread = threading.Thread(target=writer, args=(ffmpeg_process, wav_bytes_array))
    thread.start()
    
    while thread.is_alive():
        ogg_chunk = ffmpeg_process.stdout.read(1024)  # Read chunk with arbitrary size from stdout pipe
        out_stream.write(ogg_chunk)  # Write the encoded chunk to the "in-memory file".
    

    For reading the remaining data, we may use ffmpeg_process.communicate():

    # Read the last encoded chunk.
    ogg_chunk = ffmpeg_process.communicate()[0]
    out_stream.write(ogg_chunk)  # Write the encoded chunk to the "in-memory file".
    

    Complete code sample:

    import ffmpeg
    import base64
    from io import BytesIO
    import threading
    
    # Equivalent shell command
    # cat input.wav | ffmpeg -y -f wav -i pipe: -acodec libopus -f ogg pipe: > test.ogg
    
    # Writer thread - write the wav data to FFmpeg stdin pipe in small chunks of 1KBytes.
    def writer(ffmpeg_proc, wav_bytes_arr):
        chunk_size = 1024  # Define chunk size to 1024 bytes (the exacts size is not important).
        n_chunks = len(wav_bytes_arr) // chunk_size  # Number of chunks (without the remainder smaller chunk at the end).
        remainder_size = len(wav_bytes_arr) % chunk_size  # Remainder bytes (assume total size is not a multiple of chunk_size).
    
        for i in range(n_chunks):
            ffmpeg_proc.stdin.write(wav_bytes_arr[i*chunk_size:(i+1)*chunk_size])  # Write chunk of data bytes to stdin pipe of FFmpeg sub-process.
    
        if (remainder_size > 0):
            ffmpeg_proc.stdin.write(wav_bytes_arr[chunk_size*n_chunks:])  # Write remainder bytes of data bytes to stdin pipe of FFmpeg sub-process.
    
        ffmpeg_proc.stdin.close()  # Close stdin pipe - closing stdin finish encoding the data, and closes FFmpeg sub-process.
    
    
    # The example reads the decode_string from a file, assume: decoded_bytes_array = base64.b64decode(audioString)
    with open('input.wav', 'rb') as f:
        wav_bytes_array = f.read()
    
    # Encode as base64 and decode the base64 - assume the encoded and decoded data are bytes arrays (not UTF-8 strings).
    dat = base64.b64encode(wav_bytes_array)  # Encode as Base64 (used for testing - not part of the solution).
    wav_bytes_array = base64.b64decode(dat)  # wav_bytes_array applies "decode_string" (from the question).
         
    # Execute FFmpeg sub-process with stdin pipe as input and stdout pipe as output.
    ffmpeg_process = (
        ffmpeg
        .input('pipe:', format='wav')
        .output('pipe:', format='ogg', acodec='libopus')
        .run_async(pipe_stdin=True, pipe_stdout=True)
    )
    
    # Open in-memory file for storing the encoded OGG file
    out_stream = BytesIO()
    
    # Starting a thread that writes the WAV data in small chunks.
    # We need the thread because writing too much data to stdin pipe at once, causes a deadlock.
    thread = threading.Thread(target=writer, args=(ffmpeg_process, wav_bytes_array))
    thread.start()
    
    # Read encoded OGG data from stdout pipe of FFmpeg, and write it to out_stream
    while thread.is_alive():
        ogg_chunk = ffmpeg_process.stdout.read(1024)  # Read chunk with arbitrary size from stdout pipe
        out_stream.write(ogg_chunk)  # Write the encoded chunk to the "in-memory file".
    
    # Read the last encoded chunk.
    ogg_chunk = ffmpeg_process.communicate()[0]
    out_stream.write(ogg_chunk)  # Write the encoded chunk to the "in-memory file".
    out_stream.seek(0)  # Seek to the beginning of out_stream
    ffmpeg_process.wait() # Wait for FFmpeg sub-process to end
    
    # Write out_stream to file - just for testing:
    with open('test.ogg', "wb") as f:
        f.write(out_stream.getbuffer())
    
    Login or Signup to reply.
  2. You can do this with TorchAudio in the following manner.

    Couple of caveats

    1. OPUS support is available via either libsox (not available on Windows) or ffmpeg (available on Linux/macOS/Windows).
    2. On the latest stable release (v0.13), torchaudio.save can encode OPUS format using libsox. However, underlying implementation on libsox is buggy so it is not recommended to use torchaudio.save for OPUS.
    3. Instead, it is recommended to use StreamWriter from torchaudio.io, which is available as of v0.13. (You need to install ffmpeg>=4.1,<5)
    4. OPUS only supports 48kHz.
    5. OPUS only supports monaural channels. Specifying num_channels something other than 1 does not throw an error, but it produces wrong audio data.
    import io
    import base64
    
    from torchaudio.io import StreamReader, StreamWriter
    
    
    # 0. Generate test data
    with open("foo.wav", "rb") as file:
        data = file.read()
    data = base64.b64encode(data)
    
    # 1. Decode base64
    data = base64.b64decode(data)
    
    # 2. Load with torchaudio
    reader = StreamReader(io.BytesIO(data))
    reader.add_basic_audio_stream(
        frames_per_chunk=-1,  # Decode all the data at once
        format="s16p",  # Use signed 16-bit integer
    )
    reader.process_all_packets()  # Decode all the data
    waveform, = reader.pop_chunks()  # Get the waveform
    
    # 3. Save to OPUS.
    writer = StreamWriter("output.opus")
    writer.add_audio_stream(
        sample_rate=48000,  # OPUS only supports 48000 Hz
        num_channels=1,  # OPUS only supports monaural
        format="s16",
        encoder_option={"strict": "experimental"},
    )
    with writer.open():
        writer.write_audio_chunk(0, waveform)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search