skip to Main Content

I want to do speech recognition in my Watch app, displaying a live transcription. Since SFSpeechRecognizer isn’t available on WatchOS, I set the app up to stream audio to the iOS companion, using WatchConnectivity. Before attempting this, I tried the same on iPhone, same code without involving the Watch – it works there.

With my streaming attempt, the companion will receive audio chunks and not throw any errors, but it won’t transcribe any text either. I suspect I did something wrong, when converting from AVAudioPCMBuffer and back, but I can’t quite put my finger on it, as I lack experience, working with raw data and pointers.

Now, the whole thing works as follows:

  1. User presses button, triggering Watch to ask iPhone to set up a recognitionTask
  2. iPhone sets up recognitionTask and answers with ok or some error:
guard let speechRecognizer = self.speechRecognizer else {
    WCManager.shared.sendWatchMessage(.speechRecognitionRequest(.error("no speech recognizer")))
    return
}
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else {
    WCManager.shared.sendWatchMessage(.speechRecognitionRequest(.error("speech recognition request denied by ios")))
    return
}
recognitionRequest.shouldReportPartialResults = true
if #available(iOS 13, *) {
    recognitionRequest.requiresOnDeviceRecognition = true
}

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
    if let result = result {
        let t = result.bestTranscription.formattedString
        WCManager.shared.sendWatchMessage(.recognizedSpeech(t))
    }
    
    if error != nil {
        self.recognitionRequest = nil
        self.recognitionTask = nil
        WCManager.shared.sendWatchMessage(.speechRecognition(.error("?")))
    }
}
WCManager.shared.sendWatchMessage(.speechRecognitionRequest(.ok))
  1. Watch sets up an audio session, installs a tap on the audio engine’s input node and returns the audio format to iPhone:
do {
    try startAudioSession()
} catch {
    self.state = .error("couldn't start audio session")
    return
}

let inputNode = audioEngine.inputNode
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat)
    { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
        let audioBuffer = buffer.audioBufferList.pointee.mBuffers
        let data = Data(bytes: audioBuffer.mData!, count: Int(audioBuffer.mDataByteSize))
        if self.state == .running {
            WCManager.shared.sendWatchMessage(.speechRecognition(.chunk(data, frameCount: Int(buffer.frameLength))))
        }
    }
audioEngine.prepare()

do {
    let data = try NSKeyedArchiver.archivedData(withRootObject: recordingFormat, requiringSecureCoding: true)
    WCManager.shared.sendWatchMessage(.speechRecognition(.audioFormat(data)),
        errorHandler: { _ in
            self.state = .error("iphone unavailable")
    })
    self.state = .sentAudioFormat
} catch {
    self.state = .error("could not convert audio format")
}
  1. iPhone saves the audio format and returns .ok or .error():
guard let format = try? NSKeyedUnarchiver.unarchivedObject(ofClass: AVAudioFormat.self, from: data) else {
    // ...send back .error, destroy the recognitionTask
}
self.audioFormat = format
// ...send back .ok
  1. Watch starts the audio engine
try audioEngine.start()
  1. iPhone receives audio chunks and appends them to the recognitionRequest:
guard let pcm = AVAudioPCMBuffer(pcmFormat: audioFormat, frameCapacity: AVAudioFrameCount(frameCount)) else {
    // ...send back .error, destroy the recognitionTask
}

let channels = UnsafeBufferPointer(start: pcm.floatChannelData, count: Int(pcm.format.channelCount))
let data = chunk as NSData
data.getBytes(UnsafeMutableRawPointer(channels[0]), length: data.length)
recognitionRequest.append(pcm)

Any ideas are highly appreciated. Thanks for taking the time!

2

Answers


  1. Chosen as BEST ANSWER

    I forgot to update the AVAudioPCMBuffer.frameLength after copying the memory. It works flawlessly now, without any noticable delay :)

    // ...
    data.getBytes(UnsafeMutableRawPointer(channels[0]), length: data.length)
    pcm.frameLength = AVAudioFrameCount(frameCount)
    // ...
    

  2. I would strongly suspect the problem is that you’re not even close to keeping up with real-time because of how slow the link is. You’re appending tiny (maybe as short as 20ms) samples of sound separated by long silences. That’s not going to be recognizable, even to human ears.

    I’d start by exploring CMSampleBuffers since you can set their timestamps. That will let the recognizer know when this buffer was recorded and remove the silence.

    If that doesn’t work, you’ll need to do buffering to accumulate enough AVAudioPCMBuffers to perform the analysis on. That’s going to be a lot more complicated, so hopeful CMSampleBuffers will work instead.

    In either case you might also consider transferring the data in a compressed format. I’m not sure what formats watchOS supports, but you could dramatically reduce your bandwidth requirements between the watch and phone. Just be careful not to overwhelm the watch’s CPU. You want easy-to-compute compression, not the tightest compression you can get.

    Also, I don’t see what sampling frequency you’re configuring here. Make sure it’s low. Probably 8kHz. There is absolutely no reason to record CD-quality sounds just to do speech transcription. It’s actually worse because it includes so many frequencies that aren’t in the human voice range.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search