skip to Main Content

I am trying to develop Android app that takes an audio clip and classify this audio using YAMNet model

https://tfhub.dev/google/lite-model/yamnet/classification/tflite/1

During my research about this, I found this solution:

Add these dependencies

// to run yamnet.tflite model
implementation 'org.tensorflow:tensorflow-lite-task-audio:0.2.0'
// prepare input file for model
implementation("com.google.guava:guava:31.0.1-android")
implementation 'com.arthenica:ffmpeg-kit-full:4.5'

Run this code for preparing and processing input .wav file

val srcFile =File("src_file_path")
// load and prepare model
val classifier = AudioClassifier.createFromFile(this, MODEL_FILE)
val audioTensor = classifier.createInputTensorAudio()
// temp file
val tempFile = File.createTempFile(System.currentTimeMillis().toString(),".wav")

if (!tempFile.exists()) {
tempFile.createNewFile()
}
// make input file required input for model model
FFmpegKit.execute("-i $srcFile -ar 16000 -ac 1 -y ${tempFile.absolutePath}")

val musicList = ArrayList<Short>()
val dis = LittleEndianDataInputStream(FileInputStream(tempFile))
while (true) {
try {
val d = dis.readShort()
musicList.add(d)
} catch (e: EOFException) {
break
}
}
// The input must be normalized to floats between -1 and 1.
// To normalize it, we just need to divide all the values by 2**16 or in our 
//code, MAX_ABS_INT16 = 32768
val floatsForInference = FloatArray(musicList.size)
for ((index, value) in musicList.withIndex()) {
floatsForInference[index] = (value / 32768F)
}
audioTensor.load(floatsForInference)
val output = classifier.classify(audioTensor)

I tried this solution. But, the output I got (category: silence with 80%) every time which means it didn’t classify or recognize the given input audio.

For example, if I use this audio clip as input, the output (category) is expected to be cough, not silence :
https://storage.googleapis.com/audioset/yamalyzer/audio/cough.wav

How can I fix the issue with the code?

2

Answers


  1. I won’t have time to test the full code but what I think is happening, is that yamnet has a very specific configuration for the input audio file (eg: bitrate, samplerate). Usually when that is not done correct, it will give some random results.

    I’d suggest you follow this tutorial: https://www.tensorflow.org/lite/examples/audio_classification/overview

    It used the TFLite task library and this does all the transformations correctly and you will not neet to convert the audio yourself.

    another cool resource to test the model is this one: https://www.tensorflow.org/hub/tutorials/yamnet

    I know it’s Python (not Kotlin) but it’s simple and replacing the audio file is easy and can give you some insights of what the model is seeing

    I hope this helps

    Login or Signup to reply.
  2. This is happening because YAMNet is looking at your audio file in ~1 second increments with a half second of overlap. If you have, for example, a 5 second audio clip you will have 9 1 second samples. If your sound is say 1 second long, then it will only exist in 2 or 3 of those samples. The probability is calculated for each sample individually so 6 or 7 of the samples will rightly be identified as silence, and 2 or 3 will be identified as your sound. The algorithm will look at that and say silence is the most prevalent sound in this clip and its kindof right!

    I don’t know how to fix your code specifically, but I can show you what to do in this tensorflow example code. In the following I have changed np.mean to np.max. This means that instead of classifying based on the average of your 1 second samples, it looks at the highest rated second for each sound in your file. You will still get a high score for silence, but it will not diminish the score for your sound.

    # Predict YAMNet classes.
    scores, embeddings, spectrogram = yamnet(waveform)
    # Scores is a matrix of (time_frames, num_classes) classifier scores.
    # Average them along time to get an overall classifier output for the clip.
    prediction = np.max(scores, axis=0) ### use np.max not np.mean ###
    

    Alternatively, if you want to save on processing, preprocess out audio that is silence, and only feed audio that actually contains sound into your classifier.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search