Keyword Spotting with TensorFlow Lite
Introduction
This project demonstrates how to use TensorFlow Lite for keyword spotting on the ReSpeaker 2-Mics Pi HAT v2. Keyword spotting allows for real-time detection of predefined words from audio input, enabling applications such as voice-controlled devices and interactive systems. We will guide you through the steps to train a TensorFlow Lite model, deploy it on the ReSpeaker HAT, and run speech recognition locally.
Hardware and Software Requirements
- Hardware: Raspberry Pi with ReSpeaker 2-Mics Pi HAT v2
- Software: TensorFlow Lite, Google Colab, Python, and supporting libraries
Applications
Keyword spotting can be applied in:
- Smart home devices
- Voice-controlled robots
- Interactive kiosks
What is TensorFlow Lite?
TensorFlow Lite is a lightweight version of TensorFlow designed for mobile and embedded devices. It enables machine learning inference with low latency and small binary sizes, making it ideal for running models on edge devices like Raspberry Pi.
Train and Get TensorFlow Lite Model
Dataset
We will use a subset of the Speech Commands dataset for training. The dataset contains WAV audio files of people saying different words, collected by Google and released under a CC BY license. The dataset can be downloaded from here. For more information on datasets, refer to this guide.
Why Use Google Colab?
Google Colab is a cloud-based platform for running Jupyter notebooks. It provides free access to GPU resources, making it an excellent choice for training machine learning models without requiring local computation power.
Steps
Now we will use a Google Colab Notebook to perform the data training and generate a TensorFlow Lite model in .tflite
format.
Step 1. Open this Python Notebook
By default, it will load the mini Speech Commands dataset which is a smaller version of the Speech Commands dataset. The original dataset consists of over 105,000 audio files in the WAV (Waveform) audio file format of people saying 35 different words. This data was collected by Google and released under a CC BY license.
Step 2. Connect to a new runtime by selecting Changing runtime type -> CPU -> Save, then click Connect.
Step 3. Navigate to
Runtime > Run all
to run all the code cells. This process will take about 10 minutes to complete.Step 4. Once all the code cells are executed, append a new cell and run the following code to generate the
.tflite
model file.converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)Step 5. Right click the generated
model.tflite
file and select Download to save the file to your computer.
Local Inference
Running the Inference Script
The script inference.py performs the following steps:
- Loads the trained TensorFlow Lite model.
- Processes input audio into a spectrogram suitable for inference.
- Runs the inference and outputs the detected keyword along with confidence scores for each label.
Steps to Run
Upload the
model.tflite
model file to your Pi, in this example, we put it in~/speech_recognition/model.tflite
.Save the following script as
~/speech_recognition/inference.py
:import numpy as np
from scipy import signal
from tflite_runtime.interpreter import Interpreter
import soundfile as sf
MODEL_PATH = 'model.tflite'
LABELS = ['no', 'yes', 'down', 'go', 'left', 'up', 'right', 'stop']
def get_spectrogram(waveform, expected_time_steps=124, expected_freq_bins=129):
_, _, Zxx = signal.stft(
waveform,
fs=16000,
nperseg=255,
noverlap=124,
nfft=256
)
spectrogram = np.abs(Zxx)
if spectrogram.shape[0] != expected_freq_bins:
spectrogram = np.pad(spectrogram, ((
0, expected_freq_bins - spectrogram.shape[0]), (0, 0)), mode='constant')
if spectrogram.shape[1] != expected_time_steps:
spectrogram = np.pad(spectrogram, ((
0, 0), (0, expected_time_steps - spectrogram.shape[1])), mode='constant')
if spectrogram.shape != (expected_freq_bins, expected_time_steps):
raise ValueError(
f"Invalid spectrogram shape. Got {spectrogram.shape}, expected ({expected_freq_bins}, {expected_time_steps})."
)
spectrogram = np.transpose(spectrogram)
return spectrogram
def preprocess_audio(file_path):
waveform, sample_rate = sf.read(file_path)
if sample_rate != 16000:
raise ValueError("Expected sample rate is 16 kHz")
if len(waveform.shape) > 1:
waveform = waveform[:, 0]
spectrogram = get_spectrogram(waveform)
spectrogram = spectrogram[..., np.newaxis]
spectrogram = spectrogram[np.newaxis, ...]
return spectrogram
def run_inference(file_path):
spectrogram = preprocess_audio(file_path)
interpreter = Interpreter(MODEL_PATH)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape']
if spectrogram.shape != tuple(input_shape):
raise ValueError(
f"Expected input shape {input_shape}, got {spectrogram.shape}"
)
interpreter.set_tensor(
input_details[0]['index'], spectrogram.astype(np.float32))
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])[0]
prediction = np.argmax(output_data)
confidence = np.exp(output_data) / \
np.sum(np.exp(output_data))
print(f"command: {LABELS[prediction].upper()}")
for label, conf in zip(LABELS, confidence):
print(f"{label}: {conf:.2%}")
if __name__ == "__main__":
audio_file_path = 'test_audio.wav'
run_inference(audio_file_path)
```
Record a sound using the following command, available keywords are:
no
,yes
,down
,go
,left
,up
,right
,stop
.$ arecord -D "plughw:2,0" -f S16_LE -r 16000 -d 1 -t wav ~/speech_recognition/test_audio.wav
Execute the script:
$ python3 inference.py
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
command: YES
no: 8.74%
yes: 21.10%
down: 5.85%
go: 14.57%
left: 11.02%
up: 8.25%
right: 10.53%
stop: 19.94%
Interpreting the Results
The script outputs the detected command (e.g., YES) and the confidence scores for all labels. This provides insights into the model’s predictions and allows you to evaluate its performance.
Tech Support & Product Discussion
Thank you for choosing our products! We are here to provide you with different support to ensure that your experience with our products is as smooth as possible. We offer several communication channels to cater to different preferences and needs.