TensorFlow Lite on ReSpeaker XVF3800

Introduction

In this tutorial, we guide you through creating a custom voice recognition system using TensorFlow Lite Micro (TFLM) on the Seeed XIAO ESP32 with the XVF3800 ReSpeaker. You will learn how to collect and label audio data, preprocess it for training, and split it into training and validation sets. Next, we train a custom keyword spotting model tailored to your dataset, convert it into TFLite format, and finally deploy it as a hex file onto the ESP32 for real-time voice command recognition. By the end, you’ll have a fully functional microcontroller-based system capable of accurately classifying spoken commands.

pir

Get One Now 🖱️

Dependencies

To follow this tutorial, you need to install the following Arduino libraries:

Make sure to install these libraries in your Arduino IDE. Each GitHub repository contains a guide on how to install and configure the libraries properly.

Collect the Data

We will record short voice samples (10 seconds each) and split them into 1-second clips. To use the XVF3800 ReSpeaker, you may need to install the USB firmware first.

Firmware Guide: Seeed Studio XVF3800 Firmware Flash

Step 1: Find the Device ID

Use the following Python script to list all audio devices connected to your PC and find the correct device index for the ReSpeaker:

import sounddevice as sd

# List all available devices
devices = sd.query_devices()

# Print all devices
for i, device in enumerate(devices):
    print(f"Device {i}: {device['name']} (input channels: {device['max_input_channels']})")

Note: Update DEVICE_INDEX in the next script according to the printed device number for the ReSpeaker.

Step 2: Collect Audio Samples

This Python script collects audio samples based on person name and label. A folder will be created for each person, and WAV files will be saved under the corresponding labels.

import os
import sounddevice as sd
from scipy.io.wavfile import write

# === Settings ===
SAMPLERATE = 16000
CHANNELS = 1  # ReSpeaker 4-Mic Array
DURATION = 10  # seconds
DEVICE_INDEX = 2  # Set to your ReSpeaker device index


def record_audio(filename, samplerate=SAMPLERATE, channels=CHANNELS, duration=DURATION, device=DEVICE_INDEX):
    print(f"Recording '{filename}' for {duration} seconds...")
    recording = sd.rec(int(duration * samplerate),
                       samplerate=samplerate,
                       channels=channels,
                       dtype='int16',
                       device=device)
    sd.wait()
    write(filename, samplerate, recording)
    print(f"Saved: {filename}")


def get_next_filename(directory, label):
    existing = [f for f in os.listdir(directory) if f.startswith(label) and f.endswith('.wav')]
    index = len(existing) + 1
    return os.path.join(directory, f"{label}.{index}.wav")


def collect_samples():
    while True:
        sample_name = input("Enter sample name (e.g., PersonA): ").strip()
        if not sample_name:
            print("Sample name cannot be empty.")
            continue

        sample_dir = os.path.join(os.getcwd(), sample_name)
        os.makedirs(sample_dir, exist_ok=True)
        print(f"Directory created: {sample_dir}")

        while True:
            label = input("Enter sound/voice to record (e.g., yes, no): ").strip()
            if not label:
                print("Label cannot be empty.")
                continue

            while True:
                filename = get_next_filename(sample_dir, label)
                record_audio(filename)

                cont = input("Record another sample for this label? (yes/no): ").strip().lower()
                if cont != 'yes':
                    break

            next_label = input("Do you want to record a different label? (yes/no): ").strip().lower()
            if next_label != 'yes':
                break

        next_sample = input("Do you want to create a new sample? (yes/no): ").strip().lower()
        if next_sample != 'yes':
            print("Audio collection completed.")
            break


if __name__ == "__main__":
    collect_samples()

How it works:

Creates a folder for each person.
Prompts for labels (e.g., "yes", "no") and saves corresponding audio files.
Records 10-second audio clips which can later be split into 1-second segments for training.

Data Preprocessing

After collecting your 10-second audio samples, the next step is to split them into 1-second clips for training. I used Edge Impulse to visualize and split the recordings easily.

Audio File Format

All audio files must meet the following requirements:

Format: WAV (.wav)
Sample Rate: 16 kHz
Channels: Mono (1 channel)
Bit Depth: 16-bit PCM
Duration: 1 second (1000 ms)

Note: Edge Impulse can help automatically split longer recordings into these 1-second segments.

Target Labels

Each folder name is treated as a class label.
Examples:
- hi_speaker → Model recognizes “hi speaker”
- seeed → Model recognizes “seeed”
You can add more classes as needed, but folder names must match the WANTED_WORDS list used during training.

Unknown / Other

The other/ folder should contain random words not in your target list. This helps the model classify unknown words correctly.

Silence / Noise

The _background_noise_/ folder should include ambient sounds such as:
- Office noise
- Street noise
- Keyboard typing
- Silence recordings (mic on but no speaking)

Proper preprocessing ensures the model learns to distinguish between target commands, unknown words, and background noise.

dataset_dir/
│
├── hi_speaker/           # All audio samples for the "hi_speaker" keyword
│   ├── audio_0.wav
│   ├── audio_1.wav
│   └── ...
│
├── seeed/                # All audio samples for the "seeed" keyword
│   ├── audio_2.wav
│   ├── audio_3.wav
│   └── ...
│
├── other/                # Random speech or non-target words
│   ├── audio_4.wav
│   ├── audio_5.wav
│   └── ...
│
└── _background_noise_/   # Background noise samples
    ├── noise_0.wav
    ├── noise_1.wav
    └── ...

Data Training

To train your custom voice recognition model, it’s recommended to use a PC with Ubuntu x86. You will also need the xxd tool, which can be installed via:

sudo apt-get install xxd

Step 1: Install Anaconda

Download and install Anaconda Navigator
Create a new environment in Anaconda for this project.

Step 2: Set Up the Environment

Install the required packages in the environment:

info

Deep Learning Framework: TensorFlow 1.5
Programming Language: Python 3.7

This setup ensures compatibility with TensorFlow Lite Micro for microcontroller deployment.

Step 3: Run the Training Notebook

Download the Jupyter notebook: train_micro_speech_model.ipynb
Open the notebook in Jupyter and follow the instructions.
Once completed, the notebook will generate a hexadecimal model file named model.cc ready for deployment to the ESP32.

The model.cc file can then be included in your Arduino project to run real-time keyword spotting on the XIAO ESP32 with the XVF3800 ReSpeaker.

Inference on XIAO ESP32 with XVF3800

Once your model.cc file is ready, you can deploy it on the XIAO ESP32 for real-time voice command recognition. Because the XVF3800 outputs 32-bit audio samples, we need to convert them to 16-bit for TensorFlow Lite Micro. We also configure the I2S pins, sample rate, and channels to match the model requirements.

Arduino Code Example

#include "AudioTools.h"
#include "AudioTools/AudioLibs/TfLiteAudioStream.h"
#include "model.h"  // Replace with your generated model.cc

I2SStream i2s;  
TfLiteAudioStream tfl;  
StreamCopy copier(tfl, i2s);

const char* kCategoryLabels[] = {
    "silence",
    "unknown",
    "hi_respeaker", //change the key word that you trained
    "seeed" // change the key word that you trained 
};

void respondToCommand(const char* found_command, uint8_t score, bool is_new_command) {
  if (is_new_command) {
    Serial.printf("Detected: %s (score: %d)\n", found_command, score);
  }
}

// Temp buffer for 32-bit I2S samples
int32_t i2s_buffer[512];
int16_t conv_buffer[512];

void setup() {
  Serial.begin(115200);
  AudioLogger::instance().begin(Serial, AudioLogger::Warning);

  // XVF3800 I2S input configuration
  auto cfg = i2s.defaultConfig(RX_MODE);
  cfg.sample_rate = 16000;
  cfg.channels = 1;            // Mono
  cfg.bits_per_sample = 32;    // XVF3800 streams 32-bit samples
  cfg.pin_bck = 8;
  cfg.pin_ws = 7;
  cfg.pin_data = 44;
  cfg.pin_data_rx = 43;
  cfg.is_master = true;
  i2s.begin(cfg);

  // TensorFlow Lite configuration
  auto tcfg = tfl.defaultConfig();
  tcfg.setCategories(kCategoryLabels);
  tcfg.sample_rate = 16000;
  tcfg.channels = 1;
  tcfg.kTensorArenaSize = 15 * 1024;
  tcfg.respondToCommand = respondToCommand;
  tcfg.model = g_model;  // Replace with your model.cc
  tfl.begin(tcfg);
}

void loop() {
  // Read 32-bit audio from XVF3800
  size_t n = i2s.readBytes((uint8_t*)i2s_buffer, sizeof(i2s_buffer));

  if (n > 0) {
    size_t samples = n / sizeof(int32_t);

    // Convert 32-bit -> 16-bit
    for (size_t i = 0; i < samples; i++) {
      conv_buffer[i] = (int16_t)(i2s_buffer[i] >> 16);
    }

    // Feed converted data into TensorFlow
    tfl.write((uint8_t*)conv_buffer, samples * sizeof(int16_t));
  }
}

Key Notes

Make sure to replace g_model with the name of your generated model.cc file.

pir

XVF3800 outputs 32-bit stereo by default; we convert to 16-bit mono to match the model.
TensorFlow Lite Micro reads the audio data continuously and triggers respondToCommand() whenever a recognized command is detected.

With this setup, your XIAO ESP32 can now recognize custom voice commands in real time using the XVF3800 microphone array.

Tech Support & Product Discussion

Thank you for choosing our products! We are here to provide you with different support to ensure that your experience with our products is as smooth as possible. We offer several communication channels to cater to different preferences and needs.

Introduction​

Dependencies​

Collect the Data​

Step 1: Find the Device ID​

Step 2: Collect Audio Samples​

Data Preprocessing​

Audio File Format​

Target Labels​

Unknown / Other​

Silence / Noise​

Data Training​

Step 1: Install Anaconda​

Step 2: Set Up the Environment​

Step 3: Run the Training Notebook​

Inference on XIAO ESP32 with XVF3800​

Arduino Code Example​

Key Notes​

Tech Support & Product Discussion​