Skip to main content

Deploy Whisper on NVIDIA Jetson Orin for Real time Speech to Text

Introduction

Real-time speech-to-text (STT) systems play a vital role in modern applications, from voice assistants to transcription services. Here are some popular STT model: Whisper, Riva, DeepSpeech, Google Cloud Speech-to-Text API, Microsoft Azure Speech Service, IBM Watson Speech to Text, Kaldi, Wit.ai and so on. The NVIDIA Jetson Orin, known for its high-performance and energy efficiency, offers a promising platform for deploying such demanding applications at the edge.

Whisper, an advanced STT system leveraging deep learning, excels in accuracy and efficiency. Riva is a comprehensive, multimodal conversational AI framework developed by NVIDIA.By deploying Whisper or Riva on the Jetson Orin, developers can harness its powerful GPU and CPU cores, along with hardware acceleration technologies like Tensor Cores, to achieve real-time STT with low latency.

In this wiki we introduce you Real Time Whisper on Jetson, this integration enables speech processing directly on the device, eliminating the need for constant network connectivity and enhancing privacy and security. Additionally, we will compare the inference speed of Whisper and Riva when deployed on the same Jetson Orin device. Ultimately, deploying Whisper on the Jetson Orin empowers developers to build robust, efficient STT applications that deliver high accuracy and low latency in various domains, from smart homes to industrial automation.

Hardware Setup

Hardware components

reComputer(Or other devices based on Jetson)reSpeaker (Or other USB interface microphones)

hardware connection

pir

Prepare the runtime environment

Step1. Install dependencies:

git clone https://github.com/LJ-Hao/Deploy-Whisper-on-NVIDIA-Jetson-Orin-for-Real-time-Speech-to-Text.git
cd Deploy-Whisper-on-NVIDIA-Jetson-Orin-for-Real-time-Speech-to-Text
pip install -r requirements.txt
sudo apt update && sudo apt install ffmpeg
arecord -D hw:2,0 --dump-hw-params #set microphone rate to 16000

Step2. Test environment:

python test.py

If you see the following information printed in the terminal, it means you have successfully installed the necessary libraries.

pir

In your terminal(Ctrl+Alt+T), input ffmpeg -version if you get something like following that means you have installed ffmpeg.

pir

Let's run it

python main.py

Riva vs Whisper

Riva, known for its advanced AI-driven speech recognition and natural language processing, empowers users with real-time transcription, translation, and analysis of spoken conversations.

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English.

In the upcoming comparative video, we'll compare ability of Riva and Whisper in Speech to Text developed on Nvidia Jetson.

Project Outlook

In this project, we use the Whisper to capture data from the microphone input in real-time and display it on a webpage. In the future, we will enhance the real-time processing capabilities of Whisper to further reduce latency and improve the accuracy of speech recognition and explore integration with other AI services or APIs to enhance the functionality of the application.

Loading Comments...