如何在reComputer Jetson上运行带语音交互的VLM

介绍

本指南详细说明了如何在reComputer Nvidia Jetson设备上运行带语音交互的多模态视觉语言模型（VLM）。该模型利用官方Nvidia Jetson平台的强大计算能力，结合阿里巴巴开源的语音转文本模型SenseVoice和coqui-ai的文本转语音（TTS）模型，来执行复杂的多模态任务。通过遵循本指南，您将能够成功安装和操作此系统，使其具备视觉识别和语音交互能力，从而为您的项目提供更智能的解决方案。

VLM（视觉语言模型）介绍

视觉语言模型（VLM）是为Nvidia Jetson平台优化的多模态模型。它结合了视觉和语言处理来处理复杂任务，如物体识别和生成描述性语言。VLM适用于自动驾驶、智能监控和智能家居等领域，提供智能直观的解决方案。

SenseVoice介绍

SenseVoice是一个专注于高精度多语言语音识别、语音情感识别和音频事件检测的开源模型。它在超过40万小时的数据上进行训练，支持50多种语言，性能超越Whisper模型。SenseVoice-Small模型提供超低延迟，仅需70毫秒即可处理10秒音频。它还提供便捷的微调功能，支持多种语言部署，包括Python、C++、HTML、Java和C#。

TTS（文本转语音）介绍

TTS模型是用于文本转语音任务的高性能深度学习模型。它包含各种模型，如Tacotron2和声码器，如MelGAN和WaveRNN。TTS模型支持多说话人TTS、高效训练，并提供数据集整理和模型测试工具。其模块化代码库允许轻松实现新功能。

先决条件

具有16GB以上内存的reComputer Jetson AGX Orin 64G或reComputer Jetson J4012 16G设备。
USB免驱动扬声器麦克风
能够输出RTSP流地址的IP摄像头。我们还包含了如何使用NVIDIA Nvstreamer工具将本地视频转换为RTSP流的说明。

note

我们已经在reComputer Orin NX 16GB和AGX Orin 64GB开发套件上测试了本wiki的可行性。

立即获取 🖱️

安装

初始化系统环境

使用 JP6 安装初始系统后，您需要检查 CUDA 和其他库的安装情况。您可以通过运行 sudo apt-get install nvidia-jetpack 来验证和安装它们。
安装 python3-pip、jtop 和 docker-ce。

通过运行以下命令安装必要的依赖项：

sudo apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev
sudo pip3 install pyaudio playsound subprocess wave keyboard
sudo pip3 --upgrade setuptools
sudo pip3 install sudachipy==0.5.2

检查音频输入和输出以及 USB 扬声器麦克风是否正常工作，并确保网络连接稳定。

安装 VLM

本项目的核心功能是视觉语言模型（VLM）。我们提供了一个关于如何在 reComputer Nvidia Jetson 上使用 VLM 的指南。请参考此链接获取安装和使用说明。在继续以下步骤之前，请确保您完全了解如何在 VLM 中使用文本描述进行推理。

安装 Pytorch Torchaudio

我们为初学者提供了 Nvidia Jetson AI 课程，其中包括如何安装 PyTorch、Torchaudio 和 Torchvision 的说明。请根据您的系统环境下载并安装这些包。

安装 Speech_vlm（基于 SenseVoice）

克隆 Speech_vlm 包：

cd ~/
git clone https://github.com/ZhuYaoHui1998/speech_vlm.git

安装 Speech_vlm 环境：

cd ~/speech_vlm
sudo pip3 install -r requement.txt

安装 TTS（基于 Coqui-ai）

cd ~/speech_vlm/TTS
sudo pip3 install .[all]

使用方法

speech_vlm 仓库的结构如下：

speech_vlm/
├── /TTS   # Coqui-ai TTS program
├── config # VLM config
├── README.md    #Project Introduction
├── requirements.txt   #SenseVoice required environment libraries
├── compose.yaml   #VLM Docker Compose startup file
├── delete_id.sh     #Delete camera ID script
├── example_1.wav     #Audio feedback sound tone template (replaceable)
├── model.py     #SenseVoice main program
├── set_alerts.sh     #Set up camera alerts
├── set_describe.sh     #Text input to have the VLM describe the current scene
├── set_streamer_id.sh  #Add RTSP camera to VLM
├── view_rtsp.py  # View RTSP stream by opencv
└── vlm_voice.py  # multimodal main program

启动 VLM

cd ~/speech_vlm
sudo docker compose up -d

将 RTSP 流添加到 VLM

查看 speech_vlm 仓库下 set_streamer_id.sh 的内容：

#!/bin/bash
curl --location 'http://0.0.0.0:5010/api/v1/live-stream' \
--header 'Content-Type: application/json' \
--data '{"liveStreamUrl": "RTSP stream address"}'

将 0.0.0.0 替换为 Jetson 设备的 IP 地址，将 RTSP stream address 替换为摄像头的 RTSP 流地址。例如：

#!/bin/bash
curl --location 'http://192.168.49.227:5010/api/v1/live-stream' \
--header 'Content-Type: application/json' \
--data '{"liveStreamUrl": "rtsp://admin:[email protected]:554//Streaming/Channels/1"}'

note

如果您没有 RTSP 摄像头，我们提供了关于如何使用 NVStreamer 将本地视频流式传输为 RTSP 以及将它们添加到 VLM 的说明。

运行 set_streamer_id.sh

cd ~/speech_vlm
sudo chmod +x ./set_streamer_id.sh
./set_streamer_id.sh

我们将获得一个摄像头ID，这个ID非常重要，需要记录下来，如下所示：

运行 vlm_voice.py

您需要在以下两行Python代码中替换 0.0.0.0：

API_URL = 'http://0.0.0.0:5010/api/v1/chat/completions'  # API endpoint
REQUEST_ID = ""  # Request ID

使用 Jetson IP 地址并填入从步骤 2 返回的摄像头 ID 来替换 REQUEST_ID。

vlm_voice.py

import pyaudio
import wave
import keyboard
import subprocess
import json
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
import time
import torch
from TTS.api import TTS
import os
# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Init TTS
api = TTS("tts_models/en/ljspeech/glow-tts").to(device)

# Configuration parameters 
FORMAT = pyaudio.paInt16  # 16-bit resolution
CHANNELS = 1  # Mono channel
CHUNK = 1024  # Number of samples per chunk
OUTPUT_FILENAME = "output.wav"  # Output file name
API_URL = 'http://192.168.49.227:5010/api/v1/chat/completions'  # API endpoint
REQUEST_ID = "1388b691-3b9f-4bda-9d70-0ff0696f80f4"  # Request ID

# Initialize PyAudio
audio = pyaudio.PyAudio()
# Prepare the list to store recording data
frames = []

# Initialize Micphone Rate
print("Available audio input devices:")
for i in range(audio.get_device_count()):
    info = audio.get_device_info_by_index(i)
    print(f"Device {i}: {info['name']} - {info['maxInputChannels']} channels")

device_index = int(input("Please select the device index for your USB microphone: "))

device_info = audio.get_device_info_by_index(device_index)
supported_sample_rates = [8000, 16000, 32000, 44100, 48000]
supported_rate=0
for rate in supported_sample_rates:
    try:
        if audio.is_format_supported(rate,
                                     input_device=device_index,
                                     input_channels=1,
                                     input_format=pyaudio.paInt16):
            supported_rate=rate
            print(f"{rate} Hz is supported.")
    except ValueError:
        print(f"{rate} Hz is not supported.")


# Initialize the model
model = "./SenseVoiceSmall"
model = AutoModel(
    model=model,
    vad_model="./speech_fsmn_vad_zh-cn-16k-common-pytorch",
    vad_kwargs={"max_single_segment_time": 30000},
    trust_remote_code=True,
    disable_log=True
)


def extract_content(json_response):
    try:
        # 解析JSON字符串
        data = json.loads(json_response)
        
        # 提取content部分
        content = data["choices"][0]["message"]["content"]
        
        print(f"{content}")
        return content
    except KeyError as e:
        print(f"Key error: {e}")
    except json.JSONDecodeError as e:
        print(f"JSON decode error: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

def start_recording():
    global frames
    frames = []
    
    try:
        stream = audio.open(format=FORMAT, channels=CHANNELS,
                            rate=supported_rate, input=True,
                            frames_per_buffer=CHUNK, input_device_index=device_index)
        print("Recording started... Press '2' to stop recording.")
    
        while True:
            if keyboard.is_pressed('2'):
                print("Recording stopped.")
                break
            data = stream.read(CHUNK)
            frames.append(data)
    
        stream.stop_stream()
        stream.close()
    
    except Exception as e:
        print(f"An error occurred during recording: {e}")

def save_recording():
    try:
        waveFile = wave.open(OUTPUT_FILENAME, 'wb')
        waveFile.setnchannels(CHANNELS)
        waveFile.setsampwidth(audio.get_sample_size(FORMAT))
        waveFile.setframerate(supported_rate)
        waveFile.writeframes(b''.join(frames))
        waveFile.close()
        print(f"Recording saved as {OUTPUT_FILENAME}")
    except Exception as e:
        print(f"An error occurred while saving the recording: {e}")

def send_alert(text):
    # Construct the JSON payload
    payload = {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful AI assistant."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "stream",
                        "stream": {
                            "stream_id": REQUEST_ID
                        }
                    },
                    {
                        "type": "text",
                        "text": text
                    }
                ]
            }
        ],
        "min_tokens": 1,
        "max_tokens": 128
    }
    
    # Convert the payload to a JSON string
    json_payload = json.dumps(payload)
    
    # Execute the curl command using subprocess
    curl_command = [
        'curl', '--location', API_URL,
        '--header', 'Content-Type: application/json',
        '--data', json_payload
    ]
    
    try:
        result = subprocess.run(curl_command, check=True, capture_output=True, text=True)
        ##Get words
        content_result=extract_content(result.stdout)
        # TTS 
        api.tts_to_file(
            str(content_result),
            speaker_wav="./example_1.wav",
            file_path="speech.wav"
        )
        # Convert audio rate
        subprocess.run(['ffmpeg', '-i', 'speech.wav', '-ar',str(supported_rate), 'speech1.wav','-y'])
        # Play audio
        wf = wave.open('./speech1.wav', 'rb')
        stream = audio.open(format=pyaudio.paInt16,
                        channels=1,
                        rate=supported_rate,
                        output=True,
                        output_device_index=device_index)
        data = wf.readframes(1024)
        while data:
            stream.write(data)
            data = wf.readframes(1024)
        # Play audio
        os.remove('speech.wav')
        os.remove('speech1.wav')
        stream.stop_stream()
        stream.close()
        wf.close()  # Close the wave file as well

        #print(f"Alert sent successfully: {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"An error occurred while sending the alert: {e.stderr}")
    finally:
        # Even if an error occurs, try to close the stream
        if stream.is_active():
            stream.stop_stream()
            os.remove('speech.wav')
            os.remove('speech1.wav')
            stream.close()
print("Welcome to the Recording and Speech-to-Text System!")
print("Press '1' to start recording, '2' to stop recording.")

while True:
    if keyboard.is_pressed('1'):
        print("Preparing to start recording...")
        start_recording()
        save_recording()
        
        print("Processing the recording file, please wait...")
        try:
            res = model.generate(
                input=f"./{OUTPUT_FILENAME}",
                cache={},
                language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
                use_itn=True,
                batch_size_s=60,
                merge_vad=True,
                merge_length_s=15,
            )
            text = rich_transcription_postprocess(res[0]["text"])
            print(f"Speech-to-Text Result:\n{text}")
            
            # Send the transcription result as an alert
            send_alert(text)
            
        except Exception as e:
            print(f"An error occurred while processing the recording: {e}")
        
    time.sleep(0.1)  # Reduce CPU usage

Run python:

cd ~/speech_vlm
sudo python3 vlm_voice.py

程序启动后，它将扫描所有音频输入和输出设备。您需要手动选择所需音频设备的索引ID。程序即将开始工作，然后按 1 录音，按 2 发送。

查看结果

我们准备了一个 view_rtsp.py 脚本来查看输出结果。您需要将 rtsp_url = "rtsp://0.0.0.0:5011/out" 中的IP部分替换为您的Jetson设备的IP地址。

viwe_rtsp.py

import cv2

rtsp_url = "rtsp://192.168.49.227:5011/out"

cap = cv2.VideoCapture(rtsp_url)

if not cap.isOpened():
    print("Cannot open RTSP stream")
    exit()

while True:
    ret, frame = cap.read()
    if not ret:
        print("Failed to retrieve frame")
        break

    height, width = frame.shape[:2]

    frame_resized = cv2.resize(frame, (int(width // 1.1), int(height // 1.1)))

    cv2.imshow('RTSP Stream', frame_resized)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

sudo pip3 install opencv-python
cd ~/speech_vlm
sudo python3 view_rtsp.py

演示

技术支持与产品讨论

感谢您选择我们的产品！我们在这里为您提供不同的支持，以确保您使用我们产品的体验尽可能顺畅。我们提供多种沟通渠道，以满足不同的偏好和需求。

介绍​

VLM（视觉语言模型）介绍​

SenseVoice介绍​

TTS（文本转语音）介绍​

先决条件​

安装​

初始化系统环境​

安装 VLM​

安装 Pytorch Torchaudio​

安装 Speech_vlm（基于 SenseVoice）​

安装 TTS（基于 Coqui-ai）​

使用方法​

演示​

技术支持与产品讨论​

介绍