当前位置：首页 > article >正文

FunASR搭建语音识别服务和VAD检测

article 2025/2/28 9:38:48

搭建ASR语音识别服务（含VAD检测）教程

在本文中，我将为大家详细介绍如何搭建一套基于FunASR的ASR（语音识别）服务，并集成VAD（语音活动检测）。该服务使用阿里达摩院的模型，并支持SSL连接、2pass模式以及语音热词处理。我们将一步步讲解如何启动服务、调整VAD参数，以及使用Python客户端请求识别。

1. 环境准备

首先，确保你的服务器已经安装好docker。

你还需要从阿里云上下载相关的语音识别模型、VAD模型、标点符号模型等。这些模型是由达摩院发布的，具体的模型目录稍后会在启动命令中给出。
官方教程：
https://github.chat.carlife.host/modelscope/FunASR/blob/main/runtime/docs/SDK_advanced_guide_online_zh.md

2. 启动ASR服务

镜像启动
通过下述命令拉取并启动FunASR软件包的docker镜像：

sudo docker pull \
  registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.10
mkdir -p ./funasr-runtime-resources/models
sudo docker run -p 10096:10095 -it --privileged=true \
  -v $PWD/funasr-runtime-resources/models:/workspace/models \
  registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.10

首先进入工作目录：

cd /workspace/FunASR/runtime

在启动服务时，我们有两种选择：启用SSL或不启用SSL。

2.1 不启用SSL

如果你不需要SSL，可以将certfile设置为0，但注意此时客户端只能通过ws协议请求，而不能使用wss。启动命令如下：

nohup bash run_server_2pass.sh \
  --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx \
  --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx \
  --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx \
  --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx \
  --lm-dir damo/speech_ngram_lm_zh-cn-ai-wesp-fst \
  --itn-dir thuduj12/fst_itn_zh \
  --certfile 0 \
  --keyfile ../../../ssl_key/server.key \
  --hotword ../../hotwords.txt > log.txt 2>&1 &

2.2 启用SSL

若希望启用SSL保护通信，可以提供SSL证书和密钥。启动命令如下：

nohup bash run_server_2pass.sh \
  --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx \
  --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx \
  --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx \
  --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx \
  --lm-dir damo/speech_ngram_lm_zh-cn-ai-wesp-fst \
  --itn-dir thuduj12/fst_itn_zh \
  --certfile ../../../ssl_key/server.crt \
  --keyfile ../../../ssl_key/server.key \
  --hotword ../../hotwords.txt > log.txt 2>&1 &

3. Python请求示例

启动服务后，可以通过Python客户端发送请求。以下是使用funasr_wss_client.py的示例代码：

如果开启了https

python funasr_wss_client.py --host "xx.xx.xx.xx" --port 10096 --mode 2pass

如果没开启https

python funasr_wss_client.py --host "xx.xx.xx.xx" --port 10096 --mode 2pass --ssl 0

这里需要确保你的客户端主机和端口设置正确，并且使用的是2pass模式。

4. 调整VAD参数

1. 查找VAD模型的配置文件

FunASR中的VAD模型为FSMN-VAD，参数配置类为VADXOptions，可以在以下路径中找到：

/workspace/FunASR/runtime/python/onnxruntime/funasr_onnx/utils/e2e_vad.py

其中，VADXOptions类定义了多个VAD参数。以下是一些常见参数的定义：

class VADXOptions:
    sample_rate: int = 16000
    detect_mode: int = VadDetectMode.kVadMutipleUtteranceDetectMode.value
    snr_mode: int = 0
    max_end_silence_time: int = 800
    max_start_silence_time: int = 3000
    do_start_point_detection: bool = True
    do_end_point_detection: bool = True
    window_size_ms: int = 200
    sil_to_speech_time_thres: int = 150
    speech_to_sil_time_thres: int = 150
    speech_2_noise_ratio: float = 1.0
    do_extend: int = 1
    lookback_time_start_point: int = 200
    lookahead_time_end_point: int = 100
    max_single_segment_time: int = 60000

这些参数控制了VAD的静音检测、语音与噪音之间的比率等。具体参数意义如下：

max_single_segment_time：单段音频的最大时长，默认60000毫秒（1分钟）。
max_end_silence_time：检测到结束静音的最大时长，默认800毫秒。
max_start_silence_time：检测到开始静音的最大时长，默认3000毫秒。
sil_to_speech_time_thres：从静音到语音的时间阈值，默认150毫秒。
speech_to_sil_time_thres：从语音到静音的时间阈值，默认150毫秒。

2. 修改VAD配置

VAD模型的实际配置是从模型目录中的config.yaml文件读取的。可以在以下路径找到config.yaml文件：

/workspace/models/damo/speech_fsmn_vad_zh-cn-16k-common-onnx/config.yaml

config.yaml文件中的model_conf字段包含了VAD模型的详细配置：

model: FsmnVADStreaming
model_conf:
    sample_rate: 16000
    detect_mode: 1
    snr_mode: 0
    max_end_silence_time: 800
    max_start_silence_time: 3000
    do_start_point_detection: True
    do_end_point_detection: True
    window_size_ms: 200
    sil_to_speech_time_thres: 150
    speech_to_sil_time_thres: 150
    speech_2_noise_ratio: 1.0
    do_extend: 1
    lookback_time_start_point: 200
    lookahead_time_end_point: 100
    max_single_segment_time: 60000