Tencent Websocket

2024年02月19日

柏拉文

越努力，越幸运

一、认识

Tencent ASR 文档。Tencent ASR 语音识别技术 采用 websocket 协议，对实时音频流进行识别，同步返回识别结果，达到边说边出文字的效果。首先，我们获取到的用户媒体流或者 Canvas 画布合成流之后, 初始化 AudioContext 音频上下文才。在音频上下文中通过 audioContext.createMediaStreamSource(stream) 创建音频处理的输入源, 也叫音源节点, 该接口可以从传入的媒体流（MediaStream）对象中抽取音频数据作为音频上下文（AudioContext）里的一个音源节点。然后创建音频处理流, 我们优先采用 AudioWorklet/AudioWorkletNode 开启音频渲染线程来处理音频, 音频渲染线程将处理好的音频数据通过 postMessage 发送到主线程, 主线程通过 onmessage 来接收音频数据; 并在不支持的情况下回退到 ScriptProcessorNode 在主线程中处理音频, ScriptProcessorNode 通过 onaudioprocess 事件来处理音频数据。最后, 音源节点连接音频处理流、连接音频输出节点（扬声器）。这样，音源发出的音频信号在输出到扬声器前，会先经过自定义的音频处理器处理。在音频处理部分, 我们分别实现了 16kHz 音频降采样、音频响度 RMS 均方根、音频强度 dB 分贝值、音频音量百分比、16-bit PCM 音频格式转换、语音活动检测（VAD）

一、获取用户媒体流: 上面已经获取的用户媒体流或者 Canvas 画布合成流都可以。并在 getUserMedia audio 参数中增加 noiseSuppression: true 降噪、echoCancellation: true 回声消除和 autoGainControl: true 自动增益。

二、创建音频上下文: 实例化一个新的 AudioContext 上下文, 用于处理和控制音频操作, 提供了一个用于音频处理和合成的工作环境。

三、创建音源节点(音频处理的输入源): 通过 audioContext.createMediaStreamSource(stream) 创建一个接口, 该接口可以从传入的媒体流（MediaStream）对象中抽取音频数据作为音频上下文（AudioContext）里的一个音源节点。

四、创建音频处理流: 我们优先采用 AudioWorklet/AudioWorkletNode 开启音频渲染线程来处理音频, 音频渲染线程将处理好的音频数据通过 postMessage 发送到主线程, 主线程通过 onmessage 来接收音频数据; 并在不支持的情况下回退到 ScriptProcessorNode 在主线程中处理音频, ScriptProcessorNode 通过 onaudioprocess 事件来处理音频数据。为什么优先使用 AudioWorklet？: AudioWorklet 运行在 Audio Rendering Thread（音频渲染线程）中，与主线程解耦，能够避免主线程阻塞导致的音频丢失或卡顿问题。而 ScriptProcessorNode 运行在主线程，会受到 UI 渲染、网络请求等任务的影响，导致音频处理延迟增大，甚至出现音频丢帧，特别是在高 CPU 负载情况下，影响实时性和用户体验。

五、音源节点连接音频处理节点并连接音频输出点（扬声器）: 这样，音源发出的音频信号在输出到扬声器前，会先经过自定义的音频处理器处理

六、AudioWorklet/AudioWorkletNode 或者 ScriptProcessorNode 音频处理: 16kHz 音频降采样、音频响度 RMS 均方根、音频强度 dB 分贝值、音频音量百分比、16-bit PCM 音频格式转换、语音活动检测（VAD）

七、处理完成后进行音量绘制与增量 ASR 语音识别: 增量 ASR 语音识别时, 如果音量百分比大于等于0, 将最后的 16-bit PCM 通过 Socket.send 发送给各个 ASR 服务商。根据返回的数据进行相应的处理。比如 Tensor ASR 有一句话开始与完成的逻辑, 而且还有语速相关信息。比如: Tencent ASR 工作流程如下:

握手阶段: 前端定义请求 url 部分参数, 发送服务端, 由服务端生成带有 signature 签名的请求 url, 拼接 wss:// 前缀, 随后根据请求 url 建立 websocket 连接, 返回 text message，内容为 json 序列化字符串
识别阶段: 发送格式为 16k 采样率的 PCM 音频数据，接收识别结果, 返回 text message，内容为 json 序列化字符串
处理结果: json 序列化字符串通过 JSON.parse() 解析后，包含 code、message、voice_id、final、result 等字段
- code: 返回码，0表示成功，其他值表示失败
- result: 最新语音识别结果
  - slice_type: 0-一段话开始识别，1-一段话识别中，2-一段话识别结束
  - voice_text_str: 当前一段话文本结果，编码为 UTF8
  - start_time: 当前一段话结果在整个音频流中的起始时间
  - end_time: 当前一段话结果在整个音频流中的结束时间
  - voice_id: 语音流唯一标识
  - word_size: 当前一段话的词结果个数
  - word_list: 当前一段话的词列表
- final: 该字段返回1时表示音频流全部识别结束

二、实现

2.1 index.js

export class AudioRecorder {
  constructor(options) {
    const {
      asrURL,
      asrType,
    } = options;

    this.asrURL = asrURL;
    this.asrType = asrType;

    this.asr = null;
    this.audioProcessor = null;

    this.run();
  }

  run() {
    this.initASR();
    this.initVideoAudio();
  }

  initASR() {
    this.asr = new ASR({
      asrURL: this.asrURL,
      asrType: this.asrType,
      asrChangeCallback: this.asrChangeCallback,
      asrCompleteCallback: this.asrCompleteCallback,
    });
  }

  async initAudio() {
    const stream = await getUserMedia({ audio: true});

    if (stream) {
      this.audioProcessor = new AudioProcessor({
        stream,
        processCallback: this.audioProcessCallback,
      });
    }
  }

  startRecord = () => {
    this.asr.start();
    this.audioProcessor.start();
  };

  pauseRecord = () => {
    this.audioProcessor.pause();
  };

  resumeRecord = () => {
    this.audioProcessor.resume();
  };

  stopRecord = () => {
    this.audioProcessor.stop();
  };

  asrChangeCallback = (result) => {
    console.log("asrChangeCallback result", result);
  };

  asrCompleteCallback = (result) => {
    console.log("asrCompleteCallback result", result);
  };

  audioProcessCallback = (data) => {
    const { volume, buffer } = data;

    if (this.videoRecorder.status !== "recording") {
      return;
    }

    this.audioVolumeCanvas.draw(volume);

    if (this.videoRecorder.status === "recording") {
      this.asr.send(buffer);
    }
  };
}

2.2 asr/index.js

import { TencentASR } from "./tencent/index.js";

const asrTypeMap = {
  google: "google",
  tencent: "tencent",
};

export class ASR {
  constructor(options) {
    const { asrType } = options;

    if (asrType === asrTypeMap.google) {
      this.asrInstance = new GoogleASR(options);
    } else {
      this.asrInstance = new TencentASR(options);
    }
  }

  send(data) {
    this.asrInstance.send(data);
  }

  stop() {
    this.asrInstance.stop();
  }

  start() {
    this.asrInstance.start();
  }
}

2.3 asr/tencent.js

export class TencentASR {
  constructor(options) {
    const { asrURL, asrChangeCallback, asrCompleteCallback } = options;

    this.socket = null;
    this.asrURL = asrURL;
    this.totalSentences = [];

    this.asrChangeCallback = asrChangeCallback;
    this.asrCompleteCallback = asrCompleteCallback;

    // 是否一句话开始
    this.isSentenceBegin = false;
    // 当前是否识别结束
    this.isRecognizeComplete = false;

    this.isFinish = false;
    this.isCanSendData = false;
    this.isSignSuccess = false;
    this.stopSendAudio = false;
  }

  send(data) {
    if (!this.socket || this.socket.readyState !== 1) {
      return;
    }
    this.socket.send(data);
  }

  stop() {
    this.stopSendAudio = true;

    if (this.socket && this.socket.readyState === 1) {
      this.socket.send(JSON.stringify({ type: "end" }));
    } else {
      if (this.socket && this.socket.readyState === 1) {
        this.socket.close(4000, "");
      }
    }

    setTimeout(() => {
      this.onASRComplete();
    }, 3000);
  }

  async start() {
    if (!this.asrURL) {
      return;
    }

    this.socket = new WebSocket(this.asrURL);

    this.socket.onmessage = (e) => {
      const response = JSON.parse(e.data);

      if (response.code !== 0) {
        console.log("TencentASR Recognizer Error: ", response.message);
        this.socket.close();
        return;
      }

      if (!this.isSignSuccess) {
        this.onASRStart();
        this.isSignSuccess = true;
      }

      if (response.final === 1) {
        this.isRecognizeComplete = true;
        this.onASRComplete(response);
        return;
      }

      if (response.result) {
        const result = {
          ...response.result,
          voice_id: response.voice_id,
        };

        if (response.result.slice_type === 0) {
          this.isSentenceBegin = true;
          this.onSentenceBegin(result);
        } else if (response.result.slice_type === 2) {
          if (!this.isSentenceBegin) {
            this.onSentenceBegin(result);
          }

          this.onSentenceEnd(result);
        } else {
          this.onASRChange(result);
        }
      }
    };

    this.socket.onerror = (error) => {
      this.socket.close(4000, error);
    };

    this.socket.onclose = (error) => {};
  }

  onSentenceBegin() {
    console.log("一句话识别开始！！");
  }

  onSentenceEnd(result) {
    this.totalSentences.push(result);
  }

  onASRStart() {
    this.isCanSendData = true;
  }

  onASRChange(result) {
    const { voice_text_str } = result;
    this.asrChangeCallback?.(voice_text_str);
  }

  onASRComplete() {
    if (this.isFinish) {
      return;
    }
    this.isFinish = true;
    this.isCanSendData = false;
    this.asrCompleteCallback?.(this.totalSentences);
  }
}

2.4 audio-processor.js

function isSupportAudioWorklet(audioContext) {
  return (
    audioContext.audioWorklet &&
    typeof audioContext.audioWorklet.addModule === "function" &&
    typeof AudioWorkletNode !== "undefined"
  );
}

function isSupportCreateScriptProcessor(audioContext) {
  return typeof audioContext.createScriptProcessor === "function";
}

function calculateRMS(input) {
  const frames = new Int16Array(input);

  // 将Int16Array转换为Int32Array
  const audioData = new Int32Array(frames.length);
  for (let i = 0; i < frames.length; i++) {
    audioData[i] = frames[i];
  }

  // 计算每个样本的平方值
  let sumOfSquares = 0;
  for (let i = 0; i < audioData.length; i++) {
    sumOfSquares += audioData[i] ** 2;
  }

  // 计算平均值
  const meanOfSquares = sumOfSquares / audioData.length;

  // 计算RMS
  const rms = Math.sqrt(meanOfSquares);

  // 定义参考压力和转换为相应的参考值
  const refPressure = 0.00002;
  const refValue = refPressure * 32767;

  // 计算分贝值
  let dB = 0;
  if (rms !== 0) {
    dB = 20 * Math.log10(rms / refValue);
    dB = Math.round(dB);
  }

  return dB;
}

/**
 * @description: rms 转换为分贝计算公式
 * @param {*} gain
 */
function rmsToDb(gain) {
  return 20 * Math.log10(gain);
}

export function to16kHz(audioBuffer, sampleRate = 44100) {
  const data = new Float32Array(audioBuffer);
  const fitCount = Math.round(data.length * (16000 / sampleRate));
  const newData = new Float32Array(fitCount);
  const springFactor = (data.length - 1) / (fitCount - 1);

  function cubicInterpolate(p0, p1, p2, p3, x) {
    return (
      p1 +
      0.5 *
        x *
        (p2 -
          p0 +
          x * (2 * p0 - 5 * p1 + 4 * p2 - p3 + x * (3 * (p1 - p2) + p3 - p0)))
    );
  }

  for (let i = 0; i < fitCount; i++) {
    const tmp = i * springFactor;
    const index = Math.floor(tmp);
    const x = tmp - index;

    // 取出四个相邻的数据点，处理边界情况
    const p0 = data[Math.max(0, index - 1)];
    const p1 = data[index];
    const p2 = data[Math.min(index + 1, data.length - 1)];
    const p3 = data[Math.min(index + 2, data.length - 1)];

    newData[i] = cubicInterpolate(p0, p1, p2, p3, x);
  }

  return newData;
}

function to16BitPCM(input) {
  const dataLength = input.length * (16 / 8);
  const dataBuffer = new ArrayBuffer(dataLength);
  const dataView = new DataView(dataBuffer);
  let offset = 0;
  for (let i = 0; i < input.length; i++, offset += 2) {
    const s = Math.max(-1, Math.min(1, input[i]));
    dataView.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
  }
  return dataView;
}

function getVolumeShownPercent(dbValue) {
  const minDb = -80;

  if (dbValue < minDb) {
    return 0;
  }
  if (dbValue > 1) {
    return 1;
  }

  const volumePercent = (Math.abs(minDb) - Math.abs(dbValue)) / Math.abs(minDb);

  return volumePercent;
}

function convertFloat32ToInt16(float32Array) {
  const int16Array = new Int16Array(float32Array.length);
  for (let i = 0; i < float32Array.length; i++) {
    // 将浮点数从范围[-1, 1] 映射到整数范围[-32768, 32767]
    int16Array[i] = Math.floor(float32Array[i] * 32767);
  }
  return int16Array;
}

class VAD {
  constructor(sampleRate, threshold = 0.01) {
    this.threshold = threshold;
    this.sampleRate = sampleRate;
  }

  isSpeech(samples) {
    const energy =
      samples.reduce((sum, val) => sum + val * val, 0) / samples.length;
    return energy > this.threshold;
  }
}

const myAudioWorkletProcessorCode = `
class MyAudioWorkletProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super(options);
    this.vad = new ${VAD}(sampleRate);
  }

  get intervalInFrames() {
    return 40 / 1000 * sampleRate;
  }

  process(inputs) {
    const input = inputs[0];
    if (!input || input.length === 0) {
      return;
    }

    const samples = input[0];
    const vad = this.vad.isSpeech(samples);
    const output = ${to16kHz}(samples, sampleRate);

    const sum = samples.reduce((acc, curr) => acc + curr * curr, 0);
    const rms = Math.sqrt(sum / samples.length);
    const db = ${rmsToDb}(rms);
    const volume = ${getVolumeShownPercent}(db);

    
    const audioBuffer = ${to16BitPCM}(output);
    const aiDb = ${calculateRMS}(${convertFloat32ToInt16}(samples));

    this.port.postMessage({ rms, db, vad, aiDb, volume, audioBuffer });

    return true;
  }
}

registerProcessor("my-audio-worklet-processor", MyAudioWorkletProcessor);
`;

export class AudioProcessor {
  constructor(options) {
    const {
      stream,
      processCallback,
      enableVAD = false,
      scriptProcessorOptions,
    } = options;

    this.stream = stream;
    this.enableVAD = enableVAD;
    this.processCallback = processCallback;
    this.audioContext = new AudioContext();
    this.scriptProcessorOptions = scriptProcessorOptions || {
      bufferSize: 1024,
      numberOfInputChannels: 1,
      numberOfOutputChannels: 1,
    };
    this.streamSource = this.audioContext.createMediaStreamSource(this.stream);

    this.init();
  }

  init() {
    if (isSupportAudioWorklet(this.audioContext)) {
      this.audioWorkletNodeDealAudioData();
    } else {
      this.scriptNodeDealAudioData();
    }

    this.audioContext?.suspend();
  }

  stop() {
    this.audioContext?.close();
    this.streamSource?.disconnect();
    this.scriptProcessor?.disconnect();
  }

  start() {
    this.audioContext?.resume();
  }

  pause() {
    this.audioContext?.suspend();
  }

  resume() {
    this.audioContext?.resume();
  }

  scriptNodeDealAudioData() {
    if (!isSupportCreateScriptProcessor(this.audioContext)) {
      return;
    }

    try {
      this.scriptProcessor = this.audioContext.createScriptProcessor(
        this.scriptProcessorOptions.bufferSize,
        this.scriptProcessorOptions.numberOfInputChannels,
        this.scriptProcessorOptions.numberOfOutputChannels
      );
      this.streamSource.connect(this.scriptProcessor);
      this.scriptProcessor.connect(this.audioContext.destination);

      this.scriptProcessor.onaudioprocess = (event) => {
        const samples = event.inputBuffer.getChannelData(0);
        const output = to16kHz(samples, this.audioContext.sampleRate);

        const sum = samples.reduce((acc, curr) => acc + curr * curr, 0);
        const rms = Math.sqrt(sum / samples.length);
        const db = rmsToDb(rms);
        const volume = getVolumeShownPercent(db);
        const audioBuffer = to16BitPCM(output);
        const aiDb = calculateRMS(convertFloat32ToInt16(samples));

        const data = {
          db,
          rms,
          aiDb,
          volume,
          buffer: audioBuffer,
        };

        this.processCallback?.(data);
      };
    } catch (e) {
      console.log("AudioProcessor scriptNodeDealAudioData 错误原因:", e);
    }
  }

  async audioWorkletNodeDealAudioData() {
    if (!isSupportAudioWorklet(this.audioContext)) {
      return;
    }

    try {
      const myAudioWorkletProcessorBlogURL = window.URL.createObjectURL(
        new Blob([myAudioWorkletProcessorCode], { type: "text/javascript" })
      );

      await this.audioContext.audioWorklet.addModule(
        myAudioWorkletProcessorBlogURL
      );

      const myAudioWorkletNode = new AudioWorkletNode(
        this.audioContext,
        "my-audio-worklet-processor",
        {
          numberOfInputs: 1,
          numberOfOutputs: 1,
          channelCount: 1,
        }
      );

      this.streamSource
        .connect(myAudioWorkletNode)
        .connect(this.audioContext.destination);

      myAudioWorkletNode.onprocessorerror = () => {
        this.scriptNodeDealAudioData();
        return false;
      };

      myAudioWorkletNode.port.onmessageerror = () => {
        this.scriptNodeDealAudioData();
        return false;
      };

      myAudioWorkletNode.port.onmessage = (event) => {
        const { db, rms, vad, volume, audioBuffer } = event.data;

        console.log("vad", vad);

        const data = {
          db,
          rms,
          vad,
          volume,
          buffer: audioBuffer,
        };

        this.processCallback?.(data);
      };
    } catch (e) {
      console.log("AudioProcessor audioWorkletNodeDealAudioData 错误原因:", e);
    }
  }
}

一、认识​

二、实现​

2.1 index.js​

2.2 asr/index.js​

2.3 asr/tencent.js​

2.4 audio-processor.js​