Tencent Websocket
一、认识
Tencent ASR
文档 。Tencent ASR
语音识别技术 采用 websocket
协议,对实时音频流进行识别,同步返回识别结果,达到边说边出文字的效果。首先,我们获取到的用户媒体流或者 Canvas
画布合成流之后, 初始化 AudioContext
音频上下文才。在音频上下文中通过 audioContext.createMediaStreamSource(stream)
创建音频处理的输入源, 也叫音源节点, 该接口可以从传入的媒体流(MediaStream
)对象中抽取音频数据作为音频上下文(AudioContext
)里的一个音源节点。 然后创建音频处理流, 我们优先采用 AudioWorklet/AudioWorkletNode
开启音频渲染线程来处理音频, 音频渲染线程将处理好的音频数据通过 postMessage
发送到主线程, 主线程通过 onmessage
来接收音频数据; 并在不支持的情况下回退到 ScriptProcessorNode
在主线程中处理音频, ScriptProcessorNode
通过 onaudioprocess
事件来处理音频数据。最后, 音源节点连接音频处理流、连接音频输出节点(扬声器)。这样,音源发出的音频信号在输出到扬声器前,会先经过自定义的音频处理器处理。在音频处理部分, 我们分别实现了 16kHz
音频降采样、音频响度 RMS
均方根、音频强度 dB
分贝值、音频音量百分比、16-bit PCM
音频格式转换、语音活动检测(VAD
)
一、获取用户媒体流: 上面已经获取的用户媒体流或者 Canvas
画布合成流都可以。并在 getUserMedia audio
参数中增加 noiseSuppression: true
降噪、echoCancellation: true
回声消除和 autoGainControl: true
自动增益。
二、创建音频上下文: 实例化一个新的 AudioContext
上下文, 用于处理和控制音频操作, 提供了一个用于音频处理和合成的工作环境。
三、创建音源节点(音频处理的输入源): 通过 audioContext.createMediaStreamSource(stream)
创建一个接口, 该接口可以从传入的媒体流(MediaStream
)对象中抽取音频数据作为音频上下文(AudioContext
)里的一个音源节点。
四、创建音频处理流: 我们优先采用 AudioWorklet/AudioWorkletNode
开启音频渲染线程来处理音频, 音频渲染线程将处理好的音频数据通过 postMessage
发送到主线程, 主线程通过 onmessage
来接收音频数据; 并在不支持的情况下回退到 ScriptProcessorNode
在主线程中处理音频, ScriptProcessorNode
通过 onaudioprocess
事件来处理音频数据。为什么优先使用 AudioWorklet
?: AudioWorklet
运行在 Audio Rendering Thread
(音频渲染线程)中,与主线程解耦,能够避免 主线程阻塞 导致的音频丢失或卡顿问题。而 ScriptProcessorNode
运行在 主线程,会受到 UI
渲染、网络请求等任务的影响,导致音频处理延迟增大,甚至出现 音频丢帧,特别是在高 CPU
负载情况下,影响实时性和用户体验。
五、音源节点连接音频处理节点并连接音频输出点(扬声器): 这样,音源发出的音频信号在输出到扬声器前,会先经过自定义的音频处理器处理
六、AudioWorklet/AudioWorkletNode
或者 ScriptProcessorNode
音频处理: 16kHz
音频降采样、音频响度 RMS
均方根、音频强度 dB
分贝值、音频音量百分比、16-bit PCM
音频格式转换、语音活动检测(VAD
)
七、处理完成后进行音量绘制与增量 ASR
语音识别: 增量 ASR
语音识别时, 如果音量百分比大于等于0
, 将最后的 16-bit PCM
通过 Socket.send
发送给各个 ASR
服务商。根据返回的数据进行相应的处理。比如 Tensor ASR
有一句话开始与完成的逻辑, 而且还有语速相关信息。比如: Tencent ASR
工作流程如下:
-
握手阶段: 前端定义请求
url
部分参数, 发送服务端, 由服务端生成带有signature
签名的请求url
, 拼接wss://
前缀, 随后根据请求url
建立websocket
连接, 返回text message
,内容为json
序列化字符串 -
识别阶段: 发送格式为
16k
采样率的PCM
音频数据,接收识别结果, 返回text message
,内容为json
序列化字符串 -
处理结果:
json
序列化字符串通过JSON.parse()
解析后,包含code
、message
、voice_id
、final
、result
等字段-
code
: 返回码,0
表示成功,其他值表示失败 -
result
: 最新语音识别结果-
slice_type
:0
-一段话开始识别,1
-一段话识别中,2
-一段话识别结束 -
voice_text_str
: 当前一段话文本结果,编码为UTF8
-
start_time
: 当前一段话结果在整个音频流中的起始时间 -
end_time
: 当前一段话结果在整个音频流中的结束时间 -
voice_id
: 语音流唯一标识 -
word_size
: 当前一段话的词结果个数 -
word_list
: 当前一段话的词列表
-
-
final
: 该字段返回1时表示音频流全部识别结束
-
二、实现
2.1 index.js
export class AudioRecorder {
constructor(options) {
const {
asrURL,
asrType,
} = options;
this.asrURL = asrURL;
this.asrType = asrType;
this.asr = null;
this.audioProcessor = null;
this.run();
}
run() {
this.initASR();
this.initVideoAudio();
}
initASR() {
this.asr = new ASR({
asrURL: this.asrURL,
asrType: this.asrType,
asrChangeCallback: this.asrChangeCallback,
asrCompleteCallback: this.asrCompleteCallback,
});
}
async initAudio() {
const stream = await getUserMedia({ audio: true});
if (stream) {
this.audioProcessor = new AudioProcessor({
stream,
processCallback: this.audioProcessCallback,
});
}
}
startRecord = () => {
this.asr.start();
this.audioProcessor.start();
};
pauseRecord = () => {
this.audioProcessor.pause();
};
resumeRecord = () => {
this.audioProcessor.resume();
};
stopRecord = () => {
this.audioProcessor.stop();
};
asrChangeCallback = (result) => {
console.log("asrChangeCallback result", result);
};
asrCompleteCallback = (result) => {
console.log("asrCompleteCallback result", result);
};
audioProcessCallback = (data) => {
const { volume, buffer } = data;
if (this.videoRecorder.status !== "recording") {
return;
}
this.audioVolumeCanvas.draw(volume);
if (this.videoRecorder.status === "recording") {
this.asr.send(buffer);
}
};
}
2.2 asr/index.js
import { TencentASR } from "./tencent/index.js";
const asrTypeMap = {
google: "google",
tencent: "tencent",
};
export class ASR {
constructor(options) {
const { asrType } = options;
if (asrType === asrTypeMap.google) {
this.asrInstance = new GoogleASR(options);
} else {
this.asrInstance = new TencentASR(options);
}
}
send(data) {
this.asrInstance.send(data);
}
stop() {
this.asrInstance.stop();
}
start() {
this.asrInstance.start();
}
}
2.3 asr/tencent.js
export class TencentASR {
constructor(options) {
const { asrURL, asrChangeCallback, asrCompleteCallback } = options;
this.socket = null;
this.asrURL = asrURL;
this.totalSentences = [];
this.asrChangeCallback = asrChangeCallback;
this.asrCompleteCallback = asrCompleteCallback;
// 是否一句话开始
this.isSentenceBegin = false;
// 当前是否识别结束
this.isRecognizeComplete = false;
this.isFinish = false;
this.isCanSendData = false;
this.isSignSuccess = false;
this.stopSendAudio = false;
}
send(data) {
if (!this.socket || this.socket.readyState !== 1) {
return;
}
this.socket.send(data);
}
stop() {
this.stopSendAudio = true;
if (this.socket && this.socket.readyState === 1) {
this.socket.send(JSON.stringify({ type: "end" }));
} else {
if (this.socket && this.socket.readyState === 1) {
this.socket.close(4000, "");
}
}
setTimeout(() => {
this.onASRComplete();
}, 3000);
}
async start() {
if (!this.asrURL) {
return;
}
this.socket = new WebSocket(this.asrURL);
this.socket.onmessage = (e) => {
const response = JSON.parse(e.data);
if (response.code !== 0) {
console.log("TencentASR Recognizer Error: ", response.message);
this.socket.close();
return;
}
if (!this.isSignSuccess) {
this.onASRStart();
this.isSignSuccess = true;
}
if (response.final === 1) {
this.isRecognizeComplete = true;
this.onASRComplete(response);
return;
}
if (response.result) {
const result = {
...response.result,
voice_id: response.voice_id,
};
if (response.result.slice_type === 0) {
this.isSentenceBegin = true;
this.onSentenceBegin(result);
} else if (response.result.slice_type === 2) {
if (!this.isSentenceBegin) {
this.onSentenceBegin(result);
}
this.onSentenceEnd(result);
} else {
this.onASRChange(result);
}
}
};
this.socket.onerror = (error) => {
this.socket.close(4000, error);
};
this.socket.onclose = (error) => {};
}
onSentenceBegin() {
console.log("一句话识别开始!!");
}
onSentenceEnd(result) {
this.totalSentences.push(result);
}
onASRStart() {
this.isCanSendData = true;
}
onASRChange(result) {
const { voice_text_str } = result;
this.asrChangeCallback?.(voice_text_str);
}
onASRComplete() {
if (this.isFinish) {
return;
}
this.isFinish = true;
this.isCanSendData = false;
this.asrCompleteCallback?.(this.totalSentences);
}
}
2.4 audio-processor.js
function isSupportAudioWorklet(audioContext) {
return (
audioContext.audioWorklet &&
typeof audioContext.audioWorklet.addModule === "function" &&
typeof AudioWorkletNode !== "undefined"
);
}
function isSupportCreateScriptProcessor(audioContext) {
return typeof audioContext.createScriptProcessor === "function";
}
function calculateRMS(input) {
const frames = new Int16Array(input);
// 将Int16Array转换为Int32Array
const audioData = new Int32Array(frames.length);
for (let i = 0; i < frames.length; i++) {
audioData[i] = frames[i];
}
// 计算每个样本的平方值
let sumOfSquares = 0;
for (let i = 0; i < audioData.length; i++) {
sumOfSquares += audioData[i] ** 2;
}
// 计算平均值
const meanOfSquares = sumOfSquares / audioData.length;
// 计算RMS
const rms = Math.sqrt(meanOfSquares);
// 定义参考压力和转换为相应的参考值
const refPressure = 0.00002;
const refValue = refPressure * 32767;
// 计算分贝值
let dB = 0;
if (rms !== 0) {
dB = 20 * Math.log10(rms / refValue);
dB = Math.round(dB);
}
return dB;
}
/**
* @description: rms 转换为分贝计算公式
* @param {*} gain
*/
function rmsToDb(gain) {
return 20 * Math.log10(gain);
}
export function to16kHz(audioBuffer, sampleRate = 44100) {
const data = new Float32Array(audioBuffer);
const fitCount = Math.round(data.length * (16000 / sampleRate));
const newData = new Float32Array(fitCount);
const springFactor = (data.length - 1) / (fitCount - 1);
function cubicInterpolate(p0, p1, p2, p3, x) {
return (
p1 +
0.5 *
x *
(p2 -
p0 +
x * (2 * p0 - 5 * p1 + 4 * p2 - p3 + x * (3 * (p1 - p2) + p3 - p0)))
);
}
for (let i = 0; i < fitCount; i++) {
const tmp = i * springFactor;
const index = Math.floor(tmp);
const x = tmp - index;
// 取出四个相邻的数据点,处理边界情况
const p0 = data[Math.max(0, index - 1)];
const p1 = data[index];
const p2 = data[Math.min(index + 1, data.length - 1)];
const p3 = data[Math.min(index + 2, data.length - 1)];
newData[i] = cubicInterpolate(p0, p1, p2, p3, x);
}
return newData;
}
function to16BitPCM(input) {
const dataLength = input.length * (16 / 8);
const dataBuffer = new ArrayBuffer(dataLength);
const dataView = new DataView(dataBuffer);
let offset = 0;
for (let i = 0; i < input.length; i++, offset += 2) {
const s = Math.max(-1, Math.min(1, input[i]));
dataView.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
}
return dataView;
}
function getVolumeShownPercent(dbValue) {
const minDb = -80;
if (dbValue < minDb) {
return 0;
}
if (dbValue > 1) {
return 1;
}
const volumePercent = (Math.abs(minDb) - Math.abs(dbValue)) / Math.abs(minDb);
return volumePercent;
}
function convertFloat32ToInt16(float32Array) {
const int16Array = new Int16Array(float32Array.length);
for (let i = 0; i < float32Array.length; i++) {
// 将浮点数从范围[-1, 1] 映射到整数范围[-32768, 32767]
int16Array[i] = Math.floor(float32Array[i] * 32767);
}
return int16Array;
}
class VAD {
constructor(sampleRate, threshold = 0.01) {
this.threshold = threshold;
this.sampleRate = sampleRate;
}
isSpeech(samples) {
const energy =
samples.reduce((sum, val) => sum + val * val, 0) / samples.length;
return energy > this.threshold;
}
}
const myAudioWorkletProcessorCode = `
class MyAudioWorkletProcessor extends AudioWorkletProcessor {
constructor(options) {
super(options);
this.vad = new ${VAD}(sampleRate);
}
get intervalInFrames() {
return 40 / 1000 * sampleRate;
}
process(inputs) {
const input = inputs[0];
if (!input || input.length === 0) {
return;
}
const samples = input[0];
const vad = this.vad.isSpeech(samples);
const output = ${to16kHz}(samples, sampleRate);
const sum = samples.reduce((acc, curr) => acc + curr * curr, 0);
const rms = Math.sqrt(sum / samples.length);
const db = ${rmsToDb}(rms);
const volume = ${getVolumeShownPercent}(db);
const audioBuffer = ${to16BitPCM}(output);
const aiDb = ${calculateRMS}(${convertFloat32ToInt16}(samples));
this.port.postMessage({ rms, db, vad, aiDb, volume, audioBuffer });
return true;
}
}
registerProcessor("my-audio-worklet-processor", MyAudioWorkletProcessor);
`;
export class AudioProcessor {
constructor(options) {
const {
stream,
processCallback,
enableVAD = false,
scriptProcessorOptions,
} = options;
this.stream = stream;
this.enableVAD = enableVAD;
this.processCallback = processCallback;
this.audioContext = new AudioContext();
this.scriptProcessorOptions = scriptProcessorOptions || {
bufferSize: 1024,
numberOfInputChannels: 1,
numberOfOutputChannels: 1,
};
this.streamSource = this.audioContext.createMediaStreamSource(this.stream);
this.init();
}
init() {
if (isSupportAudioWorklet(this.audioContext)) {
this.audioWorkletNodeDealAudioData();
} else {
this.scriptNodeDealAudioData();
}
this.audioContext?.suspend();
}
stop() {
this.audioContext?.close();
this.streamSource?.disconnect();
this.scriptProcessor?.disconnect();
}
start() {
this.audioContext?.resume();
}
pause() {
this.audioContext?.suspend();
}
resume() {
this.audioContext?.resume();
}
scriptNodeDealAudioData() {
if (!isSupportCreateScriptProcessor(this.audioContext)) {
return;
}
try {
this.scriptProcessor = this.audioContext.createScriptProcessor(
this.scriptProcessorOptions.bufferSize,
this.scriptProcessorOptions.numberOfInputChannels,
this.scriptProcessorOptions.numberOfOutputChannels
);
this.streamSource.connect(this.scriptProcessor);
this.scriptProcessor.connect(this.audioContext.destination);
this.scriptProcessor.onaudioprocess = (event) => {
const samples = event.inputBuffer.getChannelData(0);
const output = to16kHz(samples, this.audioContext.sampleRate);
const sum = samples.reduce((acc, curr) => acc + curr * curr, 0);
const rms = Math.sqrt(sum / samples.length);
const db = rmsToDb(rms);
const volume = getVolumeShownPercent(db);
const audioBuffer = to16BitPCM(output);
const aiDb = calculateRMS(convertFloat32ToInt16(samples));
const data = {
db,
rms,
aiDb,
volume,
buffer: audioBuffer,
};
this.processCallback?.(data);
};
} catch (e) {
console.log("AudioProcessor scriptNodeDealAudioData 错误原因:", e);
}
}
async audioWorkletNodeDealAudioData() {
if (!isSupportAudioWorklet(this.audioContext)) {
return;
}
try {
const myAudioWorkletProcessorBlogURL = window.URL.createObjectURL(
new Blob([myAudioWorkletProcessorCode], { type: "text/javascript" })
);
await this.audioContext.audioWorklet.addModule(
myAudioWorkletProcessorBlogURL
);
const myAudioWorkletNode = new AudioWorkletNode(
this.audioContext,
"my-audio-worklet-processor",
{
numberOfInputs: 1,
numberOfOutputs: 1,
channelCount: 1,
}
);
this.streamSource
.connect(myAudioWorkletNode)
.connect(this.audioContext.destination);
myAudioWorkletNode.onprocessorerror = () => {
this.scriptNodeDealAudioData();
return false;
};
myAudioWorkletNode.port.onmessageerror = () => {
this.scriptNodeDealAudioData();
return false;
};
myAudioWorkletNode.port.onmessage = (event) => {
const { db, rms, vad, volume, audioBuffer } = event.data;
console.log("vad", vad);
const data = {
db,
rms,
vad,
volume,
buffer: audioBuffer,
};
this.processCallback?.(data);
};
} catch (e) {
console.log("AudioProcessor audioWorkletNodeDealAudioData 错误原因:", e);
}
}
}