Express.js 应用通过认知服务语音将文本转换为语音

项目
01/18/2024

在本教程中，将认知服务语音添加到现有的 Express.js 应用中，以便使用认知服务语音服务将文本转换为语音。通过将文本转换为语音，无需手动生成音频即可提供音频。

本教程演示了通过 Azure 认知服务语音将文本转换为语音的 3 种不同方法：

客户端 JavaScript 直接获取音频
服务器 JavaScript 通过文件 (*.MP3) 获取音频
服务器 JavaScript 通过内存中 arrayBuffer 获取音频

应用程序体系结构

本教程使用最精简的 Express.js 应用，并结合使用以下内容来添加功能：

服务器 API 的新路由，提供从文本到语音的转换并返回 MP3 流
HTML 表单的新路由，可用于输入信息
带有 JavaScript 的新 HTML 表单，提供对语音服务的客户端调用

此应用程序提供三种不同的调用，将语音转换为文本：

第一个服务器调用在服务器上创建一个文件，然后将其返回给客户端。通常将其用于较长的文本或需要多次提供的文本。
第二个服务器调用用于较短的文本，并在返回到客户端之前保存在内存中。
客户端调用演示了如何使用 SDK 直接调用语音服务。如果拥有仅限客户端的应用程序而无服务器，可以选择执行此调用。

先决条件

Node.js LTS - 安装到本地计算机。
Visual Studio Code - 已安装到本地计算机。
适用于 VS Code 的 Azure 应用服务扩展（从 VS Code 中安装）。
Git - 用于推送到 GitHub，这将激活 GitHub 操作。
使用 bash 使用 Azure Cloud Shell
如果需要，请安装 Azure CLI 来运行 CLI 参考命令。
- 如果使用的是本地安装，请通过 Azure CLI 使用 az login 命令登录。若要完成身份验证过程，请遵循终端中显示的步骤。有关更多登录选项，请参阅使用 Azure CLI 登录。
- 出现提示时，请在首次使用时安装 Azure CLI 扩展。有关扩展详细信息，请参阅使用 Azure CLI 的扩展。
- 运行 az version 以查找安装的版本和依赖库。若要升级到最新版本，请运行 az upgrade。

下载示例 Express.js 存储库

使用 git，将 Express.js 示例存储库克隆到本地计算机。
```
git clone https://github.com/Azure-Samples/js-e2e-express-server
```
更改为示例的新目录。
```
cd js-e2e-express-server
```
在 Visual Studio Code 中打开项目。
```
code .
```
在 Visual Studio Code 中打开新终端并安装项目依赖项。
```
npm install
```

安装用于 JavaScript 的认知服务语音 SDK

从 Visual Studio Code 终端安装 Azure 认知服务语音 SDK。

npm install microsoft-cognitiveservices-speech-sdk

为 Express.js 应用创建语音模块

若要将 Speech SDK 集成到 Express.js 应用程序中，请在 src 文件夹中创建一个名为 azure-cognitiveservices-speech.js 的文件。

添加以下代码以拉取依赖关系，并创建一个将文本转换为语音的函数。

// azure-cognitiveservices-speech.js

const sdk = require('microsoft-cognitiveservices-speech-sdk');
const { Buffer } = require('buffer');
const { PassThrough } = require('stream');
const fs = require('fs');

/**
 * Node.js server code to convert text to speech
 * @returns stream
 * @param {*} key your resource key
 * @param {*} region your resource region
 * @param {*} text text to convert to audio/speech
 * @param {*} filename optional - best for long text - temp file for converted speech/audio
 */
const textToSpeech = async (key, region, text, filename)=> {
    
    // convert callback function to promise
    return new Promise((resolve, reject) => {
        
        const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
        speechConfig.speechSynthesisOutputFormat = 5; // mp3
        
        let audioConfig = null;
        
        if (filename) {
            audioConfig = sdk.AudioConfig.fromAudioFileOutput(filename);
        }
        
        const synthesizer = new sdk.SpeechSynthesizer(speechConfig, audioConfig);

        synthesizer.speakTextAsync(
            text,
            result => {
                
                const { audioData } = result;

                synthesizer.close();
                
                if (filename) {
                    
                    // return stream from file
                    const audioFile = fs.createReadStream(filename);
                    resolve(audioFile);
                    
                } else {
                    
                    // return stream from memory
                    const bufferStream = new PassThrough();
                    bufferStream.end(Buffer.from(audioData));
                    resolve(bufferStream);
                }
            },
            error => {
                synthesizer.close();
                reject(error);
            }); 
    });
};

module.exports = {
    textToSpeech
};

参数 - 文件拉取依赖关系以便使用 SDK、流、缓冲区和文件系统 (fs)。 textToSpeech 函数采用四个参数。如果发送包含本地路径的文件名，则文本将转换为音频文件。如果未发送文件名，则会创建内存中音频流。
语音 SDK 方法 - 语音 SDK 方法 synthesizer.speakTextAsync 基于收到的配置返回不同的类型。该方法返回结果，结果因要求方法执行的操作而有所不同：
- 创建文件
- 将内存流创建为缓冲区数组
音频格式 - 所选的音频格式是 MP3，但是也存在其他格式，以及其他音频配置方法。

本地方法 textToSpeech，将 SDK 回叫功能打包并转换为承诺。

为 Express.js 应用创建新路由

打开 src/server.js 文件。
将 azure-cognitiveservices-speech.js 模块作为依赖项添加到文件顶部：
```
const { textToSpeech } = require('./azure-cognitiveservices-speech');
```

添加新的 API 路由，以调用在本教程的上一部分中创建的 textToSpeech 方法。在路由后 /api/hello 添加此代码。

// creates a temp file on server, the streams to client
/* eslint-disable no-unused-vars */
app.get('/text-to-speech', async (req, res, next) => {
    
    const { key, region, phrase, file } = req.query;
    
    if (!key || !region || !phrase) res.status(404).send('Invalid query string');
    
    let fileName = null;
    
    // stream from file or memory
    if (file && file === true) {
        fileName = `./temp/stream-from-file-${timeStamp()}.mp3`;
    }
    
    const audioStream = await textToSpeech(key, region, phrase, fileName);
    res.set({
        'Content-Type': 'audio/mpeg',
        'Transfer-Encoding': 'chunked'
    });
    audioStream.pipe(res);
});

此方法从查询字符串中获取 textToSpeech 方法的必需和可选参数。如果需要创建文件，则会开发一个唯一的文件名。将异步调用 textToSpeech 方法，并通过管道将结果传递给响应 (res) 对象。

使用表单更新客户端网页

使用收集所需参数的表单更新客户端 HTML 网页。基于用户选择的音频控件传入可选参数。由于本教程提供了从客户端调用 Azure 语音服务的机制，因此还提供了 JavaScript。

打开 /public/client.html 文件并将其内容替换为以下内容：

<!DOCTYPE html>
<html lang="en">

<head>
  <title>Microsoft Cognitive Services Demo</title>
  <meta charset="utf-8" />
</head>

<body>

  <div id="content" style="display:none">
    <h1 style="font-weight:500;">Microsoft Cognitive Services Speech </h1>
    <h2>npm: microsoft-cognitiveservices-speech-sdk</h2>
    <table width="100%">
      <tr>
        <td></td>
        <td>
          <a href="https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started" target="_blank">Azure
            Cognitive Services Speech Documentation</a>
        </td>
      </tr>
      <tr>
        <td align="right">Your Speech Resource Key</td>
        <td>

          <input id="resourceKey" type="text" size="40" placeholder="Your resource key (32 characters)" value=""
            onblur="updateSrc()">

      </tr>
      <tr>
        <td align="right">Your Speech Resource region</td>
        <td>
          <input id="resourceRegion" type="text" size="40" placeholder="Your resource region" value="eastus"
            onblur="updateSrc()">

        </td>
      </tr>
      <tr>
        <td align="right" valign="top">Input Text (max 255 char)</td>
        <td><textarea id="phraseDiv" style="display: inline-block;width:500px;height:50px" maxlength="255"
            onblur="updateSrc()">all good men must come to the aid</textarea></td>
      </tr>
      <tr>
        <td align="right">
          Stream directly from Azure Cognitive Services
        </td>
        <td>
          <div>
            <button id="clientAudioAzure" onclick="getSpeechFromAzure()">Get directly from Azure</button>
          </div>
        </td>
      </tr>

      <tr>
        <td align="right">
          Stream audio from file on server</td>
        <td>
          <audio id="serverAudioFile" controls preload="none" onerror="DisplayError()">
          </audio>
        </td>
      </tr>

      <tr>
        <td align="right">Stream audio from buffer on server</td>
        <td>
          <audio id="serverAudioStream" controls preload="none" onerror="DisplayError()">
          </audio>
        </td>
      </tr>
    </table>
  </div>

  <!-- Speech SDK reference sdk. -->
  <script
    src="https://cdn.jsdelivr.net/npm/microsoft-cognitiveservices-speech-sdk@latest/distrib/browser/microsoft.cognitiveservices.speech.sdk.bundle-min.js">
    </script>

  <!-- Speech SDK USAGE -->
  <script>
    // status fields and start button in UI
    var phraseDiv;
    var resultDiv;

    // subscription key and region for speech services.
    var resourceKey = null;
    var resourceRegion = "eastus";
    var authorizationToken;
    var SpeechSDK;
    var synthesizer;

    var phrase = "all good men must come to the aid"
    var queryString = null;

    var audioType = "audio/mpeg";
    var serverSrc = "/text-to-speech";

    document.getElementById('serverAudioStream').disabled = true;
    document.getElementById('serverAudioFile').disabled = true;
    document.getElementById('clientAudioAzure').disabled = true;

    // update src URL query string for Express.js server
    function updateSrc() {

      // input values
      resourceKey = document.getElementById('resourceKey').value.trim();
      resourceRegion = document.getElementById('resourceRegion').value.trim();
      phrase = document.getElementById('phraseDiv').value.trim();

      // server control - by file
      var serverAudioFileControl = document.getElementById('serverAudioFile');
      queryString += `%file=true`;
      const fileQueryString = `file=true&region=${resourceRegion}&key=${resourceKey}&phrase=${phrase}`;
      serverAudioFileControl.src = `${serverSrc}?${fileQueryString}`;
      console.log(serverAudioFileControl.src)
      serverAudioFileControl.type = "audio/mpeg";
      serverAudioFileControl.disabled = false;

      // server control - by stream
      var serverAudioStreamControl = document.getElementById('serverAudioStream');
      const streamQueryString = `region=${resourceRegion}&key=${resourceKey}&phrase=${phrase}`;
      serverAudioStreamControl.src = `${serverSrc}?${streamQueryString}`;
      console.log(serverAudioStreamControl.src)
      serverAudioStreamControl.type = "audio/mpeg";
      serverAudioStreamControl.disabled = false;

      // client control
      var clientAudioAzureControl = document.getElementById('clientAudioAzure');
      clientAudioAzureControl.disabled = false;

    }

    function DisplayError(error) {
      window.alert(JSON.stringify(error));
    }

    // Client-side request directly to Azure Cognitive Services
    function getSpeechFromAzure() {

      // authorization for Speech service
      var speechConfig = SpeechSDK.SpeechConfig.fromSubscription(resourceKey, resourceRegion);

      // new Speech object
      synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig);

      synthesizer.speakTextAsync(
        phrase,
        function (result) {

          // Success function

          // display status
          if (result.reason === SpeechSDK.ResultReason.SynthesizingAudioCompleted) {

            // load client-side audio control from Azure response
            audioElement = document.getElementById("clientAudioAzure");
            const blob = new Blob([result.audioData], { type: "audio/mpeg" });
            const url = window.URL.createObjectURL(blob);

          } else if (result.reason === SpeechSDK.ResultReason.Canceled) {
            // display Error
            throw (result.errorDetails);
          }

          // clean up
          synthesizer.close();
          synthesizer = undefined;
        },
        function (err) {

          // Error function
          throw (err);
          audioElement = document.getElementById("audioControl");
          audioElement.disabled = true;

          // clean up
          synthesizer.close();
          synthesizer = undefined;
        });

    }

    // Initialization
    document.addEventListener("DOMContentLoaded", function () {

      var clientAudioAzureControl = document.getElementById("clientAudioAzure");
      var resultDiv = document.getElementById("resultDiv");

      resourceKey = document.getElementById('resourceKey').value;
      resourceRegion = document.getElementById('resourceRegion').value;
      phrase = document.getElementById('phraseDiv').value;
      if (!!window.SpeechSDK) {
        SpeechSDK = window.SpeechSDK;
        clientAudioAzure.disabled = false;

        document.getElementById('content').style.display = 'block';
      }
    });

  </script>
</body>

</html>

文件中突出显示的行：

第 74 行：使用站点传送 NPM 包，将 Azure 语音 SDK 拉取到客户端库中 cdn.jsdelivr.net 。
第 102 行：该方法 updateSrc 使用查询字符串（包括键、区域和文本）更新音频控件的 src URL。
第 137 行：如果用户选择 Get directly from Azure 该按钮，网页将从客户端页直接调用 Azure 并处理结果。

创建认知服务语音资源

使用 Azure CLI 命令在 Azure Cloud Shell 中创建语音资源。

登录到 Azure Cloud Shell。该操作需要使用具有有效 Azure 订阅权限的帐户在浏览器中进行身份验证。

为语音资源创建资源组。

az group create \
    --location eastus \
    --name tutorial-resource-group-eastus

在资源组中创建语音资源。

az cognitiveservices account create \
    --kind SpeechServices \
    --location eastus \
    --name tutorial-speech \
    --resource-group tutorial-resource-group-eastus \
    --sku F0

如果已创建唯一的可用语音资源，则此命令将失败。

使用命令获取新的语音资源的密钥值。

az cognitiveservices account keys list \
    --name tutorial-speech \
    --resource-group tutorial-resource-group-eastus \
    --output table

复制其中一个密钥。

可将密钥粘贴到 Express 应用的 Web 窗体中，以向 Azure 语音服务进行身份验证。

运行 Express.js 应用将文本转换为语音

使用以下 bash 命令启动应用。
```
npm start
```
在浏览器中打开 Web 应用。
```
http://localhost:3000    
```
将语音密钥粘贴到突出显示的文本框中。
（可选）将文本更改为新内容。
选择三个按钮之一，开始转换为音频格式：
- 直接从 Azure 获取 - 客户端对 Azure 的调用
- 文件中音频的音频控件
- 缓冲区中音频的音频控件
你可能会注意到从选择控件到音频播放之间存在很短的延迟。

在 Visual Studio Code 中创建新的 Azure 应用服务

在命令面板（Ctrl+Shift+P）中，键入“创建 Web”并选择Azure App 服务：创建新 Web 应用...高级。我们使用高级命令来完全控制部署（包括资源组、应用服务计划、操作系统），而不是使用 Linux 默认设置。
响应提示，如下所述：
- 选择你的“订阅”帐户。
- 对于“输入全局唯一的名称”，例如 my-text-to-speech-app。
  - 输入在整个 Azure 中均唯一的名称。仅使用字母数字字符（“A-Z”、“a-z”和“0-9”）和连字符（“-”）
- 选择 tutorial-resource-group-eastus 作为资源组。
- 选择包含 Node 和 LTS 的运行时堆栈版本。
- 选择 Linux 操作系统。
- 选择“创建新的应用服务计划”，并提供名称，如 my-text-to-speech-app-plan。
- 选择 F1 免费定价层。如果订阅已有免费 Web 应用，请选择 Basic 层。
- 对于 Application Insights 资源，选择“暂时跳过”。
- 选择 eastus 位置。
短时间过后，Visual Studio Code 会通知你创建已完成。使用“X”按钮关闭通知：。

在 Visual Studio Code 中将本地 Express.js 应用部署到远程应用服务

部署 Web 应用后，从本地计算机部署代码。选择 Azure 图标以打开“Azure 应用服务”资源管理器，展开订阅节点，右键单击刚创建的 Web 应用的名称，然后选择“配置到 Web 应用”。
如果出现部署提示，请选择 Express.js 应用的根文件夹并再次选择你的订阅帐户，然后选择此前创建的 Web 应用的名称 my-text-to-speech-app。
如果在部署到 Linux 时提示运行 npm install，请在系统提示更新配置以在目标服务器上运行 npm install 时选择“是”。
部署完成后，选择提示中的“浏览网站”，查看全新部署的 Web 应用。
（可选）：可以更改代码文件，然后使用部署到 Web 应用，在Azure 应用服务扩展中更新 Web 应用。

在 Visual Studio Code 中流式传输远程服务日志

通过调用 console.log 来查看（跟踪）正在运行的应用所生成的任何输出。此输出显示在 Visual Studio Code 的“输出”窗口中。

在“Azure 应用服务”资源管理器中右键单击新的应用节点，并选择“开始流式传输日志”。
```
 Starting Live Log Stream ---
 
```
在浏览器中刷新网页几次以查看更多日志输出。

通过删除资源组来清理资源

完成本教程后，需要删除包含该资源的资源组，以确保不再支付相关使用费用。

在 Azure Cloud Shell 中，使用 Azure CLI 命令删除资源组：

az group delete --name tutorial-resource-group-eastus  -y

此命令可能需要花费几分钟时间。

后续步骤

将 Express.js MongoDB 应用部署到App 服务

通过