当前位置：首页 > article >正文

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback论文学习

article 2025/3/27 1:06:41

借助人类反馈增强零样本文本到语音合成

Abstract

In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even state-of-the-art TTS approaches have kept human feedback isolated from training that resulted in mismatched training objectives and evaluation metrics. In this work, we investigate a novel topic of integrating subjective human evaluation into the TTS training loop. Inspired by the recent success of reinforcement learning from human feedback, we propose a comprehensive sampling-annotating-learning framework tailored to TTS optimization, namely uncertainty-aware optimization (UNO). Specifically, UNO eliminates the need for a reward model or preference data by directly maximizing the utility of speech generations while considering the uncertainty that lies in the inherent variability in subjective human speech perception and evaluations. Experimental results of both subjective and objective evaluations demonstrate that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity. Additionally, we present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.

近年来，文本到语音（TTS）技术取得了令人瞩目的进步，特别是在大规模训练数据集的助力下，其语音合成质量和对未见说话人的零样本能力已达到人类水平。然而，尽管以平均意见得分（MOS）等人类主观评估方式仍是衡量合成语音质量的金标准(但是仅仅是训练完了，推理时进行评分，训练时没有使用MOS)，但即使是最先进的TTS方法在训练过程中也未融入人类反馈，导致训练目标与评估指标不匹配。在本研究中，我们探索了一项新课题，即将主观人类评估纳入TTS训练循环。受近期人类反馈强化学习成功实践的启发，我们提出了一种专为TTS优化设计的全面采样 - 注释 - 学习框架，即不确定性感知优化（UNO）。具体而言，UNO无需奖励模型或偏好数据，而是通过直接最大化语音生成的效用，同时考虑主观人类语音感知和评估中的固有变异性来消除不确定性。主观和客观评估的实验结果表明，UNO显著提升了TTS模型的零样本性能，在MOS、词错率和说话人相似性方面表现优异。此外，我们还发现UNO具备一项显著优势，即在情感TTS中，能够无缝且灵活地适应所需的说话风格。

1 Introduction

Artificial intelligence-generated content (AIGC) has attracted a surge of interest in both academia and industry, revolutionizing the way we acquire and generate information [11]. In this context, learning from human feedback plays a pivotal role in calibration—it aims to align the generative models with human preferences. For instance, the reinforcement learning from human feedback (RLHF) technique can effectively help large language models (LLMs) to avoid generating harmful and toxic content [1, 5], which is crucial to the success of helpful systems like ChatGPT. Similar methods have recently been employed in text-to-image generation [37, 62].

人工智能生成内容（AIGC）在学术界和工业界引发了强烈关注，彻底变革了信息获取与生成方式 [11]。在此背景下，从人类反馈中学习在调整方面发挥着关键作用，其目标是使生成模型契合人类偏好。例如，人类反馈强化学习（RLHF）技术能够有效助力大型语言模型（LLMs）避免生成有害及有毒内容 [1, 5]，这对于ChatGPT等辅助系统取得成功至关重要。近来，类似方法也被应用于文本生成图像领域 [37, 62]。

As an important task of AIGC, text-to-speech (TTS) synthesis technology is undergoing rapid development driven by deep learning models [6, 49]. Recent TTS works [35, 30, 47, 56] with extensive text-speech training pairs exhibit remarkable zero-shot capacity which generates high quality speech for speakers unseen during training. However, unlike the widespread application of RLHFin LLMs’ calibration, aligning synthetic speech generation with human preferences remains challenging and has not yet been adopted in practice [71]. This contradicts the use of human subjective evaluation, such as the mean opinion score (MOS), as the gold standard for assessing TTS performance, and results in a clear mismatch between TTS training and evaluation. Motivated by this, we raise our basic research question: Can we integrate human feedback into the TTS learning loop?

作为人工智能生成内容（AIGC）的一项重要任务，文本到语音（TTS）合成技术在深度学习模型的推动下正经历着飞速发展 [6, 49]。近期的 TTS 研究工作 [35, 30, 47, 56] 使用了大量的文本 - 语音训练对，展现出了非凡的零样本生成能力，能够为训练过程中未见过的说话者生成高质量的语音。然而，与在大型语言模型（LLMs）的校准中广泛应用的人类反馈强化学习（RLHF）不同，将合成语音生成与人类偏好相匹配仍然具有挑战性，且尚未在实际中得到应用 [71]。这与将人类主观评估（例如平均意见得分（MOS））作为评估 TTS 性能的金标准相矛盾，导致 TTS 的训练与评估之间存在明显不匹配。基于此，我们提出了一个基础研究问题：我们能否将人类反馈整合到 TTS 的学习循环中？

The technical barrier of this topic stems from both sides of “TTS” and “human”. (1) First, existing TTS models typically learn a mapping function from input word/phoneme sequences to either mel-spectrograms or discrete audio tokens followed by high-resolution waveform generation based on supervised training [56]. However, this learned mapping hardly provides diverse yet plausible generations based on the same text and speech prompt, which hinders the formulation of pairwise preference data required by widely used LLMs alignment methods such as direct preference optimization (DPO) [48] (more discussion in Appendix B). (2) Second, human evaluations of speech quality are subjective and inherently personalised. Each individual’s acoustic perception is unique and influenced by their physical state and cognitive biases, thereby resulting in uncertainty in the human feedback of each sample.

这一主题的技术障碍源于"TTS"和"人类"两个方面。(1)首先，现有的TTS模型通常通过监督训练学习从输入的单词/音素序列到梅尔频谱图或离散音频标记的映射函数，随后基于此生成高分辨率波形[56]。然而，这种学习到的映射几乎无法基于相同的文本和语音提示生成多样且合理的结果，这阻碍了广泛使用的大型语言模型对齐方法(如直接偏好优化(DPO)[48])所需的成对偏好数据的形成(更多讨论见附录B)。(2)其次，人类对语音质量的评估是主观的且天生具有个性化特征。每个人的声音感知都是独特的，并受到其身体状态和认知偏见的影响，从而导致每个样本的人类反馈中存在不确定性。

To address the above challenges, we propose a pioneering method named uncertainty-aware optimization (UNO), which aims to enhance zero-shot TTS performance with human feedback. Inspired by the recent success of RLHF, UNO encompasses a sampling-annotating-learning pipeline, but its original design tailored to TTS lies in these three sub-steps:

Sampling with diversity. To obtain representative training examples, UNO performs zero shot TTS sampling with different speech prompts. It significantly contributes to the diversity in self-generated speech samples, thereby reducing the potential bias that arises from unrepresentative or skewed data collection.
Annotating with uncertainty. UNO eliminates the dependency on preference data based on the same input. Instead, it allows a more flexible and tolerant data annotation: only a binary signal is required for whether the generated speech is desirable or not. Furthermore, since speech evaluation is subjective and personalised, this step encompasses the uncertainty caused by the individual differences among human annotators.
UNOtreats human feedback as a form of supervision with inconsistent labels to mitigate the mismatch between the TTS training objectives and MOS-like subjective evaluation metrics. This learning approach directly maximizes the utility of generations from TTS sampling, instead of relying on a reward model or maximizing the log-likelihood of preferences.

为应对上述挑战，我们提出了一种名为不确定性感知优化（UNO）的开创性方法，旨在借助人类反馈提升零样本TTS性能。受RLHF近期成功的启发，UNO包含采样 - 注释 - 学习流程，但其为TTS定制的独到之处在于以下三个子步骤：

多样化采样。为获取具代表性的训练示例，UNO使用不同语音提示进行零样本TTS采样。这显著提升了自动生成语音样本的多样性，从而减少因样本缺乏代表性或数据收集偏斜导致的潜在偏差。
不确定性注释。UNO摆脱了对基于相同输入的偏好数据的依赖。相反，它允许更灵活、更宽容的数据注释：仅需一个二元信号来判断生成的语音是否符合要求。此外，由于语音评估具有主观性和个性化特点，此步骤涵盖了因人类标注者个体差异导致的不确定性。
UNO将人类反馈视为一种具有不一致标签的监督形式，以缓解TTS训练目标与MOS类主观评估指标之间的不匹配。这种学习方法直接最大化TTS采样生成结果的效用，而非依赖奖励模型或最大化偏好对数似然性。

1. 多样性采样

传统方法问题：
传统TTS模型对同一文本通常只生成一种或非常相似的语音输出。

UNO解决方案：
使用不同的语音提示进行零样本TTS采样，获得多样化的训练样本。

具体例子：
假设我们有文本：“这是一个美丽的早晨”

UNO会使用不同的语音提示生成多样化的样本：

样本1：使用年轻女性声音提示 → 生成轻快、明亮的语音
样本2：使用中年男性声音提示 → 生成沉稳、低沉的语音
样本3：使用老年人声音提示 → 生成缓慢、温和的语音
样本4：使用儿童声音提示 → 生成活泼、高音调的语音

这种多样性采样减少了数据偏见，确保模型学习到更广泛的语音表现形式。

2. 带不确定性的标注

传统方法问题：
传统的偏好学习需要对同一输入的不同输出进行比较排序（如A比B好）。

UNO解决方案：
采用更灵活的二元标注方式，同时考虑人类评估的主观不确定性。

具体例子：
对于上述生成的样本，不同评估者可能给出不同评价：

样本1（年轻女性声音）：

评估者A：满意（认为声音清晰自然）
评估者B：满意（喜欢声音的活力）
评估者C：不满意（认为语调过于夸张）

样本2（中年男性声音）：

评估者A：满意（认为声音权威可信）
评估者B：不满意（认为语速太慢）
评估者C：满意（喜欢声音的稳定感）

UNO不是简单地取平均值，而是将这种评估不一致性作为数据的固有特性保留下来，形成带有不确定性的标注。

3. 效用最大化学习

传统方法问题：
传统方法要么需要单独训练奖励模型，要么需要最大化偏好数据的似然。

UNO解决方案：
直接最大化TTS采样生成的效用，将人类反馈视为带有不一致标签的监督信号。

具体例子：
假设我们收集了以下带不确定性的标注数据：

样本1：70%的评估者满意
样本2：60%的评估者满意
样本3：40%的评估者满意
样本4：90%的评估者满意

UNO不会简单地只学习样本4（最高满意度），而是：

考虑满意度的概率分布
学习一个能够最大化整体效用的模型

例如，当模型需要为新闻播报生成语音时，它会倾向于生成类似样本2的声音（权威可信），即使它不是总体满意度最高的；而为儿童故事生成语音时，则可能倾向于生成类似样本4的声音。

Experimental results show that UNO comprehensively enhances the performance of zero-shot TTS models, including speaker similarity (SIM), word error rate (WER), and pseudo-MOS estimated by three pre-trained models. For validation, we conduct subjective human listening tests in the form of naturalness MOS scoring and side-by-side A/B testing. The results of human evaluation confirm that the TTS model optimized through our method significantly outperforms the baseline (3.38 → 4.06) and sounds equally good compared to the ground truth speech. Through both token-level and utterance-level visualization, it is observed that UNO provides effective supervised signals, resulting in the distribution of generated content being closer to the ground truth distribution. Furthermore, UNOexhibits flexible scalability through adjusting optimization objectives. By changing the selection criteria in sampling selection, the method can be seamlessly extended to emotional TTS.

实验结果表明，UNO全面提升了零样本TTS模型的性能，涵盖说话人相似性（SIM）、词错率（WER）以及由三个预训练模型估算的伪MOS。为验证此方法，我们开展了自然度MOS评分及并列A/B测试形式的主观人类听觉测试。人类评估结果显示，经我们方法优化的TTS模型较基线模型有显著提升（3.38→4.06），与真实语音效果相当。通过词元级和句子级可视化观察可知，UNO提供了有效的监督信号，使生成内容分布更贴近真实分布。此外，UNO通过调整优化目标展现了灵活的可扩展性。若改变采样选择中的筛选标准，该方法可无缝扩展至情感TTS。

Our contributions are summarised as follows: (1) We present a comprehensive framework tailored to zero-shot TTS optimization, where human feedback is taken into account in the training objectives to alleviate the mismatch between training and evaluation. (2) By delving into the characteristics of speech synthesis, UNO eliminates the dependence on preference data and accommodates the uncertainty in human subjective evaluations. This provides a new perspective for high-dimensional modality generation and alignment. (3) Intensive experiments demonstrate that UNO brings remark able performance gain to zero-shot TTS models, especially in avoiding most failed cases. Additionally, UNOcanbe seamlessly extended to emotion TTS, demonstrating its scalability and practical value.

我们的贡献总结如下：(1) 我们提出了一个专为零样本TTS优化设计的全面框架，将人类反馈纳入训练目标，以缓解训练与评估之间的不匹配问题。 (2) UNO深入探究语音合成的特点，摆脱了对偏好数据的依赖，并且能够容纳人类主观评估中的不确定性。这为高维模态生成和对齐提供了新的视角。 (3) 大量实验证明，UNO为零样本TTS模型带来了显著的性能提升，尤其是在避免大多数失败案例方面。此外，UNO可以无缝扩展到情感TTS，展示了其可扩展性和实用价值。

2 Related Work

Text-to-speech as language modelling. Inspired by the success of LLMs [9], formatting TTS task as next token prediction has gained remarkable popularity in recent years [8, 50, 70]. Under this setup, the prior step is to convert speech waveform into a sequence of learnable and discrete units based on vector quantization [21, 67]. SpeechTokenizer [72] and RepCodec [27] enhance the semantic tokenization by adding self-supervised embedding prediction-based losses [44]. With discrete acoustic tokens, TortoiseTTS [6] pioneers to combine a speech-code language model with a diffusion decoder to achieve few-shot TTS. VALL-E [56] and Spear-TTS [62] scales to use 60k training data using a pre-trained neural codec model [21], which exhibits remarkable zero-shot capacity to synthesize speech for unseen speakers with speech prompt. VALL-E-X [73] and VioLa [57] extent this framework to cross-lingual TTS, and later works [41, 29, 39, 64] control the style of speech synthesis based on the neural codec. More recent work [47] extends the TTS framework to address speech editing tasks. BaseTTS [55] built the first one billion-parameter TTS model based on a decoder-only structure. RALL-E [61] presents a robust language modeling approach for zero-shot TTS.

Learning from Human Feedback. Human feedback has been widely used in language models for NLP tasks, such as text translation [33], instruction-following [45], and summarization [52]. The advancements in the RLHF framework have contributed to training helpful and harmless AI agents aligning with human preference in the past years [16, 5, 1, 19]. Moreover, recent works like DPO [48] shift in favour of closed-form losses that directly operate on preference data. Different from typical preference-based RL [28, 10], DPO-style works remove the explicit reward model learning and provide the same alignment effect [75, 3, 69, 40] with typical RLHF. Moreover, [14, 66] propose to calibrate LLMs in a “self-rewarding” manner, and KTO [23] relieves the dependence on preference data and optimises LLMs using prospect theory [31].

Summary. To align TTS systems with human preference, we propose to consider the uncertainty of human subjective evaluations [42] - assessing speech has more perspectives and subjectivity than text [54, 58]. Moreover, UNO eliminates the dependence on pairwise preference data and does not require any corresponding ground truth speech (different from SpeechAlign [71]). Combining these factors, we regard UNO as a step towards incorporating human evaluation signals in TTS training and contributing to developing more powerful and versatile speech synthesis.

语音合成作为语言建模。 受大型语言模型（LLMs）成功 [9] 的启发，将文本转语音（TTS）任务格式化为下一个标记预测在近年来获得了显著的流行度 [8, 50, 70]。在这种设置下，首要步骤是基于矢量量化 [21, 67] 将语音波形转换为可学习且离散的单元序列。SpeechTokenizer [72] 和 RepCodec [27] 通过添加基于自监督嵌入预测的损失 [44] 来增强语义分词。借助离散声学标记，TortoiseTTS [6] 首次将语音代码语言模型与扩散解码器相结合，以实现少样本 TTS。VALL-E [56] 和 Spear-TTS [62] 利用预训练的神经编解码器模型 [21]，扩展到使用 6 万个训练数据，展现出显著的零样本能力，能够根据语音提示为未见过的说话者合成语音。VALL-E-X [73] 和 VioLa [57] 将这一框架扩展到跨语言 TTS，后续工作 [41, 29, 39, 64] 则基于神经编解码器控制语音合成的风格。更近期的工作 [47] 将 TTS 框架扩展以处理语音编辑任务。BaseTTS [55] 构建了首个基于仅解码器结构的十亿参数 TTS 模型。RALL-E [61] 提出了一种用于零样本 TTS 的鲁棒语言建模方法。

从人类反馈中学习。 在自然语言处理任务中，如文本翻译 [33]、指令遵循 [45] 和摘要生成 [52]，人类反馈在语言模型中的应用已十分广泛。近年来，强化学习人类反馈（RLHF）框架的进展有助于训练出符合人类偏好的有益且无害的 AI 代理 [16, 5, 1, 19]。此外，像直接偏好优化（DPO） [48] 这样的近期工作倾向于使用直接作用于偏好数据的闭式损失函数。与典型的基于偏好的强化学习 [28, 10] 不同，DPO 类工作去除了显式的奖励模型学习，却能提供与典型 RLHF 相同的对齐效果 [75, 3, 69, 40]。此外，[14, 66] 提出以“自我奖励”的方式校准大型语言模型（LLMs），而基于前景理论 [31] 的 KTO [23] 则减轻了对偏好数据的依赖，并优化了 LLMs。

总结。 为了使 TTS 系统与人类偏好保持一致，我们建议考虑人类主观评估的不确定性 [42]——与文本相比，语音评估具有更多的视角和主观性 [54, 58]。此外，UNO 去除了对成对偏好数据的依赖，并且不需要任何对应的地面真实语音（与 SpeechAlign [71] 不同）。综合这些因素，我们将 UNO 视为在 TTS 训练中纳入人类评估信号的一步，并有助于开发更强大、更通用的语音合成系统。

3 Background

Neural Codec Language Modeling for TTS

Speech synthesis aims to convert a sequence of transcript $t$ into a corresponding speech waveform $s$ . This synthesis mapping can be formally expressed using a function $\pi$ parameterized by $\theta$ as $\pi_\theta(t)$ . In this work, we regard the TTS problem as a conditional speech-codec-based language modelling task as proposed in [56]. $\in \mathbb{R}^{L \times n}$ is tokenized into sequences of discrete acoustic units with a neural codec encoder, where $L$ is the downsampled utterance length and $n$ is the number of residual vector quantization (RVQ) codebooks. After quantization, the neural codec decoder is able to reconstruct the waveform.

Zero-shot TTS

Zero-shot TTS extends conventional TTS by enabling speech synthesis using voices unseen during model training. Given the target transcript $t$ and a short speech prompt $p$ as reference, zero-shot TTS is framed as a transcript-conditioned speech continuation task that uses well-trained model $\theta = \theta_1 \cup \theta_2$ to predict the first layer of codebook $s_L^{(1)}$ with corresponding content and speaker’s voice using parameter $\theta_1$ in an autoregressive manner, and then predict other codebooks $s_L^{(2:n)}$ using parameter $\theta_2$ in a non-autoregressive manner. This hierarchical structure is denoted as:

$\begin{aligned} s_l^{(1)} &= \pi_{\theta_1}(p^{(1)}, s_{<l}^{(1)}, t), \quad l \in \{1, 2, \ldots, L\} \quad (1) \\ s_L^{(n)} &= \pi_{\theta_2}(p^{(1:n)}, s_L^{(1:n-1)}, t) \quad (2) \end{aligned}$

where $p$ can be viewed as a prefix sequence during decoding, and $s_{<t}$ is the history predicted by the model. Typically, $s_L^{(1)}$ represents the acoustic properties like speech content, while $s_L^{(2:n)}$ recovers fine acoustic details.

RLHF with Preference Data

Given a dataset $\mathcal{D}$ with preference data point $x, y_w, y_l)$ , where $y_w$ and $y_l$ are the win-loss generations based on the same input ( x ), it is assumed that the probability of $y_w$ is preferred to $y_l$ can be captured by a “true” reward function $R^*$ . Since obtaining $R^*$ from humans would be intractably expensive, prior RLHF work employs a reward model $R_\phi$ as a proxy trained by minimizing the negative log-likelihood of the human-annotated data:

$\mathcal{L}_{R_\phi} = \mathbb{E}_{x, y_w, y_l \sim \mathcal{D}} \left[ -\log \sigma(R_\phi(x, y_w) - R_\phi(x, y_l)) \right], \quad (3)$

where $\sigma$ is the logistic function. Furthermore, a reference model $\pi_{\text{ref}}$ with KL divergence penalty is introduced to prevent the model $\pi_\theta$ from making radical update, the maximizing objective is:

$\mathbb{E}_{x \in \mathcal{D}, y \in \pi_\theta} [R_\phi(x, y)] - \beta D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)), \quad (4)$

where $\beta$ is a balancing weight. Since the first item is non-differentiable for backpropagation, an RL algorithm like PPO is required to pursue maximum rewards and optimize the policy network $\pi_\theta$ .

Variability in Human Evaluations

Variability is a unique aspect of real-world human evaluation. Individual variations in physical states, cognitive biases, and personal experiences can lead to subjectivity in perceptual quality assessment (e.g. TTS quality assessment). Instead of solely relying on mean opinions, we propose incorporating the variability present in human evaluation, which helps mitigate potential biases and promotes fairness and inclusivity. Prior approaches for modelling variability in human annotations can be broadly grouped into two types. The first approach explicitly models the behaviours of different annotators using different individual models [24, 15, 20], which is not scalable when the number of annotators increases. The second approach approximates subjective probability distributions using Markov chain Monte Carlo with people [51, 26], which requires human annotators to be dynamically involved in the process. In this work, a meta-learning framework is adopted for zero-shot human annotation distribution estimation. Given a synthesized utterance $s_i$ and a set of $M_i$ human annotations $\mathcal{D}_i = \{y_i^{(m)}\}_{m=1}^{M_i}$ associated with $s_i$ , the simulator aims to model the conditional annotation distribution $p(y_i | s_i)$ . For an unseen test utterance $s_*$ , the simulator can then predict $p(y_* | s_*)$ to simulate human-like annotations $\mathcal{D}_* = \{y_*^{(m)}\}_{m=1}^{M_*}$ in a way that reflects how it would be labeled by human annotators. The framework involves meta-learning a deep neural network model to estimate $p(y_i | s_i)$ across all training data $\mathcal{D} = \{(s_i, \mathcal{D}_i)\}_{i=1}^N$ where $N$ is the number of training samples. The deep neural network model then serves as a distribution estimator to allow efficient generation of human-like annotations.

用于 TTS 的神经编解码器语言建模

语音合成旨在将转录序列 $t$ 转换为对应的语音波形 $s$ 。这一合成映射可以正式地用参数化为 $\theta$ 的函数 $\pi$ 表示为 $\pi_\theta(t)$ 。在本研究中，我们借鉴 [56] 中的提议，将 TTS 问题视为一种基于条件语音编解码器的语言建模任务。 $\in \mathbb{R}^{L \times n}$ 利用神经编解码器编码器被标记化为离散声学单元序列，其中 $L$ 是下采样后的语音长度， $n$ 是残差矢量量化（RVQ）码本的数量。量化后，神经编解码器解码器能够重建波形。

零样本 TTS

零样本 TTS 通过使用模型训练期间未见过的声音进行语音合成，扩展了传统的 TTS 技术。给定目标转录 $t$ 和作为参考的简短语音提示 $p$ ，零样本 TTS 被构架为一个以转录为条件的语音延续任务。该任务利用经过良好训练的模型 $\theta = \theta_1 \cup \theta_2$ ，以自回归方式使用参数 $\theta_1$ 根据相应的内容和说话者的声音预测码本的第一层 $s_L^{(1)}$ ，然后以非自回归方式使用参数 $\theta_2$ 预测其他码本 $s_L^{(2:n)}$ 。这种层次结构表示为：

$\begin{aligned} s_l^{(1)} &= \pi_{\theta_1}(p^{(1)}, s_{<l}^{(1)}, t), \quad l \in \{1, 2, \ldots, L\} \quad (1) \\ s_L^{(n)} &= \pi_{\theta_2}(p^{(1:n)}, s_L^{(1:n-1)}, t) \quad (2) \end{aligned}$

其中， $p$ 在解码过程中可以视为一个前缀序列，而 $s_{<t}$ 是模型预测的历史。通常， $s_L^{(1)}$ 表示语音内容等声学特性，而 $s_L^{(2:n)}$ 恢复精细的声学细节。

基于偏好数据的 RLHF

给定一个包含偏好数据点 $x, y_w, y_l)$ 的数据集 $\mathcal{D}$ ，其中 $y_w$ 和 $y_l$ ) 是基于相同输入 $x$ 的优劣生成结果，我们假设 $y_w$ 被偏好于 $y_l$ 的概率可以通过一个“真实”奖励函数 $R^*$ 来捕捉。由于直接从人类获取 ( R^* ) 将难以承受其昂贵成本，以往的 RLHF 工作采用奖励模型 $R_\phi$ 作为代理，通过最小化人类标注数据的负对数似然进行训练：

$\mathcal{L}_{R_\phi} = \mathbb{E}_{x, y_w, y_l \sim \mathcal{D}} \left[ -\log \sigma(R_\phi(x, y_w) - R_\phi(x, y_l)) \right], \quad (3)$

其中， $\sigma$ logistic function。此外，引入具有 KL 散度惩罚的参考模型 $\pi_{\text{ref}}$ 以防止模型 $\pi_\theta$ 进行激进更新，最大化目标为：

$\mathbb{E}_{x \in \mathcal{D}, y \in \pi_\theta} [R_\phi(x, y)] - \beta D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)), \quad (4)$

其中， $\beta$ 是平衡权重。由于第一项对于反向传播是不可微的，需要如 PPO 等强化学习算法来追求最大奖励并优化策略网络 $\pi_\theta$ 。

人类评估中的变异性

变异性是现实世界人类评估的一个独特方面。身体状态、认知偏差和个人经历的个体差异可能导致感知质量评估（例如 TTS 质量评估）的主观性。我们建议不要仅仅依赖平均意见，而是结合人类评估中存在的变异性，这有助于减轻潜在偏差并促进公平性和包容性。以往针对人类标注中变异性建模的方法大致可分为两类。第一种方法使用不同的个体模型 [24, 15, 20] 明确建模不同标注者的标注行为，但当标注者数量增加时，这种方法不具备可扩展性。第二种方法通过人群马尔可夫链蒙特卡洛 [51, 26] 来近似主观概率分布，但这需要人类标注者动态参与标注过程。在本研究中，我们采用元学习框架进行零样本人类标注分布估计。给定一个合成语音 $s_i$ 和一组 $M_i$ 个人类标注 $\mathcal{D}_i = \{y_i^{(m)}\}_{m=1}^{M_i}$ （与 $s_i$ 相关联），模拟器旨在建模条件标注分布$ p(y_i | s_i) $。对于一个未见过的测试语音$ s_ $，模拟器可以预测$ p(y_ | s_*)$ 以模拟人类标注 $\mathcal{D}_* = \{y_*^{(m)}\}_{m=1}^{M_*}$ ，反映人类标注者会如何标注。该框架涉及元学习一个深度神经网络模型，以估计所有训练数据 $\mathcal{D} = \{(s_i, \mathcal{D}_i)\}_{i=1}^N$ （其中 $N$ 是训练样本数量）的 $p(y_i | s_i)$ 。然后，该深度神经网络模型作为分布估计器，允许高效生成类似人类的标注。

1. 神经编解码器语言建模（TTS的“翻译官”）

核心任务：将文本转换为语音波形。
类比：想象你要把一本英文书翻译成中文，但需要先拆解成段落、句子、单词，再逐层转换。

步骤拆解：

编码（文本→离散单元）：
- 文本“你好” → 被编码器转换为一系列数字代号（如[12, 45, 78]），每个代号对应语音的某段特征（如音高、音色）。
- 类似将汉字“你好”拆解为拼音“nǐ hǎo”，再映射为电报码。
解码（离散单元→语音）：
- 解码器根据数字代号[12, 45, 78] → 合成对应的语音波形。
- 类似根据电报码还原出声音信号。

技术意义：

离散化编码让模型更易学习语音的层次结构（如先确定内容，再补充细节）。

2. 零样本TTS（语音“模仿秀”）

核心任务：用未见过的说话人的简短语音提示（如3秒录音），生成其声音的任意文本语音。

实例：

输入：
- 目标文本：“今天天气真好。”
- 参考语音提示：陌生人说“你好，我是小明”的3秒录音。
模型操作：
1. 自回归预测内容层（θ₁）：
  - 根据参考语音的编码和小明的声音特征，逐帧生成语音的“骨架”（如发音节奏、基本音调）。
  - 类似先画出素描轮廓。
2. 非自回归补充细节层（θ₂）：
  - 基于骨架，批量填充细节（如呼吸声、细微音色变化）。
  - 类似给素描上色、添加阴影。

效果：

生成的“今天天气真好”语音，听起来像小明在说话，尽管模型从未训练过小明的长段语音。

3. 基于偏好数据的RLHF（语音“选美比赛”）

核心任务：通过人类偏好数据（如A语音比B语音好），训练模型生成更符合人类喜好的语音。

流程举例：

收集偏好数据：
- 同一文本生成两个版本：
  - 版本A：清晰但语调平淡。
  - 版本B：略带情感但稍有杂音。
- 人类标注者选择B优于A。
训练奖励模型：
- 目标：让奖励模型学会预测人类偏好（如给B打高分，A打低分）。
- 公式(3)本质是让模型对“B-A”的评分差越大越好。
优化策略模型：
- 使用强化学习（如PPO），调整TTS模型参数，使其生成的语音在奖励模型眼中得分更高。
- 公式(4)的KL散度惩罚：防止模型为追求高分生成怪异语音（如过度夸张的情感）。

关键点：

奖励模型是“评委”：它代替人类快速打分，但依赖标注数据质量。
KL惩罚是“安全绳”：防止模型从“播音员”突变成“戏剧演员”。

4. 人类评估的变异性（“众口难调”问题）

核心问题：不同人对同一语音的评价可能截然相反。

实例分析：

语音样本：句子“这部电影太棒了！”生成两种语调：
- 版本C：激昂高亢。
- 版本D：温和含蓄。
人类反馈：
- 年轻人：70%给C高分（“有感染力”），30%认为D“平淡”。
- 年长者：60%给D高分（“不刺耳”），40%认为C“吵闹”。

解决方案：

元学习预测标注分布：
1. 训练阶段：
  - 收集多组人类对同一语音的不同评分（如C的评分分布：年轻人4.5分，年长者3.0分）。
  - 训练一个神经网络，输入语音特征，输出评分概率分布（而非单一分数）。
2. 预测阶段：
  - 对新语音，模型预测“若让100人评分，可能30人打4分，50人打3分…”
  - 类似天气预报预测降水概率分布。

优势：

避免依赖“平均分”掩盖个体差异（如MOS=3.5可能代表两极分化）。
为模型优化提供更全面的信号（需让多数群体满意）。

技术联动：从编码到人类偏好优化

生成阶段：
- 神经编解码器将文本转为离散编码 → 零样本TTS生成多样化语音。
评估阶段：
- 元学习模型预测人类评分分布 → 识别哪些语音更可能被多数人偏好。
优化阶段：
- RLHF调整生成策略，优先输出高概率偏好语音，同时通过KL惩罚保持稳定性。

总结

神经编解码器是TTS的“翻译基础”，零样本能力让它成为“模仿高手”。
RLHF引入“人类评委”指导优化方向，但需解决数据效率和偏差问题。
变异性建模让系统理解“众口难调”，追求概率最优而非绝对最优。
这些技术共同推动TTS从“能说话”迈向“说得好听又自然”。

4 Methodology

在这里插入图片描述

4.1 Data Sampling and Annotating

In order to acquire representative data, we introduce a simple sampling strategy that can encourage more diversified zero-shot TTS generation. Specifically, for each target transcript $x$ , we sample a batch of speech prompts $\{p_1, p_2, \ldots, p_b\}$ with the size of $b$ from an unseen speaker pool. Then $x$ is alternately combined with $k$ -th reference $p_k$ from the batch to form a different input to the TTS model by $s_k = \pi_\theta(t, p_k), k \in \{1, 2, \ldots, b\}$ . Note that $p_k$ acts as a prefix sequence in autoregressive decoding and significantly contributes to the diversity of target speech $s_k$ .

After completing $b$ inferences for a batch, the desirable and undesirable samples are distinguished by human and respectively stored in $\mathcal{P}_{\text{pos}}$ and $\mathcal{P}_{\text{neg}}$ pools. Moreover, since human evaluations of generated speech are less intuitive than text, we record the uncertainty $u$ associated with each data point. When multiple annotators provide distinct decisions, $u$ can be the variance of evaluations for this assessment. We finally obtain two pools with $I$ desirable and $J$ undesirable samples after sampling $K$ times, respectively:

$\mathcal{P}_{\text{pos}} = \{(t_i, p_i, s_i; u_i) \mid s_i \sim \pi_{\text{ref}}(t_i, p_i), u_i \in [0, 1], i = 1, 2, \ldots, I\}$

$\mathcal{P}_{\text{neg}} = \{(t_j, p_j, s_j; u_j) \mid s_j \sim \pi_{\text{ref}}(t_j, p_j), u_j \in [0, 1], j = 1, 2, \ldots, J\}$

where $I$ and $J$ may not be equal. This indicates that the samples in $\mathcal{P}_{\text{pos}}$ and $\mathcal{P}_{\text{neg}}$ are not pairwise preference data since they are based on diversified speech prompts. This significantly helps to increase the diversity of generated speech (see more analysis in Appendix C), thus potentially reducing the bias of data collection for the subsequent optimization process.

Considering the substantial human resource consumption required in annotations, this paper utilizes anthropomorphic annotation simulators trained with real human-labeled SOMOS dataset [42] for efficient generation of evaluation labels while simulating variability in human opinions. Both a discriminative simulator, EDL [60], and a generative simulator, I-CN F [59], are used to simulate the human decision with uncertainty. EDL makes a Gaussian assumption on the conditional annotation distribution $p (y ∣ s)$ and places a normal inverse-gamma (NIG) over the Gaussian likelihood to learn a higher-order prior distribution, also called the evidential distribution [4]:

$\{y^{(m)}\}_{m=1}^M \sim \mathcal{N}(\mu, \sigma^2), \quad \mu \sim \mathcal{N}(\gamma, \sigma^2 v^{-1}), \quad \sigma^2 \sim \Gamma^{-1}(\alpha, \beta)$

A deep neural network model is trained to predict the hyperparameters $\Omega = \{\gamma, v, \alpha, \beta\}$ of the NIG prior by maximizing the marginal likelihood of sampling from all possible Gaussians:

$\mathrm{p}(y|\Omega) = \int \mathrm{p}(y|\Psi) \mathrm{p}(\Psi|\Omega) \mathrm{d}\Psi = \mathrm{St}_{2\alpha}\left(y|\gamma, \frac{\beta(1 + v)}{v\alpha}\right)$

where $\Psi = \{\mu, \sigma\}$ is the hyperparameters of the Gaussian likelihood and $\mathrm{St}_\nu(t|r, s)$ is the Student’s t-distribution evaluated at $t$ with location parameter $r$ , scale parameter $s$ , and $\nu$ degrees of freedom. The predicted mean and variance can be computed analytically as $\mathbb{E}[y] = \gamma$ and $\operatorname{Var}[y] = \beta(1 + v)/v(\alpha - 1)$ . The predicted variance is then used as uncertainty. I-CN F removes the Gaussian assumption of EDL by meta-learning a conditional normalizing flow $\mathrm{p}_\phi(y|s) = \int \mathrm{p}_\phi(y|z) \mathrm{p}_\Lambda(z|s) \mathrm{d}z$ where $z$ is a latent variable sampled from a Gaussian distribution conditioned on input $s$ . The mean and variance of the conditional Gaussian prior are parameterized by a neural network model with parameters $\Lambda$ as $\mathrm{p}_\Lambda(z|s) = \mathcal{N}(z|\mu_\Lambda(s), \operatorname{diag}(\sigma_\Lambda^2(s)))$ . The simulated evaluation $y$ is obtained by a deterministic invertible transformation $\mathrm{p}_\phi(y|s) = \delta(y - f_\phi(z))$ , where $f_\phi(z)$ is parameterized by an invertible neural network model, $\phi$ , and $\delta(\cdot)$ is the multivariate Dirac delta function. That is,

$\mathrm{p}(y|s) = \int \delta(y - f_\phi(z)) \mathrm{p}_\Lambda(z|s) \mathrm{d}z = \mathrm{p}_\Lambda\left(f_\phi^{-1}(y)|s\right)\left|\operatorname{det}\left(\frac{\partial f_\phi^{-1}(y)}{\partial y}\right)\right|$

where $\operatorname{det}(\cdot)$ denotes the determinant operator, $\partial f_\phi^{-1}(y)/ \partial y$ denotes the Jacobian matrix of $f_\phi^{-1}(y)$ . This modelling choice has the advantage of having tractable marginal likelihood as in Eqn. (10) while not restricting the intermediate variable $y$ to a specific type of distribution. At test time, the I-CN F can simulate human-like annotations for an unseen, unlabeled descriptor $s_*$ by first drawing $\{z_*^{(m)}\}_{m=1}^{M_*} \sim \mathrm{p}_\Lambda(z|s_*)$ from the conditional prior, then applying transformation $y_*^{(m)} = f_\phi(z_*^{(m)})$ . The uncertainty can be computed as the variance of ${y_*^{(m)}\}_{m=1}^{M_*}$ . Since the sampling process can be batch processed, I-CN F thus allows efficient simulation of human evaluations.

4.2 Uncertainty-aware Learning for TTS

Why is DPO Eliminated? To optimize the KL-constrained RLHF objective given in Eqn. (4), DPO [48] presents an approach with mathematical equivalence by maximizing the margin between the preferred and unpreferred generations based on the same input. Especially, the training criterion can be written as follows under the TTS formulation:

$\mathcal{L}_{\text{DPO-TTS}}(\pi_\theta, \pi_{\text{ref}}) = \mathbb{E}\left[-\log \sigma\left(\beta \log \frac{\pi_\theta(s_w|t, v)}{\pi_{\text{ref}}(s_w|t, v)} - \beta \log \frac{\pi_\theta(s_l|t, v)}{\pi_{\text{ref}}(s_l|t, v)}\right)\right]$

where $s_w$ and $s_l$ are supposed to be preferred and unpreferred speech obtained using the same transcript and speech prompt $(t, v)$ . The sampling strategy introduced in Sec. 4.1 fails to provide pairwise $s_w$ and $s_l$ required by Eqn. (11), as the data point in both $\mathcal{P}_{\text{pos}}$ and $\mathcal{P}_{\text{neg}}$ are generated using different speech prompts. In fact, due to the lack of diversity, existing TTS models struggle to generate paired $s_w$ and $s_l$ using fixed target transcripts and speech prompts. A solution recently proposed in [71] recalls ground truth speech as $s_w$ while treating the generated speech by the model as $s_l$ . A difficulty of this approach lies in the fact that the TTS model’s outputs are not necessarily unpreferred, and the ground truth may not always be accessible in practice.

Uncertainty-aware Optimization. To remove the dependence on preference data, a promising solution is to anchor a “reference optimization point” that is added or subtracted to get the relative gain or loss respectively. To this end, we utilize the KL term $Z_{\text{ref}}$ introduced in KTO [23] that is defined as:

$Z_{\text{ref}} = \mathbb{E}_{(t', v', s') \sim \mathcal{P}_{\text{pos}} \cup \mathcal{P}_{\text{neg}}}[\mathrm{KL}(\pi_\theta(s'|t') \|\pi_{\text{ref}}(s'|t', p'))]$

where $(t^{'}, v^{'}, s^{'})$ samples from each batch during training. $Z_{\text{ref}}$ is not involved in the backpropagation process, while it makes the training more stable like the role of the baseline in REINFORCE [53]. With the existence of $Z_{\text{ref}}$ , we can directly maximize the utility of generations from $\mathcal{P}_{\text{pos}}$ and $\mathcal{P}_{\text{neg}}$ as follows using a value function $V_{\text{TTS}}$ with logistic function $\sigma$ :

$V_{\text{TTS}}(t, p, s; u) = \begin{cases} \sigma(u^{-1} \cdot R(t, p, s) - Z_{\text{ref}}), & \text{if } (t, p, s) \sim \mathcal{P}_{\text{pos}} \\ \sigma(Z_{\text{ref}} - u^{-1} \cdot R(t, p, s)), & \text{if } (t, p, s; u) \sim \mathcal{P}_{\text{neg}} \end{cases}$

$\log \frac{\pi_\theta(s|t, p)}{\pi_{\text{ref}}(s|t, p)}$

where $R (t, p, s)$ is the implicit reward modeling under RLHF objective in Eqn. (4) and normalized inverse uncertainty $u^{-1}$ controls the magnitude of model $\pi_\theta$ updates from $\pi_{\text{ref}}$ , replacing the original hyper-parameter $\beta$ in DPO. The motivation behind this design is that reward allocation should consider the uncertainty in human feedback associated with the sample. Intuitively, the model is allowed to update more aggressively given a desirable generation with low uncertainty, and conversely, the updates are more conservative when there is high uncertainty. Based on this, the optimization loss is written as:

$\mathcal{L}_{\text{TTS}}(\pi_\theta, \pi_{\text{ref}}) = \mathbb{E}_{t, p, s \sim \mathcal{P}_{\text{pos}} \cup \mathcal{P}_{\text{neg}}} \left(1 - V_{\text{TTS}}(t, p, s; u)\right)$

If $s$ is a desirable data point sampled from $\mathcal{P}_{\text{pos}}$ , then the probability of $\pi_\theta$ is boosted to minimize the loss, but the $Z_{\text{ref}}$ also increases. This forces the model to learn exactly what makes an output desirable without dispensable update based on $\pi_{\text{ref}}$ .

4 方法论

4.1 数据采样与标注

为了获取具有代表性的数据，我们引入了一种简单的采样策略，以鼓励更具多样性的零样本 TTS 生成。具体来说，对于每个目标转录文本 $x$ ，我们从一个未见过的说话者池中采样一批语音提示 $\{p_1, p_2, \ldots, p_b\}$ ，这批提示的大小为 $b$ 。然后， $x$ 与这批提示中的第 $k$ 个参考提示 $p_k$ 交替组合，通过 $s_k = \pi_\theta(t, p_k), k \in \{1, 2, \ldots, b\}$ 形成不同的 TTS 模型输入。需要注意的是， $p_k$ 在自回归解码中充当前缀序列，并对目标语音 $s_k$ 的多样性有显著贡献。

核心思想

UNO方法的关键在于生成多样化的语音样本，而不是仅仅依赖成对的偏好数据。这种策略有几个重要特点：

目标：获取具有代表性的多样化数据样本
方法：使用不同的语音提示生成同一文本的多种语音版本
优势：增加数据多样性，减少数据收集偏差

具体流程

输入准备：
- 目标转录文本： $x$ (例如：“今天天气真好”)
- 说话者池：一组模型从未见过的说话者语音样本
批量采样：
- 从说话者池中随机选择 $b$ 个不同的语音提示 ${p_1, p_2, ..., p_b\}$
- 每个提示可能是不同说话者的短语音片段
组合生成：
- 将同一文本 $x$ 与每个语音提示 $p_k$ 组合
- 通过模型 $\pi_\theta$ 生成对应的语音输出： $s_k = \pi_\theta(t, p_k)$
- 最终得到 $b$ 个不同的语音版本，它们表达相同的内容但有不同的声音特征
前缀序列作用：
- 语音提示 $p_k$ 在自回归解码过程中作为"前缀序列"
- 这意味着模型会基于这个提示的声音特征来生成后续内容
- 不同的提示会引导模型生成具有不同声音特征的输出

实际例子

假设我们要生成文本"这是一个测试"的语音：

从说话者池中选择3个语音提示：
- $p_1$ ：年轻女性说"你好"
- $p_2$ ：中年男性说"早上好"
- $p_3$ ：老年人说"欢迎"
分别生成3个语音版本：
- $s_1 = \pi_\theta(t, p_1)$ ：年轻女性声音的"这是一个测试"
- $s_2 = \pi_\theta(t, p_2)$ ：中年男性声音的"这是一个测试"
- $s_3 = \pi_\theta(t, p_3)$ ：老年人声音的"这是一个测试"
这些生成的样本会被人类评估并分类到 $\mathcal{P}_{\text{pos}}$ （好的样本）或 $\mathcal{P}_{\text{neg}}$ （不好的样本）池中

这种方法的优势在于，它不需要成对的偏好数据（即同一个语音提示生成的好/坏对比），而是通过多样化的语音提示获得各种质量的样本，更符合实际应用场景。

在完成一批次的 $b$ 次推理后，人类标注者区分出理想的和不理想的样本，并分别存储在 $\mathcal{P}_{\text{pos}}$ 和 $\mathcal{P}_{\text{neg}}$ 池中。此外，由于人类对生成语音的评估不如对文本的评估直观，我们记录每个数据点相关的不确定性 $u$ 。当多个标注者提供不同的决策时， $u$ 可以是此次评估的方差。经过 $K$ 次采样后，我们最终分别获得了包含 $I$ 个理想样本和 $J$ 个不理想样本的两个池：

$\mathcal{P}_{\text{pos}} = \{(t_i, p_i, s_i; u_i) \mid s_i \sim \pi_{\text{ref}}(t_i, p_i), u_i \in [0, 1], i = 1, 2, \ldots, I\}$

$\mathcal{P}_{\text{neg}} = \{(t_j, p_j, s_j; u_j) \mid s_j \sim \pi_{\text{ref}}(t_j, p_j), u_j \in [0, 1], j = 1, 2, \ldots, J\}$

其中 $I$ 和 $J$ 可能不相等。这表明 $\mathcal{P}_{\text{pos}}$ 和 $\mathcal{P}_{\text{neg}}$ 中的样本不是成对的偏好数据，因为它们基于多样化的语音提示。这显著有助于增加生成语音的多样性（详见附录 C），从而可能减少后续优化过程中的数据收集偏差。

考虑到标注过程中大量的人力消耗，本文利用基于真实人类标注的 SOMOS 数据集训练的拟人化标注模拟器，以高效生成评估标签，同时模拟人类意见的多样性。我们使用了两种模拟器：判别式模拟器 EDL 和生成式模拟器 I-CN F。EDL 对条件标注分布 $p (y ∣ s)$ 做高斯假设，并在高斯似然上放置一个正态逆伽玛（NIG）先验，以学习更高阶的先验分布，也称为证据分布：

$\{y^{(m)}\}_{m=1}^M \sim \mathcal{N}(\mu, \sigma^2), \quad \mu \sim \mathcal{N}(\gamma, \sigma^2 v^{-1}), \quad \sigma^2 \sim \Gamma^{-1}(\alpha, \beta)$

拟人化标注模拟器

为减少人工标注成本，提出两种基于 SOMOS 数据集训练的模拟器：

判别式模拟器 EDL

概率建模
假设人类对语音 $s$ 的评估 $y$ 服从高斯分布，但参数 $\mu, \sigma^2$ 本身具有不确定性。采用正态逆伽玛（NIG）分布作为 $\mu, \sigma^2$ 的联合先验：
$\mu \sim \mathcal{N}(\gamma, \sigma^2 v^{-1}), \quad \sigma^2 \sim \Gamma^{-1}(\alpha, \beta)$
其中 $\Omega = \{\gamma, v, \alpha, \beta\}$ 为超参数，通过神经网络预测。
边际似然推导
通过积分消去 $\mu, \sigma^2$ ，得到 $y$ 的边际分布为位置-尺度 t 分布：
$p(y|\Omega) = \mathrm{St}_{2\alpha}\left(y \big| \gamma, \frac{\beta(1 + v)}{v\alpha} \right)$
其均值 $\mathbb{E}[y] = \gamma$ ，方差 $\operatorname{Var}[y] = \frac{\beta(1 + v)}{v(\alpha - 1)}$ 。训练时最大化该边际似然，使模型预测的 $\gamma$ 逼近真实评分均值，方差 $\operatorname{Var}[y]$ 反映评估不确定性。

2.2 生成式模拟器 I-CNF

生成过程建模
通过条件归一化流（Conditional Normalizing Flow）直接建模条件分布 $p (y ∣ s)$ ，避免EDL的高斯假设限制：
- 潜在变量：引入隐变量 $\sim \mathcal{N}(\mu_\Lambda(s), \operatorname{diag}(\sigma_\Lambda^2(s))$ ，由神经网络 $\Lambda$ 参数化。
- 可逆变换：通过可逆函数 $f_\phi(z)$ 将 $z$ 映射到评估空间，保证概率密度变换满足：
  $p_\Lambda(z|s) \left| \det \frac{\partial f_\phi^{-1}(y)}{\partial y} \right|$
  其中 $f_\phi^{-1}(y)$ ，雅可比行列式校正因变量空间的体积变化。
不确定性计算
对未见样本 $s_*$ ，从 $p_\Lambda(z|s_*)$ 采样 ${z_*^{(m)}\}$ ，经 $y_*^{(m)} = f_\phi(z_*^{(m)})$ 得到模拟评估 ${y_*^{(m)}\}$ ，最终不确定性为：
$u_* = \operatorname{Var}(\{y_*^{(m)}\})$

三、技术优势

EDL的解析优势
- 通过NIG先验的共轭性质，边际似然可解析计算，训练稳定高效。
- 预测的 $\gamma$ 直接作为评分期望， $\operatorname{Var}[y]$ 自然量化不确定性。
I-CNF的灵活性
- 归一化流的可逆性保证概率密度精确计算，避免近似误差。
- 潜在变量 $z$ 的条件分布可建模复杂依赖，支持多峰、非对称等非高斯评估分布。
系统效率
- EDL适用于标注分布接近高斯的场景，计算成本低。
- I-CNF通过批量采样并行处理，在保持表达力的同时实现高效模拟。

训练一个深度神经网络模型，通过最大化从所有可能的高斯分布中采样的边际似然性，来预测 NIG 先验的超参数 $\Omega = \{\gamma, v, \alpha, \beta\}$ ：

$\mathrm{p}(y|\Omega) = \int \mathrm{p}(y|\Psi) \mathrm{p}(\Psi|\Omega) \mathrm{d}\Psi = \mathrm{St}_{2\alpha}\left(y|\gamma, \frac{\beta(1 + v)}{v\alpha}\right)$

其中， $\Psi = \{\mu, \sigma\}$ 是高斯似然的超参数， $\mathrm{St}_\nu(t|r, s)$ 是在位置参数 $r$ 、尺度参数 $s$ 和自由度 $\nu$ 下的 t 分布。预测的均值和方差可以解析计算为 $\mathbb{E}[y] = \gamma$ 和 $\operatorname{Var}[y] = \beta(1 + v)/v(\alpha - 1)$ 。然后将预测的方差用作不确定性。I-CN F 通过元学习条件归一化流来去除 EDL 的高斯假设， $\mathrm{p}_\phi(y|s) = \int \mathrm{p}_\phi(y|z) \mathrm{p}_\Lambda(z|s) \mathrm{d}z$ ，其中 $z$ 是从以输入 $s$ 为条件的高斯分布中采样的潜在变量。条件高斯先验的均值和方差由参数为 $\Lambda$ 的神经网络模型参数化， $\mathrm{p}_\Lambda(z|s) = \mathcal{N}(z|\mu_\Lambda(s), \operatorname{diag}(\sigma_\Lambda^2(s)))$ 。通过确定性可逆变换 $\mathrm{p}_\phi(y|s) = \delta(y - f_\phi(z))$ 获得模拟评估 $y$ ，其中 $f_\phi(z)$ 由可逆神经网络模型参数化， $\phi$ ， $\delta(\cdot)$ 是多变量 Dirac delta 函数。即，

其中， $\operatorname{det}(\cdot)$ 表示行列式算子， $\partial f_\phi^{-1}(y)/ \partial y$ 表示 $f_\phi^{-1}(y)$ 的雅可比矩阵。这种建模选择的优势在于，它具有如方程 (10) 所示的可处理边际似然性，同时不限制中间变量 $y$ 为特定类型的分布。在测试阶段，I-CN F 可以为未见过的、未标记的描述符 $s_*$ 模拟类似人类的标注，首先从条件先验中采样 $\{z_*^{(m)}\}_{m=1}^{M_*} \sim \mathrm{p}_\Lambda(z|s_*)$ ，然后应用变换 $y_*^{(m)} = f_\phi(z_*^{(m)})$ 。不确定性可以计算为 ${y_*^{(m)}\}_{m=1}^{M_*}$ 的方差。由于采样过程可以批量处理，I-CN F 允许高效地模拟人类评估。

4.2 TTS 的不确定性感知学习

为什么消除 DPO？为了优化方程 (4) 中给出的带 KL 约束的 RLHF 目标，DPO 提出了一种通过最大化基于相同输入的优选和非优选生成之间的间隔来优化目标的方法。特别是，在 TTS 的框架下，训练标准可以写成如下形式：

其中， $s_w$ 和 $s_l$ 是使用相同的转录本和语音提示 $(t, v)$ 获得的优选和非优选语音。第 4.1 节介绍的采样策略无法提供方程 (11) 所需的成对 $s_w$ 和 $s_l$ ，因为 $\mathcal{P}_{\text{pos}}$ 和 $\mathcal{P}_{\text{neg}}$ 中的数据点都是使用不同的语音提示生成的。实际上，由于缺乏多样性，现有的 TTS 模型很难使用固定的转录本和语音提示生成成对的 $s_w$ 和 $s_l$ 。最近在 [71] 中提出的一种解决方案是将真实语音作为 $s_w$ ，而将模型生成的语音作为 $s_l$ 。这种方法的一个难点在于，TTS 模型的输出并不一定是非优选的，而且真实语音在实际中并不总是可用的。

不确定性感知优化。为了消除对偏好数据的依赖，一个很有前景的解决方案是锚定一个“参考优化点”，分别加上或减去该点以获得相对收益或损失。为此，我们利用了 KTO 中引入的 KL 项 $Z_{\text{ref}}$ ，其定义为：

$Z_{\text{ref}} = \mathbb{E}_{(t', v', s') \sim \mathcal{P}_{\text{pos}} \cup \mathcal{P}_{\text{neg}}}[\mathrm{KL}(\pi_\theta(s'|t') \|\pi_{\text{ref}}(s'|t', p'))]$

其中， $(t^{'}, v^{'}, s^{'})$ 在训练期间从每个批次中采样。 $Z_{\text{ref}}$ 不参与反向传播过程，但它像 REINFORCE 中的基线一样使训练更加稳定。借助 $Z_{\text{ref}}$ ，我们可以直接利用价值函数 $V_{\text{TTS}}$ （带有逻辑函数 $\sigma$ ）最大化来自 $\mathcal{P}_{\text{pos}}$ 和 $\mathcal{P}_{\text{neg}}$ 的生成的效用：

$V_{\text{TTS}}(t, p, s; u) = \begin{cases} \sigma(u^{-1} \cdot R(t, p, s) - Z_{\text{ref}}), & \text{如果 } (t, p, s) \sim \mathcal{P}_{\text{pos}} \\ \sigma(Z_{\text{ref}} - u^{-1} \cdot R(t, p, s)), & \text{如果 } (t, p, s; u) \sim \mathcal{P}_{\text{neg}} \end{cases}$

$\log \frac{\pi_\theta(s|t, p)}{\pi_{\text{ref}}(s|t, p)}$

其中， $R (t, p, s)$ 是方程 (4) 中 RLHF 目标下的隐式奖励建模，归一化的逆不确定性 $u^{-1}$ 控制着模型 $\pi_\theta$ 从 $\pi_{\text{ref}}$ 更新的幅度，取代了 DPO 中原有的超参数 $\beta$ 。这一设计背后的理念是，奖励分配应考虑样本相关的人类反馈中的不确定性。直观上，当给定一个低不确定性的理想生成时，模型可以更积极地更新，反之，当存在高不确定性时，更新则更为保守。基于此，优化损失可表示为：

$\mathcal{L}_{\text{TTS}}(\pi_\theta, \pi_{\text{ref}}) = \mathbb{E}_{t, p, s \sim \mathcal{P}_{\text{pos}} \cup \mathcal{P}_{\text{neg}}} \left(1 - V_{\text{TTS}}(t, p, s; u)\right)$

若 $s$ 是从 $\mathcal{P}_{\text{pos}}$ 中采样的理想数据点，则 $\pi_\theta$ 的概率会被提升以最小化损失，但 $Z_{\text{ref}}$ 也会随之增加。这迫使模型精确地学习使输出理想的因素，而不基于 $\pi_{\text{ref}}$ 进行不必要的更新。

5 Experiments Setup

TTS Data

The data used in our experiments includes three parts: supervised pre-training for the TTS model, optimization with UNO, and evaluation. There are no overlapping speakers between them.

The GigaSpeech dataset [12] is used as training data to train the supervised TTS model from scratch, which contains 9k hours of audiobooks, podcasts, and YouTube videos at a 16kHz audio sampling rate.
The LibriTTS [68] dataset which has no overlapping with GigaSpeech is used for UNO. More specifically, we sample a pool of speech prompts consisting of audio files around 3 seconds (commonly used in zero-shot TTS studies), and then perform zero-shot TTS generation based on other target transcripts of more than 6 words. Notably, this process does not require the ground-truth speech of the target transcript.
For evaluation, we use a subset from LibriSpeech test-clean [46] with the audio lengths between 4 and 10 seconds (keeping consistency with [12]), and select the 3-second audio files as speech prompt according to their speaker identities.

Models

We employ VoiceCraft [27] as the baseline model due to its demonstrated superior zero-shot TTS capability, where both base (330M) and large (830M) pre-trained models are considered as the starting points in subsequent experiments. Speech Tokenizer is the pre-trained EncDec with 4 RVQ codebooks and a vocabulary of size 2048. More details are introduced in the Appendix [D].

Objective Evaluation

Following prior studies, the metrics of WER and SIM are used in this work, which are calculated using pre-trained Whisper-medium.en and WavLM-TDCNN speech and speaker recognition models respectively. Furthermore, we use the MOSNet to estimate an objective MOS for reference, which is reported to have good generalization capability to out-of-domain data.

Human Evaluation

We randomly sample 40 listening examples from 4-6 seconds, 6-8 seconds, and 8-10 seconds, respectively, in order to cover different length of generations. Then, these 120 synthetic speech samples are assessed by ten listeners for naturalness MOS evaluation. Listeners were tasked with rating the naturalness of each audio sample on a 5 - point Likert scale, ranging from 1 (very unnatural) to 5 (completely natural). Furthermore, these 120 speech samples are randomly assigned to ten listeners for side - by - side A/B testing (12 samples per person). After listening to two samples with the same speech content, listeners were asked to decide which one sounded more natural, or if they were too close to call, indicating a tie.

Baselines

In addition to the well-trained VoiceCraft model by typical supervised learning, we reproduce the following optimization approaches based on VoiceCraft system for comparison:

SpeechAlign-DPO: Proposed by [48], it adapts the DPO algorithm to the TTS task and achieves better performance than other alignment methods.
SpeechAlign-ODPO: [3] presents an enhanced version of DPO with considering offset. We use the difference between the estimated MOS of ground truth and the MOS of generated speech as “offset” to achieve ODPO optimization.
PPO-SDP: We apply PPO optimization by directly employing MOSNet as the reward model and the mean of MOS as the reward signal. Furthermore, as standard deviation is available, we implement the Standard Deviation-Based Penalty method proposed in [63].
GroundTruth: Since ground truth waveforms of the evaluation set are available, we calculate their corresponding metrics for TTS reference.

Notably, SpeechAlign-DPO and SpeechAlign-ODPO require ground truth to serve as positive samples $y_w$ , thus resulting in an unfair comparison with our approach.

5 实验设置

TTS 数据

我们实验中使用的数据包含三个部分：用于 TTS 模型有监督预训练的数据、使用 UNO 进行优化的数据以及评估数据。它们之间没有重叠的说话者。

GigaSpeech 数据集 [12] 用于从头开始训练有监督的 TTS 模型，该数据集包含 9,000 小时的有声读物、播客和 YouTube 视频，采样率为 16kHz。
LibriTTS [68] 数据集与 GigaSpeech 没有重叠的说话者，用于 UNO。具体来说，我们采样了一组时长在 3 秒左右的语音提示（在零样本 TTS 研究中很常见），然后基于其他目标转录本（单词数超过 6 个）进行零样本 TTS 生成。值得注意的是，这一过程不需要目标转录本的真实语音。
对于评估，我们使用 LibriSpeech test-clean [46] 的一个子集，音频时长在 4 到 10 秒之间（与 [12] 保持一致），并根据说话者身份选择 3 秒的音频文件作为语音提示。

模型

我们采用 VoiceCraft [27] 作为基线模型，因为它展示了卓越的零样本 TTS 能力，其中基础（330M）和大型（830M）预训练模型都被视为后续实验的起点。Speech Tokenizer 是具有 4 个 RVQ 码本和 2048 词汇量的预训练 EncDec。更多细节在附录 D 中介绍。

客观评估

遵循以往研究，WER 和 SIM 指标用于评估，分别使用预训练的 Whisper-medium.en 和 WavLM-TDCNN 语音和说话者识别模型进行计算。此外，我们使用 MOSNet 来估计客观 MOS，据报道其对域外数据具有良好的泛化能力。

人类评估

我们分别从 4-6 秒、6-8 秒和 8-10 秒的音频中随机抽取 40 个听觉示例，以覆盖不同长度的生成。然后，这 120 个合成语音样本由十位听众进行自然度 MOS 评估。听众的任务是使用 5 点 Likert 尺度对每个音频样本的自然度进行评分，范围从 1（非常不自然）到 5（完全自然）。此外，这 120 个语音样本随机分配给十位听众进行并列 A/B 测试（每人 12 个样本）。在收听两个具有相同语音内容的样本后，听众被要求判断哪一个更自然，或者如果它们过于接近则表示平局。

基线

除了通过典型有监督学习训练的 VoiceCraft 模型外，我们还基于 VoiceCraft 系统重现了以下优化方法以供比较：

SpeechAlign-DPO：由 [48] 提出，它将 DPO 算法适应于 TTS 任务，并且比其他对齐方法表现更好。
SpeechAlign-ODPO：[3] 提出了一个增强版的 DPO，考虑了偏移。我们使用真实语音的估计 MOS 与生成语音的 MOS 之间的差异作为“偏移”以实现 ODPO 优化。
PPO-SDP：我们通过直接使用 MOSNet 作为奖励模型和 MOS 平均值作为奖励信号来应用 PPO 优化。此外，由于标准差可用，我们实现了 [63] 中提出的基于标准差的惩罚方法。
GroundTruth：由于评估集的真实语音波形可用，我们计算了它们的对应指标作为 TTS 参考。

值得注意的是，SpeechAlign-DPO 和 SpeechAlign-ODPO 需要真实语音作为正样本 ((y_w))，因此与我们的方法相比存在不公平比较。

在这里插入图片描述

在这里插入图片描述
Figure 2 通过可视化展示了 UNO（不确定性感知优化）方法在优化前后的效果。图中使用 t-SNE 技术将生成语音的两种嵌入表示（音频嵌入和编解码器嵌入）投影到同一空间中，以便直观地展示语音生成的分布变化。

音频嵌入是通过 TTS 模型的嵌入层提取的特征向量，主要关注语音的语义信息。它将语音信号转换为一个低维的向量表示，使得语义相似的语音片段在嵌入空间中距离更近。
编解码器嵌入是通过编解码器模型提取的特征向量，主要包含语音的内容和声学细节。它不仅捕捉语音的语义信息，还保留了语音的声学特性，如音色、语调等。

图的结构

Token-level Alignment（词汇级别对齐）：展示了生成语音在词汇级别上的分布情况。每个点代表一个生成的语音片段。
Utterance-level Alignment（句子级别对齐）：展示了生成语音在句子级别上的分布情况。每个点代表一个完整的句子。
Audio emb.（音频嵌入）：通过 TTS 模型的嵌入层提取，关注语义信息。
Codec emb.（编解码器嵌入）：合并了编解码器中所有 RVQ 嵌入，包含语音内容和声学细节。

颜色和符号

黄色（VoiceCraft）：基线模型生成的语音。
红色（UNO）：经过 UNO 优化后生成的语音。
蓝色（GroundTruth）：真实语音，作为参考标准。
红色圆圈：表示零样本 TTS 的失败案例。

关键观察

优化效果：红色数据点（UNO）比黄色数据点（VoiceCraft）更接近蓝色数据点（GroundTruth），说明 UNO 优化后生成的语音分布更接近真实语音。
失败案例减少：在句子级别对齐中，VoiceCraft 生成的语音中有一簇黄色数据点位于红色圆圈内，代表失败案例。而 UNO 优化后，这簇失败案例被移除，说明 UNO 提升了零样本 TTS 的鲁棒性。

总结

Figure 2 直观地展示了 UNO 方法在优化语音生成方面的效果。通过 t-SNE 可视化，我们可以看到 UNO 不仅提升了生成语音的质量，使其更接近真实语音，还显著减少了失败案例，增强了模型的稳定性。这对于零样本 TTS 系统的实际应用具有重要意义。

在这里插入图片描述

我们通过人类听众进行自然度平均意见得分（MOS）评分和A/B测试，以验证性能提升，并在表2中报告了MOS结果。总体而言，人类评估结果与MOSNet的评估结果相近，这显示了UNO的有效性。更重要的是，我们观察到UNO通过避免大多数失败案例，可以显著提高文本到语音合成（TTS）的鲁棒性。此外，我们增加了一个对照组“UNO-Human”，其中人类同时进行标注（原本由ICNF和EDL完成）和评估，更多细节在附录E中。MOS结果表明，UNO能够有效使TTS与这些标注者的偏好一致。

在这里插入图片描述

在这里插入图片描述
从表格中可以看出，UNO-ICNF 在 MOS 上显著优于 VoiceCraft 和 UNO-null，接近 GroundTruth 的得分。同时，UNO-ICNF 的不确定性低于 VoiceCraft，表明其生成的语音更稳定

在这里插入图片描述

6.4 情感语音合成的扩展

在这里插入图片描述

除了 MOS，我们将 UNO 扩展到与其他人类偏好对齐，允许在不同情感下定制 TTS 合成。实际上，我们通过操作 $\mathcal{P}_{\text{pos}}$ 和 $\mathcal{P}_{\text{neg}}$ 中的样本，基于情感状态模型 [43]，建立了情感 TTS 优化的目标。

愉悦度（Valence）：首先，我们利用愉悦度 $\in (0, 1)$ 作为一个指标，提示 TTS 模型生成具有“愉悦”情感的语音，这相当于在生成时最大化愉悦度，同时保持高质量的 MOS (m)。具体来说，我们对应实验调整包括两部分：（1）从情感 ESD 数据集 [74] 的“高兴（高 $v$ ）”和“悲伤（低 $v$ ）”类别中采样语音提示，以鼓励在采样生成中 $v$ 的多样性。（2）将高愉悦度和高 MOS $(v +, m +$ ) 的样本输入 $\mathcal{P}_{\text{pos}}$ ，将低愉悦度和高 MOS $(v -, m +$ ) 的样本输入 $mathcal{P}_{\text{neg}}$。更多细节在附录 F [4] 中有说明。关于 $v$ 和 $m$ 的相应统计数据在表 [4] 的左部分报告，基于“高兴”提示的评估结果显示，UNO 有效实现了 0.12 的绝对提升（0.55 → 0.67）。令人惊讶的是，0.67 甚至高于 $\mathcal{P}_{\text{pos}}$ 中的平均 $\bar{v}$ （0.65），这表明 UNO 捕捉到了理想语音风格，以符合人类偏好。

唤醒度（Arousal）：我们还利用唤醒度 $\bar{a}$ 来引导模型生成具有“惊讶”情感的语音，其中语音提示是从 ESD 数据集的“惊讶”和“中性”类别中采样的。由于它们不是对立的情感，因此 $\mathcal{P}_{\text{neg}}$ 中的平均 $\bar{a}$ 仅为 0.48。然而，如表 [4] 的右部分所示，UNO 在评估中将 $a$ 从 0.62 有效提升至 0.71。

表格结构

表格分为两部分，分别对应愉悦度（EmotionTTS-Valence）和唤醒度（EmotionTTS-Arousal）的实验结果。每一部分又细分为以下几个关键指标：

$\mathcal{P}_{\text{pos}}$ 和 $\mathcal{P}_{\text{neg}}$ ：分别表示积极池和消极池。
$\bar{v}$ 和 $\bar{a}$ ：分别表示愉悦度和唤醒度的平均值。
$\bar{m}$ ：表示 MOS（平均意见得分）的平均值。
before 和 after：分别表示优化前和优化后的结果。

实验结果分析

愉悦度（Valence）

积极池（ $\mathcal{P}_{\text{pos}}$ ）：
- $\bar{v}$ ：0.65（愉悦度平均值）
- $\bar{m}$ ：4.08（MOS 平均值）
消极池（ $\mathcal{P}_{\text{neg}}$ ）：
- $\bar{v}$ ：0.36（愉悦度平均值）
- $\bar{m}$ ：4.04（MOS 平均值）
优化结果：
- before：0.55（优化前的愉悦度）
- after：0.67（优化后的愉悦度）

从结果可以看出，UNO 方法在愉悦度优化上取得了显著效果，将愉悦度从 0.55 提高到 0.67。这一提升不仅超过了消极池的平均水平（0.36），还高于积极池的平均水平（0.65），表明 UNO 成功捕捉到了人类偏好的语音风格。

唤醒度（Arousal）

积极池（ $\mathcal{P}_{\text{pos}}$ ）：
- $\bar{a}$ ：0.69（唤醒度平均值）
- $\bar{m}$ ：4.05（MOS 平均值）
消极池（ $\mathcal{P}_{\text{neg}}$ ）：
- $\bar{a}$ ：0.48（唤醒度平均值）
- $\bar{m}$ ：4.20（MOS 平均值）
优化结果：
- before：0.62（优化前的唤醒度）
- after：0.71（优化后的唤醒度）

在唤醒度优化方面，UNO 方法同样表现出色，将唤醒度从 0.62 提高到 0.71。尽管消极池的唤醒度平均值较低（0.48），但 UNO 依然能够有效提升生成语音的唤醒度，表明其在情感表达上的优化能力。

总结

表格 4 的结果显示，UNO 方法在情感语音合成任务中取得了显著的性能提升：

在愉悦度优化中，UNO 将愉悦度从 0.55 提高到 0.67，超过了积极池的平均水平（0.65）。
在唤醒度优化中，UNO 将唤醒度从 0.62 提高到 0.71，表明其能够有效增强语音的情感表达。

这些结果验证了 UNO 方法在情感语音合成中的有效性和灵活性，能够根据人类偏好定制不同情感的语音合成系统。

查看全文

原文地址:https://blog.csdn.net/2303_77275067/article/details/146441974
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.kler.cn/a/598044.html 如若内容造成侵权/违法违规/事实不符，请联系邮箱：809451989@qq.com进行投诉反馈，一经查实，立即删除！