Bayesian Speech Synthesisers Can Learn from Multiple Teachers

Anonymous Institution(s)
Anonymous submission for review
Text-to-Speech (TTS) is inherently a "one-to-many" mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard single-reference datasets, we introduce a "one-to-many" training strategy that leverages synthetic samples as a statistical support set, allowing the model to learn robust distributional properties rather than merely imitating teacher artifacts. Experiments demonstrate that BELLE, trained on only ~5k hours of data, outperforms leading open-source models trained on 50k hours (achieving a 25.8% relative WER reduction) and naturally supports high-quality streaming generation.

Data Pipeline & Model

数据示例

Text-to-Speech Generation Comparison

Text 1: when she used to tell me about him i always wondered whether she wasn't a little in love with him
GroundTruth
MaskGCT
F5-TTS
MELLE
BELLE
BELLE Stream
Text 2: there was a unanimous groan at this and much reproach after which in his preoccupied way he explained
GroundTruth
MaskGCT
F5-TTS
MELLE
BELLE
BELLE Stream
Text 3: we will go out together to the bower there is a way down to the court from my window
GroundTruth
MaskGCT
F5-TTS
MELLE
BELLE
BELLE Stream