Flow Matching VAE Latent Normalization for TTS Training | Deep Dive

A rigorous analysis of VAE latent statistics and channel-wise normalization for Flow Matching training in TTS, with practical insights from VoxFlash-TTS.

A detailed technical post on the CNBlogs platform systematically addresses a critical engineering challenge in training Flow Matching models for text-to-speech: the statistical properties of VAE latent representations and their impact on training stability and quality. The author derives the mean and variance of input and velocity field distributions under Optimal Transport Conditional Flow Matching (OT-CFM) paths, and analyzes how the VAE KL divergence weight influences the dispersion of latent point clouds. Drawing an analogy to the SNR mismatch theory from image generation, the post argues for per-channel normalization as a practical solution to improve Flow Matching training. The analysis is grounded in a real TTS system, VoxFlash-TTS, making it highly relevant for practitioners building generative audio models. This is not a beginner tutorial but a deep theoretical and practical exploration that will benefit engineers and researchers working on flow-based generative models for speech and audio.