End-to-End Adversarial Text-to-Speech


DeepMind: In our new paper [] we propose EATS: End-to-End Adversarial Text-to-Speech, which allows for speech synthesis directly from text or phonemes without the need for multi-stage training pipelines or additional supervision. Audio:

Sander Dieleman: Our latest work on GANs for text-to-speech, from characters/phonemes to waveforms with a single model. Learning varying alignment without teacher forcing is tricky, but we found dynamic time warping (DTW) to be very effective.

Sander Dieleman: We've updated the EATS paper on arXiv: 'End-to-end' has many possible interpretations – Table 5 in the appendix (p. 21) describes some of the many ways in which the TTS pipeline has been factorised into stages in the literature, for easier comparison.

Tweet of the day

