Phrase break prediction for Text-to-Speech Systems

This page contains synthesized audio samples of children's stories to accompany the repository anandaswarup/phrase_break_prediction which contains code to train speaker independent phrasing models for English Text-to-Speech systems. In text, phrase breaks are usually represented by punctuation. Typically, Text-to-Speech systems insert phrase breaks in the synthesized speech whenever they encounter a comma in the text to be synthesized. Currently the codebase supports two models

BLSTM token classification model using task specific word embeddings trained from scratch
Fine tuned BERT model with a token classification head

Given unpunctuated text as input, these models punctuate the text with commas, and the text with predicted commas is then passed to the Text-to-Speech system to be synthesized. The models are trained using LibriTTS alignments provided at kan-bayashi/LibriTTSLabel. The train-clean-360 split is used for training, while the dev-clean and test-clean splits are used for validation and test respectively.
The samples presented in this page are synthesized from text using the End-to-End TTS system based on the Tacotron2 model with modifications as described below

A Tacotron2 model with Dynamic Convolutional Attention which modifies the hybrid location sensitive attention mechanism to be purely location based, resulting in better generalization on long utterances. This model takes text (in the form of character sequence) as input and predicts a sequence of mel-spectrogram frames as output.
A WaveRNN based vocoder; which takes the mel-spectrogram predicted in the previous step as input and generates a waveform as output.

The End-to-End TTS system has been trained on the LJSpeech dataset. Code for training the TTS system can be found at anandaswarup/TTS. In all samples presented below, no punctuation refers to unpunctuated text synthesized by the TTS system described above, while blstm and bert refers to text punctuated using the blstm and bert models respectively and then synthesized by the TTS system.

no punctuation	blstm	bert