Eric Chu

Adaptive Speech Synthesis for Tutoring Children

Motivation and Overview

Using phones for literacy learning is an empowering application of mobile technology, but there are elements of the human tutor that have yet to be replicated in current apps. Namely, when reading a story, a tutor is likely to be more expressive and colorful in tone. When encountering a new word, a tutor might emphasize the vowel phoneme or stress a consonant pair the child has yet to master. By modeling speech with deep neural networks, our speech synthesizer will be able to interpolate between speaking styles, switching from ‘normal’ mode to ‘tutor’ mode as needed.

A short presentation covering the problem, approach, and evaluation.

Collected Dataset

As a part of this work, I am working with a voice talent to record speech in different speaking styles. We have collected over three hours over speech.

Examples can be found here

Model

Frontend: Grapheme-to-phoneme LSTM duration model

Backend approach 1: LSTM acoustic model generating inputs into WORLD vocoder

Original recording:

Synthesized recording (trained on 1 hour):