Using phones for literacy learning is an empowering application of mobile technology, but there are elements of the human tutor that have yet to be replicated in current apps. Namely, when reading a story, a tutor is likely to be more expressive and colorful in tone. When encountering a new word, a tutor might emphasize the vowel phoneme or stress a consonant pair the child has yet to master. By modeling speech with deep neural networks, our speech synthesizer will be able to interpolate between speaking styles, switching from ‘normal’ mode to ‘tutor’ mode as needed.
A short presentation covering the problem, approach, and evaluation.
As a part of this work, I am working with a voice talent to record speech in different speaking styles. We have collected over three hours over speech.
Original recording:
Synthesized recording (trained on 1 hour):