Voice transformation: making new voices for speech synthesis

Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener would believe the speech was uttered by a target speaker.

One of the main applications of voice conversion would be in the field of text-to-speech synthesis. Modern speech synthesisers requires a large database of speech which is very expensive and time consuming to collect. A voice transformation system would allow new voices to be created with much less effort and cost by transforming existing voices, using information gathered from a relatively small sample of speech from the target speaker.

In this talk I will address the problem of transforming just the spectral characteristics of speech. My system has two main parts corresponding to the two components of the source-filter model. The first component transforms the spectral envelope as represented by a linear prediction model. The transformation is achieved using a gaussian mixture model, which is trained on aligned speech from source and target speakers.

The second part of the system predicts the spectral detail from the transformed LPC parameters. I propose a novel approach which is based on a classifier and residual codebooks.

Through the use of objective tests, I will show that the proposed system outperforms existing methods in a number of respects. I will demonstrate the performance of the system by playing some sample output.