Producing Synthetic Speech from Facial Movements
Arthur R. Toth – email@example.com
2414 Shady Ave.
Pittsburgh, PA 15217 USA
Cognitive Systems Lab
Karlsruhe, Baden-Wrttemberg 76131 Germany
Szu-Chen Stan Jou
Industrial Technology Research Institute
Chutung, Hsinchu, Taiwan 31040 R.O.C.
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213 USA
Dept. of Electrical and Computer Engineering
Georgia Institute of Technology
Atlanta, GA 30332 USA
Mitsubishi Electric Research Laboratories
Cambridge, MA 02139 USA
Popular version of paper 2aSC5 presented at the 2010 159th ASA Meeting in Baltimore, Maryland.
New approaches suggest the possibility of artificially producing speech based on measuring facial movements with either attached facial probes or ultrasound. Such techniques could allow people to communicate through speech in places where it would otherwise be difficult or impossible. In noisy places, their facial motions could be measured, and the corresponding artificial speech could be played on noise-canceling headphones or on speakers in a different, quieter location. In places where a person does not want to be heard, these techniques could allow silent speech where words are only mouthed. If a person can silently produce the same facial movements as in typical speech, these could be used to produce speech on headphones or on speakers in another location where it would not be obtrusive or public.
In our experiments, we used Surface Electromyography (EMG) and Acoustic Doppler Sonar (ADS) to measure facial movements. EMG uses probes attached to a persons face to measure the electrical impulses that control various muscles used while speaking. Different muscle configurations are necessary to place and move the articulators, which are the parts of a persons anatomy used to produce speech sounds. Our measurements are based on the electrical signals coming from 5 sets of probes placed to detect activity for specific muscles (see Figure 1).
Figure 1. A person wearing EMG probes.
Acoustic Doppler Sonar (ADS) is based on bouncing ultrasound off a speaking persons face and measuring the changes that the velocities of the articulators make to the ultrasound wave. An ultrasound wave with a [infopopup tag=frequency] of 40 kilohertz (humans are typically unable to hear frequencies above 20 kilohertz) is transmitted towards a speakers face from the distance of a typical desktop [infopopup tag=microphone] and a sensor is used to receive the reflected, modified ultrasound wave. Figure 2 shows the ADS hardware.
Figure 2. ADS hardware.
After using one of these techniques to measure facial movements, there is still the question of how to use this information to produce speech. Trying to construct a full physical model that would account for all the placements and motions of the articulators would be very difficult and error-prone. Instead, we adapted a statistical technique from the field of Voice Transformation to construct a generalized correspondence between the facial movement measurements and microphone-recorded speech.
In its original conception, Voice Transformation is the process of taking speech from one person, called the source speaker, and making it sound as if it were spoken by another person, called the target speaker. One popular technique for accomplishing this task is called Gaussian Mixture Model Mapping. In this technique, the source and target speakers are recorded reading the same sentences, and these recordings are used to determine parameters of a statistical model, called a Gaussian Mixture Model. This model can be used to predict how the target speaker would say something based on how the source speaker says it. Then, when the source speaker says a new utterance that has not been recorded by the target speaker, the same statistical model can be applied to transform it to sound like the target speaker. Due to the nature of the statistical model, it can generalize beyond the original data used to estimate the models parameters and predict target speaker speech even when the source speakers speech includes new phenomena that were not found in the original data.
From a more general viewpoint, Voice Transformation is a way of taking one type of data and converting it to speech, as long as this data is from a process that is correlated with the produced speech. As this is indeed the case with EMG and ADS data, we used this approach to convert from EMG and ADS data to speech.
For both EMG and ADS, we collected facial movement measurements and recorded speech simultaneously. For EMG, we had a person read 500 sentences. 380 were used to determine a correspondence between facial movements and speech and 120 were used for evaluation. For ADS, we had a person read 188 sentences. 170 were used to determine a correspondence, and 18 were used for evaluation. Evaluation consisted of taking the facial movement measurements, using them to produce synthetic speech, and comparing the synthetic speech with the actual recorded speech. The comparison was performed considering two types of measures which are referred to as objective and subjective.
Objective measures are based on mathematical formulas, and are convenient in that they can be automatically computed. Their disadvantage is that none are known to correspond perfectly to human perception. Due to differences in opinions among humans, a perfect correspondence might not even be possible. Nevertheless, some objective measures, such as the Mel-Cepstral Distortion (MCD) used in our experiments, have been shown to generally correlate to human perception of speech quality, and are useful as rough indicators of performance. In our best trials, the average MCD between speech synthesized from EMG and the actual recorded speech was 6.37. For the best ADS trials, the MCD was 6.69. A Mel-Cepstral distortion of 0 would indicate a perfect match, and for comparison, typical Voice Transformation from speech to speech, which produces intelligible, though somewhat unnatural sounding speech often has MCDs in the range of 5-7.
Subjective measures are based on collecting human opinions. In many subjective measures, a number of humans are asked to score a quality of speech, such as naturalness, on a numerical scale, and the scores are then averaged. The advantage of subjective measures is that the goal often is related to human perception, and these are the most direct methods to measure it. The disadvantage is that conducting experiments with humans can be a time-consuming, expensive process. For that reason, it is common to first use an objective measure and informally listen to synthesized examples to determine whether the results may be good enough to warrant a formal subjective evaluation.
After listening to synthetic speech both partially predicted from facial movements and completely predicted from facial movements, we decided that the intelligibility of the artificial speech was not yet high enough to warrant a formal subjective evaluation. The partially synthesized examples for both EMG and ADS were somewhat intelligible, but the examples fully synthesized from EMG were not. We were able to occasionally recognize words and phrases in the examples fully synthesized from ADS, however. We also found it to be speech-like, both in terms of how it sounded and in the similarities in the spectrograms, which are diagrams that display different sound frequency contributions over time.
Sound example 1 is a recorded sentence from our ADS evaluation set. Sound example 2 is the same sentence partially resynthesized from ADS data. The power and fundamental frequency were taken from the recording, but the spectral envelopes were estimated from the ADS data. Finally, Sound example 3 is the same utterance synthesized solely from the ADS data. Due to potential difficulties with predicting fundamental frequencies, we chose to use noise excitation to produce whisper-like speech for this experiment. Although more work is necessary to improve the intelligibility, our preliminary results are encouraging.
For more information on these topics, including links to publications, news articles, and example sound files, please visit http://www.cs.cmu.edu/~atoth/ASA2010.html.
Sound example 1
Sound example 2
Sound example 3