This is the demo webpage for the paper ‘Face-based Voice Conversion: Learning the Voice behind a Face’.
Outline
1. FaceVC demo
2. Voice style interpolation
3. Conversion results of each stage
4. Necessity of Stage I
FaceVC demo
Speaker photo
FaceVC (voice style comes from face) (trained on LRS3 and VCTK)
AutoVC (voice style comes from speech) (trained on LRS3)
Ground Truth
Voice style interpolation
Since FaceVC is applied reparamterization trick, speaker embedding can be interpolated.
Following samples show the interpolated voice style between 2 specified speakers.
Note that the voice styles are generated from the facial characteristics, so that the interpolated voice styles may not be presented right on the vocal feature axes.
Speaker A
Speaker B
Ratio
Audio
0.0A + 1.0B
0.2A + 0.8B
0.4A + 0.6B
0.6A + 0.4B
0.8A + 0.2B
1.0A + 0.0B
Speaker C
Speaker D
Ratio
Audio
0.0C + 1.0D
0.2C + 0.8D
0.4C + 0.6D
0.6C + 0.4D
0.8C + 0.2D
1.0C + 0.0D
Conversion results of each stage
Stage I (voice style from face + content from LRS3)
Stage II (voice style from speech + content from VCTK)
Inference (voice style from face + content from VCTK)