This is the demo webpage for the paper ‘Face-based Voice Conversion: Learning the Voice behind a Face’.


1. FaceVC demo

2. Voice style interpolation

3. Conversion results of each stage

4. Necessity of Stage I

FaceVC demo


Speaker photo


(voice style comes from face)
(trained on LRS3 and VCTK)


(voice style comes from speech)
(trained on LRS3)


Ground Truth


Voice style interpolation

Since FaceVC is applied reparamterization trick, speaker embedding can be interpolated.

Following samples show the interpolated voice style between 2 specified speakers.

Note that the voice styles are generated from the facial characteristics, so that the interpolated voice styles may not be presented right on the vocal feature axes.


Speaker A


Speaker B







0.0A + 1.0B


0.2A + 0.8B


0.4A + 0.6B


0.6A + 0.4B


0.8A + 0.2B


1.0A + 0.0B



Speaker C


Speaker D







0.0C + 1.0D


0.2C + 0.8D


0.4C + 0.6D


0.6C + 0.4D


0.8C + 0.2D


1.0C + 0.0D


Conversion results of each stage


Stage I
(voice style from face + content from LRS3)


Stage II
(voice style from speech + content from VCTK)


(voice style from face + content from VCTK)




Necessity of Stage I


Face Encoder pretrained in Stage I


Face Encoder trained from scratch in Stage III


Ground Truth