FaceVC

This is the demo webpage for the paper ‘Face-based Voice Conversion: Learning the Voice behind a Face’.

Outline

1. FaceVC demo

2. Voice style interpolation

3. Conversion results of each stage

4. Necessity of Stage I

FaceVC demo

Speaker photo	FaceVC (voice style comes from face) (trained on LRS3 and VCTK)	AutoVC (voice style comes from speech) (trained on LRS3)	Ground Truth

Voice style interpolation

Since FaceVC is applied reparamterization trick, speaker embedding can be interpolated.

Following samples show the interpolated voice style between 2 specified speakers.

Note that the voice styles are generated from the facial characteristics, so that the interpolated voice styles may not be presented right on the vocal feature axes.

Speaker A	Speaker B

Ratio	Audio
0.0A + 1.0B
0.2A + 0.8B
0.4A + 0.6B
0.6A + 0.4B
0.8A + 0.2B
1.0A + 0.0B

Speaker C	Speaker D

Ratio	Audio
0.0C + 1.0D
0.2C + 0.8D
0.4C + 0.6D
0.6C + 0.4D
0.8C + 0.2D
1.0C + 0.0D

Conversion results of each stage

Stage I (voice style from face + content from LRS3)	Stage II (voice style from speech + content from VCTK)	Inference (voice style from face + content from VCTK)