FaceVC

This is the demo webpage for the paper ‘Face-based Voice Conversion: Learning the Voice behind a Face’.

Outline

1. FaceVC demo

2. Voice style interpolation

3. Conversion results of each stage

4. Necessity of Stage I

FaceVC demo

 

Speaker photo

 
 

FaceVC
(voice style comes from face)
(trained on LRS3 and VCTK)

 
 

AutoVC
(voice style comes from speech)
(trained on LRS3)

 
 

Ground Truth

 
           
           
           
           
           
           


Voice style interpolation

Since FaceVC is applied reparamterization trick, speaker embedding can be interpolated.

Following samples show the interpolated voice style between 2 specified speakers.

Note that the voice styles are generated from the facial characteristics, so that the interpolated voice styles may not be presented right on the vocal feature axes.

 

Speaker A

 
 

Speaker B

 
     


 

Ratio

 
 

Audio

 
 

0.0A + 1.0B

 
  
 

0.2A + 0.8B

 
  
 

0.4A + 0.6B

 
  
 

0.6A + 0.4B

 
  
 

0.8A + 0.2B

 
  
 

1.0A + 0.0B

 
  


 

Speaker C

 
 

Speaker D

 
     


 

Ratio

 
 

Audio

 
 

0.0C + 1.0D

 
  
 

0.2C + 0.8D

 
  
 

0.4C + 0.6D

 
  
 

0.6C + 0.4D

 
  
 

0.8C + 0.2D

 
  
 

1.0C + 0.0D

 
  

Conversion results of each stage

 

Stage I
(voice style from face + content from LRS3)

 
 

Stage II
(voice style from speech + content from VCTK)

 
 

Inference
(voice style from face + content from VCTK)

 
 


 


 


Necessity of Stage I

 

Face Encoder pretrained in Stage I

 
 

Face Encoder trained from scratch in Stage III

 
 

Ground Truth