Audio demos

Voice Cloning Experiment I

The multi-speaker model and speaker encoder model were trained on 84 VCTK speakers (48 KHz sampling rate), voice cloning was performed on other VCTK speakers (48 KHz sampling rate). The average duration of a cloning sample is 3.7 seconds. Boldface indicates the best results.

	Speaker 0 (female)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)

	Speaker 1 (female)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)

	Speaker 2 (male)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)

	Speaker 3 (male)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)

Voice Cloning Experiment II

The multi-speaker model and speaker encoder model were trained on LibriSpeech speakers (16 KHz sampling rate), voice cloning was performed on VCTK speakers (downsampled to 16 KHz sampling rate). The average duration of a cloning sample is 3.7 seconds. Boldface indicates the best results.

	Speaker 0 (female)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)
Cloned speech (speaker encoder without fine-tuning with 1 sample)
Cloned speech (speaker encoder without fine-tuning with 5 samples)
Cloned speech (speaker encoder without fine-tuning with 10 samples)
Cloned speech (speaker encoder with fine-tuning with 1 sample)
Cloned speech (speaker encoder with fine-tuning with 5 samples)
Cloned speech (speaker encoder with fine-tuning with 10 samples)

	Speaker 1 (female)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)
Cloned speech (speaker encoder without fine-tuning with 1 sample)
Cloned speech (speaker encoder without fine-tuning with 5 samples)
Cloned speech (speaker encoder without fine-tuning with 10 samples)
Cloned speech (speaker encoder with fine-tuning with 1 sample)
Cloned speech (speaker encoder with fine-tuning with 5 samples)
Cloned speech (speaker encoder with fine-tuning with 10 samples)

	Speaker 2 (male)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)
Cloned speech (speaker encoder without fine-tuning with 1 sample)
Cloned speech (speaker encoder without fine-tuning with 5 samples)
Cloned speech (speaker encoder without fine-tuning with 10 samples)
Cloned speech (speaker encoder with fine-tuning with 1 sample)
Cloned speech (speaker encoder with fine-tuning with 5 samples)
Cloned speech (speaker encoder with fine-tuning with 10 samples)

	Speaker 3 (male)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)
Cloned speech (speaker encoder without fine-tuning with 1 sample)
Cloned speech (speaker encoder without fine-tuning with 5 samples)
Cloned speech (speaker encoder without fine-tuning with 10 samples)
Cloned speech (speaker encoder with fine-tuning with 1 sample)
Cloned speech (speaker encoder with fine-tuning with 5 samples)
Cloned speech (speaker encoder with fine-tuning with 10 samples)

Manipulation on estimated speaker embedding by speaker encoder

	Apply male-->female vector to convert British male to British female
Original speaker (British male)
Synthesized speaker (sample 1)
Synthesized speaker (sample 2)

	Apply British-->American vector to convert British male to American male
Original speaker (British male)
Synthesized speaker (sample 1)
Synthesized speaker (sample 2)