Check the CHANGELOG file to have a global overview of the latest modifications! π
βββ architectures : utilities for model architectures
β βββ layers : custom layer implementations
β βββ transformers : transformer architecture implementations
β βββ tacotron2_arch.py : Tacotron-2 synthesizer architecture (+ multi-speaker variant)
β βββ waveglow_arch.py : WaveGlow vocoder architecture
βββ custom_train_objects
β βββ losses
β β βββ tacotron_loss.py : custom Tacotron2 loss
βββ example_outputs : some pre-computed audios (cf the `text_to_speech` notebook)
βββ loggers
βββ models
β βββ tts
β β βββ sv2tts_tacotron2.py : SV2TTS main class
β β βββ tacotron2.py : Tacotron2 main class
β β βββ waveglow.py : WaveGlow main class (both pytorch and tensorflow)
β βββ weights_converter.py : utilities to convert weights between different models
βββ pretrained_models
βββ tests : unit and integration tests for model validation
βββ utils : utility functions for data processing and visualization
βββ LICENCE : project license file
βββ README.md : this file
βββ requirements.txt : required packages
βββ text_to_speech.ipynb : notebook demonstrating model creation + TTS features
Check the main project for more information about the unextended modules / structure / main classes.
* Check the encoders project for more information about the models/encoder
module
- Text-To-Speech (module
models.tts
) :
Feature | Function / class | Description |
---|---|---|
Text-To-Speech | tts |
perform TTS on text you want with the model you want |
stream | stream |
perform TTS on text you enter |
The text_to_speech
notebook provides a concrete demonstration of the tts
function
Available architectures:
Synthesizer
:Vocoder
:
The SV2TTS models are fine-tuned from pretrained Tacotron2 models, by using the partial transfer learning procedure (see below for details), which speeds up the training significantly.
Name | Language | Dataset | Synthesizer | Vocoder | Speaker Encoder | Trainer | Weights |
---|---|---|---|---|---|---|---|
pretrained_tacotron2 | en |
LJSpeech | Tacotron2 |
WaveGlow |
/ | NVIDIA | Google Drive |
tacotron2_siwis | fr |
SIWIS | Tacotron2 |
WaveGlow |
/ | me | Google Drive |
sv2tts_tacotron2_256 | fr |
SIWIS, VoxForge, CommonVoice | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
sv2tts_siwis | fr |
SIWIS, VoxForge, CommonVoice | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
sv2tts_tacotron2_256_v2 | fr |
SIWIS, VoxForge, CommonVoice | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
sv2tts_siwis_v2 | fr |
SIWIS | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
Models must be unzipped in the pretrained_models/
directory!
Important Note: These links will be updated in a future version, and the converted keras weights of WaveGlow
will also be added.
- Clone this repository:
git clone https://github.com/yui-mhcp/text_to_speech.git
- Go to the root of this repository:
cd text_to_speech
- Install requirements:
pip install -r requirements.txt
- Open the
text_to_speech
notebook and follow the instructions!
You may have to install ffmpeg
for audio loading/saving.
- Make the TO-DO list
- Comment the code
- Add
batch_size
support forvocoder inference
- Add pretrained
SV2TTS
weights - Add document parsing to perform
TTS
on document (in progress) - Train a
SV2TTS
model based on an encoder trained with theGE2E
loss - Add support for long text inference
- Add support for streaming inference
- Update the pretrained models links + add the
WaveGlow
model - Update the
models/encoder
module - Update the
Google Colab
demo - Update the training notebooks
- Train new
SV2TTS
models based on the optimized pre-processing codes - Add dedicated document-based
TTS
There are multiple ways to enable multi-speaker
speech synthesis:
- Use a
speaker ID
that is embedded by a learnableEmbedding
layer. The speaker embedding is then learned during training. - Use a
Speaker Encoder (SE)
to embed audio from the reference speaker. This is often referred to aszero-shot voice cloning
, as it only requires a sample from the speaker (without training). - Recently, a new
prompt-based
strategy has been proposed to control the speech with prompts.
Note: In the next paragraphs, encoder
refers to the Tacotron Encoder
part (that encodes the input text), while SE
refers to a speaker encoder
model (detailed below).
The Speaker Encoder-based Text-To-Speech
is inspired from the "From Speaker Verification To Text-To-Speech (SV2TTS)" paper. The authors have proposed an extension of the Tacotron-2
architecture to include information about the speaker's voice.
Here is a short overview of the proposed procedure:
- Train a model to identify speakers based on short audio samples: the
speaker verification
model. This model takes as input an audio sample (5-10 sec) from a speaker and encodes it into a d-dimensional vector, named theembedding
. This embedding aims to capture relevant information about the speaker's voice (e.g.,frequencies
,rhythm
,pitch
, etc.). - This pre-trained
Speaker Encoder (SE)
is then used to encode the voice of the speaker to clone. - The produced embedding is then concatenated with the output of the
Tacotron-2
encoder part, such that theDecoder
has access to both the encoded text and the speaker embedding.
The objective is that the Decoder
will learn to use the speaker embedding
to copy its prosody/intonation/etc. to read the text with the voice of this speaker.
There are some limitations with the above approach:
- Perfect generalization to new speakers is very difficult, as it would require large datasets with many speakers.
- The audio should not have any noise/artifacts to avoid noisy synthetic audios.
- The
Speaker Encoder
has to correctly separate speakers and encode their voice in a meaningful way for the synthesizer.
To tackle these limitations, the proposed solution is to perform a 2-step training:
- First train a low-quality multi-speakers model on the
CommonVoice
database. This is one of the largest multilingual databases for audio, at the cost of noisy/variable quality audios. This is therefore not suitable to train good quality models, whereas pre-processing still helps to obtain intelligible audios. - Once a multi-speaker model is trained, a single-speaker database with a limited amount of good quality data can be used to fine-tune the model on a single speaker. This allows the model to learn faster, with only a limited amount of good quality data, and to produce really good quality audios!
The SE part should be able to differentiate speakers and embed (encode in a 1-D vector) them in a meaningful way.
The model used in the paper is a 3-layer LSTM
model with a normalization layer trained with the GE2E loss. The major limitation is that training this model is really slow and took 2 weeks on 4 GPUs in CorentinJ's master thesis (cf. his GitHub).
This project proposes a simpler architecture based on Convolutional Neural Networks (CNN)
, which is much faster to train compared to LSTM
networks. Furthermore, the Euclidean
distance has been used rather than the cosine
metric, which has shown faster convergence. Additionally, a custom cache-based generator is proposed to speed up audio processing. These modifications allowed training a 99% accuracy model within 2-3 hours on a single RTX 3090
GPU!
In order to avoid training a SV2TTS model from scratch, which would be completely impossible on a single GPU, a new partial transfer learning
procedure is proposed.
This procedure takes a pre-trained model with a slightly different architecture and transfers all the common weights (like in regular transfer learning). For the layers with different weight shapes, only the common part is transferred, while the remaining weights are initialized to zeros. This results in a new model with different weights that mimics the behavior of the original model.
In the SV2TTS
architecture, the speaker embedding is passed to the recurrent layer of the Tacotron2 decoder
. This results in a different input shape, making the layer weights matrix different. The partial transfer learning allows us to initialize the model such that it replicates the behavior of the original single-speaker Tacotron2
model!
The code for this project is a mixture of multiple GitHub projects, to have a fully modular Tacotron-2
implementation:
- NVIDIA's repository (tacotron2 / waveglow): The base pretrained models are inspired from this repository.
- The TFTTS project: Some inference optimizations are inspired from their
dynamic decoder
implementation, which has now been optimized and updated to beKeras 3
compatible. - CorentinJ's Real-Time Voice cloning project: The provided
SV2TTS
architecture is inspired from this repository, with small differences and optimizations.
- Tacotron 2: The original Tacotron2 paper
- Waveglow: The original WaveGlow paper
- Transfer learning from Speaker Verification to Text-To-Speech: Original paper for SV2TTS variant
- Generalized End-to-End loss for Speaker Verification: The GE2E Loss paper (used for speaker encoder in the SV2TTS architecture)
Contacts:
- Mail:
yui-mhcp@tutanota.com
- Discord: yui0732
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.
This license allows you to use, modify, and distribute the code, as long as you include the original copyright and license notice in any copy of the software/source. Additionally, if you modify the code and distribute it, or run it on a server as a service, you must make your modified version available under the same license.
For more information about the AGPL-3.0 license, please visit the official website
If you find this project useful in your work, please add this citation to give it more visibility! π
@misc{yui-mhcp
author = {yui},
title = {A Deep Learning projects centralization},
year = {2021},
publisher = {GitHub},
howpublished = {\url{https://github.com/yui-mhcp}}
}