Add code to control finetune layers (#154 )

Fix bug of importing GST and add more parameters in toolbox
docs: this repo -> 本代码库 (#157 )
2026-02-03 18:43:41 +08:00 · 2021-10-23 10:25:43 +08:00 · 2021-10-21 00:40:00 +08:00 · 2021-10-20 22:54:31 +08:00 · 2021-10-20 00:27:19 +08:00 · 2021-10-20 00:27:13 +08:00
17 changed files with 343 additions and 93 deletions
--- a/.vscode/launch.json
+++ b/.vscode/launch.json
@@ -17,7 +17,7 @@
      "request": "launch",
      "program": "vocoder_preprocess.py",
      "console": "integratedTerminal",
-      "args": ["..\\..\\chs1"]
+      "args": ["..\\audiodata"]
    },
    {
      "name": "Python: Vocoder Train",
@@ -25,7 +25,7 @@
      "request": "launch",
      "program": "vocoder_train.py",
      "console": "integratedTerminal",
-      "args": ["dev", "..\\..\\chs1"]
+      "args": ["dev", "..\\audiodata"]
    },
    {
      "name": "Python: Demo Box",
@@ -33,7 +33,15 @@
      "request": "launch",
      "program": "demo_toolbox.py",
      "console": "integratedTerminal",
-      "args": ["-d","..\\..\\chs"]
-    }
+      "args": ["-d","..\\audiodata"]
+    },
+    {
+      "name": "Python: Synth Train",
+      "type": "python",
+      "request": "launch",
+      "program": "synthesizer_train.py",
+      "console": "integratedTerminal",
+      "args": ["my_run", "..\\"]
+    },
  ]
 }
--- a/README-CN.md
+++ b/README-CN.md
@@ -5,10 +5,10 @@

 ### [English](README.md)  | 中文

-### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/)
+### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/) | [Wiki教程](https://github.com/babysor/MockingBird/wiki/Quick-Start-(Newbie)) ｜ [训练教程](https://vaj2fgg8yn.feishu.cn/docs/doccn7kAbr3SJz0KM0SIDJ0Xnhd)

 ## 特性
-🌍 **中文** 支持普通话并使用多种中文数据集进行测试：aidatatang_200zh, magicdata, aishell3， biaobei，MozillaCommonVoice 等
+🌍 **中文** 支持普通话并使用多种中文数据集进行测试：aidatatang_200zh, magicdata, aishell3, biaobei, MozillaCommonVoice, data_aishell 等

 🤩 **PyTorch** 适用于 pytorch，已在 1.9.0 版本（最新于 2021 年 8 月）中测试，GPU Tesla T4 和 GTX 2060

@@ -18,6 +18,7 @@

 🌍 **Webserver Ready** 可伺服你的训练结果，供远程调用

+## 开始
 ### 1. 安装要求
 > 按照原始存储库测试您是否已准备好所有环境。
 **Python 3.7 或更高版本** 需要运行工具箱。
@@ -34,8 +35,10 @@
 #### 2.1 使用数据集自己训练合成器模型（与2.2二选一）
 * 下载 数据集并解压：确保您可以访问 *train* 文件夹中的所有音频文件（如.wav）
 * 进行音频和梅尔频谱图预处理：
-`python pre.py <datasets_root>`
-可以传入参数 --dataset `{dataset}` 支持 aidatatang_200zh, magicdata, aishell3
+`python pre.py <datasets_root> -d {dataset} -n {number}`
+可传入参数：
+* -d`{dataset}` 指定数据集，支持 aidatatang_200zh, magicdata, aishell3, data_aishell, 不传默认为aidatatang_200zh
+* -n `{number}` 指定并行数，CPU 11770k + 32GB实测10没有问题
 > 假如你下载的 `aidatatang_200zh`文件放在D盘，`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`

 * 训练合成器：
@@ -48,7 +51,7 @@

 | 作者 | 下载链接 | 效果预览 | 信息 |
 | --- | ----------- | ----- | ----- |
-| 作者 | https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA  [百度盘链接](https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA ) 提取码：i183  |  | 200k steps 只用aidatatang_200zh
+| 作者 | https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ  [百度盘链接](https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ) 提取码：gdn5 |  | 25k steps 用3个开源数据集混合训练
 |@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [百度盘链接](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) 提取码：1024  | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音
 |@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码：2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 旧版需根据[issue](https://github.com/babysor/MockingBird/issues/37)修复

@@ -119,8 +122,9 @@

 | URL | Designation | 标题 | 实现源码 |
 | --- | ----------- | ----- | --------------------- |
+| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
 | [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
-|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
+|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | 本代码库 |
 |[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
 |[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
 |[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |
@@ -132,6 +136,7 @@
 | aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
 | magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
 | aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
+| data_aishell | [OpenSLR](https://www.openslr.org/33/) |  |
 > 解壓 aidatatang_200zh 後，還需將 `aidatatang_200zh\corpus\train`下的檔案全選解壓縮

 #### 2.`<datasets_root>`是什麼意思?
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@
 > English | [中文](README-CN.md)

 ## Features
-🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, and etc.
+🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.

 🤩 **PyTorch** worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060

@@ -36,7 +36,7 @@ You can either train your models or use existing ones:
 * Download dataset and unzip: make sure you can access all .wav in folder
 * Preprocess with the audios and the mel spectrograms:
 `python pre.py <datasets_root>`
-Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata, aishell3, etc.
+Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata, aishell3, data_aishell, etc.If this parameter is not passed, the default dataset will be aidatatang_200zh.

 * Train the synthesizer:
 `python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
@@ -49,7 +49,7 @@ Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata,
 | author | Download link | Preview Video | Info |
 | --- | ----------- | ----- |----- |
 | @myself | https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA  [Baidu](https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA ) code：i183  |  | 200k steps only trained by aidatatang_200zh
-|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [Baidu Pan](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) Code：1024  | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan
+|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing https://u.teknik.io/AYxWf.pt  | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan
 |@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code：2021 | https://www.bilibili.com/video/BV1uh411B7AD/

 #### 2.3 Train vocoder (Optional)
@@ -77,6 +77,7 @@ You can then try the toolbox:

 | URL | Designation | Title | Implementation source |
 | --- | ----------- | ----- | --------------------- |
+| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | This repo |
 | [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | This repo |
 |[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
 |[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
@@ -90,6 +91,7 @@ You can then try the toolbox:
 | aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
 | magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
 | aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
+| data_aishell | [OpenSLR](https://www.openslr.org/33/) |  |
 > After unzip aidatatang_200zh, you need to unzip all the files under `aidatatang_200zh\corpus\train`

 #### 2.What is`<datasets_root>`?
--- a/pre.py
+++ b/pre.py
@@ -12,7 +12,8 @@ import argparse
 recognized_datasets = [
    "aidatatang_200zh",
    "magicdata",
-    "aishell3"
+    "aishell3",
+    "data_aishell"
 ]

 if __name__ == "__main__":
@@ -40,7 +41,7 @@ if __name__ == "__main__":
        "Use this option when dataset does not include alignments\
        (these are used to split long audio files into sub-utterances.)")
    parser.add_argument("-d", "--dataset", type=str, default="aidatatang_200zh", help=\
-        "Name of the dataset to process, allowing values: magicdata, aidatatang_200zh, aishell3.")
+        "Name of the dataset to process, allowing values: magicdata, aidatatang_200zh, aishell3, data_aishell.")
    parser.add_argument("-e", "--encoder_model_fpath", type=Path, default="encoder/saved_models/pretrained.pt", help=\
        "Path your trained encoder model.")
    parser.add_argument("-ne", "--n_processes_embed", type=int, default=1, help=\
--- a/requirements.txt
+++ b/requirements.txt
@@ -19,4 +19,5 @@ flask
 flask_wtf
 flask_cors
 gevent==21.8.0
-flask_restx
+flask_restx
+tensorboard
--- a/synthesizer/gst_hyperparameters.py
+++ b/synthesizer/gst_hyperparameters.py
@@ -0,0 +1,13 @@
+class GSTHyperparameters():
+    E = 512
+
+    # reference encoder
+    ref_enc_filters = [32, 32, 64, 64, 128, 128]
+
+    # style token layer
+    token_num = 10
+    # token_emb_size = 256
+    num_heads = 8
+
+    n_mels = 256  # Number of Mel banks to generate
+    
--- a/synthesizer/hparams.py
+++ b/synthesizer/hparams.py
@@ -62,9 +62,11 @@ hparams = HParams(
        tts_clip_grad_norm = 1.0,                   # clips the gradient norm to prevent explosion - set to None if not needed
        tts_eval_interval = 500,                    # Number of steps between model evaluation (sample generation)
                                                    # Set to -1 to generate after completing epoch, or 0 to disable
-
        tts_eval_num_samples = 1,                   # Makes this number of samples

+        ## For finetune usage, if set, only selected layers will be trained, available: encoder,encoder_proj,gst,decoder,postnet,post_proj
+        tts_finetune_layers = [], 
+
        ### Data Preprocessing
        max_mel_frames = 900,
        rescale = True,
--- a/synthesizer/inference.py
+++ b/synthesizer/inference.py
@@ -70,7 +70,7 @@ class Synthesizer:

    def synthesize_spectrograms(self, texts: List[str],
                                embeddings: Union[np.ndarray, List[np.ndarray]],
-                                return_alignments=False):
+                                return_alignments=False, style_idx=0, min_stop_token=5):
        """
        Synthesizes mel spectrograms from texts and speaker embeddings.

@@ -125,7 +125,7 @@ class Synthesizer:
            speaker_embeddings = torch.tensor(speaker_embeds).float().to(self.device)

            # Inference
-            _, mels, alignments = self._model.generate(chars, speaker_embeddings)
+            _, mels, alignments = self._model.generate(chars, speaker_embeddings, style_idx=style_idx, min_stop_token=min_stop_token)
            mels = mels.detach().cpu().numpy()
            for m in mels:
                # Trim silence from end of each spectrogram
--- a/synthesizer/models/global_style_token.py
+++ b/synthesizer/models/global_style_token.py
@@ -0,0 +1,135 @@
+import torch
+import torch.nn as nn
+import torch.nn.init as init
+import torch.nn.functional as tFunctional
+from synthesizer.gst_hyperparameters import GSTHyperparameters as hp
+
+
+class GlobalStyleToken(nn.Module):
+
+    def __init__(self):
+
+        super().__init__()
+        self.encoder = ReferenceEncoder()
+        self.stl = STL()
+
+    def forward(self, inputs):
+        enc_out = self.encoder(inputs)
+        style_embed = self.stl(enc_out)
+
+        return style_embed
+
+
+class ReferenceEncoder(nn.Module):
+    '''
+    inputs --- [N, Ty/r, n_mels*r]  mels
+    outputs --- [N, ref_enc_gru_size]
+    '''
+
+    def __init__(self):
+
+        super().__init__()
+        K = len(hp.ref_enc_filters)
+        filters = [1] + hp.ref_enc_filters
+        convs = [nn.Conv2d(in_channels=filters[i],
+                           out_channels=filters[i + 1],
+                           kernel_size=(3, 3),
+                           stride=(2, 2),
+                           padding=(1, 1)) for i in range(K)]
+        self.convs = nn.ModuleList(convs)
+        self.bns = nn.ModuleList([nn.BatchNorm2d(num_features=hp.ref_enc_filters[i]) for i in range(K)])
+
+        out_channels = self.calculate_channels(hp.n_mels, 3, 2, 1, K)
+        self.gru = nn.GRU(input_size=hp.ref_enc_filters[-1] * out_channels,
+                          hidden_size=hp.E // 2,
+                          batch_first=True)
+
+    def forward(self, inputs):
+        N = inputs.size(0)
+        out = inputs.view(N, 1, -1, hp.n_mels)  # [N, 1, Ty, n_mels]
+        for conv, bn in zip(self.convs, self.bns):
+            out = conv(out)
+            out = bn(out)
+            out = tFunctional.relu(out)  # [N, 128, Ty//2^K, n_mels//2^K]
+
+        out = out.transpose(1, 2)  # [N, Ty//2^K, 128, n_mels//2^K]
+        T = out.size(1)
+        N = out.size(0)
+        out = out.contiguous().view(N, T, -1)  # [N, Ty//2^K, 128*n_mels//2^K]
+
+        self.gru.flatten_parameters()
+        memory, out = self.gru(out)  # out --- [1, N, E//2]
+
+        return out.squeeze(0)
+
+    def calculate_channels(self, L, kernel_size, stride, pad, n_convs):
+        for i in range(n_convs):
+            L = (L - kernel_size + 2 * pad) // stride + 1
+        return L
+
+
+class STL(nn.Module):
+    '''
+    inputs --- [N, E//2]
+    '''
+
+    def __init__(self):
+
+        super().__init__()
+        self.embed = nn.Parameter(torch.FloatTensor(hp.token_num, hp.E // hp.num_heads))
+        d_q = hp.E // 2
+        d_k = hp.E // hp.num_heads
+        # self.attention = MultiHeadAttention(hp.num_heads, d_model, d_q, d_v)
+        self.attention = MultiHeadAttention(query_dim=d_q, key_dim=d_k, num_units=hp.E, num_heads=hp.num_heads)
+
+        init.normal_(self.embed, mean=0, std=0.5)
+
+    def forward(self, inputs):
+        N = inputs.size(0)
+        query = inputs.unsqueeze(1)  # [N, 1, E//2]
+        keys = tFunctional.tanh(self.embed).unsqueeze(0).expand(N, -1, -1)  # [N, token_num, E // num_heads]
+        style_embed = self.attention(query, keys)
+
+        return style_embed
+
+
+class MultiHeadAttention(nn.Module):
+    '''
+    input:
+        query --- [N, T_q, query_dim]
+        key --- [N, T_k, key_dim]
+    output:
+        out --- [N, T_q, num_units]
+    '''
+
+    def __init__(self, query_dim, key_dim, num_units, num_heads):
+
+        super().__init__()
+        self.num_units = num_units
+        self.num_heads = num_heads
+        self.key_dim = key_dim
+
+        self.W_query = nn.Linear(in_features=query_dim, out_features=num_units, bias=False)
+        self.W_key = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
+        self.W_value = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
+
+    def forward(self, query, key):
+        querys = self.W_query(query)  # [N, T_q, num_units]
+        keys = self.W_key(key)  # [N, T_k, num_units]
+        values = self.W_value(key)
+
+        split_size = self.num_units // self.num_heads
+        querys = torch.stack(torch.split(querys, split_size, dim=2), dim=0)  # [h, N, T_q, num_units/h]
+        keys = torch.stack(torch.split(keys, split_size, dim=2), dim=0)  # [h, N, T_k, num_units/h]
+        values = torch.stack(torch.split(values, split_size, dim=2), dim=0)  # [h, N, T_k, num_units/h]
+
+        # score = softmax(QK^T / (d_k ** 0.5))
+        scores = torch.matmul(querys, keys.transpose(2, 3))  # [h, N, T_q, T_k]
+        scores = scores / (self.key_dim ** 0.5)
+        scores = tFunctional.softmax(scores, dim=3)
+
+        # out = score * V
+        out = torch.matmul(scores, values)  # [h, N, T_q, num_units/h]
+        out = torch.cat(torch.split(out, 1, dim=0), dim=3).squeeze(0)  # [N, T_q, num_units]
+
+        return out
--- a/synthesizer/models/tacotron.py
+++ b/synthesizer/models/tacotron.py
@@ -3,8 +3,7 @@ import numpy as np
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from pathlib import Path
-from typing import Union
+from synthesizer.models.global_style_token import GlobalStyleToken


 class HighwayNetwork(nn.Module):
@@ -338,6 +337,7 @@ class Tacotron(nn.Module):
        self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
                               encoder_K, num_highways, dropout)
        self.encoder_proj = nn.Linear(encoder_dims + speaker_embedding_size, decoder_dims, bias=False)
+        self.gst = GlobalStyleToken()
        self.decoder = Decoder(n_mels, encoder_dims, decoder_dims, lstm_dims,
                               dropout, speaker_embedding_size)
        self.postnet = CBHG(postnet_K, n_mels, postnet_dims,
@@ -358,11 +358,11 @@ class Tacotron(nn.Module):
    def r(self, value):
        self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)

-    def forward(self, x, m, speaker_embedding):
+    def forward(self, texts, mels, speaker_embedding):
        device = next(self.parameters()).device  # use same device as parameters

        self.step += 1
-        batch_size, _, steps  = m.size()
+        batch_size, _, steps  = mels.size()

        # Initialise all hidden states and pack into tuple
        attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
@@ -383,7 +383,12 @@ class Tacotron(nn.Module):

        # SV2TTS: Run the encoder with the speaker embedding
        # The projection avoids unnecessary matmuls in the decoder loop
-        encoder_seq = self.encoder(x, speaker_embedding)
+        encoder_seq = self.encoder(texts, speaker_embedding)
+        # put after encoder 
+        if self.gst is not None:
+            style_embed = self.gst(speaker_embedding) 
+            style_embed = style_embed.expand_as(encoder_seq)
+            encoder_seq = encoder_seq + style_embed
        encoder_seq_proj = self.encoder_proj(encoder_seq)

        # Need a couple of lists for outputs
@@ -391,10 +396,10 @@ class Tacotron(nn.Module):

        # Run the decoder loop
        for t in range(0, steps, self.r):
-            prenet_in = m[:, :, t - 1] if t > 0 else go_frame
+            prenet_in = mels[:, :, t - 1] if t > 0 else go_frame
            mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
                self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
-                             hidden_states, cell_states, context_vec, t, x)
+                             hidden_states, cell_states, context_vec, t, texts)
            mel_outputs.append(mel_frames)
            attn_scores.append(scores)
            stop_outputs.extend([stop_tokens] * self.r)
@@ -414,7 +419,7 @@ class Tacotron(nn.Module):

        return mel_outputs, linear, attn_scores, stop_outputs

-    def generate(self, x, speaker_embedding=None, steps=2000):
+    def generate(self, x, speaker_embedding=None, steps=200, style_idx=0, min_stop_token=5):
        self.eval()
        device = next(self.parameters()).device  # use same device as parameters

@@ -440,6 +445,18 @@ class Tacotron(nn.Module):
        # SV2TTS: Run the encoder with the speaker embedding
        # The projection avoids unnecessary matmuls in the decoder loop
        encoder_seq = self.encoder(x, speaker_embedding)
+
+        # put after encoder 
+        if self.gst is not None and style_idx >= 0 and style_idx < 10:
+            gst_embed = self.gst.stl.embed.cpu().data.numpy()  #[0, number_token]
+            gst_embed = np.tile(gst_embed, (1, 8))
+            scale = np.zeros(512)
+            scale[:] = 0.3
+            speaker_embedding = (gst_embed[style_idx] * scale).astype(np.float32)
+            speaker_embedding = torch.from_numpy(np.tile(speaker_embedding, (x.shape[0], 1))).to(device)
+            style_embed = self.gst(speaker_embedding)
+            style_embed = style_embed.expand_as(encoder_seq)
+            encoder_seq = encoder_seq + style_embed
        encoder_seq_proj = self.encoder_proj(encoder_seq)

        # Need a couple of lists for outputs
@@ -455,7 +472,7 @@ class Tacotron(nn.Module):
            attn_scores.append(scores)
            stop_outputs.extend([stop_tokens] * self.r)
            # Stop the loop when all stop tokens in batch exceed threshold
-            if (stop_tokens > 0.5).all() and t > 10: break
+            if (stop_tokens * 10 > min_stop_token).all() and t > 10: break

        # Concat the mel outputs into sequence
        mel_outputs = torch.cat(mel_outputs, dim=2)
@@ -479,6 +496,15 @@ class Tacotron(nn.Module):
        for p in self.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)

+    def finetune_partial(self, whitelist_layers):
+        self.zero_grad()
+        for name, child in self.named_children():
+            if name in whitelist_layers:
+                print("Trainable Layer: %s" % name)
+                print("Trainable Parameters: %.3f" % sum([np.prod(p.size()) for p in child.parameters()]))
+                for param in child.parameters():
+                    param.requires_grad = False
+
    def get_step(self):
        return self.step.data.item()

@@ -494,7 +520,7 @@ class Tacotron(nn.Module):
        # Use device of model params as location for loaded state
        device = next(self.parameters()).device
        checkpoint = torch.load(str(path), map_location=device)
-        self.load_state_dict(checkpoint["model_state"])
+        self.load_state_dict(checkpoint["model_state"], strict=False)

        if "optimizer_state" in checkpoint and optimizer is not None:
            optimizer.load_state_dict(checkpoint["optimizer_state"])
--- a/synthesizer/preprocess.py
+++ b/synthesizer/preprocess.py
@@ -7,7 +7,7 @@ from tqdm import tqdm
 import numpy as np
 from encoder import inference as encoder
 from synthesizer.preprocess_speaker import preprocess_speaker_general
-from synthesizer.preprocess_transcript import preprocess_transcript_aishell3
+from synthesizer.preprocess_transcript import preprocess_transcript_aishell3, preprocess_transcript_magicdata

 data_info = {
    "aidatatang_200zh": {
@@ -18,13 +18,19 @@ data_info = {
    "magicdata": {
        "subfolders": ["train"],
        "trans_filepath": "train/TRANS.txt",
-        "speak_func": preprocess_speaker_general
+        "speak_func": preprocess_speaker_general,
+        "transcript_func": preprocess_transcript_magicdata,
    },
    "aishell3":{
        "subfolders": ["train/wav"],
        "trans_filepath": "train/content.txt",
        "speak_func": preprocess_speaker_general,
        "transcript_func": preprocess_transcript_aishell3,
+    },
+    "data_aishell":{
+        "subfolders": ["wav/train"],
+        "trans_filepath": "transcript/aishell_transcript_v0.8.txt",
+        "speak_func": preprocess_speaker_general
    }
 }

--- a/synthesizer/preprocess_transcript.py
+++ b/synthesizer/preprocess_transcript.py
@@ -6,4 +6,13 @@ def preprocess_transcript_aishell3(dict_info, dict_transcript):
        transList = []
        for i in range(2, len(v), 2):
            transList.append(v[i])
-        dict_info[v[0]] = " ".join(transList)
+        dict_info[v[0]] = " ".join(transList)
+
+
+def preprocess_transcript_magicdata(dict_info, dict_transcript):
+    for v in dict_transcript:
+        if not v:
+            continue
+        v = v.strip().replace("\n","").replace("\t"," ").split(" ")
+        dict_info[v[0]] = " ".join(v[2:])
+       
--- a/synthesizer/train.py
+++ b/synthesizer/train.py
@@ -93,7 +93,7 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
                     speaker_embedding_size=hparams.speaker_embedding_size).to(device)

    # Initialize the optimizer
-    optimizer = optim.Adam(model.parameters())
+    optimizer = optim.Adam(model.parameters(), amsgrad=True)

    # Load the weights
    if force_restart or not weights_fpath.exists():
@@ -146,7 +146,6 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
                continue

        model.r = r
-
        # Begin the training
        simple_table([(f"Steps with r={r}", str(training_steps // 1000) + "k Steps"),
                      ("Batch Size", batch_size),
@@ -155,6 +154,8 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,

        for p in optimizer.param_groups:
            p["lr"] = lr
+        if hparams.tts_finetune_layers is not None and len(hparams.tts_finetune_layers) > 0:
+            model.finetune_partial(hparams.tts_finetune_layers)

        data_loader = DataLoader(dataset,
                                 collate_fn=collate_synthesizer,
--- a/toolbox/init.py
+++ b/toolbox/init.py
@@ -71,6 +71,7 @@ class Toolbox:

        # Initialize the events and the interface
        self.ui = UI()
+        self.style_idx = 0
        self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, seed)
        self.setup_events()
        self.ui.start()
@@ -233,7 +234,8 @@ class Toolbox:
        texts = processed_texts
        embed = self.ui.selected_utterance.embed
        embeds = [embed] * len(texts)
-        specs = self.synthesizer.synthesize_spectrograms(texts, embeds)
+        min_token = int(self.ui.token_slider.value())
+        specs = self.synthesizer.synthesize_spectrograms(texts, embeds, style_idx=int(self.ui.style_slider.value()), min_stop_token=min_token)
        breaks = [spec.shape[1] for spec in specs]
        spec = np.concatenate(specs, axis=1)
        
--- a/toolbox/assets/mb.png
+++ b/toolbox/assets/mb.png
--- a/toolbox/ui.py
+++ b/toolbox/ui.py
@@ -2,6 +2,7 @@ import matplotlib.pyplot as plt
 from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
 from matplotlib.figure import Figure
 from PyQt5.QtCore import Qt, QStringListModel
+from PyQt5 import QtGui
 from PyQt5.QtWidgets import *
 from encoder.inference import plot_embedding_as_heatmap
 from toolbox.utterance import Utterance
@@ -420,7 +421,10 @@ class UI(QDialog):
        ## Initialize the application
        self.app = QApplication(sys.argv)
        super().__init__(None)
-        self.setWindowTitle("SV2TTS toolbox")
+        self.setWindowTitle("MockingBird GUI")
+        self.setWindowIcon(QtGui.QIcon('toolbox\\assets\\mb.png'))
+        self.setWindowFlag(Qt.WindowMinimizeButtonHint, True)
+        self.setWindowFlag(Qt.WindowMaximizeButtonHint, True)
        
        
        ## Main layouts
@@ -430,21 +434,24 @@ class UI(QDialog):
        
        # Browser
        browser_layout = QGridLayout()
-        root_layout.addLayout(browser_layout, 0, 0, 1, 2)
+        root_layout.addLayout(browser_layout, 0, 0, 1, 8)
        
        # Generation
        gen_layout = QVBoxLayout()
-        root_layout.addLayout(gen_layout, 0, 2, 1, 2)
-        
-        # Projections
-        self.projections_layout = QVBoxLayout()
-        root_layout.addLayout(self.projections_layout, 1, 0, 1, 1)
-        
+        root_layout.addLayout(gen_layout, 0, 8)
+
        # Visualizations
        vis_layout = QVBoxLayout()
-        root_layout.addLayout(vis_layout, 1, 1, 1, 3)
+        root_layout.addLayout(vis_layout, 1, 0, 2, 8)

+        # Output
+        output_layout = QGridLayout()
+        vis_layout.addLayout(output_layout, 0)

+        # Projections
+        self.projections_layout = QVBoxLayout()
+        root_layout.addLayout(self.projections_layout, 1, 8, 2, 2)
+        
        ## Projections
        # UMap
        fig, self.umap_ax = plt.subplots(figsize=(3, 3), facecolor="#F0F0F0")
@@ -458,80 +465,88 @@ class UI(QDialog):
        ## Browser
        # Dataset, speaker and utterance selection
        i = 0
-        self.dataset_box = QComboBox()
-        browser_layout.addWidget(QLabel("<b>Dataset</b>"), i, 0)
-        browser_layout.addWidget(self.dataset_box, i + 1, 0)
-        self.speaker_box = QComboBox()
-        browser_layout.addWidget(QLabel("<b>Speaker</b>"), i, 1)
-        browser_layout.addWidget(self.speaker_box, i + 1, 1)
-        self.utterance_box = QComboBox()
-        browser_layout.addWidget(QLabel("<b>Utterance</b>"), i, 2)
-        browser_layout.addWidget(self.utterance_box, i + 1, 2)
-        self.browser_load_button = QPushButton("Load")
-        browser_layout.addWidget(self.browser_load_button, i + 1, 3)
-        i += 2
        
-        # Random buttons
+        source_groupbox = QGroupBox('Source(源音频)')
+        source_layout = QGridLayout()
+        source_groupbox.setLayout(source_layout)
+        browser_layout.addWidget(source_groupbox, i, 0, 1, 4)
+
+        self.dataset_box = QComboBox()
+        source_layout.addWidget(QLabel("Dataset(数据集):"), i, 0)
+        source_layout.addWidget(self.dataset_box, i, 1)
        self.random_dataset_button = QPushButton("Random")
-        browser_layout.addWidget(self.random_dataset_button, i, 0)
+        source_layout.addWidget(self.random_dataset_button, i, 2)
+        i += 1
+        self.speaker_box = QComboBox()
+        source_layout.addWidget(QLabel("Speaker(说话者)"), i, 0)
+        source_layout.addWidget(self.speaker_box, i, 1)
        self.random_speaker_button = QPushButton("Random")
-        browser_layout.addWidget(self.random_speaker_button, i, 1)
+        source_layout.addWidget(self.random_speaker_button, i, 2)
+        i += 1
+        self.utterance_box = QComboBox()
+        source_layout.addWidget(QLabel("Utterance(音频):"), i, 0)
+        source_layout.addWidget(self.utterance_box, i, 1)
        self.random_utterance_button = QPushButton("Random")
-        browser_layout.addWidget(self.random_utterance_button, i, 2)
+        source_layout.addWidget(self.random_utterance_button, i, 2)
+
+        i += 1
+        source_layout.addWidget(QLabel("<b>Use(使用):</b>"), i, 0)
+        self.browser_load_button = QPushButton("Load Above(加载上面)")
+        source_layout.addWidget(self.browser_load_button, i, 1, 1, 2)
        self.auto_next_checkbox = QCheckBox("Auto select next")
        self.auto_next_checkbox.setChecked(True)
-        browser_layout.addWidget(self.auto_next_checkbox, i, 3)
-        i += 1
+        source_layout.addWidget(self.auto_next_checkbox, i+1, 1)
+        self.browser_browse_button = QPushButton("Browse(打开本地)")
+        source_layout.addWidget(self.browser_browse_button, i, 3)
+        self.record_button = QPushButton("Record(录音)")
+        source_layout.addWidget(self.record_button, i+1, 3)
        
+        i += 2
        # Utterance box
-        browser_layout.addWidget(QLabel("<b>Use embedding from:</b>"), i, 0)
+        browser_layout.addWidget(QLabel("<b>Current(当前):</b>"), i, 0)
        self.utterance_history = QComboBox()
-        browser_layout.addWidget(self.utterance_history, i, 1, 1, 3)
-        i += 1
-        
-        # Random & next utterance buttons
-        self.browser_browse_button = QPushButton("Browse")
-        browser_layout.addWidget(self.browser_browse_button, i, 0)
-        self.record_button = QPushButton("Record")
-        browser_layout.addWidget(self.record_button, i, 1)
-        self.play_button = QPushButton("Play")
+        browser_layout.addWidget(self.utterance_history, i, 1)
+        self.play_button = QPushButton("Play(播放)")
        browser_layout.addWidget(self.play_button, i, 2)
-        self.stop_button = QPushButton("Stop")
+        self.stop_button = QPushButton("Stop(暂停)")
        browser_layout.addWidget(self.stop_button, i, 3)
-        i += 1

+        i += 1
+        model_groupbox = QGroupBox('Models(模型选择)')
+        model_layout = QHBoxLayout()
+        model_groupbox.setLayout(model_layout)
+        browser_layout.addWidget(model_groupbox, i, 0, 1, 4)

        # Model and audio output selection
        self.encoder_box = QComboBox()
-        browser_layout.addWidget(QLabel("<b>Encoder</b>"), i, 0)
-        browser_layout.addWidget(self.encoder_box, i + 1, 0)
+        model_layout.addWidget(QLabel("Encoder:"))
+        model_layout.addWidget(self.encoder_box)
        self.synthesizer_box = QComboBox()
-        browser_layout.addWidget(QLabel("<b>Synthesizer</b>"), i, 1)
-        browser_layout.addWidget(self.synthesizer_box, i + 1, 1)
+        model_layout.addWidget(QLabel("Synthesizer:"))
+        model_layout.addWidget(self.synthesizer_box)
        self.vocoder_box = QComboBox()
-        browser_layout.addWidget(QLabel("<b>Vocoder</b>"), i, 2)
-        browser_layout.addWidget(self.vocoder_box, i + 1, 2)
+        model_layout.addWidget(QLabel("Vocoder:"))
+        model_layout.addWidget(self.vocoder_box)
        
-        self.audio_out_devices_cb=QComboBox()
-        browser_layout.addWidget(QLabel("<b>Audio Output</b>"), i, 3)
-        browser_layout.addWidget(self.audio_out_devices_cb, i + 1, 3)
-        i += 2

        #Replay & Save Audio
-        browser_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
+        i = 0
+        output_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
        self.waves_cb = QComboBox()
        self.waves_cb_model = QStringListModel()
        self.waves_cb.setModel(self.waves_cb_model)
        self.waves_cb.setToolTip("Select one of the last generated waves in this section for replaying or exporting")
-        browser_layout.addWidget(self.waves_cb, i, 1)
+        output_layout.addWidget(self.waves_cb, i, 1)
        self.replay_wav_button = QPushButton("Replay")
        self.replay_wav_button.setToolTip("Replay last generated vocoder")
-        browser_layout.addWidget(self.replay_wav_button, i, 2)
+        output_layout.addWidget(self.replay_wav_button, i, 2)
        self.export_wav_button = QPushButton("Export")
        self.export_wav_button.setToolTip("Save last generated vocoder audio in filesystem as a wav file")
-        browser_layout.addWidget(self.export_wav_button, i, 3)
+        output_layout.addWidget(self.export_wav_button, i, 3)
+        self.audio_out_devices_cb=QComboBox()
        i += 1
-
+        output_layout.addWidget(QLabel("<b>Audio Output</b>"), i, 0)
+        output_layout.addWidget(self.audio_out_devices_cb, i, 1)

        ## Embed & spectrograms
        vis_layout.addStretch()
@@ -552,7 +567,6 @@ class UI(QDialog):
            for side in ["top", "right", "bottom", "left"]:
                ax.spines[side].set_visible(False)
        
-        
        ## Generation
        self.text_prompt = QPlainTextEdit(default_text)
        gen_layout.addWidget(self.text_prompt, stretch=1)
@@ -578,6 +592,32 @@ class UI(QDialog):
        self.trim_silences_checkbox.setToolTip("When checked, trims excess silence in vocoder output."
            " This feature requires `webrtcvad` to be installed.")
        layout_seed.addWidget(self.trim_silences_checkbox, 0, 2, 1, 2)
+        self.style_slider = QSlider(Qt.Horizontal)
+        self.style_slider.setTickInterval(1)
+        self.style_slider.setFocusPolicy(Qt.NoFocus)
+        self.style_slider.setSingleStep(1)
+        self.style_slider.setRange(-1, 9)
+        self.style_value_label = QLabel("-1")
+        self.style_slider.setValue(-1)
+        layout_seed.addWidget(QLabel("Style:"), 1, 0)
+
+        self.style_slider.valueChanged.connect(lambda s: self.style_value_label.setNum(s))
+        layout_seed.addWidget(self.style_value_label, 1, 1)
+        layout_seed.addWidget(self.style_slider, 1, 3)
+
+        self.token_slider = QSlider(Qt.Horizontal)
+        self.token_slider.setTickInterval(1)
+        self.token_slider.setFocusPolicy(Qt.NoFocus)
+        self.token_slider.setSingleStep(1)
+        self.token_slider.setRange(3, 9)
+        self.token_value_label = QLabel("5")
+        self.token_slider.setValue(4)
+        layout_seed.addWidget(QLabel("Accuracy(精度):"), 2, 0)
+
+        self.token_slider.valueChanged.connect(lambda s: self.token_value_label.setNum(s))
+        layout_seed.addWidget(self.token_value_label, 2, 1)
+        layout_seed.addWidget(self.token_slider, 2, 3)
+
        gen_layout.addLayout(layout_seed)

        self.loading_bar = QProgressBar()
@@ -591,7 +631,7 @@ class UI(QDialog):

        
        ## Set the size of the window and of the elements
-        max_size = QDesktopWidget().availableGeometry(self).size() * 0.8
+        max_size = QDesktopWidget().availableGeometry(self).size() * 0.5
        self.resize(max_size)
        
        ## Finalize the display
--- a/utils/modelutils.py
+++ b/utils/modelutils.py
@@ -11,7 +11,6 @@ def check_model_paths(encoder_path: Path, synthesizer_path: Path, vocoder_path:

    # If none of the paths exist, remind the user to download models if needed
    print("********************************************************************************")
-    print("Error: Model files not found. Follow these instructions to get and install the models:")
-    print("https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models")
+    print("Error: Model files not found. Please download the models")
    print("********************************************************************************\n")
    quit(-1)
Author	SHA1	Message	Date
Vega	724194a4de	Add code to control finetune layers (#154 )	2021-10-23 10:25:43 +08:00
babysor00	31bc6656c3	Fix bug of importing GST and add more parameters in toolbox	2021-10-21 00:40:00 +08:00
洛竹	aa35fb3139	docs: this repo -> 本代码库 (#157 ) Co-authored-by: 洛竹 <youngjuning@aliyun.com>	2021-10-20 22:54:31 +08:00
babysor00	727eafc51b	Merge branch 'main' of https://github.com/babysor/Realtime-Voice-Clone-Chinese	2021-10-20 00:27:19 +08:00
babysor00	d328ecba81	Reconstruct UI of toolbox	2021-10-20 00:27:13 +08:00
Vega	fad574118c	Update README-CN.md	2021-10-18 13:50:19 +08:00
babysor00	b0c156a537	Add new dataset support to preprocess parameter	2021-10-17 17:21:49 +08:00
Vega	724809abf4	Update README.md	2021-10-15 14:34:29 +08:00
Vega	05cd1a54ea	Add new pretrain model with gst	2021-10-14 01:26:23 +08:00
李子	245099c740	支持data_aishell（SLR33）数据集 (#141 ) * 支持data_aishell（SLR33）数据集 * 更新readme	2021-10-12 23:40:27 +08:00
babysor00	6dd2af49fe	Merge branch 'main' of https://github.com/babysor/Realtime-Voice-Clone-Chinese	2021-10-12 20:02:05 +08:00
babysor00	8b43ec9a64	Fix bug pre-processing magicdata	2021-10-12 20:01:37 +08:00
Vega	2a99f0ff05	Add gst (#137 ) * Commit with working GST * Make it backward compatible * Add readme	2021-10-12 19:43:29 +08:00
babysor00	a824b54122	补充预处理文档	2021-10-12 09:22:10 +08:00
weida wang	81befb91b0	Update ui.py (#136 ) Add minimize and maximize button of window	2021-10-11 17:17:36 +08:00