update img

web tool box update UI
2026-02-04 02:54:07 +08:00 · 2021-10-01 00:24:59 +08:00 · 2021-09-30 23:48:45 +08:00
25 changed files with 142 additions and 537 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -17,7 +17,5 @@
 *.sh
 synthesizer/saved_models/*
 vocoder/saved_models/*
-encoder/saved_models/*
 cp_hifigan/*
-!vocoder/saved_models/pretrained/*
-!encoder/saved_models/pretrained.pt
+!vocoder/saved_models/pretrained/*
--- a/.vscode/launch.json
+++ b/.vscode/launch.json
@@ -17,7 +17,7 @@
      "request": "launch",
      "program": "vocoder_preprocess.py",
      "console": "integratedTerminal",
-      "args": ["..\\audiodata"]
+      "args": ["..\\..\\chs1"]
    },
    {
      "name": "Python: Vocoder Train",
@@ -25,23 +25,15 @@
      "request": "launch",
      "program": "vocoder_train.py",
      "console": "integratedTerminal",
-      "args": ["dev", "..\\audiodata"]
+      "args": ["dev", "..\\..\\chs1"]
    },
    {
-      "name": "Python: Demo Box",
+      "name": "Python: demo box",
      "type": "python",
      "request": "launch",
      "program": "demo_toolbox.py",
      "console": "integratedTerminal",
-      "args": ["-d","..\\audiodata"]
-    },
-    {
-      "name": "Python: Synth Train",
-      "type": "python",
-      "request": "launch",
-      "program": "synthesizer_train.py",
-      "console": "integratedTerminal",
-      "args": ["my_run", "..\\"]
-    },
+      "args": ["-d", "..\\..\\chs"]
+    }
  ]
 }
--- a/README-CN.md
+++ b/README-CN.md
@@ -5,10 +5,10 @@

 ### [English](README.md)  | 中文

-### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/) | [Wiki教程](https://github.com/babysor/MockingBird/wiki/Quick-Start-(Newbie)) ｜ [训练教程](https://vaj2fgg8yn.feishu.cn/docs/doccn7kAbr3SJz0KM0SIDJ0Xnhd)
+### [DEMO VIDEO](https://www.bilibili.com/video/BV1sA411P7wM/)

 ## 特性
-🌍 **中文** 支持普通话并使用多种中文数据集进行测试：aidatatang_200zh, magicdata, aishell3, biaobei, MozillaCommonVoice, data_aishell 等
+🌍 **中文** 支持普通话并使用多种中文数据集进行测试：aidatatang_200zh, magicdata, aishell3， biaobei，MozillaCommonVoice 等

 🤩 **PyTorch** 适用于 pytorch，已在 1.9.0 版本（最新于 2021 年 8 月）中测试，GPU Tesla T4 和 GTX 2060

@@ -18,7 +18,6 @@

 🌍 **Webserver Ready** 可伺服你的训练结果，供远程调用

-## 开始
 ### 1. 安装要求
 > 按照原始存储库测试您是否已准备好所有环境。
 **Python 3.7 或更高版本** 需要运行工具箱。
@@ -35,10 +34,8 @@
 #### 2.1 使用数据集自己训练合成器模型（与2.2二选一）
 * 下载 数据集并解压：确保您可以访问 *train* 文件夹中的所有音频文件（如.wav）
 * 进行音频和梅尔频谱图预处理：
-`python pre.py <datasets_root> -d {dataset} -n {number}`
-可传入参数：
-* -d`{dataset}` 指定数据集，支持 aidatatang_200zh, magicdata, aishell3, data_aishell, 不传默认为aidatatang_200zh
-* -n `{number}` 指定并行数，CPU 11770k + 32GB实测10没有问题
+`python pre.py <datasets_root>`
+可以传入参数 --dataset `{dataset}` 支持 aidatatang_200zh, magicdata, aishell3
 > 假如你下载的 `aidatatang_200zh`文件放在D盘，`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`

 * 训练合成器：
@@ -51,7 +48,7 @@

 | 作者 | 下载链接 | 效果预览 | 信息 |
 | --- | ----------- | ----- | ----- |
-| 作者 | https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ  [百度盘链接](https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ) 提取码：gdn5 |  | 25k steps 用3个开源数据集混合训练
+| 作者 | https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA  [百度盘链接](https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA ) 提取码：i183  |  | 200k steps 只用aidatatang_200zh
 |@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [百度盘链接](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) 提取码：1024  | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音
 |@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码：2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 旧版需根据[issue](https://github.com/babysor/MockingBird/issues/37)修复

@@ -122,21 +119,15 @@

 | URL | Designation | 标题 | 实现源码 |
 | --- | ----------- | ----- | --------------------- |
-| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
 | [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
-|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | 本代码库 |
+|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
 |[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
 |[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
 |[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |

 ## 常見問題(FQ&A)
 #### 1.數據集哪裡下載?
-| 数据集 | OpenSLR地址 | 其他源 (Google Drive, Baidu网盘等) |
-| --- | ----------- | ---------------|
-| aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
-| magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
-| aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
-| data_aishell | [OpenSLR](https://www.openslr.org/33/) |  |
+[aidatatang_200zh](http://www.openslr.org/62/)、[magicdata](http://www.openslr.org/68/)、[aishell3](http://www.openslr.org/93/)
 > 解壓 aidatatang_200zh 後，還需將 `aidatatang_200zh\corpus\train`下的檔案全選解壓縮

 #### 2.`<datasets_root>`是什麼意思?
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@
 > English | [中文](README-CN.md)

 ## Features
-🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.
+🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, and etc.

 🤩 **PyTorch** worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060

@@ -16,7 +16,7 @@

 🌍 **Webserver Ready** to serve your result with remote calling

-### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/)
+### [DEMO VIDEO](https://www.bilibili.com/video/BV1sA411P7wM/)

 ## Quick Start

@@ -36,7 +36,7 @@ You can either train your models or use existing ones:
 * Download dataset and unzip: make sure you can access all .wav in folder
 * Preprocess with the audios and the mel spectrograms:
 `python pre.py <datasets_root>`
-Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata, aishell3, data_aishell, etc.If this parameter is not passed, the default dataset will be aidatatang_200zh.
+Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata, aishell3, etc.

 * Train the synthesizer:
 `python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
@@ -49,7 +49,7 @@ Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata,
 | author | Download link | Preview Video | Info |
 | --- | ----------- | ----- |----- |
 | @myself | https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA  [Baidu](https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA ) code：i183  |  | 200k steps only trained by aidatatang_200zh
-|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing https://u.teknik.io/AYxWf.pt  | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan
+|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [Baidu Pan](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) Code：1024  | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan
 |@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code：2021 | https://www.bilibili.com/video/BV1uh411B7AD/

 #### 2.3 Train vocoder (Optional)
@@ -77,7 +77,6 @@ You can then try the toolbox:

 | URL | Designation | Title | Implementation source |
 | --- | ----------- | ----- | --------------------- |
-| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | This repo |
 | [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | This repo |
 |[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
 |[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
@@ -86,12 +85,7 @@ You can then try the toolbox:

 ## F Q&A
 #### 1.Where can I download the dataset?
-| Dataset | Original Source | Alternative Sources |
-| --- | ----------- | ---------------|
-| aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
-| magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
-| aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
-| data_aishell | [OpenSLR](https://www.openslr.org/33/) |  |
+[aidatatang_200zh](http://www.openslr.org/62/)、[magicdata](http://www.openslr.org/68/)、[aishell3](http://www.openslr.org/93/)
 > After unzip aidatatang_200zh, you need to unzip all the files under `aidatatang_200zh\corpus\train`

 #### 2.What is`<datasets_root>`?
--- a/archived_untest_files/encoder_preprocess.py
+++ b/archived_untest_files/encoder_preprocess.py
@@ -1,4 +1,4 @@
-from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2, preprocess_aidatatang_200zh
+from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2
 from utils.argutils import print_args
 from pathlib import Path
 import argparse
@@ -10,7 +10,17 @@ if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Preprocesses audio files from datasets, encodes them as mel spectrograms and "
                    "writes them to the disk. This will allow you to train the encoder. The "
-                    "datasets required are at least one of LibriSpeech, VoxCeleb1, VoxCeleb2, aidatatang_200zh. ",
+                    "datasets required are at least one of VoxCeleb1, VoxCeleb2 and LibriSpeech. "
+                    "Ideally, you should have all three. You should extract them as they are "
+                    "after having downloaded them and put them in a same directory, e.g.:\n"
+                    "-[datasets_root]\n"
+                    "  -LibriSpeech\n"
+                    "    -train-other-500\n"
+                    "  -VoxCeleb1\n"
+                    "    -wav\n"
+                    "    -vox1_meta.csv\n"
+                    "  -VoxCeleb2\n"
+                    "    -dev",
        formatter_class=MyFormatter
    )
    parser.add_argument("datasets_root", type=Path, help=\
@@ -19,7 +29,7 @@ if __name__ == "__main__":
        "Path to the output directory that will contain the mel spectrograms. If left out, "
        "defaults to <datasets_root>/SV2TTS/encoder/")
    parser.add_argument("-d", "--datasets", type=str, 
-                        default="librispeech_other,voxceleb1,aidatatang_200zh", help=\
+                        default="librispeech_other,voxceleb1,voxceleb2", help=\
        "Comma-separated list of the name of the datasets you want to preprocess. Only the train "
        "set of these datasets will be used. Possible names: librispeech_other, voxceleb1, "
        "voxceleb2.")
@@ -53,7 +63,6 @@ if __name__ == "__main__":
        "librispeech_other": preprocess_librispeech,
        "voxceleb1": preprocess_voxceleb1,
        "voxceleb2": preprocess_voxceleb2,
-        "aidatatang_200zh": preprocess_aidatatang_200zh,
    }
    args = vars(args)
    for dataset in args.pop("datasets"):
--- a/archived_untest_files/encoder_train.py
+++ b/archived_untest_files/encoder_train.py
--- a/encoder/preprocess.py
+++ b/encoder/preprocess.py
@@ -117,15 +117,6 @@ def _preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir,
    logger.finalize()
    print("Done preprocessing %s.\n" % dataset_name)

-def preprocess_aidatatang_200zh(datasets_root: Path, out_dir: Path, skip_existing=False):
-    dataset_name = "aidatatang_200zh"
-    dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
-    if not dataset_root:
-        return 
-    # Preprocess all speakers
-    speaker_dirs = list(dataset_root.joinpath("corpus", "train").glob("*"))
-    _preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir, "wav",
-                                skip_existing, logger)

 def preprocess_librispeech(datasets_root: Path, out_dir: Path, skip_existing=False):
    for dataset_name in librispeech_datasets["train"]["other"]:
--- a/encoder/saved_models/pretrained.pt
+++ b/encoder/saved_models/pretrained.pt
--- a/pre.py
+++ b/pre.py
@@ -12,8 +12,7 @@ import argparse
 recognized_datasets = [
    "aidatatang_200zh",
    "magicdata",
-    "aishell3",
-    "data_aishell"
+    "aishell3"
 ]

 if __name__ == "__main__":
@@ -41,7 +40,7 @@ if __name__ == "__main__":
        "Use this option when dataset does not include alignments\
        (these are used to split long audio files into sub-utterances.)")
    parser.add_argument("-d", "--dataset", type=str, default="aidatatang_200zh", help=\
-        "Name of the dataset to process, allowing values: magicdata, aidatatang_200zh, aishell3, data_aishell.")
+        "Name of the dataset to process, allowing values: magicdata, aidatatang_200zh, aishell3.")
    parser.add_argument("-e", "--encoder_model_fpath", type=Path, default="encoder/saved_models/pretrained.pt", help=\
        "Path your trained encoder model.")
    parser.add_argument("-ne", "--n_processes_embed", type=int, default=1, help=\
--- a/requirements.txt
+++ b/requirements.txt
@@ -19,5 +19,4 @@ flask
 flask_wtf
 flask_cors
 gevent==21.8.0
-flask_restx
-tensorboard
+flask_restx
--- a/synthesizer/gst_hyperparameters.py
+++ b/synthesizer/gst_hyperparameters.py
@@ -1,13 +0,0 @@
-class GSTHyperparameters():
-    E = 512
-
-    # reference encoder
-    ref_enc_filters = [32, 32, 64, 64, 128, 128]
-
-    # style token layer
-    token_num = 10
-    # token_emb_size = 256
-    num_heads = 8
-
-    n_mels = 256  # Number of Mel banks to generate
-    
--- a/synthesizer/hparams.py
+++ b/synthesizer/hparams.py
@@ -49,23 +49,18 @@ hparams = HParams(
                                                    # frame that has all values < -3.4

        ### Tacotron Training
-        tts_schedule = [(2,  1e-3,  10_000,  12),   # Progressive training schedule
-                        (2,  5e-4,  15_000,  12),   # (r, lr, step, batch_size)
-                        (2,  2e-4,  20_000,  12),   # (r, lr, step, batch_size)
-                        (2,  1e-4,  30_000,  12),   #
-                        (2,  5e-5,  40_000,  12),   #
-                        (2,  1e-5,  60_000,  12),   #
-                        (2,  5e-6, 160_000,  12),   # r = reduction factor (# of mel frames
-                        (2,  3e-6, 320_000,  12),   #     synthesized for each decoder iteration)
-                        (2,  1e-6, 640_000,  12)],  # lr = learning rate
+        tts_schedule = [(2,  1e-3,  20_000,  24),   # Progressive training schedule
+                        (2,  5e-4,  40_000,  24),   # (r, lr, step, batch_size)
+                        (2,  2e-4,  80_000,  24),   #
+                        (2,  1e-4, 160_000,  24),   # r = reduction factor (# of mel frames
+                        (2,  3e-5, 320_000,  24),   #     synthesized for each decoder iteration)
+                        (2,  1e-5, 640_000,  24)],  # lr = learning rate

        tts_clip_grad_norm = 1.0,                   # clips the gradient norm to prevent explosion - set to None if not needed
        tts_eval_interval = 500,                    # Number of steps between model evaluation (sample generation)
                                                    # Set to -1 to generate after completing epoch, or 0 to disable
-        tts_eval_num_samples = 1,                   # Makes this number of samples

-        ## For finetune usage, if set, only selected layers will be trained, available: encoder,encoder_proj,gst,decoder,postnet,post_proj
-        tts_finetune_layers = [], 
+        tts_eval_num_samples = 1,                   # Makes this number of samples

        ### Data Preprocessing
        max_mel_frames = 900,
--- a/synthesizer/inference.py
+++ b/synthesizer/inference.py
@@ -70,7 +70,7 @@ class Synthesizer:

    def synthesize_spectrograms(self, texts: List[str],
                                embeddings: Union[np.ndarray, List[np.ndarray]],
-                                return_alignments=False, style_idx=0, min_stop_token=5):
+                                return_alignments=False):
        """
        Synthesizes mel spectrograms from texts and speaker embeddings.

@@ -125,7 +125,7 @@ class Synthesizer:
            speaker_embeddings = torch.tensor(speaker_embeds).float().to(self.device)

            # Inference
-            _, mels, alignments = self._model.generate(chars, speaker_embeddings, style_idx=style_idx, min_stop_token=min_stop_token)
+            _, mels, alignments = self._model.generate(chars, speaker_embeddings)
            mels = mels.detach().cpu().numpy()
            for m in mels:
                # Trim silence from end of each spectrogram
--- a/synthesizer/models/global_style_token.py
+++ b/synthesizer/models/global_style_token.py
@@ -1,135 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.init as init
-import torch.nn.functional as tFunctional
-from synthesizer.gst_hyperparameters import GSTHyperparameters as hp
-
-
-class GlobalStyleToken(nn.Module):
-
-    def __init__(self):
-
-        super().__init__()
-        self.encoder = ReferenceEncoder()
-        self.stl = STL()
-
-    def forward(self, inputs):
-        enc_out = self.encoder(inputs)
-        style_embed = self.stl(enc_out)
-
-        return style_embed
-
-
-class ReferenceEncoder(nn.Module):
-    '''
-    inputs --- [N, Ty/r, n_mels*r]  mels
-    outputs --- [N, ref_enc_gru_size]
-    '''
-
-    def __init__(self):
-
-        super().__init__()
-        K = len(hp.ref_enc_filters)
-        filters = [1] + hp.ref_enc_filters
-        convs = [nn.Conv2d(in_channels=filters[i],
-                           out_channels=filters[i + 1],
-                           kernel_size=(3, 3),
-                           stride=(2, 2),
-                           padding=(1, 1)) for i in range(K)]
-        self.convs = nn.ModuleList(convs)
-        self.bns = nn.ModuleList([nn.BatchNorm2d(num_features=hp.ref_enc_filters[i]) for i in range(K)])
-
-        out_channels = self.calculate_channels(hp.n_mels, 3, 2, 1, K)
-        self.gru = nn.GRU(input_size=hp.ref_enc_filters[-1] * out_channels,
-                          hidden_size=hp.E // 2,
-                          batch_first=True)
-
-    def forward(self, inputs):
-        N = inputs.size(0)
-        out = inputs.view(N, 1, -1, hp.n_mels)  # [N, 1, Ty, n_mels]
-        for conv, bn in zip(self.convs, self.bns):
-            out = conv(out)
-            out = bn(out)
-            out = tFunctional.relu(out)  # [N, 128, Ty//2^K, n_mels//2^K]
-
-        out = out.transpose(1, 2)  # [N, Ty//2^K, 128, n_mels//2^K]
-        T = out.size(1)
-        N = out.size(0)
-        out = out.contiguous().view(N, T, -1)  # [N, Ty//2^K, 128*n_mels//2^K]
-
-        self.gru.flatten_parameters()
-        memory, out = self.gru(out)  # out --- [1, N, E//2]
-
-        return out.squeeze(0)
-
-    def calculate_channels(self, L, kernel_size, stride, pad, n_convs):
-        for i in range(n_convs):
-            L = (L - kernel_size + 2 * pad) // stride + 1
-        return L
-
-
-class STL(nn.Module):
-    '''
-    inputs --- [N, E//2]
-    '''
-
-    def __init__(self):
-
-        super().__init__()
-        self.embed = nn.Parameter(torch.FloatTensor(hp.token_num, hp.E // hp.num_heads))
-        d_q = hp.E // 2
-        d_k = hp.E // hp.num_heads
-        # self.attention = MultiHeadAttention(hp.num_heads, d_model, d_q, d_v)
-        self.attention = MultiHeadAttention(query_dim=d_q, key_dim=d_k, num_units=hp.E, num_heads=hp.num_heads)
-
-        init.normal_(self.embed, mean=0, std=0.5)
-
-    def forward(self, inputs):
-        N = inputs.size(0)
-        query = inputs.unsqueeze(1)  # [N, 1, E//2]
-        keys = tFunctional.tanh(self.embed).unsqueeze(0).expand(N, -1, -1)  # [N, token_num, E // num_heads]
-        style_embed = self.attention(query, keys)
-
-        return style_embed
-
-
-class MultiHeadAttention(nn.Module):
-    '''
-    input:
-        query --- [N, T_q, query_dim]
-        key --- [N, T_k, key_dim]
-    output:
-        out --- [N, T_q, num_units]
-    '''
-
-    def __init__(self, query_dim, key_dim, num_units, num_heads):
-
-        super().__init__()
-        self.num_units = num_units
-        self.num_heads = num_heads
-        self.key_dim = key_dim
-
-        self.W_query = nn.Linear(in_features=query_dim, out_features=num_units, bias=False)
-        self.W_key = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
-        self.W_value = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
-
-    def forward(self, query, key):
-        querys = self.W_query(query)  # [N, T_q, num_units]
-        keys = self.W_key(key)  # [N, T_k, num_units]
-        values = self.W_value(key)
-
-        split_size = self.num_units // self.num_heads
-        querys = torch.stack(torch.split(querys, split_size, dim=2), dim=0)  # [h, N, T_q, num_units/h]
-        keys = torch.stack(torch.split(keys, split_size, dim=2), dim=0)  # [h, N, T_k, num_units/h]
-        values = torch.stack(torch.split(values, split_size, dim=2), dim=0)  # [h, N, T_k, num_units/h]
-
-        # score = softmax(QK^T / (d_k ** 0.5))
-        scores = torch.matmul(querys, keys.transpose(2, 3))  # [h, N, T_q, T_k]
-        scores = scores / (self.key_dim ** 0.5)
-        scores = tFunctional.softmax(scores, dim=3)
-
-        # out = score * V
-        out = torch.matmul(scores, values)  # [h, N, T_q, num_units/h]
-        out = torch.cat(torch.split(out, 1, dim=0), dim=3).squeeze(0)  # [N, T_q, num_units]
-
-        return out
--- a/synthesizer/models/tacotron.py
+++ b/synthesizer/models/tacotron.py
@@ -3,7 +3,8 @@ import numpy as np
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from synthesizer.models.global_style_token import GlobalStyleToken
+from pathlib import Path
+from typing import Union


 class HighwayNetwork(nn.Module):
@@ -337,7 +338,6 @@ class Tacotron(nn.Module):
        self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
                               encoder_K, num_highways, dropout)
        self.encoder_proj = nn.Linear(encoder_dims + speaker_embedding_size, decoder_dims, bias=False)
-        self.gst = GlobalStyleToken()
        self.decoder = Decoder(n_mels, encoder_dims, decoder_dims, lstm_dims,
                               dropout, speaker_embedding_size)
        self.postnet = CBHG(postnet_K, n_mels, postnet_dims,
@@ -358,11 +358,11 @@ class Tacotron(nn.Module):
    def r(self, value):
        self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)

-    def forward(self, texts, mels, speaker_embedding):
+    def forward(self, x, m, speaker_embedding):
        device = next(self.parameters()).device  # use same device as parameters

        self.step += 1
-        batch_size, _, steps  = mels.size()
+        batch_size, _, steps  = m.size()

        # Initialise all hidden states and pack into tuple
        attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
@@ -383,12 +383,7 @@ class Tacotron(nn.Module):

        # SV2TTS: Run the encoder with the speaker embedding
        # The projection avoids unnecessary matmuls in the decoder loop
-        encoder_seq = self.encoder(texts, speaker_embedding)
-        # put after encoder 
-        if self.gst is not None:
-            style_embed = self.gst(speaker_embedding) 
-            style_embed = style_embed.expand_as(encoder_seq)
-            encoder_seq = encoder_seq + style_embed
+        encoder_seq = self.encoder(x, speaker_embedding)
        encoder_seq_proj = self.encoder_proj(encoder_seq)

        # Need a couple of lists for outputs
@@ -396,10 +391,10 @@ class Tacotron(nn.Module):

        # Run the decoder loop
        for t in range(0, steps, self.r):
-            prenet_in = mels[:, :, t - 1] if t > 0 else go_frame
+            prenet_in = m[:, :, t - 1] if t > 0 else go_frame
            mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
                self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
-                             hidden_states, cell_states, context_vec, t, texts)
+                             hidden_states, cell_states, context_vec, t, x)
            mel_outputs.append(mel_frames)
            attn_scores.append(scores)
            stop_outputs.extend([stop_tokens] * self.r)
@@ -419,7 +414,7 @@ class Tacotron(nn.Module):

        return mel_outputs, linear, attn_scores, stop_outputs

-    def generate(self, x, speaker_embedding=None, steps=200, style_idx=0, min_stop_token=5):
+    def generate(self, x, speaker_embedding=None, steps=2000):
        self.eval()
        device = next(self.parameters()).device  # use same device as parameters

@@ -445,18 +440,6 @@ class Tacotron(nn.Module):
        # SV2TTS: Run the encoder with the speaker embedding
        # The projection avoids unnecessary matmuls in the decoder loop
        encoder_seq = self.encoder(x, speaker_embedding)
-
-        # put after encoder 
-        if self.gst is not None and style_idx >= 0 and style_idx < 10:
-            gst_embed = self.gst.stl.embed.cpu().data.numpy()  #[0, number_token]
-            gst_embed = np.tile(gst_embed, (1, 8))
-            scale = np.zeros(512)
-            scale[:] = 0.3
-            speaker_embedding = (gst_embed[style_idx] * scale).astype(np.float32)
-            speaker_embedding = torch.from_numpy(np.tile(speaker_embedding, (x.shape[0], 1))).to(device)
-            style_embed = self.gst(speaker_embedding)
-            style_embed = style_embed.expand_as(encoder_seq)
-            encoder_seq = encoder_seq + style_embed
        encoder_seq_proj = self.encoder_proj(encoder_seq)

        # Need a couple of lists for outputs
@@ -472,7 +455,7 @@ class Tacotron(nn.Module):
            attn_scores.append(scores)
            stop_outputs.extend([stop_tokens] * self.r)
            # Stop the loop when all stop tokens in batch exceed threshold
-            if (stop_tokens * 10 > min_stop_token).all() and t > 10: break
+            if (stop_tokens > 0.5).all() and t > 10: break

        # Concat the mel outputs into sequence
        mel_outputs = torch.cat(mel_outputs, dim=2)
@@ -496,15 +479,6 @@ class Tacotron(nn.Module):
        for p in self.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)

-    def finetune_partial(self, whitelist_layers):
-        self.zero_grad()
-        for name, child in self.named_children():
-            if name in whitelist_layers:
-                print("Trainable Layer: %s" % name)
-                print("Trainable Parameters: %.3f" % sum([np.prod(p.size()) for p in child.parameters()]))
-                for param in child.parameters():
-                    param.requires_grad = False
-
    def get_step(self):
        return self.step.data.item()

@@ -520,7 +494,7 @@ class Tacotron(nn.Module):
        # Use device of model params as location for loaded state
        device = next(self.parameters()).device
        checkpoint = torch.load(str(path), map_location=device)
-        self.load_state_dict(checkpoint["model_state"], strict=False)
+        self.load_state_dict(checkpoint["model_state"])

        if "optimizer_state" in checkpoint and optimizer is not None:
            optimizer.load_state_dict(checkpoint["optimizer_state"])
--- a/synthesizer/preprocess.py
+++ b/synthesizer/preprocess.py
@@ -7,7 +7,7 @@ from tqdm import tqdm
 import numpy as np
 from encoder import inference as encoder
 from synthesizer.preprocess_speaker import preprocess_speaker_general
-from synthesizer.preprocess_transcript import preprocess_transcript_aishell3, preprocess_transcript_magicdata
+from synthesizer.preprocess_transcript import preprocess_transcript_aishell3

 data_info = {
    "aidatatang_200zh": {
@@ -18,19 +18,13 @@ data_info = {
    "magicdata": {
        "subfolders": ["train"],
        "trans_filepath": "train/TRANS.txt",
-        "speak_func": preprocess_speaker_general,
-        "transcript_func": preprocess_transcript_magicdata,
+        "speak_func": preprocess_speaker_general
    },
    "aishell3":{
        "subfolders": ["train/wav"],
        "trans_filepath": "train/content.txt",
        "speak_func": preprocess_speaker_general,
        "transcript_func": preprocess_transcript_aishell3,
-    },
-    "data_aishell":{
-        "subfolders": ["wav/train"],
-        "trans_filepath": "transcript/aishell_transcript_v0.8.txt",
-        "speak_func": preprocess_speaker_general
    }
 }

--- a/synthesizer/preprocess_transcript.py
+++ b/synthesizer/preprocess_transcript.py
@@ -6,13 +6,4 @@ def preprocess_transcript_aishell3(dict_info, dict_transcript):
        transList = []
        for i in range(2, len(v), 2):
            transList.append(v[i])
-        dict_info[v[0]] = " ".join(transList)
-
-
-def preprocess_transcript_magicdata(dict_info, dict_transcript):
-    for v in dict_transcript:
-        if not v:
-            continue
-        v = v.strip().replace("\n","").replace("\t"," ").split(" ")
-        dict_info[v[0]] = " ".join(v[2:])
-       
+        dict_info[v[0]] = " ".join(transList)
--- a/synthesizer/train.py
+++ b/synthesizer/train.py
@@ -93,7 +93,7 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
                     speaker_embedding_size=hparams.speaker_embedding_size).to(device)

    # Initialize the optimizer
-    optimizer = optim.Adam(model.parameters(), amsgrad=True)
+    optimizer = optim.Adam(model.parameters())

    # Load the weights
    if force_restart or not weights_fpath.exists():
@@ -146,6 +146,7 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
                continue

        model.r = r
+
        # Begin the training
        simple_table([(f"Steps with r={r}", str(training_steps // 1000) + "k Steps"),
                      ("Batch Size", batch_size),
@@ -154,8 +155,6 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,

        for p in optimizer.param_groups:
            p["lr"] = lr
-        if hparams.tts_finetune_layers is not None and len(hparams.tts_finetune_layers) > 0:
-            model.finetune_partial(hparams.tts_finetune_layers)

        data_loader = DataLoader(dataset,
                                 collate_fn=collate_synthesizer,
--- a/toolbox/init.py
+++ b/toolbox/init.py
@@ -71,7 +71,6 @@ class Toolbox:

        # Initialize the events and the interface
        self.ui = UI()
-        self.style_idx = 0
        self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, seed)
        self.setup_events()
        self.ui.start()
@@ -234,8 +233,7 @@ class Toolbox:
        texts = processed_texts
        embed = self.ui.selected_utterance.embed
        embeds = [embed] * len(texts)
-        min_token = int(self.ui.token_slider.value())
-        specs = self.synthesizer.synthesize_spectrograms(texts, embeds, style_idx=int(self.ui.style_slider.value()), min_stop_token=min_token)
+        specs = self.synthesizer.synthesize_spectrograms(texts, embeds)
        breaks = [spec.shape[1] for spec in specs]
        spec = np.concatenate(specs, axis=1)
        
--- a/toolbox/assets/mb.png
+++ b/toolbox/assets/mb.png
--- a/toolbox/ui.py
+++ b/toolbox/ui.py
@@ -2,7 +2,6 @@ import matplotlib.pyplot as plt
 from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
 from matplotlib.figure import Figure
 from PyQt5.QtCore import Qt, QStringListModel
-from PyQt5 import QtGui
 from PyQt5.QtWidgets import *
 from encoder.inference import plot_embedding_as_heatmap
 from toolbox.utterance import Utterance
@@ -421,10 +420,7 @@ class UI(QDialog):
        ## Initialize the application
        self.app = QApplication(sys.argv)
        super().__init__(None)
-        self.setWindowTitle("MockingBird GUI")
-        self.setWindowIcon(QtGui.QIcon('toolbox\\assets\\mb.png'))
-        self.setWindowFlag(Qt.WindowMinimizeButtonHint, True)
-        self.setWindowFlag(Qt.WindowMaximizeButtonHint, True)
+        self.setWindowTitle("SV2TTS toolbox")
        
        
        ## Main layouts
@@ -434,24 +430,21 @@ class UI(QDialog):
        
        # Browser
        browser_layout = QGridLayout()
-        root_layout.addLayout(browser_layout, 0, 0, 1, 8)
+        root_layout.addLayout(browser_layout, 0, 0, 1, 2)
        
        # Generation
        gen_layout = QVBoxLayout()
-        root_layout.addLayout(gen_layout, 0, 8)
-
-        # Visualizations
-        vis_layout = QVBoxLayout()
-        root_layout.addLayout(vis_layout, 1, 0, 2, 8)
-
-        # Output
-        output_layout = QGridLayout()
-        vis_layout.addLayout(output_layout, 0)
-
+        root_layout.addLayout(gen_layout, 0, 2, 1, 2)
+        
        # Projections
        self.projections_layout = QVBoxLayout()
-        root_layout.addLayout(self.projections_layout, 1, 8, 2, 2)
+        root_layout.addLayout(self.projections_layout, 1, 0, 1, 1)
        
+        # Visualizations
+        vis_layout = QVBoxLayout()
+        root_layout.addLayout(vis_layout, 1, 1, 1, 3)
+
+
        ## Projections
        # UMap
        fig, self.umap_ax = plt.subplots(figsize=(3, 3), facecolor="#F0F0F0")
@@ -465,88 +458,80 @@ class UI(QDialog):
        ## Browser
        # Dataset, speaker and utterance selection
        i = 0
-        
-        source_groupbox = QGroupBox('Source(源音频)')
-        source_layout = QGridLayout()
-        source_groupbox.setLayout(source_layout)
-        browser_layout.addWidget(source_groupbox, i, 0, 1, 4)
-
        self.dataset_box = QComboBox()
-        source_layout.addWidget(QLabel("Dataset(数据集):"), i, 0)
-        source_layout.addWidget(self.dataset_box, i, 1)
-        self.random_dataset_button = QPushButton("Random")
-        source_layout.addWidget(self.random_dataset_button, i, 2)
-        i += 1
+        browser_layout.addWidget(QLabel("<b>Dataset</b>"), i, 0)
+        browser_layout.addWidget(self.dataset_box, i + 1, 0)
        self.speaker_box = QComboBox()
-        source_layout.addWidget(QLabel("Speaker(说话者)"), i, 0)
-        source_layout.addWidget(self.speaker_box, i, 1)
-        self.random_speaker_button = QPushButton("Random")
-        source_layout.addWidget(self.random_speaker_button, i, 2)
-        i += 1
+        browser_layout.addWidget(QLabel("<b>Speaker</b>"), i, 1)
+        browser_layout.addWidget(self.speaker_box, i + 1, 1)
        self.utterance_box = QComboBox()
-        source_layout.addWidget(QLabel("Utterance(音频):"), i, 0)
-        source_layout.addWidget(self.utterance_box, i, 1)
+        browser_layout.addWidget(QLabel("<b>Utterance</b>"), i, 2)
+        browser_layout.addWidget(self.utterance_box, i + 1, 2)
+        self.browser_load_button = QPushButton("Load")
+        browser_layout.addWidget(self.browser_load_button, i + 1, 3)
+        i += 2
+        
+        # Random buttons
+        self.random_dataset_button = QPushButton("Random")
+        browser_layout.addWidget(self.random_dataset_button, i, 0)
+        self.random_speaker_button = QPushButton("Random")
+        browser_layout.addWidget(self.random_speaker_button, i, 1)
        self.random_utterance_button = QPushButton("Random")
-        source_layout.addWidget(self.random_utterance_button, i, 2)
-
-        i += 1
-        source_layout.addWidget(QLabel("<b>Use(使用):</b>"), i, 0)
-        self.browser_load_button = QPushButton("Load Above(加载上面)")
-        source_layout.addWidget(self.browser_load_button, i, 1, 1, 2)
+        browser_layout.addWidget(self.random_utterance_button, i, 2)
        self.auto_next_checkbox = QCheckBox("Auto select next")
        self.auto_next_checkbox.setChecked(True)
-        source_layout.addWidget(self.auto_next_checkbox, i+1, 1)
-        self.browser_browse_button = QPushButton("Browse(打开本地)")
-        source_layout.addWidget(self.browser_browse_button, i, 3)
-        self.record_button = QPushButton("Record(录音)")
-        source_layout.addWidget(self.record_button, i+1, 3)
-        
-        i += 2
-        # Utterance box
-        browser_layout.addWidget(QLabel("<b>Current(当前):</b>"), i, 0)
-        self.utterance_history = QComboBox()
-        browser_layout.addWidget(self.utterance_history, i, 1)
-        self.play_button = QPushButton("Play(播放)")
-        browser_layout.addWidget(self.play_button, i, 2)
-        self.stop_button = QPushButton("Stop(暂停)")
-        browser_layout.addWidget(self.stop_button, i, 3)
-
+        browser_layout.addWidget(self.auto_next_checkbox, i, 3)
        i += 1
-        model_groupbox = QGroupBox('Models(模型选择)')
-        model_layout = QHBoxLayout()
-        model_groupbox.setLayout(model_layout)
-        browser_layout.addWidget(model_groupbox, i, 0, 1, 4)
+        
+        # Utterance box
+        browser_layout.addWidget(QLabel("<b>Use embedding from:</b>"), i, 0)
+        self.utterance_history = QComboBox()
+        browser_layout.addWidget(self.utterance_history, i, 1, 1, 3)
+        i += 1
+        
+        # Random & next utterance buttons
+        self.browser_browse_button = QPushButton("Browse")
+        browser_layout.addWidget(self.browser_browse_button, i, 0)
+        self.record_button = QPushButton("Record")
+        browser_layout.addWidget(self.record_button, i, 1)
+        self.play_button = QPushButton("Play")
+        browser_layout.addWidget(self.play_button, i, 2)
+        self.stop_button = QPushButton("Stop")
+        browser_layout.addWidget(self.stop_button, i, 3)
+        i += 1
+

        # Model and audio output selection
        self.encoder_box = QComboBox()
-        model_layout.addWidget(QLabel("Encoder:"))
-        model_layout.addWidget(self.encoder_box)
+        browser_layout.addWidget(QLabel("<b>Encoder</b>"), i, 0)
+        browser_layout.addWidget(self.encoder_box, i + 1, 0)
        self.synthesizer_box = QComboBox()
-        model_layout.addWidget(QLabel("Synthesizer:"))
-        model_layout.addWidget(self.synthesizer_box)
+        browser_layout.addWidget(QLabel("<b>Synthesizer</b>"), i, 1)
+        browser_layout.addWidget(self.synthesizer_box, i + 1, 1)
        self.vocoder_box = QComboBox()
-        model_layout.addWidget(QLabel("Vocoder:"))
-        model_layout.addWidget(self.vocoder_box)
+        browser_layout.addWidget(QLabel("<b>Vocoder</b>"), i, 2)
+        browser_layout.addWidget(self.vocoder_box, i + 1, 2)
        
+        self.audio_out_devices_cb=QComboBox()
+        browser_layout.addWidget(QLabel("<b>Audio Output</b>"), i, 3)
+        browser_layout.addWidget(self.audio_out_devices_cb, i + 1, 3)
+        i += 2

        #Replay & Save Audio
-        i = 0
-        output_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
+        browser_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
        self.waves_cb = QComboBox()
        self.waves_cb_model = QStringListModel()
        self.waves_cb.setModel(self.waves_cb_model)
        self.waves_cb.setToolTip("Select one of the last generated waves in this section for replaying or exporting")
-        output_layout.addWidget(self.waves_cb, i, 1)
+        browser_layout.addWidget(self.waves_cb, i, 1)
        self.replay_wav_button = QPushButton("Replay")
        self.replay_wav_button.setToolTip("Replay last generated vocoder")
-        output_layout.addWidget(self.replay_wav_button, i, 2)
+        browser_layout.addWidget(self.replay_wav_button, i, 2)
        self.export_wav_button = QPushButton("Export")
        self.export_wav_button.setToolTip("Save last generated vocoder audio in filesystem as a wav file")
-        output_layout.addWidget(self.export_wav_button, i, 3)
-        self.audio_out_devices_cb=QComboBox()
+        browser_layout.addWidget(self.export_wav_button, i, 3)
        i += 1
-        output_layout.addWidget(QLabel("<b>Audio Output</b>"), i, 0)
-        output_layout.addWidget(self.audio_out_devices_cb, i, 1)
+

        ## Embed & spectrograms
        vis_layout.addStretch()
@@ -567,6 +552,7 @@ class UI(QDialog):
            for side in ["top", "right", "bottom", "left"]:
                ax.spines[side].set_visible(False)
        
+        
        ## Generation
        self.text_prompt = QPlainTextEdit(default_text)
        gen_layout.addWidget(self.text_prompt, stretch=1)
@@ -592,32 +578,6 @@ class UI(QDialog):
        self.trim_silences_checkbox.setToolTip("When checked, trims excess silence in vocoder output."
            " This feature requires `webrtcvad` to be installed.")
        layout_seed.addWidget(self.trim_silences_checkbox, 0, 2, 1, 2)
-        self.style_slider = QSlider(Qt.Horizontal)
-        self.style_slider.setTickInterval(1)
-        self.style_slider.setFocusPolicy(Qt.NoFocus)
-        self.style_slider.setSingleStep(1)
-        self.style_slider.setRange(-1, 9)
-        self.style_value_label = QLabel("-1")
-        self.style_slider.setValue(-1)
-        layout_seed.addWidget(QLabel("Style:"), 1, 0)
-
-        self.style_slider.valueChanged.connect(lambda s: self.style_value_label.setNum(s))
-        layout_seed.addWidget(self.style_value_label, 1, 1)
-        layout_seed.addWidget(self.style_slider, 1, 3)
-
-        self.token_slider = QSlider(Qt.Horizontal)
-        self.token_slider.setTickInterval(1)
-        self.token_slider.setFocusPolicy(Qt.NoFocus)
-        self.token_slider.setSingleStep(1)
-        self.token_slider.setRange(3, 9)
-        self.token_value_label = QLabel("5")
-        self.token_slider.setValue(4)
-        layout_seed.addWidget(QLabel("Accuracy(精度):"), 2, 0)
-
-        self.token_slider.valueChanged.connect(lambda s: self.token_value_label.setNum(s))
-        layout_seed.addWidget(self.token_value_label, 2, 1)
-        layout_seed.addWidget(self.token_slider, 2, 3)
-
        gen_layout.addLayout(layout_seed)

        self.loading_bar = QProgressBar()
@@ -631,7 +591,7 @@ class UI(QDialog):

        
        ## Set the size of the window and of the elements
-        max_size = QDesktopWidget().availableGeometry(self).size() * 0.5
+        max_size = QDesktopWidget().availableGeometry(self).size() * 0.8
        self.resize(max_size)
        
        ## Finalize the display
--- a/utils/modelutils.py
+++ b/utils/modelutils.py
@@ -11,6 +11,7 @@ def check_model_paths(encoder_path: Path, synthesizer_path: Path, vocoder_path:

    # If none of the paths exist, remind the user to download models if needed
    print("********************************************************************************")
-    print("Error: Model files not found. Please download the models")
+    print("Error: Model files not found. Follow these instructions to get and install the models:")
+    print("https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models")
    print("********************************************************************************\n")
    quit(-1)
--- a/web/init.py
+++ b/web/init.py
@@ -9,7 +9,6 @@ from vocoder.wavernn import inference as rnn_vocoder
 import numpy as np
 import re
 from scipy.io.wavfile import write
-import librosa
 import io
 import base64
 from flask_cors import CORS
@@ -31,7 +30,6 @@ def webApp():
    synthesizers = list(Path(syn_models_dirt).glob("**/*.pt"))
    synthesizers_cache = {}
    encoder.load_model(Path("encoder/saved_models/pretrained.pt"))
-    # rnn_vocoder.load_model(Path("vocoder/saved_models/pretrained/pretrained.pt"))
    gan_vocoder.load_model(Path("vocoder/saved_models/pretrained/g_hifigan.pt"))

    def pcm2float(sig, dtype='float32'):
@@ -68,6 +66,7 @@ def webApp():
    @app.route("/api/synthesize", methods=["POST"])
    def synthesize():
        # TODO Implementation with json to support more platform
+
        # Load synthesizer
        if "synt_path" in request.form:
            synt_path = request.form["synt_path"]
@@ -81,16 +80,10 @@ def webApp():
            current_synt = synthesizers_cache[synt_path]
        print("using synthesizer model: " + str(synt_path))
        # Load input wav
-        if "upfile_b64" in request.form:
-            wav_base64 = request.form["upfile_b64"]
-            wav = base64.b64decode(bytes(wav_base64, 'utf-8'))
-            wav = pcm2float(np.frombuffer(wav, dtype=np.int16), dtype=np.float32)
-            sample_rate = Synthesizer.sample_rate
-        else:
-            wav, sample_rate,  = librosa.load(request.files['file'])
-        write("temp.wav", sample_rate, wav) #Make sure we get the correct wav
-        
-        encoder_wav = encoder.preprocess_wav(wav, sample_rate)
+        wav_base64 = request.form["upfile_b64"]
+        wav = base64.b64decode(bytes(wav_base64, 'utf-8'))
+        wav = pcm2float(np.frombuffer(wav, dtype=np.int16), dtype=np.float32)
+        encoder_wav = encoder.preprocess_wav(wav, 16000)
        embed, _, _ = encoder.embed_utterance(encoder_wav, return_partials=True)
        
        # Load input text
@@ -107,7 +100,6 @@ def webApp():
        embeds = [embed] * len(texts)
        specs = current_synt.synthesize_spectrograms(texts, embeds)
        spec = np.concatenate(specs, axis=1)
-        # wav = rnn_vocoder.infer_waveform(spec)
        wav = gan_vocoder.infer_waveform(spec)

        # Return cooked wav
--- a/web/config/default.py
+++ b/web/config/default.py
@@ -5,4 +5,3 @@ PORT = 8080
 MAX_CONTENT_PATH =1024 * 1024 * 4  # mp3文件大小限定不能超过4M
 SECRET_KEY = "mockingbird_key"
 WTF_CSRF_SECRET_KEY = "mockingbird_key"
-TEMPLATES_AUTO_RELOAD = True
--- a/web/templates/index.html
+++ b/web/templates/index.html
@@ -38,37 +38,22 @@
 			</div>	

 			<div  style="margin-left: 5%;margin-top: 50px;width: 90%;">
-				<div style="font-size: larger;font-weight: bolder;">1. 请输入中文</div>
+				<div style="font-size: larger;font-weight: bolder;">请输入中文</div>
 				<textarea  id="user_input_text"
 					style="border:1px solid #ccc; width: 100%; height: 100px; font-size: 15px; margin-top: 10px;"></textarea>
 			</div>
-			<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; ">
+			
+			<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; text-align:right;">
 				<!-- <div>
 				<button onclick="recOpen()" style="margin-right:10px">打开录音,请求权限</button>
 				<button onclick="recClose()" style="margin-right:0">关闭录音,释放资源</button>
 			</div> -->
-				<div style="font-size: larger;font-weight: bolder;">2. 请直接录音，点击停止结束</div>
 				<button onclick="recStart()" >录制</button>
 				<button onclick="recStop()">停止</button>
 				<button onclick="recPlay()" >播放</button>
+				<button onclick="recUpload()" >上传</button>
 			</div>
-			<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; ">
-				<div style="font-size: larger;font-weight: bolder;">或上传音频</div>
-				<input type="file" id="fileInput" accept=".wav" />
-				<label for="fileInput">选择音频</label>
-				<div id="audio1"></div>
-			</div>
-			<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; ">
-				<div style="font-size: larger;font-weight: bolder;">3. 选择Synthesizer模型</div>
-				<span class="box">
-					<select id="select">
-					</select>
-				</span>
-			</div>
-			<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; text-align:right;">
-				<button id="upload" onclick="recUpload()">上传合成</button>
-			</div>
-			
+
 			<!-- 波形绘制区域 -->
 			<!-- <div class="pd recpower">
 				<div style="height:40px;width:100%;background:#fff;position:relative;">
@@ -91,37 +76,6 @@


 	<script>
-
-		$("#fileInput").change(function(){
-			var file = $("#fileInput").get(0).files;
-			if (file.length > 0) {
-				var path = URL.createObjectURL(file[0]);
-				var audio = document.createElement('audio');
-				audio.src = path;
-				audio.controls = true;
-				$('#audio1').empty().append(audio); 
-			}
-		});
-		
-		fetch("/api/synthesizers", {
-			method: 'get',
-			headers: {
-				"X-CSRFToken": "{{ csrf_token() }}"
-			}
-		}).then(function (res) {
-			if (!res.ok) throw Error(res.statusText);
-			return res.json();
-		}).then(function (data) {
-			for (var synt of data) {
-				var option = document.createElement('option');
-				option.text = synt.name
-				option.value = synt.path
-				$("#select").append(option);
-			}
-		}).catch(function (err) {
-			console.log('Error: ' + err.message);
-		})
-
 		var rec, wave, recBlob;
 		/**调用open打开录音请求好录音权限**/
 		var recOpen = function () {//一般在显示出录音按钮或相关的录音界面时进行此方法调用，后面用户点击开始录音时就能畅通无阻了
@@ -240,15 +194,9 @@

 		/**上传**/
 		function recUpload() {
-			var blob
-			var loadedAudios = $("#fileInput").get(0).files
-			if (loadedAudios.length > 0) {
-				blob = loadedAudios[0];
-			} else {
-				blob = recBlob;
-			}
+			var blob = recBlob;
 			if (!blob) {
-				reclog("请先录音或选择音频，然后停止后再上传", 1);
+				reclog("请先录音，然后停止后再上传", 1);
 				return;
 			};

@@ -263,18 +211,15 @@
 				var csrftoken = "{{ csrf_token() }}";
 				var user_input_text = document.getElementById("user_input_text");
 				var input_text = user_input_text.value;
-				var postData = new FormData();
-				postData.append("text", input_text)
-				postData.append("file", blob)
-				var sel = document.getElementById("select");
-				var path = sel.options[sel.selectedIndex].value;
-				if (!!path) {
-					postData.append("synt_path", path);
-				}
+				var postData = "";
+				postData += "mime=" + encodeURIComponent(blob.type);//告诉后端，这个录音是什么格式的，可能前后端都固定的mp3可以不用写
+				postData += "&upfile_b64=" + encodeURIComponent((/.+;\s*base64\s*,\s*(.+)$/i.exec(reader.result) || [])[1]) //录音文件内容，后端进行base64解码成二进制
+				postData += "&text=" + encodeURIComponent(input_text);

 				fetch(api, {
 					method: 'post',
 					headers: {
+						"Content-type": "application/x-www-form-urlencoded; charset=UTF-8",
 						"X-CSRFToken": csrftoken
 					},
 					body: postData
@@ -393,6 +338,7 @@
 			padding: 12px;
 			border-radius: 6px;
 			background: #fff;
+			--border: 1px solid #327de8;
 			box-shadow: 2px 2px 3px #aaa;
 		}

@@ -402,7 +348,7 @@
 			cursor: pointer;
 			border: none;
 			border-radius: 3px;
-			background: #5698c3;
+			background: #327de8;
 			color: #fff;
 			padding: 0 15px;
 			margin: 3px 10px 3px 0;
@@ -413,13 +359,6 @@
 			vertical-align: middle;
 		}

-		.btns #upload {
-			background: #5698c3;
-			color: #fff;
-			width: 100px;
-			height: 42px;
-		}
-
 		.btns button:active {
 			background: #5da1f5
 		}
@@ -440,68 +379,6 @@
 			padding: 2px 8px;
 			border-radius: 99px;
 		}
-
-		#fileInput {
-			width: 0.1px;
-			height: 0.1px;
-			opacity: 0;
-			overflow: hidden;
-			position: absolute;
-			z-index: -1;
-		}
-		#fileInput + label {
-			padding: 0 15px;
-			border-radius: 4px;
-			color: white;
-			background-color: #5698c3;
-			display: inline-block;
-			width: 70px;
-			line-height: 36px;
-			height: 36px;
-		}
-		#fileInput + label {
-			cursor: pointer; /* "hand" cursor */
-		}
-		#fileInput:focus + label,
-		#fileInput + label:hover {
-			background-color: #5da1f5;
-		}
-
-		.box select {
-			background-color: #5698c3;
-			color: white;
-			padding: 8px;
-			width: 120px;
-			border: none;
-			border-radius: 4px;
-			font-size: 0.5em;
-			outline: none;
-			margin: 3px 10px 3px 0;
-		}
-
-		.box::before {
-			content: "\f13a";
-			position: absolute;
-			top: 0;
-			right: 0;
-			width: 20%;
-			height: 100%;
-			text-align: center;
-			font-size: 28px;
-			line-height: 45px;
-			color: rgba(255, 255, 255, 0.5);
-			background-color: rgba(255, 255, 255, 0.1);
-			pointer-events: none;
-		}
-
-		.box:hover::before {
-			color: rgba(255, 255, 255, 0.6);
-			background-color: rgba(255, 255, 255, 0.2);
-		}
-
-		.box select option {
-			padding: 30px;
-		}
 	</style>

 </body>
Author	SHA1	Message	Date
Nemo	3713d64cc7	update img	2021-10-01 00:24:59 +08:00
Nemo	5e7cc82373	web tool box update UI	2021-09-30 23:48:45 +08:00