Add readme

Add error raise when no model folder found
Move requirement together
2026-02-07 12:34:06 +08:00 · 2022-05-04 19:56:16 +08:00 · 2022-05-04 19:05:47 +08:00 · 2022-05-04 17:18:02 +08:00 · 2022-05-04 11:25:44 +08:00 · 2022-05-03 10:27:56 +08:00
41 changed files with 596 additions and 2590 deletions
--- a/.vscode/launch.json
+++ b/.vscode/launch.json
@@ -60,6 +60,14 @@
      "args": ["-c", ".\\ppg2mel\\saved_models\\seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2.yaml",
        "-m", ".\\ppg2mel\\saved_models\\best_loss_step_304000.pth", "--wav_dir", ".\\wavs\\input", "--ref_wav_path", ".\\wavs\\pkq.mp3", "-o", ".\\wavs\\output\\"
      ]
-    }
+    },
    {
      "name": "GUI",
      "type": "python",
      "request": "launch",
      "program": "mkgui\\base\\_cli.py",
      "console": "integratedTerminal",
      "args": []
    },
  ]
 }
--- a/README-CN.md
+++ b/README-CN.md
@@ -77,7 +77,7 @@
 对效果影响不大，已经预置3款，如果希望自己训练可以参考以下命令。
 * 预处理数据:
 `python vocoder_preprocess.py <datasets_root> -m <synthesizer_model_path>`
-> `<datasets_root>`替换为你的数据集目录，`<synthesizer_model_path>`替换为一个你最好的synthesizer模型目录，例如 *sythensizer\saved_models\xxx*
+> `<datasets_root>`替换为你的数据集目录，`<synthesizer_model_path>`替换为一个你最好的synthesizer模型目录，例如 *sythensizer\saved_mode\xxx*
 * 训练wavernn声码器:
@@ -87,10 +87,7 @@
 * 训练hifigan声码器:
 `python vocoder_train.py <trainid> <datasets_root> hifigan`
 > `<trainid>`替换为你想要的标识，同一标识再次训练时会延续原模型
-* 训练fregan声码器:
+
 `python vocoder_train.py <trainid> <datasets_root> --config config.json fregan`
 > `<trainid>`替换为你想要的标识，同一标识再次训练时会延续原模型
 * 将GAN声码器的训练切换为多GPU模式：修改GAN文件夹下.json文件中的"num_gpus"参数
 ### 3. 启动程序或工具箱
 您可以尝试使用以下命令：
@@ -108,12 +105,12 @@
 ### 4. 番外：语音转换Voice Conversion(PPG based)
 想像柯南拿着变声器然后发出毛利小五郎的声音吗？本项目现基于PPG-VC，引入额外两个模块（PPG extractor + PPG2Mel）, 可以实现变声功能。（文档不全，尤其是训练部分，正在努力补充中）
 #### 4.0 准备环境
-* 确保项目以上环境已经安装ok，运行`pip install espnet` 来安装剩余的必要包。
+* 确保项目以上环境已经安装ok，运行`pip install -r requirements_vc.txt` 来安装剩余的必要包。
 * 下载以下模型 链接：https://pan.baidu.com/s/1bl_x_DHJSAUyN2fma-Q_Wg 
 提取码：gh41
-  * 24K采样率专用的vocoder（hifigan）到 *vocoder\saved_models\xxx*
+  * 24K采样率专用的vocoder（hifigan）到 *vocoder\saved_mode\xxx*
-  * 预训练的ppg特征encoder(ppg_extractor)到 *ppg_extractor\saved_models\xxx*
+  * 预训练的ppg特征encoder(ppg_extractor)到 *ppg_extractor\saved_mode\xxx*
-  * 预训练的PPG2Mel到 *ppg2mel\saved_models\xxx*
+  * 预训练的PPG2Mel到 *ppg2mel\saved_mode\xxx*
 #### 4.1 使用数据集自己训练PPG2Mel模型 (可选)
@@ -131,7 +128,7 @@
 #### 4.2 启动工具箱VC模式
 您可以尝试使用以下命令：
-`python demo_toolbox.py -vc -d <datasets_root>`
+`python demo_toolbox.py vc -d <datasets_root>`
 > 请指定一个可用的数据集文件路径，如果有支持的数据集则会自动加载供调试，也同时会作为手动录制音频的存储目录。
 <img width="971" alt="微信图片_20220305005351" src="https://user-images.githubusercontent.com/7423248/156805733-2b093dbc-d989-4e68-8609-db11f365886a.png">
@@ -142,7 +139,6 @@
 | --- | ----------- | ----- | --------------------- |
 | [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
 | [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
 | [2106.02297](https://arxiv.org/abs/2106.02297) | Fre-GAN (vocoder)| Fre-GAN: Adversarial Frequency-consistent Audio Synthesis | 本代码库 |
 |[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | 本代码库 |
 |[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
 |[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
--- a/README.md
+++ b/README.md
@@ -37,7 +37,7 @@
 * Install [ffmpeg](https://ffmpeg.org/download.html#get-packages).
 * Run `pip install -r requirements.txt` to install the remaining necessary packages.
 * Install webrtcvad `pip install webrtcvad-wheels`(If you need)
-> Note that we are using the pretrained encoder/vocoder but synthesizer since the original model is incompatible with the Chinese symbols. It means the demo_cli is not working at this moment.
+> Note that we are using the pretrained encoder/vocoder but synthesizer, since the original model is incompatible with the Chinese sympols. It means the demo_cli is not working at this moment.
 ### 2. Prepare your models
 You can either train your models or use existing ones:
@@ -68,7 +68,7 @@ Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata,
 | @author | https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g  [Baidu](https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g) 4j5d  |  | 75k steps trained by multiple datasets
 | @author | https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw  [Baidu](https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw) code：om7f  |  | 25k steps trained by multiple datasets, only works under version 0.0.1
 |@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing https://u.teknik.io/AYxWf.pt  | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan, only works under version 0.0.1
-|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code: 2021 https://www.aliyundrive.com/s/AwPsbo8mcSP code: z2m0 | https://www.bilibili.com/video/BV1uh411B7AD/ | only works under version 0.0.1
+|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code：2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | only works under version 0.0.1
 #### 2.4 Train vocoder (Optional)
 > note: vocoder has little difference in effect, so you may not need to train a new one.
@@ -90,11 +90,6 @@ You can then try to run:`python web.py` and open it in browser, default as `http
 You can then try the toolbox:
 `python demo_toolbox.py -d <datasets_root>`
 #### 3.3 Using the command line
 You can then try the command:
 `python gen_voice.py <text_file.txt> your_wav_file.wav`
 you may need to install cn2an by "pip install cn2an" for better digital number result.
 ## Reference
 > This repository is forked from [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) which only support English.
@@ -102,7 +97,6 @@ you may need to install cn2an by "pip install cn2an" for better digital number r
 | --- | ----------- | ----- | --------------------- |
 | [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | This repo |
 | [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | This repo |
 | [2106.02297](https://arxiv.org/abs/2106.02297) | Fre-GAN (vocoder)| Fre-GAN: Adversarial Frequency-consistent Audio Synthesis | This repo |
 |[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
 |[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
 |[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
--- a/encoder/audio.py
+++ b/encoder/audio.py
@@ -56,8 +56,8 @@ def wav_to_mel_spectrogram(wav):
    Note: this not a log-mel spectrogram.
    """
    frames = librosa.feature.melspectrogram(
-        y=wav,
+        wav,
-        sr=sampling_rate,
+        sampling_rate,
        n_fft=int(sampling_rate * mel_window_length / 1000),
        hop_length=int(sampling_rate * mel_window_step / 1000),
        n_mels=mel_n_channels
--- a/gen_voice.py
+++ b/gen_voice.py
@@ -1,128 +0,0 @@
 from encoder.params_model import model_embedding_size as speaker_embedding_size
 from utils.argutils import print_args
 from utils.modelutils import check_model_paths
 from synthesizer.inference import Synthesizer
 from encoder import inference as encoder
 from vocoder.wavernn import inference as rnn_vocoder
 from vocoder.hifigan import inference as gan_vocoder
 from pathlib import Path
 import numpy as np
 import soundfile as sf
 import librosa
 import argparse
 import torch
 import sys
 import os
 import re
 import cn2an
 import glob
 from audioread.exceptions import NoBackendError
 vocoder = gan_vocoder
 def gen_one_wav(synthesizer, in_fpath, embed, texts, file_name, seq):
    embeds = [embed] * len(texts)
    # If you know what the attention layer alignments are, you can retrieve them here by
    # passing return_alignments=True
    specs = synthesizer.synthesize_spectrograms(texts, embeds, style_idx=-1, min_stop_token=4, steps=400)
    #spec = specs[0]
    breaks = [spec.shape[1] for spec in specs]
    spec = np.concatenate(specs, axis=1)
    # If seed is specified, reset torch seed and reload vocoder
    # Synthesizing the waveform is fairly straightforward. Remember that the longer the
    # spectrogram, the more time-efficient the vocoder.
    generated_wav, output_sample_rate = vocoder.infer_waveform(spec)
    # Add breaks
    b_ends = np.cumsum(np.array(breaks) * synthesizer.hparams.hop_size)
    b_starts = np.concatenate(([0], b_ends[:-1]))
    wavs = [generated_wav[start:end] for start, end, in zip(b_starts, b_ends)]
    breaks = [np.zeros(int(0.15 * synthesizer.sample_rate))] * len(breaks)
    generated_wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
    ## Post-generation
    # There's a bug with sounddevice that makes the audio cut one second earlier, so we
    # pad it.
    # Trim excess silences to compensate for gaps in spectrograms (issue #53)
    generated_wav = encoder.preprocess_wav(generated_wav)
    generated_wav = generated_wav / np.abs(generated_wav).max() * 0.97
    # Save it on the disk
    model=os.path.basename(in_fpath)
    filename = "%s_%d_%s.wav" %(file_name, seq, model)
    sf.write(filename, generated_wav, synthesizer.sample_rate)
    print("\nSaved output as %s\n\n" % filename)
 def generate_wav(enc_model_fpath, syn_model_fpath, voc_model_fpath, in_fpath, input_txt, file_name): 
    if torch.cuda.is_available():
        device_id = torch.cuda.current_device()
        gpu_properties = torch.cuda.get_device_properties(device_id)
        ## Print some environment information (for debugging purposes)
        print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with "
            "%.1fGb total memory.\n" % 
            (torch.cuda.device_count(),
            device_id,
            gpu_properties.name,
            gpu_properties.major,
            gpu_properties.minor,
            gpu_properties.total_memory / 1e9))
    else:
        print("Using CPU for inference.\n")
    print("Preparing the encoder, the synthesizer and the vocoder...")
    encoder.load_model(enc_model_fpath)
    synthesizer = Synthesizer(syn_model_fpath)
    vocoder.load_model(voc_model_fpath)
    encoder_wav = synthesizer.load_preprocess_wav(in_fpath)
    embed, partial_embeds, _ = encoder.embed_utterance(encoder_wav, return_partials=True)
    texts = input_txt.split("\n")
    seq=0
    each_num=1500
    punctuation = '！，。、,' # punctuate and split/clean text
    processed_texts = []
    cur_num = 0
    for text in texts:
      for processed_text in re.sub(r'[{}]+'.format(punctuation), '\n', text).split('\n'):
        if processed_text:
            processed_texts.append(processed_text.strip())
            cur_num += len(processed_text.strip())
      if cur_num > each_num:
        seq = seq +1
        gen_one_wav(synthesizer, in_fpath, embed, processed_texts, file_name, seq)
        processed_texts = []
        cur_num = 0
    if len(processed_texts)>0:
      seq = seq +1
      gen_one_wav(synthesizer, in_fpath, embed, processed_texts, file_name, seq)
 if (len(sys.argv)>=3):
    my_txt = ""
    print("reading from :", sys.argv[1])
    with open(sys.argv[1], "r") as f:
        for line in f.readlines():
            #line = line.strip('\n')
            my_txt += line
    txt_file_name = sys.argv[1]
    wav_file_name = sys.argv[2]
    output = cn2an.transform(my_txt, "an2cn")
    print(output)
    generate_wav(
    Path("encoder/saved_models/pretrained.pt"),
    Path("synthesizer/saved_models/mandarin.pt"),
    Path("vocoder/saved_models/pretrained/g_hifigan.pt"), wav_file_name, output, txt_file_name
    )
 else:
    print("please input the file name")
    exit(1)
--- a/mkgui/app.py
+++ b/mkgui/app.py
@@ -1,3 +1,4 @@
 from asyncio.windows_events import NULL
 from pydantic import BaseModel, Field
 import os
 from pathlib import Path
@@ -10,18 +11,16 @@ import numpy as np
 from mkgui.base.components.types import FileContent
 from vocoder.hifigan import inference as gan_vocoder
 from synthesizer.inference import Synthesizer
-from typing import Any, Tuple
+from typing import Any
 import matplotlib.pyplot as plt
 # Constants
-AUDIO_SAMPLES_DIR = f"samples{os.sep}"
+AUDIO_SAMPLES_DIR = 'samples\\'
-SYN_MODELS_DIRT = f"synthesizer{os.sep}saved_models"
+SYN_MODELS_DIRT = "synthesizer\\saved_models"
-ENC_MODELS_DIRT = f"encoder{os.sep}saved_models"
+ENC_MODELS_DIRT = "encoder\\saved_models"
-VOC_MODELS_DIRT = f"vocoder{os.sep}saved_models"
+VOC_MODELS_DIRT = "vocoder\\saved_models"
-TEMP_SOURCE_AUDIO = f"wavs{os.sep}temp_source.wav"
+TEMP_SOURCE_AUDIO = "wavs/temp_source.wav"
-TEMP_RESULT_AUDIO = f"wavs{os.sep}temp_result.wav"
+TEMP_RESULT_AUDIO = "wavs/temp_result.wav"
 if not os.path.isdir("wavs"):
    os.makedirs("wavs")
 # Load local sample audio as options TODO: load dataset 
 if os.path.isdir(AUDIO_SAMPLES_DIR):
@@ -46,7 +45,6 @@ else:
    raise Exception(f"Model folder {VOC_MODELS_DIRT} doesn't exist.")
 class Input(BaseModel):
    message: str = Field(
        ..., example="欢迎使用工具箱, 现已支持中文输入！", alias="文本内容"
@@ -75,7 +73,7 @@ class AudioEntity(BaseModel):
    mel: Any
 class Output(BaseModel):
-    __root__: Tuple[AudioEntity, AudioEntity]
+    __root__: tuple[AudioEntity, AudioEntity]
    def render_output_ui(self, streamlit_app, input) -> None:  # type: ignore
        """Custom output UI.
--- a/mkgui/app_vc.py
+++ b/mkgui/app_vc.py
@@ -1,3 +1,4 @@
 from asyncio.windows_events import NULL
 from synthesizer.inference import Synthesizer
 from pydantic import BaseModel, Field
 from encoder import inference as speacker_encoder
@@ -13,18 +14,18 @@ import re
 import numpy as np
 from mkgui.base.components.types import FileContent
 from vocoder.hifigan import inference as gan_vocoder
-from typing import Any, Tuple
+from typing import Any
 import matplotlib.pyplot as plt
 # Constants
-AUDIO_SAMPLES_DIR = f'sample{os.sep}'
+AUDIO_SAMPLES_DIR = 'samples\\'
-EXT_MODELS_DIRT = f'ppg_extractor{os.sep}saved_models'
+EXT_MODELS_DIRT = "ppg_extractor\\saved_models"
-CONV_MODELS_DIRT = f'ppg2mel{os.sep}saved_models'
+CONV_MODELS_DIRT = "ppg2mel\\saved_models"
-VOC_MODELS_DIRT = f'vocoder{os.sep}saved_models'
+VOC_MODELS_DIRT = "vocoder\\saved_models"
-TEMP_SOURCE_AUDIO = f'wavs{os.sep}temp_source.wav'
+TEMP_SOURCE_AUDIO = "wavs/temp_source.wav"
-TEMP_TARGET_AUDIO = f'wavs{os.sep}temp_target.wav'
+TEMP_TARGET_AUDIO = "wavs/temp_target.wav"
-TEMP_RESULT_AUDIO = f'wavs{os.sep}temp_result.wav'
+TEMP_RESULT_AUDIO = "wavs/temp_result.wav"
 # Load local sample audio as options TODO: load dataset 
 if os.path.isdir(AUDIO_SAMPLES_DIR):
@@ -70,7 +71,7 @@ class Input(BaseModel):
        description="选择语音转换模型文件."
    )
    vocoder: vocoders = Field(
-        ..., alias="语音解码模型", 
+        ..., alias="语音编码模型", 
        description="选择语音解码模型文件(目前只支持HifiGan类型)."
    )
@@ -79,7 +80,7 @@ class AudioEntity(BaseModel):
    mel: Any
 class Output(BaseModel):
-    __root__: Tuple[AudioEntity, AudioEntity, AudioEntity]
+    __root__: tuple[AudioEntity, AudioEntity, AudioEntity]
    def render_output_ui(self, streamlit_app, input) -> None:  # type: ignore
        """Custom output UI.
@@ -134,7 +135,7 @@ def convert(input: Input) -> Output:
    # Import necessary dependency of Voice Conversion
    from utils.f0_utils import compute_f0, f02lf0, compute_mean_std, get_converted_lf0uv   
    ref_lf0_mean, ref_lf0_std = compute_mean_std(f02lf0(compute_f0(ref_wav)))
-    speacker_encoder.load_model(Path("encoder{os.sep}saved_models{os.sep}pretrained_bak_5805000.pt"))
+    speacker_encoder.load_model(Path("encoder/saved_models/pretrained_bak_5805000.pt"))
    embed = speacker_encoder.embed_utterance(ref_wav)
    lf0_uv = get_converted_lf0uv(src_wav, ref_lf0_mean, ref_lf0_std, convert=True)
    min_len = min(ppg.shape[1], len(lf0_uv))
--- a/mkgui/base/ui/streamlit_ui.py
+++ b/mkgui/base/ui/streamlit_ui.py
@@ -815,9 +815,6 @@ def getOpyrator(mode: str) -> Opyrator:
    if mode == None or mode.startswith('模型训练'):
        from mkgui.train import train
        return  Opyrator(train)
    if mode == None or mode.startswith('模型训练(VC)'):
        from mkgui.train_vc import train_vc
        return  Opyrator(train_vc)
    from mkgui.app import synthesize
    return Opyrator(synthesize)
@@ -832,7 +829,7 @@ def render_streamlit_ui() -> None:
    with st.spinner("Loading MockingBird GUI. Please wait..."):
        session_state.mode = st.sidebar.selectbox(
            '模式选择', 
-            ( "AI拟音", "VC拟音", "预处理", "模型训练", "模型训练(VC)")
+            ( "AI拟音", "VC拟音", "预处理", "模型训练")
        )
        if "mode" in session_state:
            mode = session_state.mode
--- a/mkgui/preprocess.py
+++ b/mkgui/preprocess.py
@@ -2,12 +2,12 @@ from pydantic import BaseModel, Field
 import os
 from pathlib import Path
 from enum import Enum
-from typing import Any, Tuple
+from typing import Any
 # Constants
-EXT_MODELS_DIRT = f"ppg_extractor{os.sep}saved_models"
+EXT_MODELS_DIRT = "ppg_extractor\\saved_models"
-ENC_MODELS_DIRT = f"encoder{os.sep}saved_models"
+ENC_MODELS_DIRT = "encoder\\saved_models"
 if os.path.isdir(EXT_MODELS_DIRT):    
@@ -70,7 +70,7 @@ class AudioEntity(BaseModel):
    mel: Any
 class Output(BaseModel):
-    __root__: Tuple[str, int]
+    __root__: tuple[str, int]
    def render_output_ui(self, streamlit_app, input) -> None:  # type: ignore
        """Custom output UI.
--- a/mkgui/train.py
+++ b/mkgui/train.py
@@ -3,54 +3,65 @@ import os
 from pathlib import Path
 from enum import Enum
 from typing import Any
-from synthesizer.hparams import hparams
+import numpy as np
-from synthesizer.train import train as synt_train
+from utils.load_yaml import HpsYaml
 from utils.util import AttrDict
 import torch
 # TODO: seperator for *unix systems
 # Constants
-SYN_MODELS_DIRT = f"synthesizer{os.sep}saved_models"
+EXT_MODELS_DIRT = "ppg_extractor\\saved_models"
-ENC_MODELS_DIRT = f"encoder{os.sep}saved_models"
+CONV_MODELS_DIRT = "ppg2mel\\saved_models"
 ENC_MODELS_DIRT = "encoder\\saved_models"
-# EXT_MODELS_DIRT = f"ppg_extractor{os.sep}saved_models"
+if os.path.isdir(EXT_MODELS_DIRT):    
-# CONV_MODELS_DIRT = f"ppg2mel{os.sep}saved_models"
+    extractors =  Enum('extractors', list((file.name, file) for file in Path(EXT_MODELS_DIRT).glob("**/*.pt")))
-# ENC_MODELS_DIRT = f"encoder{os.sep}saved_models"
+    print("Loaded extractor models: " + str(len(extractors)))
 # Pre-Load models
 if os.path.isdir(SYN_MODELS_DIRT):    
    synthesizers =  Enum('synthesizers', list((file.name, file) for file in Path(SYN_MODELS_DIRT).glob("**/*.pt")))
    print("Loaded synthesizer models: " + str(len(synthesizers)))
 else:
-    raise Exception(f"Model folder {SYN_MODELS_DIRT} doesn't exist.")
+    raise Exception(f"Model folder {EXT_MODELS_DIRT} doesn't exist.")
 if os.path.isdir(CONV_MODELS_DIRT):    
    convertors =  Enum('convertors', list((file.name, file) for file in Path(CONV_MODELS_DIRT).glob("**/*.pth")))
    print("Loaded convertor models: " + str(len(convertors)))
 else:
    raise Exception(f"Model folder {CONV_MODELS_DIRT} doesn't exist.")
 if os.path.isdir(ENC_MODELS_DIRT):    
-    encoders =  Enum('encoders', list((file.name, file) for file in Path(ENC_MODELS_DIRT).glob("**/*.pt")))
+    encoders = Enum('encoders', list((file.name, file) for file in Path(ENC_MODELS_DIRT).glob("**/*.pt")))
    print("Loaded encoders models: " + str(len(encoders)))
 else:
    raise Exception(f"Model folder {ENC_MODELS_DIRT} doesn't exist.")
 class Model(str, Enum):
-    DEFAULT = "default"
+    VC_PPG2MEL = "ppg2mel"
 class Dataset(str, Enum):
    AIDATATANG_200ZH = "aidatatang_200zh"
    AIDATATANG_200ZH_S = "aidatatang_200zh_s"
 class Input(BaseModel):
    # def render_input_ui(st, input) -> Dict: 
    #     input["selected_dataset"] = st.selectbox(
    #         '选择数据集', 
    #         ("aidatatang_200zh", "aidatatang_200zh_s")
    #     )
    # return input
    model: Model = Field(
-        Model.DEFAULT, title="模型类型",
+        Model.VC_PPG2MEL, title="模型类型",
    )
    # datasets_root: str = Field(
    #     ..., alias="预处理数据根目录", description="输入目录（相对/绝对）,不适用于ppg2mel模型",
    #     format=True,
    #     example="..\\trainning_data\\"
    # )
-    input_root: str = Field(
+    output_root: str = Field(
-        ..., alias="输入目录", description="预处理数据根目录",
+        ..., alias="输出目录(可选)", description="建议不填，保持默认",
        format=True,
-        example=f"..{os.sep}audiodata{os.sep}SV2TTS{os.sep}synthesizer"
+        example=""
    )
-    run_id: str = Field(
+    continue_mode: bool = Field(
-        "", alias="新模型名/运行ID", description="使用新ID进行重新训练，否则选择下面的模型进行继续训练",
+        True, alias="继续训练模式", description="选择“是”，则从下面选择的模型中继续训练",
    )
    synthesizer: synthesizers = Field(
        ..., alias="已有合成模型", 
        description="选择语音合成模型文件."
    )
    gpu: bool = Field(
        True, alias="GPU训练", description="选择“是”，则使用GPU训练",
@@ -58,18 +69,32 @@ class Input(BaseModel):
    verbose: bool = Field(
        True, alias="打印详情", description="选择“是”，输出更多详情",
    )
    # TODO: Move to hiden fields by default
    convertor: convertors = Field(
        ..., alias="转换模型", 
        description="选择语音转换模型文件."
    )
    extractor: extractors = Field(
        ..., alias="特征提取模型", 
        description="选择PPG特征提取模型文件."
    )
    encoder: encoders = Field(
        ..., alias="语音编码模型", 
        description="选择语音编码模型文件."
    )
-    save_every: int = Field(
+    njobs: int = Field(
-        1000, alias="更新间隔", description="每隔n步则更新一次模型",
+        8, alias="进程数", description="适用于ppg2mel",
    )
-    backup_every: int = Field(
+    seed: int = Field(
-        10000, alias="保存间隔", description="每隔n步则保存一次模型",
+        default=0, alias="初始随机数", description="适用于ppg2mel",
    )
-    log_every: int = Field(
+    model_name: str = Field(
-        500, alias="打印间隔", description="每隔n步则打印一次训练统计",
+        ..., alias="新模型名", description="仅在重新训练时生效,选中继续训练时无效",
        example="test"
    )
    model_config: str = Field(
        ..., alias="新模型配置", description="仅在重新训练时生效,选中继续训练时无效",
        example=".\\ppg2mel\\saved_models\\seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2"
    )
 class AudioEntity(BaseModel):
@@ -77,30 +102,55 @@ class AudioEntity(BaseModel):
    mel: Any
 class Output(BaseModel):
-    __root__: int
+    __root__: tuple[str, int]
-    def render_output_ui(self, streamlit_app) -> None:  # type: ignore
+    def render_output_ui(self, streamlit_app, input) -> None:  # type: ignore
        """Custom output UI.
        If this method is implmeneted, it will be used instead of the default Output UI renderer.
        """
-        streamlit_app.subheader(f"Training started with code: {self.__root__}")
+        sr, count = self.__root__
        streamlit_app.subheader(f"Dataset {sr} done processed total of {count}")
 def train(input: Input) -> Output:
    """Train(训练)"""
-    print(">>> Start training ...")
+    print(">>> OneShot VC training ...")
-    force_restart = len(input.run_id) > 0
+    params = AttrDict()
-    if not force_restart:
+    params.update({
-        input.run_id = Path(input.synthesizer.value).name.split('.')[0]
+        "gpu": input.gpu,
-    
+        "cpu": not input.gpu,
-    synt_train(
+        "njobs": input.njobs,
-        input.run_id, 
+        "seed": input.seed,
-        input.input_root, 
+        "verbose": input.verbose,
-        f"synthesizer{os.sep}saved_models", 
+        "load": input.convertor.value,
-        input.save_every, 
+        "warm_start": False,
-        input.backup_every, 
+    })
-        input.log_every, 
+    if input.continue_mode: 
-        force_restart,
+        # trace old model and config
-        hparams
+        p = Path(input.convertor.value)
-    )
+        params.name = p.parent.name
-    return Output(__root__=0)
+        # search a config file
        model_config_fpaths = list(p.parent.rglob("*.yaml"))
        if len(model_config_fpaths) == 0:
            raise "No model yaml config found for convertor"
        config = HpsYaml(model_config_fpaths[0])
        params.ckpdir = p.parent.parent
        params.config = model_config_fpaths[0]
        params.logdir = os.path.join(p.parent, "log")
    else:
        # Make the config dict dot visitable
        config = HpsYaml(input.config)    
    np.random.seed(input.seed)
    torch.manual_seed(input.seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(input.seed)
    mode = "train"
    from ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
    solver = Solver(config, params, mode)
    solver.load_data()
    solver.set_model()
    solver.exec()
    print(">>> Oneshot VC train finished!")
    # TODO: pass useful return code
    return Output(__root__=(input.dataset, 0))
--- a/mkgui/train_vc.py
+++ b/mkgui/train_vc.py
@@ -1,155 +0,0 @@
 from pydantic import BaseModel, Field
 import os
 from pathlib import Path
 from enum import Enum
 from typing import Any, Tuple
 import numpy as np
 from utils.load_yaml import HpsYaml
 from utils.util import AttrDict
 import torch
 # Constants
 EXT_MODELS_DIRT = f"ppg_extractor{os.sep}saved_models"
 CONV_MODELS_DIRT = f"ppg2mel{os.sep}saved_models"
 ENC_MODELS_DIRT = f"encoder{os.sep}saved_models"
 if os.path.isdir(EXT_MODELS_DIRT):    
    extractors =  Enum('extractors', list((file.name, file) for file in Path(EXT_MODELS_DIRT).glob("**/*.pt")))
    print("Loaded extractor models: " + str(len(extractors)))
 else:
    raise Exception(f"Model folder {EXT_MODELS_DIRT} doesn't exist.")
 if os.path.isdir(CONV_MODELS_DIRT):    
    convertors =  Enum('convertors', list((file.name, file) for file in Path(CONV_MODELS_DIRT).glob("**/*.pth")))
    print("Loaded convertor models: " + str(len(convertors)))
 else:
    raise Exception(f"Model folder {CONV_MODELS_DIRT} doesn't exist.")
 if os.path.isdir(ENC_MODELS_DIRT):    
    encoders = Enum('encoders', list((file.name, file) for file in Path(ENC_MODELS_DIRT).glob("**/*.pt")))
    print("Loaded encoders models: " + str(len(encoders)))
 else:
    raise Exception(f"Model folder {ENC_MODELS_DIRT} doesn't exist.")
 class Model(str, Enum):
    VC_PPG2MEL = "ppg2mel"
 class Dataset(str, Enum):
    AIDATATANG_200ZH = "aidatatang_200zh"
    AIDATATANG_200ZH_S = "aidatatang_200zh_s"
 class Input(BaseModel):
    # def render_input_ui(st, input) -> Dict: 
    #     input["selected_dataset"] = st.selectbox(
    #         '选择数据集', 
    #         ("aidatatang_200zh", "aidatatang_200zh_s")
    #     )
    # return input
    model: Model = Field(
        Model.VC_PPG2MEL, title="模型类型",
    )
    # datasets_root: str = Field(
    #     ..., alias="预处理数据根目录", description="输入目录（相对/绝对）,不适用于ppg2mel模型",
    #     format=True,
    #     example="..\\trainning_data\\"
    # )
    output_root: str = Field(
        ..., alias="输出目录(可选)", description="建议不填，保持默认",
        format=True,
        example=""
    )
    continue_mode: bool = Field(
        True, alias="继续训练模式", description="选择“是”，则从下面选择的模型中继续训练",
    )
    gpu: bool = Field(
        True, alias="GPU训练", description="选择“是”，则使用GPU训练",
    )
    verbose: bool = Field(
        True, alias="打印详情", description="选择“是”，输出更多详情",
    )
    # TODO: Move to hiden fields by default
    convertor: convertors = Field(
        ..., alias="转换模型", 
        description="选择语音转换模型文件."
    )
    extractor: extractors = Field(
        ..., alias="特征提取模型", 
        description="选择PPG特征提取模型文件."
    )
    encoder: encoders = Field(
        ..., alias="语音编码模型", 
        description="选择语音编码模型文件."
    )
    njobs: int = Field(
        8, alias="进程数", description="适用于ppg2mel",
    )
    seed: int = Field(
        default=0, alias="初始随机数", description="适用于ppg2mel",
    )
    model_name: str = Field(
        ..., alias="新模型名", description="仅在重新训练时生效,选中继续训练时无效",
        example="test"
    )
    model_config: str = Field(
        ..., alias="新模型配置", description="仅在重新训练时生效,选中继续训练时无效",
        example=".\\ppg2mel\\saved_models\\seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2"
    )
 class AudioEntity(BaseModel):
    content: bytes
    mel: Any
 class Output(BaseModel):
    __root__: Tuple[str, int]
    def render_output_ui(self, streamlit_app, input) -> None:  # type: ignore
        """Custom output UI.
        If this method is implmeneted, it will be used instead of the default Output UI renderer.
        """
        sr, count = self.__root__
        streamlit_app.subheader(f"Dataset {sr} done processed total of {count}")
 def train_vc(input: Input) -> Output:
    """Train VC(训练 VC)"""
    print(">>> OneShot VC training ...")
    params = AttrDict()
    params.update({
        "gpu": input.gpu,
        "cpu": not input.gpu,
        "njobs": input.njobs,
        "seed": input.seed,
        "verbose": input.verbose,
        "load": input.convertor.value,
        "warm_start": False,
    })
    if input.continue_mode: 
        # trace old model and config
        p = Path(input.convertor.value)
        params.name = p.parent.name
        # search a config file
        model_config_fpaths = list(p.parent.rglob("*.yaml"))
        if len(model_config_fpaths) == 0:
            raise "No model yaml config found for convertor"
        config = HpsYaml(model_config_fpaths[0])
        params.ckpdir = p.parent.parent
        params.config = model_config_fpaths[0]
        params.logdir = os.path.join(p.parent, "log")
    else:
        # Make the config dict dot visitable
        config = HpsYaml(input.config)    
    np.random.seed(input.seed)
    torch.manual_seed(input.seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(input.seed)
    mode = "train"
    from ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
    solver = Solver(config, params, mode)
    solver.load_data()
    solver.set_model()
    solver.exec()
    print(">>> Oneshot VC train finished!")
    # TODO: pass useful return code
    return Output(__root__=(input.dataset, 0))
--- a/requirements.txt
+++ b/requirements.txt
@@ -24,5 +24,4 @@ tensorboard
 streamlit==1.8.0
 PyYAML==5.4.1
 torch_complex
-espnet
+espnet
 PyWavelets
--- a/synthesizer/models/base.py
+++ b/synthesizer/models/base.py
@@ -1,73 +0,0 @@
 import torch
 import torch.nn as nn
 import imp
 import numpy as np
 class Base(nn.Module):
    def __init__(self, stop_threshold):
        super().__init__()
        self.init_model()
        self.num_params()
        self.register_buffer("step", torch.zeros(1, dtype=torch.long))
        self.register_buffer("stop_threshold", torch.tensor(stop_threshold, dtype=torch.float32))
    @property
    def r(self):
        return self.decoder.r.item()
    @r.setter
    def r(self, value):
        self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
    def init_model(self):
        for p in self.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
    def finetune_partial(self, whitelist_layers):
        self.zero_grad()
        for name, child in self.named_children():
            if name in whitelist_layers:
                print("Trainable Layer: %s" % name)
                print("Trainable Parameters: %.3f" % sum([np.prod(p.size()) for p in child.parameters()]))
                for param in child.parameters():
                    param.requires_grad = False
    def get_step(self):
        return self.step.data.item()
    def reset_step(self):
        # assignment to parameters or buffers is overloaded, updates internal dict entry
        self.step = self.step.data.new_tensor(1)
    def log(self, path, msg):
        with open(path, "a") as f:
            print(msg, file=f)
    def load(self, path, device, optimizer=None):
        # Use device of model params as location for loaded state
        checkpoint = torch.load(str(path), map_location=device)
        self.load_state_dict(checkpoint["model_state"], strict=False)
        if "optimizer_state" in checkpoint and optimizer is not None:
            optimizer.load_state_dict(checkpoint["optimizer_state"])
    def save(self, path, optimizer=None):
        if optimizer is not None:
            torch.save({
                "model_state": self.state_dict(),
                "optimizer_state": optimizer.state_dict(),
            }, str(path))
        else:
            torch.save({
                "model_state": self.state_dict(),
            }, str(path))
    def num_params(self, print_out=True):
        parameters = filter(lambda p: p.requires_grad, self.parameters())
        parameters = sum([np.prod(p.size()) for p in parameters]) / 1_000_000
        if print_out:
            print("Trainable Parameters: %.3fM" % parameters)
        return parameters
--- a/synthesizer/models/sublayer/global_style_token.py
+++ b/synthesizer/models/sublayer/global_style_token.py
--- a/synthesizer/models/sublayer/init.py
+++ b/synthesizer/models/sublayer/init.py
@@ -1 +0,0 @@
 #
--- a/synthesizer/models/sublayer/cbhg.py
+++ b/synthesizer/models/sublayer/cbhg.py
@@ -1,85 +0,0 @@
 import torch
 import torch.nn as nn
 from .common.batch_norm_conv import BatchNormConv
 from .common.highway_network import HighwayNetwork
 class CBHG(nn.Module):
    def __init__(self, K, in_channels, channels, proj_channels, num_highways):
        super().__init__()
        # List of all rnns to call `flatten_parameters()` on
        self._to_flatten = []
        self.bank_kernels = [i for i in range(1, K + 1)]
        self.conv1d_bank = nn.ModuleList()
        for k in self.bank_kernels:
            conv = BatchNormConv(in_channels, channels, k)
            self.conv1d_bank.append(conv)
        self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
        self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
        self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
        # Fix the highway input if necessary
        if proj_channels[-1] != channels:
            self.highway_mismatch = True
            self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
        else:
            self.highway_mismatch = False
        self.highways = nn.ModuleList()
        for i in range(num_highways):
            hn = HighwayNetwork(channels)
            self.highways.append(hn)
        self.rnn = nn.GRU(channels, channels // 2, batch_first=True, bidirectional=True)
        self._to_flatten.append(self.rnn)
        # Avoid fragmentation of RNN parameters and associated warning
        self._flatten_parameters()
    def forward(self, x):
        # Although we `_flatten_parameters()` on init, when using DataParallel
        # the model gets replicated, making it no longer guaranteed that the
        # weights are contiguous in GPU memory. Hence, we must call it again
        self.rnn.flatten_parameters()
        # Save these for later
        residual = x
        seq_len = x.size(-1)
        conv_bank = []
        # Convolution Bank
        for conv in self.conv1d_bank:
            c = conv(x) # Convolution
            conv_bank.append(c[:, :, :seq_len])
        # Stack along the channel axis
        conv_bank = torch.cat(conv_bank, dim=1)
        # dump the last padding to fit residual
        x = self.maxpool(conv_bank)[:, :, :seq_len]
        # Conv1d projections
        x = self.conv_project1(x)
        x = self.conv_project2(x)
        # Residual Connect
        x = x + residual
        # Through the highways
        x = x.transpose(1, 2)
        if self.highway_mismatch is True:
            x = self.pre_highway(x)
        for h in self.highways: x = h(x)
        # And then the RNN
        x, _ = self.rnn(x)
        return x
    def _flatten_parameters(self):
        """Calls `flatten_parameters` on all the rnns used by the WaveRNN. Used
        to improve efficiency and avoid PyTorch yelling at us."""
        [m.flatten_parameters() for m in self._to_flatten]
--- a/synthesizer/models/sublayer/common/batch_norm_conv.py
+++ b/synthesizer/models/sublayer/common/batch_norm_conv.py
@@ -1,14 +0,0 @@
 import torch.nn as nn
 import torch.nn.functional as F
 class BatchNormConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel, relu=True):
        super().__init__()
        self.conv = nn.Conv1d(in_channels, out_channels, kernel, stride=1, padding=kernel // 2, bias=False)
        self.bnorm = nn.BatchNorm1d(out_channels)
        self.relu = relu
    def forward(self, x):
        x = self.conv(x)
        x = F.relu(x) if self.relu is True else x
        return self.bnorm(x)
--- a/synthesizer/models/sublayer/common/highway_network.py
+++ b/synthesizer/models/sublayer/common/highway_network.py
@@ -1,17 +0,0 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 class HighwayNetwork(nn.Module):
    def __init__(self, size):
        super().__init__()
        self.W1 = nn.Linear(size, size)
        self.W2 = nn.Linear(size, size)
        self.W1.bias.data.fill_(0.)
    def forward(self, x):
        x1 = self.W1(x)
        x2 = self.W2(x)
        g = torch.sigmoid(x2)
        y = g * F.relu(x1) + (1. - g) * x
        return y
--- a/synthesizer/models/sublayer/lsa.py
+++ b/synthesizer/models/sublayer/lsa.py
@@ -1,42 +0,0 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 class LSA(nn.Module):
    def __init__(self, attn_dim, kernel_size=31, filters=32):
        super().__init__()
        self.conv = nn.Conv1d(1, filters, padding=(kernel_size - 1) // 2, kernel_size=kernel_size, bias=True)
        self.L = nn.Linear(filters, attn_dim, bias=False)
        self.W = nn.Linear(attn_dim, attn_dim, bias=True) # Include the attention bias in this term
        self.v = nn.Linear(attn_dim, 1, bias=False)
        self.cumulative = None
        self.attention = None
    def init_attention(self, encoder_seq_proj):
        device = encoder_seq_proj.device  # use same device as parameters
        b, t, c = encoder_seq_proj.size()
        self.cumulative = torch.zeros(b, t, device=device)
        self.attention = torch.zeros(b, t, device=device)
    def forward(self, encoder_seq_proj, query, times, chars):
        if times == 0: self.init_attention(encoder_seq_proj)
        processed_query = self.W(query).unsqueeze(1)
        location = self.cumulative.unsqueeze(1)
        processed_loc = self.L(self.conv(location).transpose(1, 2))
        u = self.v(torch.tanh(processed_query + encoder_seq_proj + processed_loc))
        u = u.squeeze(-1)
        # Mask zero padding chars
        u = u * (chars != 0).float()
        # Smooth Attention
        # scores = torch.sigmoid(u) / torch.sigmoid(u).sum(dim=1, keepdim=True)
        scores = F.softmax(u, dim=1)
        self.attention = scores
        self.cumulative = self.cumulative + self.attention
        return scores.unsqueeze(-1).transpose(1, 2)
--- a/synthesizer/models/sublayer/pre_net.py
+++ b/synthesizer/models/sublayer/pre_net.py
@@ -1,27 +0,0 @@
 import torch.nn as nn
 import torch.nn.functional as F
 class PreNet(nn.Module):
    def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
        super().__init__()
        self.fc1 = nn.Linear(in_dims, fc1_dims)
        self.fc2 = nn.Linear(fc1_dims, fc2_dims)
        self.p = dropout
    def forward(self, x):
        """forward
        Args:
            x (3D tensor with size `[batch_size, num_chars, tts_embed_dims]`): input texts list
        Returns:
            3D tensor with size `[batch_size, num_chars, encoder_dims]`
        """        
        x = self.fc1(x)
        x = F.relu(x)
        x = F.dropout(x, self.p, training=True)
        x = self.fc2(x)
        x = F.relu(x)
        x = F.dropout(x, self.p, training=True)
        return x
--- a/synthesizer/models/tacotron.py
+++ b/synthesizer/models/tacotron.py
@@ -1,88 +1,277 @@
 import os
 import numpy as np
 import torch
 import torch.nn as nn
-from .sublayer.global_style_token import GlobalStyleToken
+import torch.nn.functional as F
-from .sublayer.pre_net import PreNet
+from synthesizer.models.global_style_token import GlobalStyleToken
 from .sublayer.cbhg import CBHG
 from .sublayer.lsa import LSA
 from .base import Base
 from synthesizer.gst_hyperparameters import GSTHyperparameters as gst_hp
 from synthesizer.hparams import hparams
 class Encoder(nn.Module):
    def __init__(self, num_chars, embed_dims=512, encoder_dims=256, K=5, num_highways=4, dropout=0.5):
        """ Encoder for SV2TTS
-        Args:
+class HighwayNetwork(nn.Module):
-            num_chars (int): length of symbols
+    def __init__(self, size):
            embed_dims (int, optional): embedding dim for input texts. Defaults to 512.
            encoder_dims (int, optional): output dim for encoder. Defaults to 256.
            K (int, optional): _description_. Defaults to 5.
            num_highways (int, optional): _description_. Defaults to 4.
            dropout (float, optional): _description_. Defaults to 0.5.
        """             
        super().__init__()
-        self.embedding = nn.Embedding(num_chars, embed_dims)
+        self.W1 = nn.Linear(size, size)
-        self.pre_net = PreNet(embed_dims, fc1_dims=encoder_dims, fc2_dims=encoder_dims,
+        self.W2 = nn.Linear(size, size)
-                              dropout=dropout)
+        self.W1.bias.data.fill_(0.)
        self.cbhg = CBHG(K=K, in_channels=encoder_dims, channels=encoder_dims,
                         proj_channels=[encoder_dims, encoder_dims],
                         num_highways=num_highways)
    def forward(self, x):
-        """forward pass for encoder
+        x1 = self.W1(x)
        x2 = self.W2(x)
        g = torch.sigmoid(x2)
        y = g * F.relu(x1) + (1. - g) * x
        return y
        Args:
            x (2D tensor with size `[batch_size, text_num_chars]`): input texts list
-        Returns:
+class Encoder(nn.Module):
-            3D tensor with size `[batch_size, text_num_chars, encoder_dims]`
+    def __init__(self, embed_dims, num_chars, encoder_dims, K, num_highways, dropout):
-            
+        super().__init__()
-        """
+        prenet_dims = (encoder_dims, encoder_dims)
-        x = self.embedding(x) # return: [batch_size, text_num_chars, tts_embed_dims]
+        cbhg_channels = encoder_dims
-        x = self.pre_net(x) # return: [batch_size, text_num_chars, encoder_dims]
+        self.embedding = nn.Embedding(num_chars, embed_dims)
-        x.transpose_(1, 2)  # return: [batch_size, encoder_dims, text_num_chars]
+        self.pre_net = PreNet(embed_dims, fc1_dims=prenet_dims[0], fc2_dims=prenet_dims[1],
-        return self.cbhg(x) # return: [batch_size, text_num_chars, encoder_dims]
+                              dropout=dropout)
        self.cbhg = CBHG(K=K, in_channels=cbhg_channels, channels=cbhg_channels,
                         proj_channels=[cbhg_channels, cbhg_channels],
                         num_highways=num_highways)
    def forward(self, x, speaker_embedding=None):
        x = self.embedding(x)
        x = self.pre_net(x)
        x.transpose_(1, 2)
        x = self.cbhg(x)
        if speaker_embedding is not None:
            x = self.add_speaker_embedding(x, speaker_embedding)
        return x
    def add_speaker_embedding(self, x, speaker_embedding):
        # SV2TTS
        # The input x is the encoder output and is a 3D tensor with size (batch_size, num_chars, tts_embed_dims)
        # When training, speaker_embedding is also a 2D tensor with size (batch_size, speaker_embedding_size)
        #     (for inference, speaker_embedding is a 1D tensor with size (speaker_embedding_size))
        # This concats the speaker embedding for each char in the encoder output
        # Save the dimensions as human-readable names
        batch_size = x.size()[0]
        num_chars = x.size()[1]
        if speaker_embedding.dim() == 1:
            idx = 0
        else:
            idx = 1
        # Start by making a copy of each speaker embedding to match the input text length
        # The output of this has size (batch_size, num_chars * speaker_embedding_size)
        speaker_embedding_size = speaker_embedding.size()[idx]
        e = speaker_embedding.repeat_interleave(num_chars, dim=idx)
        # Reshape it and transpose
        e = e.reshape(batch_size, speaker_embedding_size, num_chars)
        e = e.transpose(1, 2)
        # Concatenate the tiled speaker embedding with the encoder output
        x = torch.cat((x, e), 2)
        return x
 class BatchNormConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel, relu=True):
        super().__init__()
        self.conv = nn.Conv1d(in_channels, out_channels, kernel, stride=1, padding=kernel // 2, bias=False)
        self.bnorm = nn.BatchNorm1d(out_channels)
        self.relu = relu
    def forward(self, x):
        x = self.conv(x)
        x = F.relu(x) if self.relu is True else x
        return self.bnorm(x)
 class CBHG(nn.Module):
    def __init__(self, K, in_channels, channels, proj_channels, num_highways):
        super().__init__()
        # List of all rnns to call `flatten_parameters()` on
        self._to_flatten = []
        self.bank_kernels = [i for i in range(1, K + 1)]
        self.conv1d_bank = nn.ModuleList()
        for k in self.bank_kernels:
            conv = BatchNormConv(in_channels, channels, k)
            self.conv1d_bank.append(conv)
        self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
        self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
        self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
        # Fix the highway input if necessary
        if proj_channels[-1] != channels:
            self.highway_mismatch = True
            self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
        else:
            self.highway_mismatch = False
        self.highways = nn.ModuleList()
        for i in range(num_highways):
            hn = HighwayNetwork(channels)
            self.highways.append(hn)
        self.rnn = nn.GRU(channels, channels // 2, batch_first=True, bidirectional=True)
        self._to_flatten.append(self.rnn)
        # Avoid fragmentation of RNN parameters and associated warning
        self._flatten_parameters()
    def forward(self, x):
        # Although we `_flatten_parameters()` on init, when using DataParallel
        # the model gets replicated, making it no longer guaranteed that the
        # weights are contiguous in GPU memory. Hence, we must call it again
        self.rnn.flatten_parameters()
        # Save these for later
        residual = x
        seq_len = x.size(-1)
        conv_bank = []
        # Convolution Bank
        for conv in self.conv1d_bank:
            c = conv(x) # Convolution
            conv_bank.append(c[:, :, :seq_len])
        # Stack along the channel axis
        conv_bank = torch.cat(conv_bank, dim=1)
        # dump the last padding to fit residual
        x = self.maxpool(conv_bank)[:, :, :seq_len]
        # Conv1d projections
        x = self.conv_project1(x)
        x = self.conv_project2(x)
        # Residual Connect
        x = x + residual
        # Through the highways
        x = x.transpose(1, 2)
        if self.highway_mismatch is True:
            x = self.pre_highway(x)
        for h in self.highways: x = h(x)
        # And then the RNN
        x, _ = self.rnn(x)
        return x
    def _flatten_parameters(self):
        """Calls `flatten_parameters` on all the rnns used by the WaveRNN. Used
        to improve efficiency and avoid PyTorch yelling at us."""
        [m.flatten_parameters() for m in self._to_flatten]
 class PreNet(nn.Module):
    def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
        super().__init__()
        self.fc1 = nn.Linear(in_dims, fc1_dims)
        self.fc2 = nn.Linear(fc1_dims, fc2_dims)
        self.p = dropout
    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = F.dropout(x, self.p, training=True)
        x = self.fc2(x)
        x = F.relu(x)
        x = F.dropout(x, self.p, training=True)
        return x
 class Attention(nn.Module):
    def __init__(self, attn_dims):
        super().__init__()
        self.W = nn.Linear(attn_dims, attn_dims, bias=False)
        self.v = nn.Linear(attn_dims, 1, bias=False)
    def forward(self, encoder_seq_proj, query, t):
        # print(encoder_seq_proj.shape)
        # Transform the query vector
        query_proj = self.W(query).unsqueeze(1)
        # Compute the scores
        u = self.v(torch.tanh(encoder_seq_proj + query_proj))
        scores = F.softmax(u, dim=1)
        return scores.transpose(1, 2)
 class LSA(nn.Module):
    def __init__(self, attn_dim, kernel_size=31, filters=32):
        super().__init__()
        self.conv = nn.Conv1d(1, filters, padding=(kernel_size - 1) // 2, kernel_size=kernel_size, bias=True)
        self.L = nn.Linear(filters, attn_dim, bias=False)
        self.W = nn.Linear(attn_dim, attn_dim, bias=True) # Include the attention bias in this term
        self.v = nn.Linear(attn_dim, 1, bias=False)
        self.cumulative = None
        self.attention = None
    def init_attention(self, encoder_seq_proj):
        device = encoder_seq_proj.device  # use same device as parameters
        b, t, c = encoder_seq_proj.size()
        self.cumulative = torch.zeros(b, t, device=device)
        self.attention = torch.zeros(b, t, device=device)
    def forward(self, encoder_seq_proj, query, t, chars):
        if t == 0: self.init_attention(encoder_seq_proj)
        processed_query = self.W(query).unsqueeze(1)
        location = self.cumulative.unsqueeze(1)
        processed_loc = self.L(self.conv(location).transpose(1, 2))
        u = self.v(torch.tanh(processed_query + encoder_seq_proj + processed_loc))
        u = u.squeeze(-1)
        # Mask zero padding chars
        u = u * (chars != 0).float()
        # Smooth Attention
        # scores = torch.sigmoid(u) / torch.sigmoid(u).sum(dim=1, keepdim=True)
        scores = F.softmax(u, dim=1)
        self.attention = scores
        self.cumulative = self.cumulative + self.attention
        return scores.unsqueeze(-1).transpose(1, 2)
 class Decoder(nn.Module):
    # Class variable because its value doesn't change between classes
    # yet ought to be scoped by class because its a property of a Decoder
    max_r = 20
-    def __init__(self, n_mels, input_dims, decoder_dims, lstm_dims,
+    def __init__(self, n_mels, encoder_dims, decoder_dims, lstm_dims,
                 dropout, speaker_embedding_size):
        super().__init__()
        self.register_buffer("r", torch.tensor(1, dtype=torch.int))
        self.n_mels = n_mels
-        self.prenet = PreNet(n_mels, fc1_dims=decoder_dims * 2, fc2_dims=decoder_dims * 2,
+        prenet_dims = (decoder_dims * 2, decoder_dims * 2)
        self.prenet = PreNet(n_mels, fc1_dims=prenet_dims[0], fc2_dims=prenet_dims[1],
                             dropout=dropout)
        self.attn_net = LSA(decoder_dims)
        if hparams.use_gst:
            speaker_embedding_size += gst_hp.E
-        self.attn_rnn = nn.GRUCell(input_dims + decoder_dims * 2, decoder_dims)
+        self.attn_rnn = nn.GRUCell(encoder_dims + prenet_dims[1] + speaker_embedding_size, decoder_dims)
-        self.rnn_input = nn.Linear(input_dims  + decoder_dims, lstm_dims)
+        self.rnn_input = nn.Linear(encoder_dims  + decoder_dims + speaker_embedding_size, lstm_dims)
        self.res_rnn1 = nn.LSTMCell(lstm_dims, lstm_dims)
        self.res_rnn2 = nn.LSTMCell(lstm_dims, lstm_dims)
        self.mel_proj = nn.Linear(lstm_dims, n_mels * self.max_r, bias=False)
-        self.stop_proj = nn.Linear(input_dims + lstm_dims, 1)
+        self.stop_proj = nn.Linear(encoder_dims + speaker_embedding_size + lstm_dims, 1)
    def zoneout(self, prev, current, device, p=0.1):
        mask = torch.zeros(prev.size(),device=device).bernoulli_(p)
        return prev * mask + current * (1 - mask)
    def forward(self, encoder_seq, encoder_seq_proj, prenet_in,
-                hidden_states, cell_states, context_vec, times, chars):
+                hidden_states, cell_states, context_vec, t, chars):
        """_summary_
        Args:
            encoder_seq (3D tensor `[batch_size, text_num_chars, project_dim(default to 512)]`): _description_
            encoder_seq_proj (3D tensor `[batch_size, text_num_chars, decoder_dims(default to 128)]`): _description_
            prenet_in (2D tensor `[batch_size, n_mels]`): _description_
            hidden_states (_type_): _description_
            cell_states (_type_): _description_
            context_vec (2D tensor `[batch_size, project_dim(default to 512)]`): _description_
            times (int): the number of times runned
            chars (2D tensor with size `[batch_size, text_num_chars]`): original texts list input
        """
        # Need this for reshaping mels
        batch_size = encoder_seq.size(0)
        device = encoder_seq.device
@@ -91,25 +280,25 @@ class Decoder(nn.Module):
        rnn1_cell, rnn2_cell = cell_states
        # PreNet for the Attention RNN
-        prenet_out = self.prenet(prenet_in) # return: `[batch_size, decoder_dims * 2(256)]`
+        prenet_out = self.prenet(prenet_in)
        # Compute the Attention RNN hidden state
-        attn_rnn_in = torch.cat([context_vec, prenet_out], dim=-1) # `[batch_size, project_dim + decoder_dims * 2 (768)]`
+        attn_rnn_in = torch.cat([context_vec, prenet_out], dim=-1)
-        attn_hidden = self.attn_rnn(attn_rnn_in.squeeze(1), attn_hidden) #  `[batch_size, decoder_dims (128)]`
+        attn_hidden = self.attn_rnn(attn_rnn_in.squeeze(1), attn_hidden)
        # Compute the attention scores
-        scores = self.attn_net(encoder_seq_proj, attn_hidden, times, chars)
+        scores = self.attn_net(encoder_seq_proj, attn_hidden, t, chars)
        # Dot product to create the context vector
        context_vec = scores @ encoder_seq
        context_vec = context_vec.squeeze(1)
        # Concat Attention RNN output w. Context Vector & project
-        x = torch.cat([context_vec, attn_hidden], dim=1) # `[batch_size, project_dim + decoder_dims (630)]`
+        x = torch.cat([context_vec, attn_hidden], dim=1)
-        x = self.rnn_input(x) # `[batch_size, lstm_dims(1024)]`
+        x = self.rnn_input(x)
-        # Compute first Residual RNN, training with fixed zoneout rate 0.1
+        # Compute first Residual RNN
-        rnn1_hidden_next, rnn1_cell = self.res_rnn1(x, (rnn1_hidden, rnn1_cell)) # `[batch_size, lstm_dims(1024)]`
+        rnn1_hidden_next, rnn1_cell = self.res_rnn1(x, (rnn1_hidden, rnn1_cell))
        if self.training:
            rnn1_hidden = self.zoneout(rnn1_hidden, rnn1_hidden_next,device=device)
        else:
@@ -117,7 +306,7 @@ class Decoder(nn.Module):
        x = x + rnn1_hidden
        # Compute second Residual RNN
-        rnn2_hidden_next, rnn2_cell = self.res_rnn2(x, (rnn2_hidden, rnn2_cell)) # `[batch_size, lstm_dims(1024)]`
+        rnn2_hidden_next, rnn2_cell = self.res_rnn2(x, (rnn2_hidden, rnn2_cell))
        if self.training:
            rnn2_hidden = self.zoneout(rnn2_hidden, rnn2_hidden_next, device=device)
        else:
@@ -125,8 +314,8 @@ class Decoder(nn.Module):
        x = x + rnn2_hidden
        # Project Mels
-        mels = self.mel_proj(x) # `[batch_size, 1600]`
+        mels = self.mel_proj(x)
-        mels = mels.view(batch_size, self.n_mels, self.max_r)[:, :, :self.r] # `[batch_size, n_mels, r]`
+        mels = mels.view(batch_size, self.n_mels, self.max_r)[:, :, :self.r]
        hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
        cell_states = (rnn1_cell, rnn2_cell)
@@ -137,30 +326,45 @@ class Decoder(nn.Module):
        return mels, scores, hidden_states, cell_states, context_vec, stop_tokens
-class Tacotron(Base):
+
 class Tacotron(nn.Module):
    def __init__(self, embed_dims, num_chars, encoder_dims, decoder_dims, n_mels, 
                 fft_bins, postnet_dims, encoder_K, lstm_dims, postnet_K, num_highways,
                 dropout, stop_threshold, speaker_embedding_size):
-        super().__init__(stop_threshold)
+        super().__init__()
        self.n_mels = n_mels
        self.lstm_dims = lstm_dims
        self.encoder_dims = encoder_dims
        self.decoder_dims = decoder_dims
        self.speaker_embedding_size = speaker_embedding_size
-        self.encoder = Encoder(num_chars, embed_dims, encoder_dims,
+        self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
                               encoder_K, num_highways, dropout)
-        self.project_dims = encoder_dims + speaker_embedding_size
+        project_dims = encoder_dims + speaker_embedding_size
        if hparams.use_gst: 
-            self.project_dims += gst_hp.E
+            project_dims += gst_hp.E
-        self.encoder_proj = nn.Linear(self.project_dims, decoder_dims, bias=False)
+        self.encoder_proj = nn.Linear(project_dims, decoder_dims, bias=False)
        if hparams.use_gst: 
            self.gst = GlobalStyleToken(speaker_embedding_size)
-        self.decoder = Decoder(n_mels, self.project_dims, decoder_dims, lstm_dims,
+        self.decoder = Decoder(n_mels, encoder_dims, decoder_dims, lstm_dims,
                               dropout, speaker_embedding_size)
        self.postnet = CBHG(postnet_K, n_mels, postnet_dims,
                            [postnet_dims, fft_bins], num_highways)
        self.post_proj = nn.Linear(postnet_dims, fft_bins, bias=False)
        self.init_model()
        self.num_params()
        self.register_buffer("step", torch.zeros(1, dtype=torch.long))
        self.register_buffer("stop_threshold", torch.tensor(stop_threshold, dtype=torch.float32))
    @property
    def r(self):
        return self.decoder.r.item()
    @r.setter
    def r(self, value):
        self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
    @staticmethod
    def _concat_speaker_embedding(outputs, speaker_embeddings):
        speaker_embeddings_ = speaker_embeddings.expand(
@@ -168,52 +372,11 @@ class Tacotron(Base):
        outputs = torch.cat([outputs, speaker_embeddings_], dim=-1)
        return outputs
-    @staticmethod
+    def forward(self, texts, mels, speaker_embedding):
    def _add_speaker_embedding(x, speaker_embedding):
        """Add speaker embedding
            This concats the speaker embedding for each char in the encoder output
        Args:
            x (3D tensor with size `[batch_size, text_num_chars, encoder_dims]`): the encoder output
            speaker_embedding (2D tensor `[batch_size, speaker_embedding_size]`): the speaker embedding
        Returns:
            3D tensor with size `[batch_size, text_num_chars, encoder_dims+speaker_embedding_size]`
        """        
        # Save the dimensions as human-readable names
        batch_size = x.size()[0]
        text_num_chars = x.size()[1]
        # Start by making a copy of each speaker embedding to match the input text length
        # The output of this has size (batch_size, text_num_chars * speaker_embedding_size)
        speaker_embedding_size = speaker_embedding.size()[1]
        e = speaker_embedding.repeat_interleave(text_num_chars, dim=1)
        # Reshape it and transpose
        e = e.reshape(batch_size, speaker_embedding_size, text_num_chars)
        e = e.transpose(1, 2)
        # Concatenate the tiled speaker embedding with the encoder output
        x = torch.cat((x, e), 2)
        return x
    def forward(self, texts, mels, speaker_embedding, steps=2000, style_idx=0, min_stop_token=5):
        """Forward pass for Tacotron
        Args:
            texts (`[batch_size, text_num_chars]`): input texts list
            mels (`[batch_size, varied_mel_lengths, steps]`): mels for comparison (training only)
            speaker_embedding (`[batch_size, speaker_embedding_size(default to 256)]`): referring embedding.
            steps (int, optional): . Defaults to 2000.
            style_idx (int, optional): GST style selected. Defaults to 0.
            min_stop_token (int, optional): decoder min_stop_token. Defaults to 5.
        """
        device = texts.device  # use same device as parameters
-        if self.training:
+        self.step += 1
-            self.step += 1
+        batch_size, _, steps  = mels.size()
            batch_size, _, steps  = mels.size()
        else:
            batch_size, _  = texts.size()
        # Initialise all hidden states and pack into tuple
        attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
@@ -229,50 +392,35 @@ class Tacotron(Base):
        # <GO> Frame for start of decoder loop
        go_frame = torch.zeros(batch_size, self.n_mels, device=device)
        # Need an initial context vector
        size = self.encoder_dims + self.speaker_embedding_size
        if hparams.use_gst:
            size += gst_hp.E
        context_vec = torch.zeros(batch_size, size, device=device)
        # SV2TTS: Run the encoder with the speaker embedding
        # The projection avoids unnecessary matmuls in the decoder loop
-        encoder_seq = self.encoder(texts)
+        encoder_seq = self.encoder(texts, speaker_embedding)
-        
+        # put after encoder 
        encoder_seq = self._add_speaker_embedding(encoder_seq, speaker_embedding)
        if hparams.use_gst and self.gst is not None:
-            if self.training:
+            style_embed = self.gst(speaker_embedding, speaker_embedding) # for training, speaker embedding can represent both style inputs and referenced
-                style_embed = self.gst(speaker_embedding, speaker_embedding) # for training, speaker embedding can represent both style inputs and referenced
+            # style_embed = style_embed.expand_as(encoder_seq)
-                # style_embed = style_embed.expand_as(encoder_seq)
+            # encoder_seq = torch.cat((encoder_seq, style_embed), 2)
-                # encoder_seq = torch.cat((encoder_seq, style_embed), 2)
+            encoder_seq = self._concat_speaker_embedding(encoder_seq, style_embed)
-            elif style_idx >= 0 and style_idx < 10:
+        encoder_seq_proj = self.encoder_proj(encoder_seq)
                query = torch.zeros(1, 1, self.gst.stl.attention.num_units)
                if device.type == 'cuda':
                    query = query.cuda()
                gst_embed = torch.tanh(self.gst.stl.embed)
                key = gst_embed[style_idx].unsqueeze(0).expand(1, -1, -1)
                style_embed = self.gst.stl.attention(query, key)
            else:
                speaker_embedding_style = torch.zeros(speaker_embedding.size()[0], 1, self.speaker_embedding_size).to(device)
                style_embed = self.gst(speaker_embedding_style, speaker_embedding)
            encoder_seq = self._concat_speaker_embedding(encoder_seq, style_embed) # return: [batch_size, text_num_chars, project_dims]
        encoder_seq_proj = self.encoder_proj(encoder_seq) # return: [batch_size, text_num_chars, decoder_dims]
        # Need a couple of lists for outputs
        mel_outputs, attn_scores, stop_outputs = [], [], []
        # Need an initial context vector
        context_vec = torch.zeros(batch_size, self.project_dims, device=device)
        # Run the decoder loop
        for t in range(0, steps, self.r):
-            if self.training:
+            prenet_in = mels[:, :, t - 1] if t > 0 else go_frame
                prenet_in = mels[:, :, t -1] if t > 0 else go_frame
            else:
                prenet_in = mel_outputs[-1][:, :, -1] if t > 0 else go_frame
            mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
                self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
                             hidden_states, cell_states, context_vec, t, texts)
            mel_outputs.append(mel_frames)
            attn_scores.append(scores)
            stop_outputs.extend([stop_tokens] * self.r)
            if not self.training and (stop_tokens * 10 > min_stop_token).all() and t > 10: break
        # Concat the mel outputs into sequence
        mel_outputs = torch.cat(mel_outputs, dim=2)
@@ -287,12 +435,135 @@ class Tacotron(Base):
        # attn_scores = attn_scores.cpu().data.numpy()
        stop_outputs = torch.cat(stop_outputs, 1)
        if self.training:
            self.train()
        return mel_outputs, linear, attn_scores, stop_outputs
-    def generate(self, x, speaker_embedding, steps=2000, style_idx=0, min_stop_token=5):
+    def generate(self, x, speaker_embedding=None, steps=2000, style_idx=0, min_stop_token=5):
        self.eval()
-        mel_outputs, linear, attn_scores, _ =  self.forward(x, None, speaker_embedding, steps, style_idx, min_stop_token)
+        device = x.device  # use same device as parameters
        batch_size, _  = x.size()
        # Need to initialise all hidden states and pack into tuple for tidyness
        attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
        rnn1_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
        rnn2_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
        hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
        # Need to initialise all lstm cell states and pack into tuple for tidyness
        rnn1_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
        rnn2_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
        cell_states = (rnn1_cell, rnn2_cell)
        # Need a <GO> Frame for start of decoder loop
        go_frame = torch.zeros(batch_size, self.n_mels, device=device)
        # Need an initial context vector
        size = self.encoder_dims + self.speaker_embedding_size
        if hparams.use_gst:
            size += gst_hp.E
        context_vec = torch.zeros(batch_size, size, device=device)
        # SV2TTS: Run the encoder with the speaker embedding
        # The projection avoids unnecessary matmuls in the decoder loop
        encoder_seq = self.encoder(x, speaker_embedding)
        # put after encoder 
        if hparams.use_gst and self.gst is not None:
            if style_idx >= 0 and style_idx < 10:
                query = torch.zeros(1, 1, self.gst.stl.attention.num_units)
                if device.type == 'cuda':
                    query = query.cuda()
                gst_embed = torch.tanh(self.gst.stl.embed)
                key = gst_embed[style_idx].unsqueeze(0).expand(1, -1, -1)
                style_embed = self.gst.stl.attention(query, key)
            else:
                speaker_embedding_style = torch.zeros(speaker_embedding.size()[0], 1, self.speaker_embedding_size).to(device)
                style_embed = self.gst(speaker_embedding_style, speaker_embedding)
            encoder_seq = self._concat_speaker_embedding(encoder_seq, style_embed)
            # style_embed = style_embed.expand_as(encoder_seq)
            # encoder_seq = torch.cat((encoder_seq, style_embed), 2)
        encoder_seq_proj = self.encoder_proj(encoder_seq)
        # Need a couple of lists for outputs
        mel_outputs, attn_scores, stop_outputs = [], [], []
        # Run the decoder loop
        for t in range(0, steps, self.r):
            prenet_in = mel_outputs[-1][:, :, -1] if t > 0 else go_frame
            mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
            self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
                         hidden_states, cell_states, context_vec, t, x)
            mel_outputs.append(mel_frames)
            attn_scores.append(scores)
            stop_outputs.extend([stop_tokens] * self.r)
            # Stop the loop when all stop tokens in batch exceed threshold
            if (stop_tokens * 10 > min_stop_token).all() and t > 10: break
        # Concat the mel outputs into sequence
        mel_outputs = torch.cat(mel_outputs, dim=2)
        # Post-Process for Linear Spectrograms
        postnet_out = self.postnet(mel_outputs)
        linear = self.post_proj(postnet_out)
        linear = linear.transpose(1, 2)
        # For easy visualisation
        attn_scores = torch.cat(attn_scores, 1)
        stop_outputs = torch.cat(stop_outputs, 1)
        self.train()
        return mel_outputs, linear, attn_scores
    def init_model(self):
        for p in self.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
    def finetune_partial(self, whitelist_layers):
        self.zero_grad()
        for name, child in self.named_children():
            if name in whitelist_layers:
                print("Trainable Layer: %s" % name)
                print("Trainable Parameters: %.3f" % sum([np.prod(p.size()) for p in child.parameters()]))
                for param in child.parameters():
                    param.requires_grad = False
    def get_step(self):
        return self.step.data.item()
    def reset_step(self):
        # assignment to parameters or buffers is overloaded, updates internal dict entry
        self.step = self.step.data.new_tensor(1)
    def log(self, path, msg):
        with open(path, "a") as f:
            print(msg, file=f)
    def load(self, path, device, optimizer=None):
        # Use device of model params as location for loaded state
        checkpoint = torch.load(str(path), map_location=device)
        self.load_state_dict(checkpoint["model_state"], strict=False)
        if "optimizer_state" in checkpoint and optimizer is not None:
            optimizer.load_state_dict(checkpoint["optimizer_state"])
    def save(self, path, optimizer=None):
        if optimizer is not None:
            torch.save({
                "model_state": self.state_dict(),
                "optimizer_state": optimizer.state_dict(),
            }, str(path))
        else:
            torch.save({
                "model_state": self.state_dict(),
            }, str(path))
    def num_params(self, print_out=True):
        parameters = filter(lambda p: p.requires_grad, self.parameters())
        parameters = sum([np.prod(p.size()) for p in parameters]) / 1_000_000
        if print_out:
            print("Trainable Parameters: %.3fM" % parameters)
        return parameters
--- a/synthesizer/train.py
+++ b/synthesizer/train.py
@@ -15,8 +15,9 @@ from datetime import datetime
 import json
 import numpy as np
 from pathlib import Path
 import sys
 import time
-import os
+
 def np_now(x: torch.Tensor): return x.detach().cpu().numpy()
@@ -264,19 +265,7 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
                                       loss=loss,
                                       hparams=hparams,
                                       sw=sw)
-                    MAX_SAVED_COUNT = 20
+
                    if (step / hparams.tts_eval_interval) % MAX_SAVED_COUNT == 0:
                        # clean up and save last MAX_SAVED_COUNT;
                        plots = next(os.walk(plot_dir), (None, None, []))[2]
                        for plot in plots[-MAX_SAVED_COUNT:]:
                            os.remove(plot_dir.joinpath(plot))
                        mel_files = next(os.walk(mel_output_dir), (None, None, []))[2]
                        for mel_file in mel_files[-MAX_SAVED_COUNT:]:
                            os.remove(mel_output_dir.joinpath(mel_file))
                        wavs = next(os.walk(wav_dir), (None, None, []))[2]
                        for w in wavs[-MAX_SAVED_COUNT:]:
                            os.remove(wav_dir.joinpath(w))
                # Break out of loop to update training schedule
                if step >= max_step:
                    break
--- a/toolbox/init.py
+++ b/toolbox/init.py
@@ -3,7 +3,6 @@ from encoder import inference as encoder
 from synthesizer.inference import Synthesizer
 from vocoder.wavernn import inference as rnn_vocoder
 from vocoder.hifigan import inference as gan_vocoder
 from vocoder.fregan import inference as fgan_vocoder
 from pathlib import Path
 from time import perf_counter as timer
 from toolbox.utterance import Utterance
@@ -443,7 +442,7 @@ class Toolbox:
            return 
        # Sekect vocoder based on model name
        model_config_fpath = None
-        if model_fpath.name is not None and model_fpath.name.find("hifigan") > -1:
+        if model_fpath.name[0] == "g":
            vocoder = gan_vocoder
            self.ui.log("set hifigan as vocoder")
            # search a config file
@@ -452,15 +451,6 @@ class Toolbox:
                return
            if len(model_config_fpaths) > 0:
                model_config_fpath = model_config_fpaths[0]
        elif model_fpath.name is not None and model_fpath.name.find("fregan") > -1:
            vocoder = fgan_vocoder
            self.ui.log("set fregan as vocoder")
            # search a config file
            model_config_fpaths = list(model_fpath.parent.rglob("*.json"))
            if self.vc_mode and self.ui.current_extractor_fpath is None:
                return
            if len(model_config_fpaths) > 0:
                model_config_fpath = model_config_fpaths[0]
        else:
            vocoder = rnn_vocoder
            self.ui.log("set wavernn as vocoder")
--- a/vocoder/fregan/.gitignore
+++ b/vocoder/fregan/.gitignore
@@ -1,129 +0,0 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 pip-wheel-metadata/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.py,cover
 .hypothesis/
 .pytest_cache/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 .python-version
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
 #Pipfile.lock
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
--- a/vocoder/fregan/LICENSE
+++ b/vocoder/fregan/LICENSE
@@ -1,21 +0,0 @@
 MIT License
 Copyright (c) 2021 Rishikesh (ऋषिकेश)
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
--- a/vocoder/fregan/config.json
+++ b/vocoder/fregan/config.json
@@ -1,42 +0,0 @@
 {
    "resblock": "1",
    "num_gpus": 0,
    "batch_size": 16,
    "learning_rate": 0.0002,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999,
    "seed": 1234,
    "disc_start_step":0,
    "upsample_rates": [5,5,2,2,2],
    "upsample_kernel_sizes": [10,10,4,4,4],
    "upsample_initial_channel": 512,
    "resblock_kernel_sizes": [3,7,11],
    "resblock_dilation_sizes": [[1, 3, 5, 7], [1,3,5,7], [1,3,5,7]],
    "segment_size": 6400,
    "num_mels": 80,
    "num_freq": 1025,
    "n_fft": 1024,
    "hop_size": 200,
    "win_size": 800,
    "sampling_rate": 16000,
    "fmin": 0,
    "fmax": 7600,
    "fmax_for_loss": null,
    "num_workers": 4,
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321",
        "world_size": 1
    }
 }
--- a/vocoder/fregan/discriminator.py
+++ b/vocoder/fregan/discriminator.py
@@ -1,303 +0,0 @@
 import torch
 import torch.nn.functional as F
 import torch.nn as nn
 from torch.nn import Conv1d, AvgPool1d, Conv2d
 from torch.nn.utils import weight_norm, spectral_norm
 from vocoder.fregan.utils import get_padding
 from vocoder.fregan.stft_loss import stft
 from vocoder.fregan.dwt import DWT_1D
 LRELU_SLOPE = 0.1
 class SpecDiscriminator(nn.Module):
    """docstring for Discriminator."""
    def __init__(self, fft_size=1024, shift_size=120, win_length=600, window="hann_window", use_spectral_norm=False):
        super(SpecDiscriminator, self).__init__()
        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
        self.fft_size = fft_size
        self.shift_size = shift_size
        self.win_length = win_length
        self.window = getattr(torch, window)(win_length)
        self.discriminators = nn.ModuleList([
            norm_f(nn.Conv2d(1, 32, kernel_size=(3, 9), padding=(1, 4))),
            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1,2), padding=(1, 4))),
            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1,2), padding=(1, 4))),
            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1,2), padding=(1, 4))),
            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 3), stride=(1,1), padding=(1, 1))),
        ])
        self.out = norm_f(nn.Conv2d(32, 1, 3, 1, 1))
    def forward(self, y):
        fmap = []
        with torch.no_grad():
            y = y.squeeze(1)
            y = stft(y, self.fft_size, self.shift_size, self.win_length, self.window.to(y.get_device()))
        y = y.unsqueeze(1)
        for i, d in enumerate(self.discriminators):
            y = d(y)
            y = F.leaky_relu(y, LRELU_SLOPE)
            fmap.append(y)
        y = self.out(y)
        fmap.append(y)
        return torch.flatten(y, 1, -1), fmap
 class MultiResSpecDiscriminator(torch.nn.Module):
    def __init__(self,
                 fft_sizes=[1024, 2048, 512],
                 hop_sizes=[120, 240, 50],
                 win_lengths=[600, 1200, 240],
                 window="hann_window"):
        super(MultiResSpecDiscriminator, self).__init__()
        self.discriminators = nn.ModuleList([
            SpecDiscriminator(fft_sizes[0], hop_sizes[0], win_lengths[0], window),
            SpecDiscriminator(fft_sizes[1], hop_sizes[1], win_lengths[1], window),
            SpecDiscriminator(fft_sizes[2], hop_sizes[2], win_lengths[2], window)
            ])
    def forward(self, y, y_hat):
        y_d_rs = []
        y_d_gs = []
        fmap_rs = []
        fmap_gs = []
        for i, d in enumerate(self.discriminators):
            y_d_r, fmap_r = d(y)
            y_d_g, fmap_g = d(y_hat)
            y_d_rs.append(y_d_r)
            fmap_rs.append(fmap_r)
            y_d_gs.append(y_d_g)
            fmap_gs.append(fmap_g)
        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
 class DiscriminatorP(torch.nn.Module):
    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
        super(DiscriminatorP, self).__init__()
        self.period = period
        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
        self.dwt1d = DWT_1D()
        self.dwt_conv1 = norm_f(Conv1d(2, 1, 1))
        self.dwt_proj1 = norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0)))
        self.dwt_conv2 = norm_f(Conv1d(4, 1, 1))
        self.dwt_proj2 = norm_f(Conv2d(1, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0)))
        self.dwt_conv3 = norm_f(Conv1d(8, 1, 1))
        self.dwt_proj3 = norm_f(Conv2d(1, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0)))
        self.convs = nn.ModuleList([
            norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
            norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
            norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
            norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
            norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
        ])
        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
    def forward(self, x):
        fmap = []
        # DWT 1
        x_d1_high1, x_d1_low1 = self.dwt1d(x)
        x_d1 = self.dwt_conv1(torch.cat([x_d1_high1, x_d1_low1], dim=1))
        # 1d to 2d
        b, c, t = x_d1.shape
        if t % self.period != 0:  # pad first
            n_pad = self.period - (t % self.period)
            x_d1 = F.pad(x_d1, (0, n_pad), "reflect")
            t = t + n_pad
        x_d1 = x_d1.view(b, c, t // self.period, self.period)
        x_d1 = self.dwt_proj1(x_d1)
        # DWT 2
        x_d2_high1, x_d2_low1 = self.dwt1d(x_d1_high1)
        x_d2_high2, x_d2_low2 = self.dwt1d(x_d1_low1)
        x_d2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
        # 1d to 2d
        b, c, t = x_d2.shape
        if t % self.period != 0:  # pad first
            n_pad = self.period - (t % self.period)
            x_d2 = F.pad(x_d2, (0, n_pad), "reflect")
            t = t + n_pad
        x_d2 = x_d2.view(b, c, t // self.period, self.period)
        x_d2 = self.dwt_proj2(x_d2)
        # DWT 3
        x_d3_high1, x_d3_low1 = self.dwt1d(x_d2_high1)
        x_d3_high2, x_d3_low2 = self.dwt1d(x_d2_low1)
        x_d3_high3, x_d3_low3 = self.dwt1d(x_d2_high2)
        x_d3_high4, x_d3_low4 = self.dwt1d(x_d2_low2)
        x_d3 = self.dwt_conv3(
            torch.cat([x_d3_high1, x_d3_low1, x_d3_high2, x_d3_low2, x_d3_high3, x_d3_low3, x_d3_high4, x_d3_low4],
                      dim=1))
        # 1d to 2d
        b, c, t = x_d3.shape
        if t % self.period != 0:  # pad first
            n_pad = self.period - (t % self.period)
            x_d3 = F.pad(x_d3, (0, n_pad), "reflect")
            t = t + n_pad
        x_d3 = x_d3.view(b, c, t // self.period, self.period)
        x_d3 = self.dwt_proj3(x_d3)
        # 1d to 2d
        b, c, t = x.shape
        if t % self.period != 0:  # pad first
            n_pad = self.period - (t % self.period)
            x = F.pad(x, (0, n_pad), "reflect")
            t = t + n_pad
        x = x.view(b, c, t // self.period, self.period)
        i = 0
        for l in self.convs:
            x = l(x)
            x = F.leaky_relu(x, LRELU_SLOPE)
            fmap.append(x)
            if i == 0:
                x = torch.cat([x, x_d1], dim=2)
            elif i == 1:
                x = torch.cat([x, x_d2], dim=2)
            elif i == 2:
                x = torch.cat([x, x_d3], dim=2)
            else:
                x = x
            i = i + 1
        x = self.conv_post(x)
        fmap.append(x)
        x = torch.flatten(x, 1, -1)
        return x, fmap
 class ResWiseMultiPeriodDiscriminator(torch.nn.Module):
    def __init__(self):
        super(ResWiseMultiPeriodDiscriminator, self).__init__()
        self.discriminators = nn.ModuleList([
            DiscriminatorP(2),
            DiscriminatorP(3),
            DiscriminatorP(5),
            DiscriminatorP(7),
            DiscriminatorP(11),
        ])
    def forward(self, y, y_hat):
        y_d_rs = []
        y_d_gs = []
        fmap_rs = []
        fmap_gs = []
        for i, d in enumerate(self.discriminators):
            y_d_r, fmap_r = d(y)
            y_d_g, fmap_g = d(y_hat)
            y_d_rs.append(y_d_r)
            fmap_rs.append(fmap_r)
            y_d_gs.append(y_d_g)
            fmap_gs.append(fmap_g)
        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
 class DiscriminatorS(torch.nn.Module):
    def __init__(self, use_spectral_norm=False):
        super(DiscriminatorS, self).__init__()
        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
        self.dwt1d = DWT_1D()
        self.dwt_conv1 = norm_f(Conv1d(2, 128, 15, 1, padding=7))
        self.dwt_conv2 = norm_f(Conv1d(4, 128, 41, 2, padding=20))
        self.convs = nn.ModuleList([
            norm_f(Conv1d(1, 128, 15, 1, padding=7)),
            norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
            norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
            norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
            norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
            norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
            norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
        ])
        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
    def forward(self, x):
        fmap = []
        # DWT 1
        x_d1_high1, x_d1_low1 = self.dwt1d(x)
        x_d1 = self.dwt_conv1(torch.cat([x_d1_high1, x_d1_low1], dim=1))
        # DWT 2
        x_d2_high1, x_d2_low1 = self.dwt1d(x_d1_high1)
        x_d2_high2, x_d2_low2 = self.dwt1d(x_d1_low1)
        x_d2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
        i = 0
        for l in self.convs:
            x = l(x)
            x = F.leaky_relu(x, LRELU_SLOPE)
            fmap.append(x)
            if i == 0:
                x = torch.cat([x, x_d1], dim=2)
            if i == 1:
                x = torch.cat([x, x_d2], dim=2)
            i = i + 1
        x = self.conv_post(x)
        fmap.append(x)
        x = torch.flatten(x, 1, -1)
        return x, fmap
 class ResWiseMultiScaleDiscriminator(torch.nn.Module):
    def __init__(self, use_spectral_norm=False):
        super(ResWiseMultiScaleDiscriminator, self).__init__()
        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
        self.dwt1d = DWT_1D()
        self.dwt_conv1 = norm_f(Conv1d(2, 1, 1))
        self.dwt_conv2 = norm_f(Conv1d(4, 1, 1))
        self.discriminators = nn.ModuleList([
            DiscriminatorS(use_spectral_norm=True),
            DiscriminatorS(),
            DiscriminatorS(),
        ])
    def forward(self, y, y_hat):
        y_d_rs = []
        y_d_gs = []
        fmap_rs = []
        fmap_gs = []
        # DWT 1
        y_hi, y_lo = self.dwt1d(y)
        y_1 = self.dwt_conv1(torch.cat([y_hi, y_lo], dim=1))
        x_d1_high1, x_d1_low1 = self.dwt1d(y_hat)
        y_hat_1 = self.dwt_conv1(torch.cat([x_d1_high1, x_d1_low1], dim=1))
        # DWT 2
        x_d2_high1, x_d2_low1 = self.dwt1d(y_hi)
        x_d2_high2, x_d2_low2 = self.dwt1d(y_lo)
        y_2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
        x_d2_high1, x_d2_low1 = self.dwt1d(x_d1_high1)
        x_d2_high2, x_d2_low2 = self.dwt1d(x_d1_low1)
        y_hat_2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
        for i, d in enumerate(self.discriminators):
            if i == 1:
                y = y_1
                y_hat = y_hat_1
            if i == 2:
                y = y_2
                y_hat = y_hat_2
            y_d_r, fmap_r = d(y)
            y_d_g, fmap_g = d(y_hat)
            y_d_rs.append(y_d_r)
            fmap_rs.append(fmap_r)
            y_d_gs.append(y_d_g)
            fmap_gs.append(fmap_g)
        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
--- a/vocoder/fregan/dwt.py
+++ b/vocoder/fregan/dwt.py
@@ -1,76 +0,0 @@
 # Copyright (c) 2019, Adobe Inc. All rights reserved.
 #
 # This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike
 # 4.0 International Public License. To view a copy of this license, visit
 # https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.
 # DWT code borrow from https://github.com/LiQiufu/WaveSNet/blob/12cb9d24208c3d26917bf953618c30f0c6b0f03d/DWT_IDWT/DWT_IDWT_layer.py
 import pywt
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 __all__ = ['DWT_1D']
 Pad_Mode = ['constant', 'reflect', 'replicate', 'circular']
 class DWT_1D(nn.Module):
    def __init__(self, pad_type='reflect', wavename='haar',
                 stride=2, in_channels=1, out_channels=None, groups=None,
                 kernel_size=None, trainable=False):
        super(DWT_1D, self).__init__()
        self.trainable = trainable
        self.kernel_size = kernel_size
        if not self.trainable:
            assert self.kernel_size == None
        self.in_channels = in_channels
        self.out_channels = self.in_channels if out_channels == None else out_channels
        self.groups = self.in_channels if groups == None else groups
        assert isinstance(self.groups, int) and self.in_channels % self.groups == 0
        self.stride = stride
        assert self.stride == 2
        self.wavename = wavename
        self.pad_type = pad_type
        assert self.pad_type in Pad_Mode
        self.get_filters()
        self.initialization()
    def get_filters(self):
        wavelet = pywt.Wavelet(self.wavename)
        band_low = torch.tensor(wavelet.rec_lo)
        band_high = torch.tensor(wavelet.rec_hi)
        length_band = band_low.size()[0]
        self.kernel_size = length_band if self.kernel_size == None else self.kernel_size
        assert self.kernel_size >= length_band
        a = (self.kernel_size - length_band) // 2
        b = - (self.kernel_size - length_band - a)
        b = None if b == 0 else b
        self.filt_low = torch.zeros(self.kernel_size)
        self.filt_high = torch.zeros(self.kernel_size)
        self.filt_low[a:b] = band_low
        self.filt_high[a:b] = band_high
    def initialization(self):
        self.filter_low = self.filt_low[None, None, :].repeat((self.out_channels, self.in_channels // self.groups, 1))
        self.filter_high = self.filt_high[None, None, :].repeat((self.out_channels, self.in_channels // self.groups, 1))
        if torch.cuda.is_available():
            self.filter_low = self.filter_low.cuda()
            self.filter_high = self.filter_high.cuda()
        if self.trainable:
            self.filter_low = nn.Parameter(self.filter_low)
            self.filter_high = nn.Parameter(self.filter_high)
        if self.kernel_size % 2 == 0:
            self.pad_sizes = [self.kernel_size // 2 - 1, self.kernel_size // 2 - 1]
        else:
            self.pad_sizes = [self.kernel_size // 2, self.kernel_size // 2]
    def forward(self, input):
        assert isinstance(input, torch.Tensor)
        assert len(input.size()) == 3
        assert input.size()[1] == self.in_channels
        input = F.pad(input, pad=self.pad_sizes, mode=self.pad_type)
        return F.conv1d(input, self.filter_low.to(input.device), stride=self.stride, groups=self.groups), \
               F.conv1d(input, self.filter_high.to(input.device), stride=self.stride, groups=self.groups)
--- a/vocoder/fregan/generator.py
+++ b/vocoder/fregan/generator.py
@@ -1,210 +0,0 @@
 import torch
 import torch.nn.functional as F
 import torch.nn as nn
 from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
 from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
 from vocoder.fregan.utils import init_weights, get_padding
 LRELU_SLOPE = 0.1
 class ResBlock1(torch.nn.Module):
    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5, 7)):
        super(ResBlock1, self).__init__()
        self.h = h
        self.convs1 = nn.ModuleList([
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
                               padding=get_padding(kernel_size, dilation[0]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
                               padding=get_padding(kernel_size, dilation[1]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
                               padding=get_padding(kernel_size, dilation[2]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[3],
                               padding=get_padding(kernel_size, dilation[3])))
        ])
        self.convs1.apply(init_weights)
        self.convs2 = nn.ModuleList([
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
                               padding=get_padding(kernel_size, 1))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
                               padding=get_padding(kernel_size, 1))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
                               padding=get_padding(kernel_size, 1))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
                               padding=get_padding(kernel_size, 1)))
        ])
        self.convs2.apply(init_weights)
    def forward(self, x):
        for c1, c2 in zip(self.convs1, self.convs2):
            xt = F.leaky_relu(x, LRELU_SLOPE)
            xt = c1(xt)
            xt = F.leaky_relu(xt, LRELU_SLOPE)
            xt = c2(xt)
            x = xt + x
        return x
    def remove_weight_norm(self):
        for l in self.convs1:
            remove_weight_norm(l)
        for l in self.convs2:
            remove_weight_norm(l)
 class ResBlock2(torch.nn.Module):
    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
        super(ResBlock2, self).__init__()
        self.h = h
        self.convs = nn.ModuleList([
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
                               padding=get_padding(kernel_size, dilation[0]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
                               padding=get_padding(kernel_size, dilation[1])))
        ])
        self.convs.apply(init_weights)
    def forward(self, x):
        for c in self.convs:
            xt = F.leaky_relu(x, LRELU_SLOPE)
            xt = c(xt)
            x = xt + x
        return x
    def remove_weight_norm(self):
        for l in self.convs:
            remove_weight_norm(l)
 class FreGAN(torch.nn.Module):
    def __init__(self, h, top_k=4):
        super(FreGAN, self).__init__()
        self.h = h
        self.num_kernels = len(h.resblock_kernel_sizes)
        self.num_upsamples = len(h.upsample_rates)
        self.upsample_rates = h.upsample_rates
        self.up_kernels = h.upsample_kernel_sizes
        self.cond_level = self.num_upsamples - top_k
        self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
        resblock = ResBlock1 if h.resblock == '1' else ResBlock2
        self.ups = nn.ModuleList()
        self.cond_up = nn.ModuleList()
        self.res_output = nn.ModuleList()
        upsample_ = 1
        kr = 80
        for i, (u, k) in enumerate(zip(self.upsample_rates, self.up_kernels)):
 #            self.ups.append(weight_norm(
 #               ConvTranspose1d(h.upsample_initial_channel // (2 ** i), h.upsample_initial_channel // (2 ** (i + 1)),
 #                               k, u, padding=(k - u) // 2)))
            self.ups.append(weight_norm(ConvTranspose1d(h.upsample_initial_channel//(2**i),
                            h.upsample_initial_channel//(2**(i+1)),
                            k, u, padding=(u//2 + u%2), output_padding=u%2)))
            if i > (self.num_upsamples - top_k):
                self.res_output.append(
                    nn.Sequential(
                        nn.Upsample(scale_factor=u, mode='nearest'),
                        weight_norm(nn.Conv1d(h.upsample_initial_channel // (2 ** i),
                                              h.upsample_initial_channel // (2 ** (i + 1)), 1))
                    )
                )
            if i >= (self.num_upsamples - top_k):
                self.cond_up.append(
                    weight_norm(
                        ConvTranspose1d(kr, h.upsample_initial_channel // (2 ** i),
                                        self.up_kernels[i - 1], self.upsample_rates[i - 1],
                                        padding=(self.upsample_rates[i-1]//2+self.upsample_rates[i-1]%2), output_padding=self.upsample_rates[i-1]%2))
                )
                kr = h.upsample_initial_channel // (2 ** i)
            upsample_ *= u
        self.resblocks = nn.ModuleList()
        for i in range(len(self.ups)):
            ch = h.upsample_initial_channel // (2 ** (i + 1))
            for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
                self.resblocks.append(resblock(h, ch, k, d))
        self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
        self.ups.apply(init_weights)
        self.conv_post.apply(init_weights)
        self.cond_up.apply(init_weights)
        self.res_output.apply(init_weights)
    def forward(self, x):
        mel = x
        x = self.conv_pre(x)
        output = None
        for i in range(self.num_upsamples):
            if i >= self.cond_level:
                mel = self.cond_up[i - self.cond_level](mel)
                x += mel
            if i > self.cond_level:
                if output is None:
                    output = self.res_output[i - self.cond_level - 1](x)
                else:
                    output = self.res_output[i - self.cond_level - 1](output)
            x = F.leaky_relu(x, LRELU_SLOPE)
            x = self.ups[i](x)
            xs = None
            for j in range(self.num_kernels):
                if xs is None:
                    xs = self.resblocks[i * self.num_kernels + j](x)
                else:
                    xs += self.resblocks[i * self.num_kernels + j](x)
            x = xs / self.num_kernels
            if output is not None:
                output = output + x
        x = F.leaky_relu(output)
        x = self.conv_post(x)
        x = torch.tanh(x)
        return x
    def remove_weight_norm(self):
        print('Removing weight norm...')
        for l in self.ups:
            remove_weight_norm(l)
        for l in self.resblocks:
            l.remove_weight_norm()
        for l in self.cond_up:
            remove_weight_norm(l)
        for l in self.res_output:
            remove_weight_norm(l[1])
        remove_weight_norm(self.conv_pre)
        remove_weight_norm(self.conv_post)
 '''
    to run this, fix 
    from . import ResStack
    into
    from res_stack import ResStack
 '''
 if __name__ == '__main__':
    '''
    torch.Size([3, 80, 10])
    torch.Size([3, 1, 2000])
    4527362
    '''
    with open('config.json') as f:
        data = f.read()
    from utils import AttrDict
    import json
    json_config = json.loads(data)
    h = AttrDict(json_config)
    model = FreGAN(h)
    c = torch.randn(3, 80, 10)  # (B, channels, T).
    print(c.shape)
    y = model(c) # (B, 1, T ** prod(upsample_scales)
    print(y.shape)
    assert y.shape == torch.Size([3, 1, 2560])  # For normal melgan torch.Size([3, 1, 2560])
    pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(pytorch_total_params)
--- a/vocoder/fregan/inference.py
+++ b/vocoder/fregan/inference.py
@@ -1,74 +0,0 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 import os
 import json
 import torch
 from utils.util import AttrDict
 from vocoder.fregan.generator import FreGAN
 generator = None       # type: FreGAN
 output_sample_rate = None
 _device = None
 def load_checkpoint(filepath, device):
    assert os.path.isfile(filepath)
    print("Loading '{}'".format(filepath))
    checkpoint_dict = torch.load(filepath, map_location=device)
    print("Complete.")
    return checkpoint_dict
 def load_model(weights_fpath, config_fpath=None, verbose=True):
    global generator, _device, output_sample_rate
    if verbose:
        print("Building fregan")
    if config_fpath == None:
        model_config_fpaths = list(weights_fpath.parent.rglob("*.json"))
        if len(model_config_fpaths) > 0:
            config_fpath = model_config_fpaths[0]
        else:
            config_fpath = "./vocoder/fregan/config.json"
    with open(config_fpath) as f:
        data = f.read()
    json_config = json.loads(data)
    h = AttrDict(json_config)
    output_sample_rate = h.sampling_rate
    torch.manual_seed(h.seed)
    if torch.cuda.is_available():
        # _model = _model.cuda()
        _device = torch.device('cuda')
    else:
        _device = torch.device('cpu')
    generator = FreGAN(h).to(_device)
    state_dict_g = load_checkpoint(
        weights_fpath, _device
    )
    generator.load_state_dict(state_dict_g['generator'])
    generator.eval()
    generator.remove_weight_norm()
 def is_loaded():
    return generator is not None
 def infer_waveform(mel, progress_callback=None):
    if generator is None:
        raise Exception("Please load fre-gan in memory before using it")
    mel = torch.FloatTensor(mel).to(_device)
    mel = mel.unsqueeze(0)
    with torch.no_grad():
        y_g_hat = generator(mel)
        audio = y_g_hat.squeeze()
    audio = audio.cpu().numpy()
    return audio, output_sample_rate
--- a/vocoder/fregan/loss.py
+++ b/vocoder/fregan/loss.py
@@ -1,35 +0,0 @@
 import torch
 def feature_loss(fmap_r, fmap_g):
    loss = 0
    for dr, dg in zip(fmap_r, fmap_g):
        for rl, gl in zip(dr, dg):
            loss += torch.mean(torch.abs(rl - gl))
    return loss*2
 def discriminator_loss(disc_real_outputs, disc_generated_outputs):
    loss = 0
    r_losses = []
    g_losses = []
    for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
        r_loss = torch.mean((1-dr)**2)
        g_loss = torch.mean(dg**2)
        loss += (r_loss + g_loss)
        r_losses.append(r_loss.item())
        g_losses.append(g_loss.item())
    return loss, r_losses, g_losses
 def generator_loss(disc_outputs):
    loss = 0
    gen_losses = []
    for dg in disc_outputs:
        l = torch.mean((1-dg)**2)
        gen_losses.append(l)
        loss += l
    return loss, gen_losses
--- a/vocoder/fregan/meldataset.py
+++ b/vocoder/fregan/meldataset.py
@@ -1,176 +0,0 @@
 import math
 import os
 import random
 import torch
 import torch.utils.data
 import numpy as np
 from librosa.util import normalize
 from scipy.io.wavfile import read
 from librosa.filters import mel as librosa_mel_fn
 MAX_WAV_VALUE = 32768.0
 def load_wav(full_path):
    sampling_rate, data = read(full_path)
    return data, sampling_rate
 def dynamic_range_compression(x, C=1, clip_val=1e-5):
    return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
 def dynamic_range_decompression(x, C=1):
    return np.exp(x) / C
 def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
    return torch.log(torch.clamp(x, min=clip_val) * C)
 def dynamic_range_decompression_torch(x, C=1):
    return torch.exp(x) / C
 def spectral_normalize_torch(magnitudes):
    output = dynamic_range_compression_torch(magnitudes)
    return output
 def spectral_de_normalize_torch(magnitudes):
    output = dynamic_range_decompression_torch(magnitudes)
    return output
 mel_basis = {}
 hann_window = {}
 def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
    if torch.min(y) < -1.:
        print('min value is ', torch.min(y))
    if torch.max(y) > 1.:
        print('max value is ', torch.max(y))
    global mel_basis, hann_window
    if fmax not in mel_basis:
        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
        mel_basis[str(fmax)+'_'+str(y.device)] = torch.from_numpy(mel).float().to(y.device)
        hann_window[str(y.device)] = torch.hann_window(win_size).to(y.device)
    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
    y = y.squeeze(1)
    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
                      center=center, pad_mode='reflect', normalized=False, onesided=True)
    spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
    spec = torch.matmul(mel_basis[str(fmax)+'_'+str(y.device)], spec)
    spec = spectral_normalize_torch(spec)
    return spec
 def get_dataset_filelist(a):
    #with open(a.input_training_file, 'r', encoding='utf-8') as fi:
    #    training_files = [os.path.join(a.input_wavs_dir, x.split('|')[0] + '.wav')
    #                      for x in fi.read().split('\n') if len(x) > 0]
    #with open(a.input_validation_file, 'r', encoding='utf-8') as fi:
    #   validation_files = [os.path.join(a.input_wavs_dir, x.split('|')[0] + '.wav')
    #                        for x in fi.read().split('\n') if len(x) > 0]
    files = os.listdir(a.input_wavs_dir)
    random.shuffle(files)
    files = [os.path.join(a.input_wavs_dir, f) for f in files]
    training_files = files[: -int(len(files) * 0.05)]
    validation_files = files[-int(len(files) * 0.05):]
    return training_files, validation_files
 class MelDataset(torch.utils.data.Dataset):
    def __init__(self, training_files, segment_size, n_fft, num_mels,
                 hop_size, win_size, sampling_rate,  fmin, fmax, split=True, shuffle=True, n_cache_reuse=1,
                 device=None, fmax_loss=None, fine_tuning=False, base_mels_path=None):
        self.audio_files = training_files
        random.seed(1234)
        if shuffle:
            random.shuffle(self.audio_files)
        self.segment_size = segment_size
        self.sampling_rate = sampling_rate
        self.split = split
        self.n_fft = n_fft
        self.num_mels = num_mels
        self.hop_size = hop_size
        self.win_size = win_size
        self.fmin = fmin
        self.fmax = fmax
        self.fmax_loss = fmax_loss
        self.cached_wav = None
        self.n_cache_reuse = n_cache_reuse
        self._cache_ref_count = 0
        self.device = device
        self.fine_tuning = fine_tuning
        self.base_mels_path = base_mels_path
    def __getitem__(self, index):
        filename = self.audio_files[index]
        if self._cache_ref_count == 0:
            #audio, sampling_rate = load_wav(filename)
            #audio = audio / MAX_WAV_VALUE
            audio = np.load(filename)
            if not self.fine_tuning:
                audio = normalize(audio) * 0.95
            self.cached_wav = audio
            #if sampling_rate != self.sampling_rate:
            #    raise ValueError("{} SR doesn't match target {} SR".format(
            #        sampling_rate, self.sampling_rate))
            self._cache_ref_count = self.n_cache_reuse
        else:
            audio = self.cached_wav
            self._cache_ref_count -= 1
        audio = torch.FloatTensor(audio)
        audio = audio.unsqueeze(0)
        if not self.fine_tuning:
            if self.split:
                if audio.size(1) >= self.segment_size:
                    max_audio_start = audio.size(1) - self.segment_size
                    audio_start = random.randint(0, max_audio_start)
                    audio = audio[:, audio_start:audio_start+self.segment_size]
                else:
                    audio = torch.nn.functional.pad(audio, (0, self.segment_size - audio.size(1)), 'constant')
            mel = mel_spectrogram(audio, self.n_fft, self.num_mels,
                                  self.sampling_rate, self.hop_size, self.win_size, self.fmin, self.fmax,
                                  center=False)
        else:
            mel_path = os.path.join(self.base_mels_path, "mel" + "-" + filename.split("/")[-1].split("-")[-1])
            mel = np.load(mel_path).T
            #mel = np.load(
            #    os.path.join(self.base_mels_path, os.path.splitext(os.path.split(filename)[-1])[0] + '.npy'))
            mel = torch.from_numpy(mel)
            if len(mel.shape) < 3:
                mel = mel.unsqueeze(0)
            if self.split:
                frames_per_seg = math.ceil(self.segment_size / self.hop_size)
                if audio.size(1) >= self.segment_size:
                    mel_start = random.randint(0, mel.size(2) - frames_per_seg - 1)
                    mel = mel[:, :, mel_start:mel_start + frames_per_seg]
                    audio = audio[:, mel_start * self.hop_size:(mel_start + frames_per_seg) * self.hop_size]
                else:
                    mel = torch.nn.functional.pad(mel, (0, frames_per_seg - mel.size(2)), 'constant')
                    audio = torch.nn.functional.pad(audio, (0, self.segment_size - audio.size(1)), 'constant')
        mel_loss = mel_spectrogram(audio, self.n_fft, self.num_mels,
                                   self.sampling_rate, self.hop_size, self.win_size, self.fmin, self.fmax_loss,
                                   center=False)
        return (mel.squeeze(), audio.squeeze(0), filename, mel_loss.squeeze())
    def __len__(self):
        return len(self.audio_files)
--- a/vocoder/fregan/modules.py
+++ b/vocoder/fregan/modules.py
@@ -1,201 +0,0 @@
 import torch
 import torch.nn.functional as F
 class KernelPredictor(torch.nn.Module):
    ''' Kernel predictor for the location-variable convolutions
    '''
    def __init__(self,
                 cond_channels,
                 conv_in_channels,
                 conv_out_channels,
                 conv_layers,
                 conv_kernel_size=3,
                 kpnet_hidden_channels=64,
                 kpnet_conv_size=3,
                 kpnet_dropout=0.0,
                 kpnet_nonlinear_activation="LeakyReLU",
                 kpnet_nonlinear_activation_params={"negative_slope": 0.1}
                 ):
        '''
        Args:
            cond_channels (int): number of channel for the conditioning sequence,
            conv_in_channels (int): number of channel for the input sequence,
            conv_out_channels (int): number of channel for the output sequence,
            conv_layers (int):
            kpnet_
        '''
        super().__init__()
        self.conv_in_channels = conv_in_channels
        self.conv_out_channels = conv_out_channels
        self.conv_kernel_size = conv_kernel_size
        self.conv_layers = conv_layers
        l_w = conv_in_channels * conv_out_channels * conv_kernel_size * conv_layers
        l_b = conv_out_channels * conv_layers
        padding = (kpnet_conv_size - 1) // 2
        self.input_conv = torch.nn.Sequential(
            torch.nn.Conv1d(cond_channels, kpnet_hidden_channels, 5, padding=(5 - 1) // 2, bias=True),
            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
        )
        self.residual_conv = torch.nn.Sequential(
            torch.nn.Dropout(kpnet_dropout),
            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
            torch.nn.Dropout(kpnet_dropout),
            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
            torch.nn.Dropout(kpnet_dropout),
            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
        )
        self.kernel_conv = torch.nn.Conv1d(kpnet_hidden_channels, l_w, kpnet_conv_size,
                                           padding=padding, bias=True)
        self.bias_conv = torch.nn.Conv1d(kpnet_hidden_channels, l_b, kpnet_conv_size, padding=padding,
                                         bias=True)
    def forward(self, c):
        '''
        Args:
            c (Tensor): the conditioning sequence (batch, cond_channels, cond_length)
        Returns:
        '''
        batch, cond_channels, cond_length = c.shape
        c = self.input_conv(c)
        c = c + self.residual_conv(c)
        k = self.kernel_conv(c)
        b = self.bias_conv(c)
        kernels = k.contiguous().view(batch,
                                      self.conv_layers,
                                      self.conv_in_channels,
                                      self.conv_out_channels,
                                      self.conv_kernel_size,
                                      cond_length)
        bias = b.contiguous().view(batch,
                                   self.conv_layers,
                                   self.conv_out_channels,
                                   cond_length)
        return kernels, bias
 class LVCBlock(torch.nn.Module):
    ''' the location-variable convolutions
    '''
    def __init__(self,
                 in_channels,
                 cond_channels,
                 upsample_ratio,
                 conv_layers=4,
                 conv_kernel_size=3,
                 cond_hop_length=256,
                 kpnet_hidden_channels=64,
                 kpnet_conv_size=3,
                 kpnet_dropout=0.0
                 ):
        super().__init__()
        self.cond_hop_length = cond_hop_length
        self.conv_layers = conv_layers
        self.conv_kernel_size = conv_kernel_size
        self.convs = torch.nn.ModuleList()
        self.upsample = torch.nn.ConvTranspose1d(in_channels, in_channels,
                                    kernel_size=upsample_ratio*2, stride=upsample_ratio,
                                    padding=upsample_ratio // 2 + upsample_ratio % 2,
                                    output_padding=upsample_ratio % 2)
        self.kernel_predictor = KernelPredictor(
            cond_channels=cond_channels,
            conv_in_channels=in_channels,
            conv_out_channels=2 * in_channels,
            conv_layers=conv_layers,
            conv_kernel_size=conv_kernel_size,
            kpnet_hidden_channels=kpnet_hidden_channels,
            kpnet_conv_size=kpnet_conv_size,
            kpnet_dropout=kpnet_dropout
        )
        for i in range(conv_layers):
            padding = (3 ** i) * int((conv_kernel_size - 1) / 2)
            conv = torch.nn.Conv1d(in_channels, in_channels, kernel_size=conv_kernel_size, padding=padding, dilation=3 ** i)
            self.convs.append(conv)
    def forward(self, x, c):
        ''' forward propagation of the location-variable convolutions.
        Args:
            x (Tensor): the input sequence (batch, in_channels, in_length)
            c (Tensor): the conditioning sequence (batch, cond_channels, cond_length)
        Returns:
            Tensor: the output sequence (batch, in_channels, in_length)
        '''
        batch, in_channels, in_length = x.shape
        kernels, bias = self.kernel_predictor(c)
        x = F.leaky_relu(x, 0.2)
        x = self.upsample(x)
        for i in range(self.conv_layers):
            y = F.leaky_relu(x, 0.2)
            y = self.convs[i](y)
            y = F.leaky_relu(y, 0.2)
            k = kernels[:, i, :, :, :, :]
            b = bias[:, i, :, :]
            y = self.location_variable_convolution(y, k, b, 1, self.cond_hop_length)
            x = x + torch.sigmoid(y[:, :in_channels, :]) * torch.tanh(y[:, in_channels:, :])
        return x
    def location_variable_convolution(self, x, kernel, bias, dilation, hop_size):
        ''' perform location-variable convolution operation on the input sequence (x) using the local convolution kernl.
        Time: 414 μs ± 309 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each), test on NVIDIA V100.
        Args:
            x (Tensor): the input sequence (batch, in_channels, in_length).
            kernel (Tensor): the local convolution kernel (batch, in_channel, out_channels, kernel_size, kernel_length)
            bias (Tensor): the bias for the local convolution (batch, out_channels, kernel_length)
            dilation (int): the dilation of convolution.
            hop_size (int): the hop_size of the conditioning sequence.
        Returns:
            (Tensor): the output sequence after performing local convolution. (batch, out_channels, in_length).
        '''
        batch, in_channels, in_length = x.shape
        batch, in_channels, out_channels, kernel_size, kernel_length = kernel.shape
        assert in_length == (kernel_length * hop_size), "length of (x, kernel) is not matched"
        padding = dilation * int((kernel_size - 1) / 2)
        x = F.pad(x, (padding, padding), 'constant', 0)  # (batch, in_channels, in_length + 2*padding)
        x = x.unfold(2, hop_size + 2 * padding, hop_size)  # (batch, in_channels, kernel_length, hop_size + 2*padding)
        if hop_size < dilation:
            x = F.pad(x, (0, dilation), 'constant', 0)
        x = x.unfold(3, dilation,
                     dilation)  # (batch, in_channels, kernel_length, (hop_size + 2*padding)/dilation, dilation)
        x = x[:, :, :, :, :hop_size]
        x = x.transpose(3, 4)  # (batch, in_channels, kernel_length, dilation, (hop_size + 2*padding)/dilation)
        x = x.unfold(4, kernel_size, 1)  # (batch, in_channels, kernel_length, dilation, _, kernel_size)
        o = torch.einsum('bildsk,biokl->bolsd', x, kernel)
        o = o + bias.unsqueeze(-1).unsqueeze(-1)
        o = o.contiguous().view(batch, out_channels, -1)
        return o
--- a/vocoder/fregan/stft_loss.py
+++ b/vocoder/fregan/stft_loss.py
@@ -1,136 +0,0 @@
 # -*- coding: utf-8 -*-
 # Copyright 2019 Tomoki Hayashi
 #  MIT License (https://opensource.org/licenses/MIT)
 """STFT-based Loss modules."""
 import torch
 import torch.nn.functional as F
 def stft(x, fft_size, hop_size, win_length, window):
    """Perform STFT and convert to magnitude spectrogram.
    Args:
        x (Tensor): Input signal tensor (B, T).
        fft_size (int): FFT size.
        hop_size (int): Hop size.
        win_length (int): Window length.
        window (str): Window function type.
    Returns:
        Tensor: Magnitude spectrogram (B, #frames, fft_size // 2 + 1).
    """
    x_stft = torch.stft(x, fft_size, hop_size, win_length, window)
    real = x_stft[..., 0]
    imag = x_stft[..., 1]
    # NOTE(kan-bayashi): clamp is needed to avoid nan or inf
    return torch.sqrt(torch.clamp(real ** 2 + imag ** 2, min=1e-7)).transpose(2, 1)
 class SpectralConvergengeLoss(torch.nn.Module):
    """Spectral convergence loss module."""
    def __init__(self):
        """Initilize spectral convergence loss module."""
        super(SpectralConvergengeLoss, self).__init__()
    def forward(self, x_mag, y_mag):
        """Calculate forward propagation.
        Args:
            x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
            y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
        Returns:
            Tensor: Spectral convergence loss value.
        """
        return torch.norm(y_mag - x_mag, p="fro") / torch.norm(y_mag, p="fro")
 class LogSTFTMagnitudeLoss(torch.nn.Module):
    """Log STFT magnitude loss module."""
    def __init__(self):
        """Initilize los STFT magnitude loss module."""
        super(LogSTFTMagnitudeLoss, self).__init__()
    def forward(self, x_mag, y_mag):
        """Calculate forward propagation.
        Args:
            x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
            y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
        Returns:
            Tensor: Log STFT magnitude loss value.
        """
        return F.l1_loss(torch.log(y_mag), torch.log(x_mag))
 class STFTLoss(torch.nn.Module):
    """STFT loss module."""
    def __init__(self, fft_size=1024, shift_size=120, win_length=600, window="hann_window"):
        """Initialize STFT loss module."""
        super(STFTLoss, self).__init__()
        self.fft_size = fft_size
        self.shift_size = shift_size
        self.win_length = win_length
        self.window = getattr(torch, window)(win_length)
        self.spectral_convergenge_loss = SpectralConvergengeLoss()
        self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss()
    def forward(self, x, y):
        """Calculate forward propagation.
        Args:
            x (Tensor): Predicted signal (B, T).
            y (Tensor): Groundtruth signal (B, T).
        Returns:
            Tensor: Spectral convergence loss value.
            Tensor: Log STFT magnitude loss value.
        """
        x_mag = stft(x, self.fft_size, self.shift_size, self.win_length, self.window.to(x.get_device()))
        y_mag = stft(y, self.fft_size, self.shift_size, self.win_length, self.window.to(x.get_device()))
        sc_loss = self.spectral_convergenge_loss(x_mag, y_mag)
        mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag)
        return sc_loss, mag_loss
 class MultiResolutionSTFTLoss(torch.nn.Module):
    """Multi resolution STFT loss module."""
    def __init__(self,
                 fft_sizes=[1024, 2048, 512],
                 hop_sizes=[120, 240, 50],
                 win_lengths=[600, 1200, 240],
                 window="hann_window"):
        """Initialize Multi resolution STFT loss module.
        Args:
            fft_sizes (list): List of FFT sizes.
            hop_sizes (list): List of hop sizes.
            win_lengths (list): List of window lengths.
            window (str): Window function type.
        """
        super(MultiResolutionSTFTLoss, self).__init__()
        assert len(fft_sizes) == len(hop_sizes) == len(win_lengths)
        self.stft_losses = torch.nn.ModuleList()
        for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths):
            self.stft_losses += [STFTLoss(fs, ss, wl, window)]
    def forward(self, x, y):
        """Calculate forward propagation.
        Args:
            x (Tensor): Predicted signal (B, T).
            y (Tensor): Groundtruth signal (B, T).
        Returns:
            Tensor: Multi resolution spectral convergence loss value.
            Tensor: Multi resolution log STFT magnitude loss value.
        """
        sc_loss = 0.0
        mag_loss = 0.0
        for f in self.stft_losses:
            sc_l, mag_l = f(x, y)
            sc_loss += sc_l
            mag_loss += mag_l
        sc_loss /= len(self.stft_losses)
        mag_loss /= len(self.stft_losses)
        return sc_loss, mag_loss
--- a/vocoder/fregan/train.py
+++ b/vocoder/fregan/train.py
@@ -1,246 +0,0 @@
 import warnings
 warnings.simplefilter(action='ignore', category=FutureWarning)
 import itertools
 import os
 import time
 import torch
 import torch.nn.functional as F
 from torch.utils.tensorboard import SummaryWriter
 from torch.utils.data import DistributedSampler, DataLoader
 from torch.distributed import init_process_group
 from torch.nn.parallel import DistributedDataParallel
 from vocoder.fregan.meldataset import MelDataset, mel_spectrogram, get_dataset_filelist
 from vocoder.fregan.generator import FreGAN
 from vocoder.fregan.discriminator import ResWiseMultiPeriodDiscriminator, ResWiseMultiScaleDiscriminator
 from vocoder.fregan.loss import feature_loss, generator_loss, discriminator_loss
 from vocoder.fregan.utils import plot_spectrogram, scan_checkpoint, load_checkpoint, save_checkpoint
 torch.backends.cudnn.benchmark = True
 def train(rank, a, h):
    a.checkpoint_path = a.models_dir.joinpath(a.run_id+'_fregan')
    a.checkpoint_path.mkdir(exist_ok=True)
    a.training_epochs = 3100
    a.stdout_interval = 5
    a.checkpoint_interval = a.backup_every
    a.summary_interval = 5000
    a.validation_interval = 1000
    a.fine_tuning = True
    a.input_wavs_dir = a.syn_dir.joinpath("audio")
    a.input_mels_dir = a.syn_dir.joinpath("mels")
    if h.num_gpus > 1:
        init_process_group(backend=h.dist_config['dist_backend'], init_method=h.dist_config['dist_url'],
                           world_size=h.dist_config['world_size'] * h.num_gpus, rank=rank)
    torch.cuda.manual_seed(h.seed)
    device = torch.device('cuda:{:d}'.format(rank))
    generator = FreGAN(h).to(device)
    mpd = ResWiseMultiPeriodDiscriminator().to(device)
    msd = ResWiseMultiScaleDiscriminator().to(device)
    if rank == 0:
        print(generator)
        os.makedirs(a.checkpoint_path, exist_ok=True)
        print("checkpoints directory : ", a.checkpoint_path)
    if os.path.isdir(a.checkpoint_path):
        cp_g = scan_checkpoint(a.checkpoint_path, 'g_fregan_')
        cp_do = scan_checkpoint(a.checkpoint_path, 'do_fregan_')
    steps = 0
    if cp_g is None or cp_do is None:
        state_dict_do = None
        last_epoch = -1
    else:
        state_dict_g = load_checkpoint(cp_g, device)
        state_dict_do = load_checkpoint(cp_do, device)
        generator.load_state_dict(state_dict_g['generator'])
        mpd.load_state_dict(state_dict_do['mpd'])
        msd.load_state_dict(state_dict_do['msd'])
        steps = state_dict_do['steps'] + 1
        last_epoch = state_dict_do['epoch']
    if h.num_gpus > 1:
        generator = DistributedDataParallel(generator, device_ids=[rank]).to(device)
        mpd = DistributedDataParallel(mpd, device_ids=[rank]).to(device)
        msd = DistributedDataParallel(msd, device_ids=[rank]).to(device)
    optim_g = torch.optim.AdamW(generator.parameters(), h.learning_rate, betas=[h.adam_b1, h.adam_b2])
    optim_d = torch.optim.AdamW(itertools.chain(msd.parameters(), mpd.parameters()),
                                h.learning_rate, betas=[h.adam_b1, h.adam_b2])
    if state_dict_do is not None:
        optim_g.load_state_dict(state_dict_do['optim_g'])
        optim_d.load_state_dict(state_dict_do['optim_d'])
    scheduler_g = torch.optim.lr_scheduler.ExponentialLR(optim_g, gamma=h.lr_decay, last_epoch=last_epoch)
    scheduler_d = torch.optim.lr_scheduler.ExponentialLR(optim_d, gamma=h.lr_decay, last_epoch=last_epoch)
    training_filelist, validation_filelist = get_dataset_filelist(a)
    trainset = MelDataset(training_filelist, h.segment_size, h.n_fft, h.num_mels,
                          h.hop_size, h.win_size, h.sampling_rate, h.fmin, h.fmax, n_cache_reuse=0,
                          shuffle=False if h.num_gpus > 1 else True, fmax_loss=h.fmax_for_loss, device=device,
                          fine_tuning=a.fine_tuning, base_mels_path=a.input_mels_dir)
    train_sampler = DistributedSampler(trainset) if h.num_gpus > 1 else None
    train_loader = DataLoader(trainset, num_workers=h.num_workers, shuffle=False,
                              sampler=train_sampler,
                              batch_size=h.batch_size,
                              pin_memory=True,
                              drop_last=True)
    if rank == 0:
        validset = MelDataset(validation_filelist, h.segment_size, h.n_fft, h.num_mels,
                              h.hop_size, h.win_size, h.sampling_rate, h.fmin, h.fmax, False, False, n_cache_reuse=0,
                              fmax_loss=h.fmax_for_loss, device=device, fine_tuning=a.fine_tuning,
                              base_mels_path=a.input_mels_dir)
        validation_loader = DataLoader(validset, num_workers=1, shuffle=False,
                                       sampler=None,
                                       batch_size=1,
                                       pin_memory=True,
                                       drop_last=True)
        sw = SummaryWriter(os.path.join(a.checkpoint_path, 'logs'))
    generator.train()
    mpd.train()
    msd.train()
    for epoch in range(max(0, last_epoch), a.training_epochs):
        if rank == 0:
            start = time.time()
            print("Epoch: {}".format(epoch + 1))
        if h.num_gpus > 1:
            train_sampler.set_epoch(epoch)
        for i, batch in enumerate(train_loader):
            if rank == 0:
                start_b = time.time()
            x, y, _, y_mel = batch
            x = torch.autograd.Variable(x.to(device, non_blocking=True))
            y = torch.autograd.Variable(y.to(device, non_blocking=True))
            y_mel = torch.autograd.Variable(y_mel.to(device, non_blocking=True))
            y = y.unsqueeze(1)
            y_g_hat = generator(x)
            y_g_hat_mel = mel_spectrogram(y_g_hat.squeeze(1), h.n_fft, h.num_mels, h.sampling_rate, h.hop_size,
                                          h.win_size,
                                          h.fmin, h.fmax_for_loss)
            if steps > h.disc_start_step:
                optim_d.zero_grad()
                # MPD
                y_df_hat_r, y_df_hat_g, _, _ = mpd(y, y_g_hat.detach())
                loss_disc_f, losses_disc_f_r, losses_disc_f_g = discriminator_loss(y_df_hat_r, y_df_hat_g)
                # MSD
                y_ds_hat_r, y_ds_hat_g, _, _ = msd(y, y_g_hat.detach())
                loss_disc_s, losses_disc_s_r, losses_disc_s_g = discriminator_loss(y_ds_hat_r, y_ds_hat_g)
                loss_disc_all = loss_disc_s + loss_disc_f
                loss_disc_all.backward()
                optim_d.step()
            # Generator
            optim_g.zero_grad()
            # L1 Mel-Spectrogram Loss
            loss_mel = F.l1_loss(y_mel, y_g_hat_mel) * 45
            # sc_loss, mag_loss = stft_loss(y_g_hat[:, :, :y.size(2)].squeeze(1), y.squeeze(1))
            # loss_mel = h.lambda_aux * (sc_loss + mag_loss)  # STFT Loss
            if steps > h.disc_start_step:
                y_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = mpd(y, y_g_hat)
                y_ds_hat_r, y_ds_hat_g, fmap_s_r, fmap_s_g = msd(y, y_g_hat)
                loss_fm_f = feature_loss(fmap_f_r, fmap_f_g)
                loss_fm_s = feature_loss(fmap_s_r, fmap_s_g)
                loss_gen_f, losses_gen_f = generator_loss(y_df_hat_g)
                loss_gen_s, losses_gen_s = generator_loss(y_ds_hat_g)
                loss_gen_all = loss_gen_s + loss_gen_f + (2 * (loss_fm_s + loss_fm_f)) + loss_mel
            else:
                loss_gen_all = loss_mel
            loss_gen_all.backward()
            optim_g.step()
            if rank == 0:
                # STDOUT logging
                if steps % a.stdout_interval == 0:
                    with torch.no_grad():
                        mel_error = F.l1_loss(y_mel, y_g_hat_mel).item()
                    print('Steps : {:d}, Gen Loss Total : {:4.3f}, Mel-Spec. Error : {:4.3f}, s/b : {:4.3f}'.
                          format(steps, loss_gen_all, mel_error, time.time() - start_b))
                # checkpointing
                if steps % a.checkpoint_interval == 0 and steps != 0:
                    checkpoint_path = "{}/g_fregan_{:08d}.pt".format(a.checkpoint_path, steps)
                    save_checkpoint(checkpoint_path,
                                    {'generator': (generator.module if h.num_gpus > 1 else generator).state_dict()})
                    checkpoint_path = "{}/do_fregan_{:08d}.pt".format(a.checkpoint_path, steps)
                    save_checkpoint(checkpoint_path,
                                    {'mpd': (mpd.module if h.num_gpus > 1
                                             else mpd).state_dict(),
                                     'msd': (msd.module if h.num_gpus > 1
                                             else msd).state_dict(),
                                     'optim_g': optim_g.state_dict(), 'optim_d': optim_d.state_dict(), 'steps': steps,
                                     'epoch': epoch})
                # Tensorboard summary logging
                if steps % a.summary_interval == 0:
                    sw.add_scalar("training/gen_loss_total", loss_gen_all, steps)
                    sw.add_scalar("training/mel_spec_error", mel_error, steps)
                # Validation
                if steps % a.validation_interval == 0:  # and steps != 0:
                    generator.eval()
                    torch.cuda.empty_cache()
                    val_err_tot = 0
                    with torch.no_grad():
                        for j, batch in enumerate(validation_loader):
                            x, y, _, y_mel = batch
                            y_g_hat = generator(x.to(device))
                            y_mel = torch.autograd.Variable(y_mel.to(device, non_blocking=True))
                            y_g_hat_mel = mel_spectrogram(y_g_hat.squeeze(1), h.n_fft, h.num_mels, h.sampling_rate,
                                                          h.hop_size, h.win_size,
                                                          h.fmin, h.fmax_for_loss)
                            #val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
                            if j <= 4:
                                if steps == 0:
                                    sw.add_audio('gt/y_{}'.format(j), y[0], steps, h.sampling_rate)
                                    sw.add_figure('gt/y_spec_{}'.format(j), plot_spectrogram(x[0]), steps)
                                sw.add_audio('generated/y_hat_{}'.format(j), y_g_hat[0], steps, h.sampling_rate)
                                y_hat_spec = mel_spectrogram(y_g_hat.squeeze(1), h.n_fft, h.num_mels,
                                                             h.sampling_rate, h.hop_size, h.win_size,
                                                             h.fmin, h.fmax)
                                sw.add_figure('generated/y_hat_spec_{}'.format(j),
                                              plot_spectrogram(y_hat_spec.squeeze(0).cpu().numpy()), steps)
                        val_err = val_err_tot / (j + 1)
                        sw.add_scalar("validation/mel_spec_error", val_err, steps)
                    generator.train()
            steps += 1
        scheduler_g.step()
        scheduler_d.step()
        if rank == 0:
            print('Time taken for epoch {} is {} sec\n'.format(epoch + 1, int(time.time() - start)))
--- a/vocoder/fregan/utils.py
+++ b/vocoder/fregan/utils.py
@@ -1,65 +0,0 @@
 import glob
 import os
 import matplotlib
 import torch
 from torch.nn.utils import weight_norm
 matplotlib.use("Agg")
 import matplotlib.pylab as plt
 import shutil
 def build_env(config, config_name, path):
    t_path = os.path.join(path, config_name)
    if config != t_path:
        os.makedirs(path, exist_ok=True)
        shutil.copyfile(config, os.path.join(path, config_name))
 def plot_spectrogram(spectrogram):
    fig, ax = plt.subplots(figsize=(10, 2))
    im = ax.imshow(spectrogram, aspect="auto", origin="lower",
                   interpolation='none')
    plt.colorbar(im, ax=ax)
    fig.canvas.draw()
    plt.close()
    return fig
 def init_weights(m, mean=0.0, std=0.01):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        m.weight.data.normal_(mean, std)
 def apply_weight_norm(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        weight_norm(m)
 def get_padding(kernel_size, dilation=1):
    return int((kernel_size*dilation - dilation)/2)
 def load_checkpoint(filepath, device):
    assert os.path.isfile(filepath)
    print("Loading '{}'".format(filepath))
    checkpoint_dict = torch.load(filepath, map_location=device)
    print("Complete.")
    return checkpoint_dict
 def save_checkpoint(filepath, obj):
    print("Saving checkpoint to {}".format(filepath))
    torch.save(obj, filepath)
    print("Complete.")
 def scan_checkpoint(cp_dir, prefix):
    pattern = os.path.join(cp_dir, prefix + '????????.pt')
    cp_list = glob.glob(pattern)
    if len(cp_list) == 0:
        return None
    return sorted(cp_list)[-1]
--- a/vocoder/hifigan/config_16k_.json
+++ b/vocoder/hifigan/config_16k_.json
@@ -7,7 +7,6 @@
    "adam_b2": 0.99,
    "lr_decay": 0.999,
    "seed": 1234,
    "disc_start_step":0,
    "upsample_rates": [5,5,4,2],
    "upsample_kernel_sizes": [10,10,8,4],
@@ -28,11 +27,5 @@
    "fmax": 7600,
    "fmax_for_loss": null,
-    "num_workers": 4,
+    "num_workers": 4
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321",
        "world_size": 1
    }
 }
--- a/vocoder/hifigan/train.py
+++ b/vocoder/hifigan/train.py
@@ -51,8 +51,8 @@ def train(rank, a, h):
        print("checkpoints directory : ", a.checkpoint_path)
    if os.path.isdir(a.checkpoint_path):
-        cp_g = scan_checkpoint(a.checkpoint_path, 'g_hifigan_')
+        cp_g = scan_checkpoint(a.checkpoint_path, 'g_')
-        cp_do = scan_checkpoint(a.checkpoint_path, 'do_hifigan_')
+        cp_do = scan_checkpoint(a.checkpoint_path, 'do_')
    steps = 0
    if cp_g is None or cp_do is None:
@@ -137,21 +137,21 @@ def train(rank, a, h):
            y_g_hat = generator(x)
            y_g_hat_mel = mel_spectrogram(y_g_hat.squeeze(1), h.n_fft, h.num_mels, h.sampling_rate, h.hop_size, h.win_size,
                                          h.fmin, h.fmax_for_loss)
            if steps > h.disc_start_step:
                optim_d.zero_grad()
-                # MPD
+            optim_d.zero_grad()
                y_df_hat_r, y_df_hat_g, _, _ = mpd(y, y_g_hat.detach())
                loss_disc_f, losses_disc_f_r, losses_disc_f_g = discriminator_loss(y_df_hat_r, y_df_hat_g)
-                # MSD
+            # MPD
-                y_ds_hat_r, y_ds_hat_g, _, _ = msd(y, y_g_hat.detach())
+            y_df_hat_r, y_df_hat_g, _, _ = mpd(y, y_g_hat.detach())
-                loss_disc_s, losses_disc_s_r, losses_disc_s_g = discriminator_loss(y_ds_hat_r, y_ds_hat_g)
+            loss_disc_f, losses_disc_f_r, losses_disc_f_g = discriminator_loss(y_df_hat_r, y_df_hat_g)
-                loss_disc_all = loss_disc_s + loss_disc_f
+            # MSD
            y_ds_hat_r, y_ds_hat_g, _, _ = msd(y, y_g_hat.detach())
            loss_disc_s, losses_disc_s_r, losses_disc_s_g = discriminator_loss(y_ds_hat_r, y_ds_hat_g)
-                loss_disc_all.backward()
+            loss_disc_all = loss_disc_s + loss_disc_f
-                optim_d.step()
+
            loss_disc_all.backward()
            optim_d.step()
            # Generator
            optim_g.zero_grad()
@@ -159,16 +159,13 @@ def train(rank, a, h):
            # L1 Mel-Spectrogram Loss
            loss_mel = F.l1_loss(y_mel, y_g_hat_mel) * 45
-            if steps > h.disc_start_step:
+            y_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = mpd(y, y_g_hat)
-                y_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = mpd(y, y_g_hat)
+            y_ds_hat_r, y_ds_hat_g, fmap_s_r, fmap_s_g = msd(y, y_g_hat)
-                y_ds_hat_r, y_ds_hat_g, fmap_s_r, fmap_s_g = msd(y, y_g_hat)
+            loss_fm_f = feature_loss(fmap_f_r, fmap_f_g)
-                loss_fm_f = feature_loss(fmap_f_r, fmap_f_g)
+            loss_fm_s = feature_loss(fmap_s_r, fmap_s_g)
-                loss_fm_s = feature_loss(fmap_s_r, fmap_s_g)
+            loss_gen_f, losses_gen_f = generator_loss(y_df_hat_g)
-                loss_gen_f, losses_gen_f = generator_loss(y_df_hat_g)
+            loss_gen_s, losses_gen_s = generator_loss(y_ds_hat_g)
-                loss_gen_s, losses_gen_s = generator_loss(y_ds_hat_g)
+            loss_gen_all = loss_gen_s + loss_gen_f + loss_fm_s + loss_fm_f + loss_mel
                loss_gen_all = loss_gen_s + loss_gen_f + loss_fm_s + loss_fm_f + loss_mel
            else:
                loss_gen_all = loss_mel
            loss_gen_all.backward()
            optim_g.step()
@@ -184,10 +181,10 @@ def train(rank, a, h):
                # checkpointing
                if steps % a.checkpoint_interval == 0 and steps != 0:
-                    checkpoint_path = "{}/g_hifigan_{:08d}.pt".format(a.checkpoint_path, steps)
+                    checkpoint_path = "{}/g_{:08d}.pt".format(a.checkpoint_path, steps)
                    save_checkpoint(checkpoint_path,
                                    {'generator': (generator.module if h.num_gpus > 1 else generator).state_dict()})
-                    checkpoint_path = "{}/do_hifigan_{:08d}.pt".format(a.checkpoint_path, steps)
+                    checkpoint_path = "{}/do_{:08d}.pt".format(a.checkpoint_path, steps)
                    save_checkpoint(checkpoint_path,
                                    {'mpd': (mpd.module if h.num_gpus > 1 else mpd).state_dict(),
                                     'msd': (msd.module if h.num_gpus > 1 else msd).state_dict(),
--- a/vocoder/hifigan/utils.py
+++ b/vocoder/hifigan/utils.py
@@ -50,7 +50,7 @@ def save_checkpoint(filepath, obj):
 def scan_checkpoint(cp_dir, prefix):
-    pattern = os.path.join(cp_dir, prefix + '????????.pt')
+    pattern = os.path.join(cp_dir, prefix + 'hifigan.pt')
    cp_list = glob.glob(pattern)
    if len(cp_list) == 0:
        return None
--- a/vocoder_train.py
+++ b/vocoder_train.py
@@ -1,13 +1,11 @@
 from utils.argutils import print_args
 from vocoder.wavernn.train import train
 from vocoder.hifigan.train import train as train_hifigan
 from vocoder.fregan.train import train as train_fregan
 from utils.util import AttrDict
 from pathlib import Path
 import argparse
 import json
-import torch
+
 import torch.multiprocessing as mp
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
@@ -63,30 +61,11 @@ if __name__ == "__main__":
    # Process the arguments
    if args.vocoder_type == "wavernn":
        # Run the training wavernn
        delattr(args, 'vocoder_type')
        delattr(args, 'config')
        train(**vars(args))
    elif args.vocoder_type == "hifigan":
        with open(args.config) as f:
            json_config = json.load(f)
        h = AttrDict(json_config)
-        if h.num_gpus > 1:
+        train_hifigan(0, args, h)
            h.num_gpus = torch.cuda.device_count()
            h.batch_size = int(h.batch_size / h.num_gpus)
            print('Batch size per GPU :', h.batch_size)
            mp.spawn(train_hifigan, nprocs=h.num_gpus, args=(args, h,))
        else:
            train_hifigan(0, args, h)
    elif args.vocoder_type == "fregan":
        with open('vocoder/fregan/config.json') as f:
            json_config = json.load(f)
        h = AttrDict(json_config)
        if h.num_gpus > 1:
            h.num_gpus = torch.cuda.device_count()
            h.batch_size = int(h.batch_size / h.num_gpus)
            print('Batch size per GPU :', h.batch_size)
            mp.spawn(train_fregan, nprocs=h.num_gpus, args=(args, h,))
        else:
            train_fregan(0, args, h)
--- a/web.py
+++ b/web.py
@@ -5,7 +5,7 @@ import typer
 cli = typer.Typer()
@cli.command()
-def launch(port: int = typer.Option(8080, "--port", "-p")) -> None:
+def launch_ui(port: int = typer.Option(8080, "--port", "-p")) -> None:
    """Start a graphical UI server for the opyrator.
    The UI is auto-generated from the input- and output-schema of the given function.
Author	SHA1	Message	Date
babysor00	a191587417	Add readme	2022-05-04 19:56:16 +08:00
babysor00	d3ba597be9	Add error raise when no model folder found	2022-05-04 19:05:47 +08:00
babysor00	6134c94b4d	Move requirement together	2022-05-04 17:18:02 +08:00
babysor00	c04a1097bf	Add entry for GUI and revise readme	2022-05-04 11:25:44 +08:00
babysor00	9b4f8cc6c9	Remove text input in vc mode	2022-05-03 10:27:56 +08:00
babysor00	96993a5c61	Add training mode	2022-05-03 10:24:39 +08:00
babysor00	70cc3988d3	Add preprocessing mode	2022-05-01 16:42:11 +08:00
babysor00	c5998bfe71	Add vc mode	2022-04-30 10:22:28 +08:00
babysor00	c997dbdf66	Make framework to support multiple pages	2022-04-29 23:48:11 +08:00
babysor00	47cc597ad0	Add samples	2022-04-17 20:01:42 +08:00
babysor00	8c895ed2c6	Reset layout	2022-04-09 18:48:20 +08:00
babysor00	2e57bf3f11	Remove unused codes	2022-04-09 11:11:58 +08:00
babysor00	11a5e2a141	Init new GUI	2022-04-09 01:25:43 +08:00