Add readme

Add UI usage of PPG-vc
Fix sample issues
2026-02-04 11:04:43 +08:00 · 2022-03-05 00:51:55 +08:00 · 2022-03-03 23:34:47 +08:00 · 2022-03-02 23:15:37 +08:00 · 2022-02-27 13:25:58 +08:00 · 2022-02-26 17:26:27 +08:00
89 changed files with 6330 additions and 15543 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -15,9 +15,8 @@
 *.toc
 *.wav
 *.sh
-synthesizer/saved_models/*
-vocoder/saved_models/*
-encoder/saved_models/*
-cp_hifigan/*
-!vocoder/saved_models/pretrained/*
-!encoder/saved_models/pretrained.pt
+*/saved_models
+!vocoder/saved_models/pretrained/**
+!encoder/saved_models/pretrained.pt
+wavs
+log
--- a/.vscode/launch.json
+++ b/.vscode/launch.json
@@ -35,6 +35,14 @@
      "console": "integratedTerminal",
      "args": ["-d","..\\audiodata"]
    },
+    {
+      "name": "Python: Demo Box VC",
+      "type": "python",
+      "request": "launch",
+      "program": "demo_toolbox.py",
+      "console": "integratedTerminal",
+      "args": ["-d","..\\audiodata","-vc"]
+    },
    {
      "name": "Python: Synth Train",
      "type": "python",
@@ -43,5 +51,15 @@
      "console": "integratedTerminal",
      "args": ["my_run", "..\\"]
    },
+    {
+      "name": "Python: PPG Convert",
+      "type": "python",
+      "request": "launch",
+      "program": "run.py",
+      "console": "integratedTerminal",
+      "args": ["-c", ".\\ppg2mel\\saved_models\\seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2.yaml",
+        "-m", ".\\ppg2mel\\saved_models\\best_loss_step_304000.pth", "--wav_dir", ".\\wavs\\input", "--ref_wav_path", ".\\wavs\\pkq.mp3", "-o", ".\\wavs\\output\\"
+      ]
+    },
  ]
 }
--- a/README-CN.md
+++ b/README-CN.md
@@ -79,10 +79,6 @@
 `python vocoder_train.py <trainid> <datasets_root> hifigan`
 > `<trainid>`替换为你想要的标识，同一标识再次训练时会延续原模型

-* 训练Fre-GAN声码器:
-`python vocoder_train.py <trainid> <datasets_root> --config config.json fregan`
-> `<trainid>`替换为你想要的标识，同一标识再次训练时会延续原模型
-
 ### 3. 启动程序或工具箱
 您可以尝试使用以下命令：

@@ -102,33 +98,33 @@

 <img width="1042" alt="d48ea37adf3660e657cfb047c10edbc" src="https://user-images.githubusercontent.com/7423248/134275227-c1ddf154-f118-4b77-8949-8c4c7daf25f0.png">

-## 文件结构（目标读者：开发者）
-```
-├─archived_untest_files 废弃文件
-├─encoder encoder模型
-│  ├─data_objects
-│  └─saved_models 预训练好的模型
-├─samples 样例语音
-├─synthesizer  synthesizer模型
-│  ├─models
-│  ├─saved_models 预训练好的模型
-│  └─utils 工具类库
-├─toolbox 图形化工具箱
-├─utils 工具类库
-├─vocoder  vocoder模型（目前包含hifi-gan、wavrnn）
-│  ├─hifigan
-│  ├─saved_models 预训练好的模型
-│  └─wavernn
-└─web
-    ├─api
-    │  └─Web端接口
-    ├─config
-    │  └─ Web端配置文件
-    ├─static 前端静态脚本
-    │  └─js 
-    ├─templates 前端模板
-    └─__init__.py Web端入口文件
-```
+### 4. 番外：语音转换Voice Conversion(PPG based)
+想像柯南拿着变声器然后发出毛利小五郎的声音吗？本项目现基于PPG-VC，引入额外两个模块（PPG extractor + PPG2Mel）, 可以实现变声功能。（文档不全，尤其是训练部分，正在努力补充中）
+#### 4.0 准备环境
+* 确保项目以上环境已经安装ok，运行`pip install -r requirements.txt` 来安装剩余的必要包。
+* 下载以下模型 
+  * 24K采样率专用的vocoder（hifigan）到 *vocoder\saved_mode\xxx*
+  * 预训练的ppg特征encoder(ppg_extractor)到 *ppg_extractor\saved_mode\xxx*
+  * 预训练的PPG2Mel到 *ppg2mel\saved_mode\xxx*
+
+#### 4.1 使用数据集自己训练PPG2Mel模型 (可选)
+
+* 下载aidatatang_200zh数据集并解压：确保您可以访问 *train* 文件夹中的所有音频文件（如.wav）
+* 进行音频和梅尔频谱图预处理：
+`python pre4ppg.py <datasets_root> -d {dataset} -n {number}`
+可传入参数：
+* `-d {dataset}` 指定数据集，支持 aidatatang_200zh, 不传默认为aidatatang_200zh
+* `-n {number}` 指定并行数，CPU 11770k在8的情况下，需要运行12到18小时！待优化
+> 假如你下载的 `aidatatang_200zh`文件放在D盘，`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`
+
+* 训练合成器, 注意在上一步先下载好`ppg2mel.yaml`, 修改里面的地址指向预训练好的文件夹：
+`python ppg2mel_train.py --config .\ppg2mel\saved_models\ppg2mel.yaml --oneshotvc `
+* 如果想要继续上一次的训练，可以通过`--load .\ppg2mel\saved_models\<old_pt_file>` 参数指定一个预训练模型文件。
+
+#### 4.2 启动工具箱VC模式
+您可以尝试使用以下命令：
+`python demo_toolbox.py vc -d <datasets_root>`
+> 请指定一个可用的数据集文件路径，如果有支持的数据集则会自动加载供调试，也同时会作为手动录制音频的存储目录。

 ## 引用及论文
 > 该库一开始从仅支持英语的[Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) 分叉出来的，鸣谢作者。
--- a/analysis.py
+++ b/analysis.py
@@ -1,43 +0,0 @@
-from scipy.io import wavfile # scipy library to read wav files
-import numpy as np
-
-AudioName = "target.wav" # Audio File
-fs, Audiodata = wavfile.read(AudioName)
-
-# Plot the audio signal in time
-import matplotlib.pyplot as plt
-plt.plot(Audiodata)
-plt.title('Audio signal in time',size=16)
-
-# spectrum
-from scipy.fftpack import fft # fourier transform
-n = len(Audiodata)
-AudioFreq = fft(Audiodata)
-AudioFreq = AudioFreq[0:int(np.ceil((n+1)/2.0))] #Half of the spectrum
-MagFreq = np.abs(AudioFreq) # Magnitude
-MagFreq = MagFreq / float(n)
-# power spectrum
-MagFreq = MagFreq**2
-if n % 2 > 0: # ffte odd
-    MagFreq[1:len(MagFreq)] = MagFreq[1:len(MagFreq)] * 2
-else:# fft even
-    MagFreq[1:len(MagFreq) -1] = MagFreq[1:len(MagFreq) - 1] * 2
-
-plt.figure()
-freqAxis = np.arange(0,int(np.ceil((n+1)/2.0)), 1.0) * (fs / n);
-plt.plot(freqAxis/1000.0, 10*np.log10(MagFreq)) #Power spectrum
-plt.xlabel('Frequency (kHz)'); plt.ylabel('Power spectrum (dB)');
-
-
-#Spectrogram
-from scipy import signal
-N = 512 #Number of point in the fft
-f, t, Sxx = signal.spectrogram(Audiodata, fs,window = signal.blackman(N),nfft=N)
-plt.figure()
-plt.pcolormesh(t, f,10*np.log10(Sxx)) # dB spectrogram
-#plt.pcolormesh(t, f,Sxx) # Lineal spectrogram
-plt.ylabel('Frequency [Hz]')
-plt.xlabel('Time [seg]')
-plt.title('Spectrogram with scipy.signal',size=16);
-
-plt.show()
--- a/demo_toolbox.py
+++ b/demo_toolbox.py
@@ -15,12 +15,18 @@ if __name__ == '__main__':
    parser.add_argument("-d", "--datasets_root", type=Path, help= \
        "Path to the directory containing your datasets. See toolbox/__init__.py for a list of "
        "supported datasets.", default=None)
+    parser.add_argument("-vc", "--vc_mode", action="store_true", 
+                        help="Voice Conversion Mode(PPG based)")
    parser.add_argument("-e", "--enc_models_dir", type=Path, default="encoder/saved_models", 
                        help="Directory containing saved encoder models")
    parser.add_argument("-s", "--syn_models_dir", type=Path, default="synthesizer/saved_models", 
                        help="Directory containing saved synthesizer models")
    parser.add_argument("-v", "--voc_models_dir", type=Path, default="vocoder/saved_models", 
                        help="Directory containing saved vocoder models")
+    parser.add_argument("-ex", "--extractor_models_dir", type=Path, default="ppg_extractor/saved_models", 
+                        help="Directory containing saved extrator models")
+    parser.add_argument("-cv", "--convertor_models_dir", type=Path, default="ppg2mel/saved_models", 
+                        help="Directory containing saved convert models")
    parser.add_argument("--cpu", action="store_true", help=\
        "If True, processing is done on CPU, even when a GPU is available.")
    parser.add_argument("--seed", type=int, default=None, help=\
--- a/encoder/inference.py
+++ b/encoder/inference.py
@@ -34,8 +34,16 @@ def load_model(weights_fpath: Path, device=None):
    _model.load_state_dict(checkpoint["model_state"])
    _model.eval()
    print("Loaded encoder \"%s\" trained to step %d" % (weights_fpath.name, checkpoint["step"]))
+    return _model
    
-    
+def set_model(model, device=None):
+    global _model, _device
+    _model = model
+    if device is None:
+        _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    _device = device
+    _model.to(device)
+
 def is_loaded():
    return _model is not None

@@ -57,7 +65,7 @@ def embed_frames_batch(frames_batch):


 def compute_partial_slices(n_samples, partial_utterance_n_frames=partials_n_frames,
-                           min_pad_coverage=0.75, overlap=0.5):
+                           min_pad_coverage=0.75, overlap=0.5, rate=None):
    """
    Computes where to split an utterance waveform and its corresponding mel spectrogram to obtain 
    partial utterances of <partial_utterance_n_frames> each. Both the waveform and the mel 
@@ -85,9 +93,18 @@ def compute_partial_slices(n_samples, partial_utterance_n_frames=partials_n_fram
    assert 0 <= overlap < 1
    assert 0 < min_pad_coverage <= 1
    
-    samples_per_frame = int((sampling_rate * mel_window_step / 1000))
-    n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
-    frame_step = max(int(np.round(partial_utterance_n_frames * (1 - overlap))), 1)
+    if rate != None:
+        samples_per_frame = int((sampling_rate * mel_window_step / 1000))
+        n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
+        frame_step = int(np.round((sampling_rate / rate) / samples_per_frame))
+    else: 
+        samples_per_frame = int((sampling_rate * mel_window_step / 1000))
+        n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
+        frame_step = max(int(np.round(partial_utterance_n_frames * (1 - overlap))), 1)
+
+    assert 0 < frame_step, "The rate is too high"
+    assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % \
+        (sampling_rate / (samples_per_frame * partials_n_frames))

    # Compute the slices
    wav_slices, mel_slices = [], []
--- a/fmcc_result.png
+++ b/fmcc_result.png
--- a/fmcc_source.png
+++ b/fmcc_source.png
--- a/ppg2mel/init.py
+++ b/ppg2mel/init.py
@@ -0,0 +1,206 @@
+#!/usr/bin/env python3
+
+# Copyright 2020 Songxiang Liu
+# Apache 2.0
+
+from typing import List
+
+import torch
+import torch.nn.functional as F
+
+import numpy as np
+
+from .utils.abs_model import AbsMelDecoder
+from .rnn_decoder_mol import Decoder
+from .utils.cnn_postnet import Postnet
+from .utils.vc_utils import get_mask_from_lengths
+
+from utils.load_yaml import HpsYaml
+
+class MelDecoderMOLv2(AbsMelDecoder):
+    """Use an encoder to preprocess ppg."""
+    def __init__(
+        self,
+        num_speakers: int,
+        spk_embed_dim: int,
+        bottle_neck_feature_dim: int,
+        encoder_dim: int = 256,
+        encoder_downsample_rates: List = [2, 2],
+        attention_rnn_dim: int = 512,
+        decoder_rnn_dim: int = 512,
+        num_decoder_rnn_layer: int = 1,
+        concat_context_to_last: bool = True,
+        prenet_dims: List = [256, 128],
+        num_mixtures: int = 5,
+        frames_per_step: int = 2,
+        mask_padding: bool = True,
+    ):
+        super().__init__()
+        
+        self.mask_padding = mask_padding
+        self.bottle_neck_feature_dim = bottle_neck_feature_dim
+        self.num_mels = 80
+        self.encoder_down_factor=np.cumprod(encoder_downsample_rates)[-1]
+        self.frames_per_step = frames_per_step
+        self.use_spk_dvec = True
+
+        input_dim = bottle_neck_feature_dim
+        
+        # Downsampling convolution
+        self.bnf_prenet = torch.nn.Sequential(
+            torch.nn.Conv1d(input_dim, encoder_dim, kernel_size=1, bias=False),
+            torch.nn.LeakyReLU(0.1),
+
+            torch.nn.InstanceNorm1d(encoder_dim, affine=False),
+            torch.nn.Conv1d(
+                encoder_dim, encoder_dim, 
+                kernel_size=2*encoder_downsample_rates[0], 
+                stride=encoder_downsample_rates[0], 
+                padding=encoder_downsample_rates[0]//2,
+            ),
+            torch.nn.LeakyReLU(0.1),
+            
+            torch.nn.InstanceNorm1d(encoder_dim, affine=False),
+            torch.nn.Conv1d(
+                encoder_dim, encoder_dim, 
+                kernel_size=2*encoder_downsample_rates[1], 
+                stride=encoder_downsample_rates[1], 
+                padding=encoder_downsample_rates[1]//2,
+            ),
+            torch.nn.LeakyReLU(0.1),
+
+            torch.nn.InstanceNorm1d(encoder_dim, affine=False),
+        )
+        decoder_enc_dim = encoder_dim
+        self.pitch_convs = torch.nn.Sequential(
+            torch.nn.Conv1d(2, encoder_dim, kernel_size=1, bias=False),
+            torch.nn.LeakyReLU(0.1),
+
+            torch.nn.InstanceNorm1d(encoder_dim, affine=False),
+            torch.nn.Conv1d(
+                encoder_dim, encoder_dim, 
+                kernel_size=2*encoder_downsample_rates[0], 
+                stride=encoder_downsample_rates[0], 
+                padding=encoder_downsample_rates[0]//2,
+            ),
+            torch.nn.LeakyReLU(0.1),
+            
+            torch.nn.InstanceNorm1d(encoder_dim, affine=False),
+            torch.nn.Conv1d(
+                encoder_dim, encoder_dim, 
+                kernel_size=2*encoder_downsample_rates[1], 
+                stride=encoder_downsample_rates[1], 
+                padding=encoder_downsample_rates[1]//2,
+            ),
+            torch.nn.LeakyReLU(0.1),
+
+            torch.nn.InstanceNorm1d(encoder_dim, affine=False),
+        )
+        
+        self.reduce_proj = torch.nn.Linear(encoder_dim + spk_embed_dim, encoder_dim)
+
+        # Decoder
+        self.decoder = Decoder(
+            enc_dim=decoder_enc_dim,
+            num_mels=self.num_mels,
+            frames_per_step=frames_per_step,
+            attention_rnn_dim=attention_rnn_dim,
+            decoder_rnn_dim=decoder_rnn_dim,
+            num_decoder_rnn_layer=num_decoder_rnn_layer,
+            prenet_dims=prenet_dims,
+            num_mixtures=num_mixtures,
+            use_stop_tokens=True,
+            concat_context_to_last=concat_context_to_last,
+            encoder_down_factor=self.encoder_down_factor,
+        )
+
+        # Mel-Spec Postnet: some residual CNN layers
+        self.postnet = Postnet()
+    
+    def parse_output(self, outputs, output_lengths=None):
+        if self.mask_padding and output_lengths is not None:
+            mask = ~get_mask_from_lengths(output_lengths, outputs[0].size(1))
+            mask = mask.unsqueeze(2).expand(mask.size(0), mask.size(1), self.num_mels)
+            outputs[0].data.masked_fill_(mask, 0.0)
+            outputs[1].data.masked_fill_(mask, 0.0)
+        return outputs
+
+    def forward(
+        self,
+        bottle_neck_features: torch.Tensor,
+        feature_lengths: torch.Tensor,
+        speech: torch.Tensor,
+        speech_lengths: torch.Tensor,
+        logf0_uv: torch.Tensor = None,
+        spembs: torch.Tensor = None,
+        output_att_ws: bool = False,
+    ):
+        decoder_inputs = self.bnf_prenet(
+            bottle_neck_features.transpose(1, 2)
+        ).transpose(1, 2)
+        logf0_uv = self.pitch_convs(logf0_uv.transpose(1, 2)).transpose(1, 2)
+        decoder_inputs = decoder_inputs + logf0_uv
+            
+        assert spembs is not None
+        spk_embeds = F.normalize(
+            spembs).unsqueeze(1).expand(-1, decoder_inputs.size(1), -1)
+        decoder_inputs = torch.cat([decoder_inputs, spk_embeds], dim=-1)
+        decoder_inputs = self.reduce_proj(decoder_inputs)
+        
+        # (B, num_mels, T_dec)
+        T_dec = torch.div(feature_lengths, int(self.encoder_down_factor), rounding_mode='floor')
+        mel_outputs, predicted_stop, alignments = self.decoder(
+            decoder_inputs, speech, T_dec)
+        ## Post-processing
+        mel_outputs_postnet = self.postnet(mel_outputs.transpose(1, 2)).transpose(1, 2)
+        mel_outputs_postnet = mel_outputs + mel_outputs_postnet
+        if output_att_ws: 
+            return self.parse_output(
+                [mel_outputs, mel_outputs_postnet, predicted_stop, alignments], speech_lengths)
+        else:
+            return self.parse_output(
+                [mel_outputs, mel_outputs_postnet, predicted_stop], speech_lengths)
+
+        # return mel_outputs, mel_outputs_postnet
+
+    def inference(
+        self,
+        bottle_neck_features: torch.Tensor,
+        logf0_uv: torch.Tensor = None,
+        spembs: torch.Tensor = None,
+    ):
+        decoder_inputs = self.bnf_prenet(bottle_neck_features.transpose(1, 2)).transpose(1, 2)
+        logf0_uv = self.pitch_convs(logf0_uv.transpose(1, 2)).transpose(1, 2)
+        decoder_inputs = decoder_inputs + logf0_uv
+
+        assert spembs is not None
+        spk_embeds = F.normalize(
+            spembs).unsqueeze(1).expand(-1, decoder_inputs.size(1), -1)
+        bottle_neck_features = torch.cat([decoder_inputs, spk_embeds], dim=-1)
+        bottle_neck_features = self.reduce_proj(bottle_neck_features)
+
+        ## Decoder
+        if bottle_neck_features.size(0) > 1:
+            mel_outputs, alignments = self.decoder.inference_batched(bottle_neck_features)
+        else:
+            mel_outputs, alignments = self.decoder.inference(bottle_neck_features,)
+        ## Post-processing
+        mel_outputs_postnet = self.postnet(mel_outputs.transpose(1, 2)).transpose(1, 2)
+        mel_outputs_postnet = mel_outputs + mel_outputs_postnet
+        # outputs = mel_outputs_postnet[0]
+        
+        return mel_outputs[0], mel_outputs_postnet[0], alignments[0]
+
+def load_model(train_config, model_file, device=None):
+    
+    if device is None:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+    model_config = HpsYaml(train_config)
+    ppg2mel_model = MelDecoderMOLv2(
+        **model_config["model"]
+    ).to(device)
+    ckpt = torch.load(model_file, map_location=device)
+    ppg2mel_model.load_state_dict(ckpt["model"])
+    ppg2mel_model.eval()
+    return ppg2mel_model
--- a/ppg2mel/preprocess.py
+++ b/ppg2mel/preprocess.py
@@ -0,0 +1,112 @@
+
+import os
+import torch
+import numpy as np
+from tqdm import tqdm
+from pathlib import Path
+import soundfile
+import resampy
+
+from ppg_extractor import load_model
+import encoder.inference as Encoder
+from encoder.audio import preprocess_wav
+from encoder import audio
+from utils.f0_utils import compute_f0
+
+from torch.multiprocessing import Pool, cpu_count
+from functools import partial
+
+SAMPLE_RATE=16000
+
+def _compute_bnf(
+    wav: any,
+    output_fpath: str,
+    device: torch.device,
+    ppg_model_local: any,
+):
+    """
+    Compute CTC-Attention Seq2seq ASR encoder bottle-neck features (BNF).
+    """
+    ppg_model_local.to(device)
+    wav_tensor = torch.from_numpy(wav).float().to(device).unsqueeze(0)
+    wav_length = torch.LongTensor([wav.shape[0]]).to(device)
+    with torch.no_grad():
+        bnf = ppg_model_local(wav_tensor, wav_length) 
+    bnf_npy = bnf.squeeze(0).cpu().numpy()
+    np.save(output_fpath, bnf_npy, allow_pickle=False)
+    return bnf_npy, len(bnf_npy)
+
+def _compute_f0_from_wav(wav, output_fpath):
+    """Compute merged f0 values."""
+    f0 = compute_f0(wav, SAMPLE_RATE)
+    np.save(output_fpath, f0, allow_pickle=False)
+    return f0, len(f0)
+
+def _compute_spkEmbed(wav, output_fpath, encoder_model_local, device):
+    Encoder.set_model(encoder_model_local)
+    # Compute where to split the utterance into partials and pad if necessary
+    wave_slices, mel_slices = Encoder.compute_partial_slices(len(wav), rate=1.3, min_pad_coverage=0.75)
+    max_wave_length = wave_slices[-1].stop
+    if max_wave_length >= len(wav):
+        wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")
+    
+    # Split the utterance into partials
+    frames = audio.wav_to_mel_spectrogram(wav)
+    frames_batch = np.array([frames[s] for s in mel_slices])
+    partial_embeds = Encoder.embed_frames_batch(frames_batch)
+    
+    # Compute the utterance embedding from the partial embeddings
+    raw_embed = np.mean(partial_embeds, axis=0)
+    embed = raw_embed / np.linalg.norm(raw_embed, 2)
+
+    np.save(output_fpath, embed, allow_pickle=False)
+    return embed, len(embed)
+
+def preprocess_one(wav_path, out_dir, device, ppg_model_local, encoder_model_local):
+    # wav = preprocess_wav(wav_path)
+    # try:
+    wav, sr = soundfile.read(wav_path)
+    if len(wav) < sr:
+        return None, sr, len(wav)
+    if sr != SAMPLE_RATE:
+        wav = resampy.resample(wav, sr, SAMPLE_RATE)
+        sr = SAMPLE_RATE
+    utt_id = os.path.basename(wav_path).rstrip(".wav")
+
+    _, length_bnf = _compute_bnf(output_fpath=f"{out_dir}/bnf/{utt_id}.ling_feat.npy", wav=wav, device=device, ppg_model_local=ppg_model_local)
+    _, length_f0 = _compute_f0_from_wav(output_fpath=f"{out_dir}/f0/{utt_id}.f0.npy", wav=wav)
+    _, length_embed = _compute_spkEmbed(output_fpath=f"{out_dir}/embed/{utt_id}.npy",  device=device, encoder_model_local=encoder_model_local, wav=wav)
+
+def preprocess_dataset(datasets_root, dataset, out_dir, n_processes, ppg_encoder_model_fpath, speaker_encoder_model):
+    # Glob wav files
+    wav_file_list = sorted(Path(f"{datasets_root}/{dataset}").glob("**/*.wav"))
+    print(f"Globbed {len(wav_file_list)} wav files.")
+
+    out_dir.joinpath("bnf").mkdir(exist_ok=True, parents=True)
+    out_dir.joinpath("f0").mkdir(exist_ok=True, parents=True)
+    out_dir.joinpath("embed").mkdir(exist_ok=True, parents=True)
+    ppg_model_local = load_model(ppg_encoder_model_fpath, "cpu")
+    encoder_model_local = Encoder.load_model(speaker_encoder_model, "cpu")
+    if n_processes is None:
+        n_processes = cpu_count()
+    
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    func = partial(preprocess_one, out_dir=out_dir, ppg_model_local=ppg_model_local, encoder_model_local=encoder_model_local, device=device)
+    job = Pool(n_processes).imap(func, wav_file_list)
+    list(tqdm(job, "Preprocessing", len(wav_file_list), unit="wav"))
+
+    # finish processing and mark
+    t_fid_file = out_dir.joinpath("train_fidlist.txt").open("w", encoding="utf-8")
+    d_fid_file = out_dir.joinpath("dev_fidlist.txt").open("w", encoding="utf-8")
+    e_fid_file = out_dir.joinpath("eval_fidlist.txt").open("w", encoding="utf-8")
+    for file in sorted(out_dir.joinpath("f0").glob("*.npy")):
+        id = os.path.basename(file).split(".f0.npy")[0]
+        if id.endswith("01"):
+            d_fid_file.write(id + "\n")
+        elif id.endswith("09"):
+            e_fid_file.write(id + "\n")
+        else:
+            t_fid_file.write(id + "\n")
+    t_fid_file.close()
+    d_fid_file.close()
+    e_fid_file.close()
--- a/ppg2mel/rnn_decoder_mol.py
+++ b/ppg2mel/rnn_decoder_mol.py
@@ -0,0 +1,374 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+from .utils.mol_attention import MOLAttention
+from .utils.basic_layers import Linear
+from .utils.vc_utils import get_mask_from_lengths
+
+
+class DecoderPrenet(nn.Module):
+    def __init__(self, in_dim, sizes):
+        super().__init__()
+        in_sizes = [in_dim] + sizes[:-1]
+        self.layers = nn.ModuleList(
+            [Linear(in_size, out_size, bias=False)
+             for (in_size, out_size) in zip(in_sizes, sizes)])
+
+    def forward(self, x):
+        for linear in self.layers:
+            x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
+        return x
+
+
+class Decoder(nn.Module):
+    """Mixture of Logistic (MoL) attention-based RNN Decoder."""
+    def __init__(
+        self,
+        enc_dim,
+        num_mels,
+        frames_per_step,
+        attention_rnn_dim,
+        decoder_rnn_dim,
+        prenet_dims,
+        num_mixtures,
+        encoder_down_factor=1,
+        num_decoder_rnn_layer=1,
+        use_stop_tokens=False,
+        concat_context_to_last=False,
+    ):
+        super().__init__()
+        self.enc_dim = enc_dim
+        self.encoder_down_factor = encoder_down_factor
+        self.num_mels = num_mels
+        self.frames_per_step = frames_per_step
+        self.attention_rnn_dim = attention_rnn_dim
+        self.decoder_rnn_dim = decoder_rnn_dim
+        self.prenet_dims = prenet_dims
+        self.use_stop_tokens = use_stop_tokens
+        self.num_decoder_rnn_layer = num_decoder_rnn_layer
+        self.concat_context_to_last = concat_context_to_last
+
+        # Mel prenet
+        self.prenet = DecoderPrenet(num_mels, prenet_dims)
+        self.prenet_pitch = DecoderPrenet(num_mels, prenet_dims)
+
+        # Attention RNN
+        self.attention_rnn = nn.LSTMCell(
+            prenet_dims[-1] + enc_dim,
+            attention_rnn_dim
+        )
+        
+        # Attention
+        self.attention_layer = MOLAttention(
+            attention_rnn_dim,
+            r=frames_per_step/encoder_down_factor,
+            M=num_mixtures,
+        )
+
+        # Decoder RNN
+        self.decoder_rnn_layers = nn.ModuleList()
+        for i in range(num_decoder_rnn_layer):
+            if i == 0:
+                self.decoder_rnn_layers.append(
+                    nn.LSTMCell(
+                        enc_dim + attention_rnn_dim,
+                        decoder_rnn_dim))
+            else:
+                self.decoder_rnn_layers.append(
+                    nn.LSTMCell(
+                        decoder_rnn_dim,
+                        decoder_rnn_dim))
+        # self.decoder_rnn = nn.LSTMCell(
+            # 2 * enc_dim + attention_rnn_dim,
+            # decoder_rnn_dim
+        # )
+        if concat_context_to_last:
+            self.linear_projection = Linear(
+                enc_dim + decoder_rnn_dim,
+                num_mels * frames_per_step
+            )
+        else:
+            self.linear_projection = Linear(
+                decoder_rnn_dim,
+                num_mels * frames_per_step
+            )
+
+
+        # Stop-token layer
+        if self.use_stop_tokens:
+            if concat_context_to_last:
+                self.stop_layer = Linear(
+                    enc_dim + decoder_rnn_dim, 1, bias=True, w_init_gain="sigmoid"
+                )
+            else:
+                self.stop_layer = Linear(
+                    decoder_rnn_dim, 1, bias=True, w_init_gain="sigmoid"
+                )
+                
+
+    def get_go_frame(self, memory):
+        B = memory.size(0)
+        go_frame = torch.zeros((B, self.num_mels), dtype=torch.float,
+                               device=memory.device)
+        return go_frame
+
+    def initialize_decoder_states(self, memory, mask):
+        device = next(self.parameters()).device
+        B = memory.size(0)
+        
+        # attention rnn states
+        self.attention_hidden = torch.zeros(
+            (B, self.attention_rnn_dim), device=device)
+        self.attention_cell = torch.zeros(
+            (B, self.attention_rnn_dim), device=device)
+
+        # decoder rnn states
+        self.decoder_hiddens = []
+        self.decoder_cells = []
+        for i in range(self.num_decoder_rnn_layer):
+            self.decoder_hiddens.append(
+                torch.zeros((B, self.decoder_rnn_dim),
+                            device=device)
+            )
+            self.decoder_cells.append(
+                torch.zeros((B, self.decoder_rnn_dim),
+                            device=device)
+            )
+        # self.decoder_hidden = torch.zeros(
+            # (B, self.decoder_rnn_dim), device=device)
+        # self.decoder_cell = torch.zeros(
+            # (B, self.decoder_rnn_dim), device=device)
+        
+        self.attention_context =  torch.zeros(
+            (B, self.enc_dim), device=device)
+
+        self.memory = memory
+        # self.processed_memory = self.attention_layer.memory_layer(memory)
+        self.mask = mask
+
+    def parse_decoder_inputs(self, decoder_inputs):
+        """Prepare decoder inputs, i.e. gt mel
+        Args:
+            decoder_inputs:(B, T_out, n_mel_channels) inputs used for teacher-forced training.
+        """
+        decoder_inputs = decoder_inputs.reshape(
+            decoder_inputs.size(0),
+            int(decoder_inputs.size(1)/self.frames_per_step), -1)
+        # (B, T_out//r, r*num_mels) -> (T_out//r, B, r*num_mels)
+        decoder_inputs = decoder_inputs.transpose(0, 1)
+        # (T_out//r, B, num_mels)
+        decoder_inputs = decoder_inputs[:,:,-self.num_mels:]
+        return decoder_inputs
+        
+    def parse_decoder_outputs(self, mel_outputs, alignments, stop_outputs):
+        """ Prepares decoder outputs for output
+        Args:
+            mel_outputs:
+            alignments:
+        """
+        # (T_out//r, B, T_enc) -> (B, T_out//r, T_enc)
+        alignments = torch.stack(alignments).transpose(0, 1)
+        # (T_out//r, B) -> (B, T_out//r)
+        if stop_outputs is not None:
+            if alignments.size(0) == 1:
+                stop_outputs = torch.stack(stop_outputs).unsqueeze(0)
+            else:
+                stop_outputs = torch.stack(stop_outputs).transpose(0, 1)
+            stop_outputs = stop_outputs.contiguous()
+        # (T_out//r, B, num_mels*r) -> (B, T_out//r, num_mels*r)
+        mel_outputs = torch.stack(mel_outputs).transpose(0, 1).contiguous()
+        # decouple frames per step
+        # (B, T_out, num_mels)
+        mel_outputs = mel_outputs.view(
+            mel_outputs.size(0), -1, self.num_mels)
+        return mel_outputs, alignments, stop_outputs     
+    
+    def attend(self, decoder_input):
+        cell_input = torch.cat((decoder_input, self.attention_context), -1)
+        self.attention_hidden, self.attention_cell = self.attention_rnn(
+            cell_input, (self.attention_hidden, self.attention_cell))
+        self.attention_context, attention_weights = self.attention_layer(
+            self.attention_hidden, self.memory, None, self.mask)
+        
+        decoder_rnn_input = torch.cat(
+            (self.attention_hidden, self.attention_context), -1)
+
+        return decoder_rnn_input, self.attention_context, attention_weights
+
+    def decode(self, decoder_input):
+        for i in range(self.num_decoder_rnn_layer):
+            if i == 0:
+                self.decoder_hiddens[i], self.decoder_cells[i] = self.decoder_rnn_layers[i](
+                    decoder_input, (self.decoder_hiddens[i], self.decoder_cells[i]))
+            else:
+                self.decoder_hiddens[i], self.decoder_cells[i] = self.decoder_rnn_layers[i](
+                    self.decoder_hiddens[i-1], (self.decoder_hiddens[i], self.decoder_cells[i]))
+        return self.decoder_hiddens[-1]
+    
+    def forward(self, memory, mel_inputs, memory_lengths):
+        """ Decoder forward pass for training
+        Args:
+            memory: (B, T_enc, enc_dim) Encoder outputs
+            decoder_inputs: (B, T, num_mels) Decoder inputs for teacher forcing.
+            memory_lengths: (B, ) Encoder output lengths for attention masking.
+        Returns:
+            mel_outputs: (B, T, num_mels) mel outputs from the decoder
+            alignments: (B, T//r, T_enc) attention weights.
+        """
+        # [1, B, num_mels]
+        go_frame = self.get_go_frame(memory).unsqueeze(0)
+        # [T//r, B, num_mels]
+        mel_inputs = self.parse_decoder_inputs(mel_inputs)
+        # [T//r + 1, B, num_mels]
+        mel_inputs = torch.cat((go_frame, mel_inputs), dim=0)
+        # [T//r + 1, B, prenet_dim]
+        decoder_inputs = self.prenet(mel_inputs) 
+        # decoder_inputs_pitch = self.prenet_pitch(decoder_inputs__)
+
+        self.initialize_decoder_states(
+            memory, mask=~get_mask_from_lengths(memory_lengths),
+        )
+        
+        self.attention_layer.init_states(memory)
+        # self.attention_layer_pitch.init_states(memory_pitch)
+
+        mel_outputs, alignments = [], []
+        if self.use_stop_tokens:
+            stop_outputs = []
+        else:
+            stop_outputs = None
+        while len(mel_outputs) < decoder_inputs.size(0) - 1:
+            decoder_input = decoder_inputs[len(mel_outputs)]
+            # decoder_input_pitch = decoder_inputs_pitch[len(mel_outputs)]
+
+            decoder_rnn_input, context, attention_weights = self.attend(decoder_input)
+
+            decoder_rnn_output = self.decode(decoder_rnn_input)
+            if self.concat_context_to_last:    
+                decoder_rnn_output = torch.cat(
+                    (decoder_rnn_output, context), dim=1)
+                   
+            mel_output = self.linear_projection(decoder_rnn_output)
+            if self.use_stop_tokens:
+                stop_output = self.stop_layer(decoder_rnn_output)
+                stop_outputs += [stop_output.squeeze()]
+            mel_outputs += [mel_output.squeeze(1)] #? perhaps don't need squeeze
+            alignments += [attention_weights]
+            # alignments_pitch += [attention_weights_pitch]   
+
+        mel_outputs, alignments, stop_outputs = self.parse_decoder_outputs(
+            mel_outputs, alignments, stop_outputs)
+        if stop_outputs is None:
+            return mel_outputs, alignments
+        else:
+            return mel_outputs, stop_outputs, alignments
+
+    def inference(self, memory, stop_threshold=0.5):
+        """ Decoder inference
+        Args:
+            memory: (1, T_enc, D_enc) Encoder outputs
+        Returns:
+            mel_outputs: mel outputs from the decoder
+            alignments: sequence of attention weights from the decoder
+        """
+        # [1, num_mels]
+        decoder_input = self.get_go_frame(memory)
+
+        self.initialize_decoder_states(memory, mask=None)
+
+        self.attention_layer.init_states(memory)
+        
+        mel_outputs, alignments = [], []
+        # NOTE(sx): heuristic 
+        max_decoder_step = memory.size(1)*self.encoder_down_factor//self.frames_per_step 
+        min_decoder_step = memory.size(1)*self.encoder_down_factor // self.frames_per_step - 5
+        while True:
+            decoder_input = self.prenet(decoder_input)
+
+            decoder_input_final, context, alignment = self.attend(decoder_input)
+
+            #mel_output, stop_output, alignment = self.decode(decoder_input)
+            decoder_rnn_output = self.decode(decoder_input_final)
+            if self.concat_context_to_last:    
+                decoder_rnn_output = torch.cat(
+                    (decoder_rnn_output, context), dim=1)
+            
+            mel_output = self.linear_projection(decoder_rnn_output)
+            stop_output = self.stop_layer(decoder_rnn_output)
+            
+            mel_outputs += [mel_output.squeeze(1)]
+            alignments += [alignment]
+            
+            if torch.sigmoid(stop_output.data) > stop_threshold and len(mel_outputs) >= min_decoder_step:
+                break
+            if len(mel_outputs) >= max_decoder_step:
+                # print("Warning! Decoding steps reaches max decoder steps.")
+                break
+
+            decoder_input = mel_output[:,-self.num_mels:]
+
+
+        mel_outputs, alignments, _  = self.parse_decoder_outputs(
+            mel_outputs, alignments, None)
+
+        return mel_outputs, alignments
+
+    def inference_batched(self, memory, stop_threshold=0.5):
+        """ Decoder inference
+        Args:
+            memory: (B, T_enc, D_enc) Encoder outputs
+        Returns:
+            mel_outputs: mel outputs from the decoder
+            alignments: sequence of attention weights from the decoder
+        """
+        # [1, num_mels]
+        decoder_input = self.get_go_frame(memory)
+
+        self.initialize_decoder_states(memory, mask=None)
+
+        self.attention_layer.init_states(memory)
+        
+        mel_outputs, alignments = [], []
+        stop_outputs = []
+        # NOTE(sx): heuristic 
+        max_decoder_step = memory.size(1)*self.encoder_down_factor//self.frames_per_step 
+        min_decoder_step = memory.size(1)*self.encoder_down_factor // self.frames_per_step - 5
+        while True:
+            decoder_input = self.prenet(decoder_input)
+
+            decoder_input_final, context, alignment = self.attend(decoder_input)
+
+            #mel_output, stop_output, alignment = self.decode(decoder_input)
+            decoder_rnn_output = self.decode(decoder_input_final)
+            if self.concat_context_to_last:    
+                decoder_rnn_output = torch.cat(
+                    (decoder_rnn_output, context), dim=1)
+            
+            mel_output = self.linear_projection(decoder_rnn_output)
+            # (B, 1)
+            stop_output = self.stop_layer(decoder_rnn_output)
+            stop_outputs += [stop_output.squeeze()]
+            # stop_outputs.append(stop_output) 
+
+            mel_outputs += [mel_output.squeeze(1)]
+            alignments += [alignment]
+            # print(stop_output.shape)
+            if torch.all(torch.sigmoid(stop_output.squeeze().data) > stop_threshold) \
+                    and len(mel_outputs) >= min_decoder_step:
+                break
+            if len(mel_outputs) >= max_decoder_step:
+                # print("Warning! Decoding steps reaches max decoder steps.")
+                break
+
+            decoder_input = mel_output[:,-self.num_mels:]
+
+
+        mel_outputs, alignments, stop_outputs = self.parse_decoder_outputs(
+            mel_outputs, alignments, stop_outputs)
+        mel_outputs_stacked = []
+        for mel, stop_logit in zip(mel_outputs, stop_outputs):
+            idx = np.argwhere(torch.sigmoid(stop_logit.cpu()) > stop_threshold)[0][0].item()
+            mel_outputs_stacked.append(mel[:idx,:])
+        mel_outputs = torch.cat(mel_outputs_stacked, dim=0).unsqueeze(0)
+        return mel_outputs, alignments
--- a/ppg2mel/train.py
+++ b/ppg2mel/train.py
@@ -0,0 +1,67 @@
+import sys
+import torch
+import argparse
+import numpy as np
+from utils.load_yaml import HpsYaml
+from ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
+
+# For reproducibility, comment these may speed up training
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+
+def main():
+    # Arguments
+    parser = argparse.ArgumentParser(description=
+            'Training PPG2Mel VC model.')
+    parser.add_argument('--config', type=str, 
+                        help='Path to experiment config, e.g., config/vc.yaml')
+    parser.add_argument('--name', default=None, type=str, help='Name for logging.')
+    parser.add_argument('--logdir', default='log/', type=str,
+                        help='Logging path.', required=False)
+    parser.add_argument('--ckpdir', default='ckpt/', type=str,
+                        help='Checkpoint path.', required=False)
+    parser.add_argument('--outdir', default='result/', type=str,
+                        help='Decode output path.', required=False)
+    parser.add_argument('--load', default=None, type=str,
+                        help='Load pre-trained model (for training only)', required=False)
+    parser.add_argument('--warm_start', action='store_true',
+                        help='Load model weights only, ignore specified layers.')
+    parser.add_argument('--seed', default=0, type=int,
+                        help='Random seed for reproducable results.', required=False)
+    parser.add_argument('--njobs', default=8, type=int,
+                        help='Number of threads for dataloader/decoding.', required=False)
+    parser.add_argument('--cpu', action='store_true', help='Disable GPU training.')
+    parser.add_argument('--no-pin', action='store_true',
+                        help='Disable pin-memory for dataloader')
+    parser.add_argument('--test', action='store_true', help='Test the model.')
+    parser.add_argument('--no-msg', action='store_true', help='Hide all messages.')
+    parser.add_argument('--finetune', action='store_true', help='Finetune model')
+    parser.add_argument('--oneshotvc', action='store_true', help='Oneshot VC model')
+    parser.add_argument('--bilstm', action='store_true', help='BiLSTM VC model')
+    parser.add_argument('--lsa', action='store_true', help='Use location-sensitive attention (LSA)')
+
+    ###
+
+    paras = parser.parse_args()
+    setattr(paras, 'gpu', not paras.cpu)
+    setattr(paras, 'pin_memory', not paras.no_pin)
+    setattr(paras, 'verbose', not paras.no_msg)
+    # Make the config dict dot visitable
+    config = HpsYaml(paras.config)
+
+    np.random.seed(paras.seed)
+    torch.manual_seed(paras.seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(paras.seed)
+
+    print(">>> OneShot VC training ...")
+    mode = "train"
+    solver = Solver(config, paras, mode)
+    solver.load_data()
+    solver.set_model()
+    solver.exec()
+    print(">>> Oneshot VC train finished!")
+    sys.exit(0)
+
+if __name__ == "__main__":
+    main()   
--- a/ppg2mel/train/init.py
+++ b/ppg2mel/train/init.py
@@ -0,0 +1 @@
+#
--- a/ppg2mel/train/loss.py
+++ b/ppg2mel/train/loss.py
@@ -0,0 +1,50 @@
+from typing import Dict
+from typing import Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from ..utils.nets_utils import make_pad_mask
+
+
+class MaskedMSELoss(nn.Module):
+    def __init__(self, frames_per_step):
+        super().__init__()
+        self.frames_per_step = frames_per_step
+        self.mel_loss_criterion = nn.MSELoss(reduction='none')
+        # self.loss = nn.MSELoss()
+        self.stop_loss_criterion = nn.BCEWithLogitsLoss(reduction='none')   
+
+    def get_mask(self, lengths, max_len=None):
+        # lengths: [B,]
+        if max_len is None:
+            max_len = torch.max(lengths)
+        batch_size = lengths.size(0)
+        seq_range = torch.arange(0, max_len).long()
+        seq_range_expand = seq_range.unsqueeze(0).expand(batch_size, max_len).to(lengths.device)
+        seq_length_expand = lengths.unsqueeze(1).expand_as(seq_range_expand)
+        return (seq_range_expand < seq_length_expand).float()
+
+    def forward(self, mel_pred, mel_pred_postnet, mel_trg, lengths, 
+                stop_target, stop_pred):
+        ## process stop_target
+        B = stop_target.size(0)
+        stop_target = stop_target.reshape(B, -1, self.frames_per_step)[:, :, 0]
+        stop_lengths = torch.ceil(lengths.float() / self.frames_per_step).long()
+        stop_mask = self.get_mask(stop_lengths, int(mel_trg.size(1)/self.frames_per_step))
+
+        mel_trg.requires_grad = False
+        # (B, T, 1)
+        mel_mask = self.get_mask(lengths, mel_trg.size(1)).unsqueeze(-1)
+        # (B, T, D)
+        mel_mask = mel_mask.expand_as(mel_trg)
+        mel_loss_pre = (self.mel_loss_criterion(mel_pred, mel_trg) * mel_mask).sum() / mel_mask.sum()
+        mel_loss_post = (self.mel_loss_criterion(mel_pred_postnet, mel_trg) * mel_mask).sum() / mel_mask.sum()
+        
+        mel_loss = mel_loss_pre + mel_loss_post
+
+        # stop token loss
+        stop_loss = torch.sum(self.stop_loss_criterion(stop_pred, stop_target) * stop_mask) / stop_mask.sum()
+        
+        return mel_loss, stop_loss
--- a/ppg2mel/train/optim.py
+++ b/ppg2mel/train/optim.py
@@ -0,0 +1,45 @@
+import torch
+import numpy as np
+
+
+class Optimizer():
+    def __init__(self, parameters, optimizer, lr, eps, lr_scheduler, 
+                **kwargs):
+
+        # Setup torch optimizer
+        self.opt_type = optimizer
+        self.init_lr = lr
+        self.sch_type = lr_scheduler
+        opt = getattr(torch.optim, optimizer)
+        if lr_scheduler == 'warmup':
+            warmup_step = 4000.0
+            init_lr = lr
+            self.lr_scheduler = lambda step: init_lr * warmup_step ** 0.5 * \
+                np.minimum((step+1)*warmup_step**-1.5, (step+1)**-0.5)
+            self.opt = opt(parameters, lr=1.0)
+        else:
+            self.lr_scheduler = None
+            self.opt = opt(parameters, lr=lr, eps=eps)  # ToDo: 1e-8 better?
+
+    def get_opt_state_dict(self):
+        return self.opt.state_dict()
+
+    def load_opt_state_dict(self, state_dict):
+        self.opt.load_state_dict(state_dict)
+
+    def pre_step(self, step):
+        if self.lr_scheduler is not None:
+            cur_lr = self.lr_scheduler(step)
+            for param_group in self.opt.param_groups:
+                param_group['lr'] = cur_lr
+        else:
+            cur_lr = self.init_lr
+        self.opt.zero_grad()
+        return cur_lr 
+ 
+    def step(self):
+        self.opt.step()
+
+    def create_msg(self):
+        return ['Optim.Info.| Algo. = {}\t| Lr = {}\t (schedule = {})'
+                .format(self.opt_type, self.init_lr, self.sch_type)]
--- a/ppg2mel/train/option.py
+++ b/ppg2mel/train/option.py
@@ -0,0 +1,10 @@
+# Default parameters which will be imported by solver
+default_hparas = {
+    'GRAD_CLIP': 5.0,          # Grad. clip threshold
+    'PROGRESS_STEP': 100,      # Std. output refresh freq.
+    # Decode steps for objective validation (step = ratio*input_txt_len)
+    'DEV_STEP_RATIO': 1.2,
+    # Number of examples (alignment/text) to show in tensorboard
+    'DEV_N_EXAMPLE': 4,
+    'TB_FLUSH_FREQ': 180       # Update frequency of tensorboard (secs)
+}
--- a/ppg2mel/train/solver.py
+++ b/ppg2mel/train/solver.py
@@ -0,0 +1,216 @@
+import os
+import sys
+import abc
+import math
+import yaml
+import torch
+from torch.utils.tensorboard import SummaryWriter
+
+from .option import default_hparas
+from utils.util import human_format, Timer
+from utils.load_yaml import HpsYaml
+
+
+class BaseSolver():
+    ''' 
+    Prototype Solver for all kinds of tasks
+    Arguments
+        config - yaml-styled config
+        paras  - argparse outcome
+        mode   - "train"/"test"
+    '''
+
+    def __init__(self, config, paras, mode="train"):
+        # General Settings
+        self.config = config  # load from yaml file
+        self.paras = paras    # command line args  
+        self.mode = mode      # 'train' or 'test'
+        for k, v in default_hparas.items():
+            setattr(self, k, v)
+        self.device = torch.device('cuda') if self.paras.gpu and torch.cuda.is_available() \
+                    else torch.device('cpu')
+
+        # Name experiment
+        self.exp_name = paras.name
+        if self.exp_name is None:
+            if 'exp_name' in self.config:
+                self.exp_name = self.config.exp_name
+            else:
+                # By default, exp is named after config file
+                self.exp_name = paras.config.split('/')[-1].replace('.yaml', '')
+            if mode == 'train':
+                self.exp_name += '_seed{}'.format(paras.seed)
+                    
+
+        if mode == 'train':
+            # Filepath setup
+            os.makedirs(paras.ckpdir, exist_ok=True)
+            self.ckpdir = os.path.join(paras.ckpdir, self.exp_name)
+            os.makedirs(self.ckpdir, exist_ok=True)
+
+            # Logger settings
+            self.logdir = os.path.join(paras.logdir, self.exp_name)
+            self.log = SummaryWriter(
+                self.logdir, flush_secs=self.TB_FLUSH_FREQ)
+            self.timer = Timer()
+
+            # Hyper-parameters
+            self.step = 0
+            self.valid_step = config.hparas.valid_step
+            self.max_step = config.hparas.max_step
+
+            self.verbose('Exp. name : {}'.format(self.exp_name))
+            self.verbose('Loading data... large corpus may took a while.')
+
+        # elif mode == 'test':
+            # # Output path
+            # os.makedirs(paras.outdir, exist_ok=True)
+            # self.ckpdir = os.path.join(paras.outdir, self.exp_name)
+
+            # Load training config to get acoustic feat and build model
+            # self.src_config = HpsYaml(config.src.config) 
+            # self.paras.load = config.src.ckpt
+
+            # self.verbose('Evaluating result of tr. config @ {}'.format(
+                # config.src.config))
+
+    def backward(self, loss):
+        '''
+        Standard backward step with self.timer and debugger
+        Arguments
+            loss - the loss to perform loss.backward()
+        '''
+        self.timer.set()
+        loss.backward()
+        grad_norm = torch.nn.utils.clip_grad_norm_(
+            self.model.parameters(), self.GRAD_CLIP)
+        if math.isnan(grad_norm):
+            self.verbose('Error : grad norm is NaN @ step '+str(self.step))
+        else:
+            self.optimizer.step()
+        self.timer.cnt('bw')
+        return grad_norm
+
+    def load_ckpt(self):
+        ''' Load ckpt if --load option is specified '''
+        if self.paras.load is not None:
+            if self.paras.warm_start:
+                self.verbose(f"Warm starting model from checkpoint {self.paras.load}.")
+                ckpt = torch.load(
+                    self.paras.load, map_location=self.device if self.mode == 'train'
+                                                        else 'cpu')
+                model_dict = ckpt['model']
+                if len(self.config.model.ignore_layers) > 0:
+                    model_dict = {k:v for k, v in model_dict.items()
+                                  if k not in self.config.model.ignore_layers}
+                    dummy_dict = self.model.state_dict()
+                    dummy_dict.update(model_dict)
+                    model_dict = dummy_dict
+                self.model.load_state_dict(model_dict)
+            else:
+                # Load weights
+                ckpt = torch.load(
+                    self.paras.load, map_location=self.device if self.mode == 'train'
+                                                else 'cpu')
+                self.model.load_state_dict(ckpt['model'])
+
+                # Load task-dependent items
+                if self.mode == 'train':
+                    self.step = ckpt['global_step']
+                    self.optimizer.load_opt_state_dict(ckpt['optimizer'])
+                    self.verbose('Load ckpt from {}, restarting at step {}'.format(
+                        self.paras.load, self.step))
+                else:
+                    for k, v in ckpt.items():
+                        if type(v) is float:
+                            metric, score = k, v
+                    self.model.eval()
+                    self.verbose('Evaluation target = {} (recorded {} = {:.2f} %)'.format(
+                        self.paras.load, metric, score))
+
+    def verbose(self, msg):
+        ''' Verbose function for print information to stdout'''
+        if self.paras.verbose:
+            if type(msg) == list:
+                for m in msg:
+                    print('[INFO]', m.ljust(100))
+            else:
+                print('[INFO]', msg.ljust(100))
+
+    def progress(self, msg):
+        ''' Verbose function for updating progress on stdout (do not include newline) '''
+        if self.paras.verbose:
+            sys.stdout.write("\033[K")  # Clear line
+            print('[{}] {}'.format(human_format(self.step), msg), end='\r')
+
+    def write_log(self, log_name, log_dict):
+        '''
+        Write log to TensorBoard
+            log_name  - <str> Name of tensorboard variable 
+            log_value - <dict>/<array> Value of variable (e.g. dict of losses), passed if value = None
+        '''
+        if type(log_dict) is dict:
+            log_dict = {key: val for key, val in log_dict.items() if (
+                val is not None and not math.isnan(val))}
+        if log_dict is None:
+            pass
+        elif len(log_dict) > 0:
+            if 'align' in log_name or 'spec' in log_name:
+                img, form = log_dict
+                self.log.add_image(
+                    log_name, img, global_step=self.step, dataformats=form)
+            elif 'text' in log_name or 'hyp' in log_name:
+                self.log.add_text(log_name, log_dict, self.step)
+            else:
+                self.log.add_scalars(log_name, log_dict, self.step)
+
+    def save_checkpoint(self, f_name, metric, score, show_msg=True):
+        '''' 
+        Ckpt saver
+            f_name - <str> the name of ckpt file (w/o prefix) to store, overwrite if existed
+            score  - <float> The value of metric used to evaluate model
+        '''
+        ckpt_path = os.path.join(self.ckpdir, f_name)
+        full_dict = {
+            "model": self.model.state_dict(),
+            "optimizer": self.optimizer.get_opt_state_dict(),
+            "global_step": self.step,
+            metric: score
+        }
+
+        torch.save(full_dict, ckpt_path)
+        if show_msg:
+            self.verbose("Saved checkpoint (step = {}, {} = {:.2f}) and status @ {}".
+                         format(human_format(self.step), metric, score, ckpt_path))
+
+
+    # ----------------------------------- Abtract Methods ------------------------------------------ #
+    @abc.abstractmethod
+    def load_data(self):
+        '''
+        Called by main to load all data
+        After this call, data related attributes should be setup (e.g. self.tr_set, self.dev_set)
+        No return value
+        '''
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def set_model(self):
+        '''
+        Called by main to set models
+        After this call, model related attributes should be setup (e.g. self.l2_loss)
+        The followings MUST be setup
+            - self.model (torch.nn.Module)
+            - self.optimizer (src.Optimizer),
+                init. w/ self.optimizer = src.Optimizer(self.model.parameters(),**self.config['hparas'])
+        Loading pre-trained model should also be performed here 
+        No return value
+        '''
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def exec(self):
+        '''
+        Called by main to execute training/inference
+        '''
+        raise NotImplementedError
--- a/ppg2mel/train/train_linglf02mel_seq2seq_oneshotvc.py
+++ b/ppg2mel/train/train_linglf02mel_seq2seq_oneshotvc.py
@@ -0,0 +1,288 @@
+import os, sys
+# sys.path.append('/home/shaunxliu/projects/nnsp')
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+from matplotlib.ticker import MaxNLocator
+import torch
+from torch.utils.data import DataLoader
+import numpy as np
+from .solver import BaseSolver
+from utils.data_load import OneshotVcDataset, MultiSpkVcCollate
+# from src.rnn_ppg2mel import BiRnnPpg2MelModel
+# from src.mel_decoder_mol_encAddlf0 import MelDecoderMOL
+from .loss import MaskedMSELoss
+from .optim import Optimizer
+from utils.util import human_format
+from ppg2mel import MelDecoderMOLv2
+
+
+class Solver(BaseSolver):
+    """Customized Solver."""
+    def __init__(self, config, paras, mode):
+        super().__init__(config, paras, mode)
+        self.num_att_plots = 5
+        self.att_ws_dir = f"{self.logdir}/att_ws"
+        os.makedirs(self.att_ws_dir, exist_ok=True)
+        self.best_loss = np.inf
+
+    def fetch_data(self, data):
+        """Move data to device"""
+        data = [i.to(self.device) for i in data]
+        return data
+
+    def load_data(self):
+        """ Load data for training/validation/plotting."""
+        train_dataset = OneshotVcDataset(
+            meta_file=self.config.data.train_fid_list,
+            vctk_ppg_dir=self.config.data.vctk_ppg_dir,
+            libri_ppg_dir=self.config.data.libri_ppg_dir,
+            vctk_f0_dir=self.config.data.vctk_f0_dir,
+            libri_f0_dir=self.config.data.libri_f0_dir,
+            vctk_wav_dir=self.config.data.vctk_wav_dir,
+            libri_wav_dir=self.config.data.libri_wav_dir,
+            vctk_spk_dvec_dir=self.config.data.vctk_spk_dvec_dir,
+            libri_spk_dvec_dir=self.config.data.libri_spk_dvec_dir,
+            ppg_file_ext=self.config.data.ppg_file_ext,
+            min_max_norm_mel=self.config.data.min_max_norm_mel,
+            mel_min=self.config.data.mel_min,
+            mel_max=self.config.data.mel_max,
+        )
+        dev_dataset = OneshotVcDataset(
+            meta_file=self.config.data.dev_fid_list,
+            vctk_ppg_dir=self.config.data.vctk_ppg_dir,
+            libri_ppg_dir=self.config.data.libri_ppg_dir,
+            vctk_f0_dir=self.config.data.vctk_f0_dir,
+            libri_f0_dir=self.config.data.libri_f0_dir,
+            vctk_wav_dir=self.config.data.vctk_wav_dir,
+            libri_wav_dir=self.config.data.libri_wav_dir,
+            vctk_spk_dvec_dir=self.config.data.vctk_spk_dvec_dir,
+            libri_spk_dvec_dir=self.config.data.libri_spk_dvec_dir,
+            ppg_file_ext=self.config.data.ppg_file_ext,
+            min_max_norm_mel=self.config.data.min_max_norm_mel,
+            mel_min=self.config.data.mel_min,
+            mel_max=self.config.data.mel_max,
+        )
+        self.train_dataloader = DataLoader(
+            train_dataset,
+            num_workers=self.paras.njobs,
+            shuffle=True,
+            batch_size=self.config.hparas.batch_size,
+            pin_memory=False,
+            drop_last=True,
+            collate_fn=MultiSpkVcCollate(self.config.model.frames_per_step,
+                                        use_spk_dvec=True),
+        )
+        self.dev_dataloader = DataLoader(
+            dev_dataset,
+            num_workers=self.paras.njobs,
+            shuffle=False,
+            batch_size=self.config.hparas.batch_size,
+            pin_memory=False,
+            drop_last=False,
+            collate_fn=MultiSpkVcCollate(self.config.model.frames_per_step,
+                                         use_spk_dvec=True),
+        )
+        self.plot_dataloader = DataLoader(
+            dev_dataset,
+            num_workers=self.paras.njobs,
+            shuffle=False,
+            batch_size=1,
+            pin_memory=False,
+            drop_last=False,
+            collate_fn=MultiSpkVcCollate(self.config.model.frames_per_step,
+                                         use_spk_dvec=True,
+                                         give_uttids=True),
+        )
+        msg = "Have prepared training set and dev set."
+        self.verbose(msg)
+    
+    def load_pretrained_params(self):
+        print("Load pretrained model from: ", self.config.data.pretrain_model_file)
+        ignore_layer_prefixes = ["speaker_embedding_table"]
+        pretrain_model_file = self.config.data.pretrain_model_file
+        pretrain_ckpt = torch.load(
+            pretrain_model_file, map_location=self.device
+        )["model"]
+        model_dict = self.model.state_dict()
+        print(self.model)
+        
+        # 1. filter out unnecessrary keys
+        for prefix in ignore_layer_prefixes:
+            pretrain_ckpt = {k : v 
+                             for k, v in pretrain_ckpt.items() if not k.startswith(prefix) 
+                            }
+        # 2. overwrite entries in the existing state dict
+        model_dict.update(pretrain_ckpt)
+
+        # 3. load the new state dict
+        self.model.load_state_dict(model_dict)
+
+    def set_model(self):
+        """Setup model and optimizer"""
+        # Model
+        print("[INFO] Model name: ", self.config["model_name"])
+        self.model = MelDecoderMOLv2(
+            **self.config["model"]
+        ).to(self.device)
+        # self.load_pretrained_params()
+
+        # model_params = [{'params': self.model.spk_embedding.weight}]
+        model_params = [{'params': self.model.parameters()}]
+        
+        # Loss criterion
+        self.loss_criterion = MaskedMSELoss(self.config.model.frames_per_step)
+
+        # Optimizer
+        self.optimizer = Optimizer(model_params, **self.config["hparas"])
+        self.verbose(self.optimizer.create_msg())
+
+        # Automatically load pre-trained model if self.paras.load is given
+        self.load_ckpt()
+
+    def exec(self):
+        self.verbose("Total training steps {}.".format(
+            human_format(self.max_step)))
+
+        mel_loss = None
+        n_epochs = 0
+        # Set as current time
+        self.timer.set()
+        
+        while self.step < self.max_step:
+            for data in self.train_dataloader:
+                # Pre-step: updata lr_rate and do zero_grad
+                lr_rate = self.optimizer.pre_step(self.step)
+                total_loss = 0
+                # data to device
+                ppgs, lf0_uvs, mels, in_lengths, \
+                    out_lengths, spk_ids, stop_tokens = self.fetch_data(data)
+                self.timer.cnt("rd")
+                mel_outputs, mel_outputs_postnet, predicted_stop = self.model(
+                    ppgs,
+                    in_lengths,
+                    mels,
+                    out_lengths,
+                    lf0_uvs,
+                    spk_ids
+                ) 
+                mel_loss, stop_loss = self.loss_criterion(
+                    mel_outputs,
+                    mel_outputs_postnet,
+                    mels,
+                    out_lengths,
+                    stop_tokens,
+                    predicted_stop
+                )
+                loss = mel_loss + stop_loss
+
+                self.timer.cnt("fw")
+
+                # Back-prop
+                grad_norm = self.backward(loss)
+                self.step += 1
+
+                # Logger
+                if (self.step == 1) or (self.step % self.PROGRESS_STEP == 0):
+                    self.progress("Tr|loss:{:.4f},mel-loss:{:.4f},stop-loss:{:.4f}|Grad.Norm-{:.2f}|{}"
+                                  .format(loss.cpu().item(), mel_loss.cpu().item(),
+                                    stop_loss.cpu().item(), grad_norm, self.timer.show()))
+                    self.write_log('loss', {'tr/loss': loss,
+                                            'tr/mel-loss': mel_loss,
+                                            'tr/stop-loss': stop_loss})
+
+                # Validation
+                if (self.step == 1) or (self.step % self.valid_step == 0):
+                    self.validate()
+
+                # End of step
+                # https://github.com/pytorch/pytorch/issues/13246#issuecomment-529185354
+                torch.cuda.empty_cache()
+                self.timer.set()
+                if self.step > self.max_step:
+                    break
+            n_epochs += 1
+        self.log.close()
+
+    def validate(self):
+        self.model.eval()
+        dev_loss, dev_mel_loss, dev_stop_loss = 0.0, 0.0, 0.0
+
+        for i, data in enumerate(self.dev_dataloader):
+            self.progress('Valid step - {}/{}'.format(i+1, len(self.dev_dataloader)))
+            # Fetch data
+            ppgs, lf0_uvs, mels, in_lengths, \
+                out_lengths, spk_ids, stop_tokens = self.fetch_data(data)
+            with torch.no_grad():
+                mel_outputs, mel_outputs_postnet, predicted_stop = self.model(
+                    ppgs,
+                    in_lengths,
+                    mels,
+                    out_lengths,
+                    lf0_uvs,
+                    spk_ids
+                ) 
+                mel_loss, stop_loss = self.loss_criterion(
+                    mel_outputs,
+                    mel_outputs_postnet,
+                    mels,
+                    out_lengths,
+                    stop_tokens,
+                    predicted_stop
+                )
+                loss = mel_loss + stop_loss
+
+                dev_loss += loss.cpu().item()
+                dev_mel_loss += mel_loss.cpu().item()
+                dev_stop_loss += stop_loss.cpu().item()
+
+        dev_loss = dev_loss / (i + 1)
+        dev_mel_loss = dev_mel_loss / (i + 1)
+        dev_stop_loss = dev_stop_loss / (i + 1)
+        self.save_checkpoint(f'step_{self.step}.pth', 'loss', dev_loss, show_msg=False)
+        if dev_loss < self.best_loss:
+            self.best_loss = dev_loss
+            self.save_checkpoint(f'best_loss_step_{self.step}.pth', 'loss', dev_loss)
+        self.write_log('loss', {'dv/loss': dev_loss,
+                                'dv/mel-loss': dev_mel_loss,
+                                'dv/stop-loss': dev_stop_loss})
+
+        # plot attention
+        for i, data in enumerate(self.plot_dataloader):
+            if i == self.num_att_plots:
+                break
+            # Fetch data
+            ppgs, lf0_uvs, mels, in_lengths, \
+                out_lengths, spk_ids, stop_tokens = self.fetch_data(data[:-1])
+            fid = data[-1][0]
+            with torch.no_grad():
+                _, _, _, att_ws = self.model(
+                    ppgs,
+                    in_lengths,
+                    mels,
+                    out_lengths,
+                    lf0_uvs,
+                    spk_ids,
+                    output_att_ws=True
+                )
+                att_ws = att_ws.squeeze(0).cpu().numpy()
+                att_ws = att_ws[None]
+                w, h = plt.figaspect(1.0 / len(att_ws))
+                fig = plt.Figure(figsize=(w * 1.3, h * 1.3))
+                axes = fig.subplots(1, len(att_ws))
+                if len(att_ws) == 1:
+                    axes = [axes]
+
+                for ax, aw in zip(axes, att_ws):
+                    ax.imshow(aw.astype(np.float32), aspect="auto")
+                    ax.set_title(f"{fid}")
+                    ax.set_xlabel("Input")
+                    ax.set_ylabel("Output")
+                    ax.xaxis.set_major_locator(MaxNLocator(integer=True))
+                    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
+                fig_name = f"{self.att_ws_dir}/{fid}_step{self.step}.png"
+                fig.savefig(fig_name)
+                
+        # Resume training
+        self.model.train()
+
--- a/ppg2mel/utils/abs_model.py
+++ b/ppg2mel/utils/abs_model.py
@@ -0,0 +1,23 @@
+from abc import ABC
+from abc import abstractmethod
+
+import torch
+
+class AbsMelDecoder(torch.nn.Module, ABC):
+    """The abstract PPG-based voice conversion class
+    This "model" is one of mediator objects for "Task" class.
+
+    """
+
+    @abstractmethod
+    def forward(
+        self, 
+        bottle_neck_features: torch.Tensor,
+        feature_lengths: torch.Tensor,
+        speech: torch.Tensor,
+        speech_lengths: torch.Tensor,
+        logf0_uv: torch.Tensor = None,
+        spembs: torch.Tensor = None,
+        styleembs: torch.Tensor = None,
+    ) -> torch.Tensor:
+        raise NotImplementedError
--- a/ppg2mel/utils/basic_layers.py
+++ b/ppg2mel/utils/basic_layers.py
@@ -0,0 +1,79 @@
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.autograd import Function
+
+def tile(x, count, dim=0):
+    """
+    Tiles x on dimension dim count times.
+    """
+    perm = list(range(len(x.size())))
+    if dim != 0:
+        perm[0], perm[dim] = perm[dim], perm[0]
+        x = x.permute(perm).contiguous()
+    out_size = list(x.size())
+    out_size[0] *= count
+    batch = x.size(0)
+    x = x.view(batch, -1) \
+         .transpose(0, 1) \
+         .repeat(count, 1) \
+         .transpose(0, 1) \
+         .contiguous() \
+         .view(*out_size)
+    if dim != 0:
+        x = x.permute(perm).contiguous()
+    return x
+
+class Linear(torch.nn.Module):
+    def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
+        super(Linear, self).__init__()
+        self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
+
+        torch.nn.init.xavier_uniform_(
+            self.linear_layer.weight,
+            gain=torch.nn.init.calculate_gain(w_init_gain))
+
+    def forward(self, x):
+        return self.linear_layer(x)
+
+class Conv1d(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
+                 padding=None, dilation=1, bias=True, w_init_gain='linear', param=None):
+        super(Conv1d, self).__init__()
+        if padding is None:
+            assert(kernel_size % 2 == 1)
+            padding = int(dilation * (kernel_size - 1)/2)
+        
+        self.conv = torch.nn.Conv1d(in_channels, out_channels,
+                                    kernel_size=kernel_size, stride=stride,
+                                    padding=padding, dilation=dilation,
+                                    bias=bias)
+        torch.nn.init.xavier_uniform_(
+            self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain, param=param))
+
+    def forward(self, x):
+        # x: BxDxT
+        return self.conv(x)
+
+
+
+def tile(x, count, dim=0):
+    """
+    Tiles x on dimension dim count times.
+    """
+    perm = list(range(len(x.size())))
+    if dim != 0:
+        perm[0], perm[dim] = perm[dim], perm[0]
+        x = x.permute(perm).contiguous()
+    out_size = list(x.size())
+    out_size[0] *= count
+    batch = x.size(0)
+    x = x.view(batch, -1) \
+         .transpose(0, 1) \
+         .repeat(count, 1) \
+         .transpose(0, 1) \
+         .contiguous() \
+         .view(*out_size)
+    if dim != 0:
+        x = x.permute(perm).contiguous()
+    return x
--- a/ppg2mel/utils/cnn_postnet.py
+++ b/ppg2mel/utils/cnn_postnet.py
@@ -0,0 +1,52 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .basic_layers import Linear, Conv1d
+
+
+class Postnet(nn.Module):
+    """Postnet
+        - Five 1-d convolution with 512 channels and kernel size 5
+    """
+    def __init__(self, num_mels=80,
+                 num_layers=5,
+                 hidden_dim=512,
+                 kernel_size=5):
+        super(Postnet, self).__init__()
+        self.convolutions = nn.ModuleList()
+
+        self.convolutions.append(
+            nn.Sequential(
+                Conv1d(
+                    num_mels, hidden_dim,
+                    kernel_size=kernel_size, stride=1,
+                    padding=int((kernel_size - 1) / 2),
+                    dilation=1, w_init_gain='tanh'),
+                nn.BatchNorm1d(hidden_dim)))
+
+        for i in range(1, num_layers - 1):
+            self.convolutions.append(
+                nn.Sequential(
+                    Conv1d(
+                        hidden_dim,
+                        hidden_dim,
+                        kernel_size=kernel_size, stride=1,
+                        padding=int((kernel_size - 1) / 2),
+                        dilation=1, w_init_gain='tanh'),
+                    nn.BatchNorm1d(hidden_dim)))
+
+        self.convolutions.append(
+            nn.Sequential(
+                Conv1d(
+                    hidden_dim, num_mels,
+                    kernel_size=kernel_size, stride=1,
+                    padding=int((kernel_size - 1) / 2),
+                    dilation=1, w_init_gain='linear'),
+                nn.BatchNorm1d(num_mels)))
+
+    def forward(self, x):
+        # x: (B, num_mels, T_dec)
+        for i in range(len(self.convolutions) - 1):
+            x = F.dropout(torch.tanh(self.convolutions[i](x)), 0.5, self.training)
+        x = F.dropout(self.convolutions[-1](x), 0.5, self.training)
+        return x
--- a/ppg2mel/utils/mol_attention.py
+++ b/ppg2mel/utils/mol_attention.py
@@ -0,0 +1,123 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class MOLAttention(nn.Module):
+    """ Discretized Mixture of Logistic (MOL) attention.
+    C.f. Section 5 of "MelNet: A Generative Model for Audio in the Frequency Domain" and 
+        GMMv2b model in "Location-relative attention mechanisms for robust long-form speech synthesis".
+    """
+    def __init__(
+        self,
+        query_dim,
+        r=1,
+        M=5,
+    ):
+        """
+        Args:
+            query_dim: attention_rnn_dim.
+            M: number of mixtures.
+        """
+        super().__init__()
+        if r < 1:
+            self.r = float(r)
+        else:
+            self.r = int(r)
+        self.M = M
+        self.score_mask_value = 0.0 # -float("inf")
+        self.eps = 1e-5
+        # Position arrary for encoder time steps
+        self.J = None
+        # Query layer: [w, sigma,]
+        self.query_layer = torch.nn.Sequential(
+            nn.Linear(query_dim, 256, bias=True),
+            nn.ReLU(),
+            nn.Linear(256, 3*M, bias=True)
+        )
+        self.mu_prev = None
+        self.initialize_bias()
+
+    def initialize_bias(self):
+        """Initialize sigma and Delta."""
+        # sigma
+        torch.nn.init.constant_(self.query_layer[2].bias[self.M:2*self.M], 1.0)
+        # Delta: softplus(1.8545) = 2.0; softplus(3.9815) = 4.0; softplus(0.5413) = 1.0
+        # softplus(-0.432) = 0.5003
+        if self.r == 2:
+            torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], 1.8545)
+        elif self.r == 4:
+            torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], 3.9815)
+        elif self.r == 1:
+            torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], 0.5413)
+        else:
+            torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], -0.432)
+
+    
+    def init_states(self, memory):
+        """Initialize mu_prev and J.
+            This function should be called by the decoder before decoding one batch.
+        Args:
+            memory: (B, T, D_enc) encoder output.
+        """
+        B, T_enc, _ = memory.size()
+        device = memory.device
+        self.J = torch.arange(0, T_enc + 2.0).to(device) + 0.5  # NOTE: for discretize usage
+        # self.J = memory.new_tensor(np.arange(T_enc), dtype=torch.float)
+        self.mu_prev = torch.zeros(B, self.M).to(device)
+
+    def forward(self, att_rnn_h, memory, memory_pitch=None, mask=None):
+        """
+        att_rnn_h: attetion rnn hidden state.
+        memory: encoder outputs (B, T_enc, D).
+        mask: binary mask for padded data (B, T_enc).
+        """
+        # [B, 3M]
+        mixture_params = self.query_layer(att_rnn_h)
+        
+        # [B, M]
+        w_hat = mixture_params[:, :self.M]
+        sigma_hat = mixture_params[:, self.M:2*self.M]
+        Delta_hat = mixture_params[:, 2*self.M:3*self.M]
+        
+        # print("w_hat: ", w_hat)
+        # print("sigma_hat: ", sigma_hat)
+        # print("Delta_hat: ", Delta_hat)
+
+        # Dropout to de-correlate attention heads
+        w_hat = F.dropout(w_hat, p=0.5, training=self.training) # NOTE(sx): needed?
+        
+        # Mixture parameters
+        w = torch.softmax(w_hat, dim=-1) + self.eps
+        sigma = F.softplus(sigma_hat) + self.eps
+        Delta = F.softplus(Delta_hat)
+        mu_cur = self.mu_prev + Delta
+        # print("w:", w)
+        j = self.J[:memory.size(1) + 1]
+
+        # Attention weights
+        # CDF of logistic distribution
+        phi_t = w.unsqueeze(-1) * (1 / (1 + torch.sigmoid(
+            (mu_cur.unsqueeze(-1) - j) / sigma.unsqueeze(-1))))
+        # print("phi_t:", phi_t)
+        
+        # Discretize attention weights
+        # (B, T_enc + 1)
+        alpha_t = torch.sum(phi_t, dim=1)
+        alpha_t = alpha_t[:, 1:] - alpha_t[:, :-1]
+        alpha_t[alpha_t == 0] = self.eps
+        # print("alpha_t: ", alpha_t.size())
+        # Apply masking
+        if mask is not None:
+            alpha_t.data.masked_fill_(mask, self.score_mask_value)
+
+        context = torch.bmm(alpha_t.unsqueeze(1), memory).squeeze(1)
+        if memory_pitch is not None:
+            context_pitch = torch.bmm(alpha_t.unsqueeze(1), memory_pitch).squeeze(1)
+
+        self.mu_prev = mu_cur
+        
+        if memory_pitch is not None:
+            return context, context_pitch, alpha_t
+        return context, alpha_t
+
--- a/ppg2mel/utils/nets_utils.py
+++ b/ppg2mel/utils/nets_utils.py
@@ -0,0 +1,451 @@
+# -*- coding: utf-8 -*-
+
+"""Network related utility tools."""
+
+import logging
+from typing import Dict
+
+import numpy as np
+import torch
+
+
+def to_device(m, x):
+    """Send tensor into the device of the module.
+
+    Args:
+        m (torch.nn.Module): Torch module.
+        x (Tensor): Torch tensor.
+
+    Returns:
+        Tensor: Torch tensor located in the same place as torch module.
+
+    """
+    assert isinstance(m, torch.nn.Module)
+    device = next(m.parameters()).device
+    return x.to(device)
+
+
+def pad_list(xs, pad_value):
+    """Perform padding for the list of tensors.
+
+    Args:
+        xs (List): List of Tensors [(T_1, `*`), (T_2, `*`), ..., (T_B, `*`)].
+        pad_value (float): Value for padding.
+
+    Returns:
+        Tensor: Padded tensor (B, Tmax, `*`).
+
+    Examples:
+        >>> x = [torch.ones(4), torch.ones(2), torch.ones(1)]
+        >>> x
+        [tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])]
+        >>> pad_list(x, 0)
+        tensor([[1., 1., 1., 1.],
+                [1., 1., 0., 0.],
+                [1., 0., 0., 0.]])
+
+    """
+    n_batch = len(xs)
+    max_len = max(x.size(0) for x in xs)
+    pad = xs[0].new(n_batch, max_len, *xs[0].size()[1:]).fill_(pad_value)
+
+    for i in range(n_batch):
+        pad[i, :xs[i].size(0)] = xs[i]
+
+    return pad
+
+
+def make_pad_mask(lengths, xs=None, length_dim=-1):
+    """Make mask tensor containing indices of padded part.
+
+    Args:
+        lengths (LongTensor or List): Batch of lengths (B,).
+        xs (Tensor, optional): The reference tensor. If set, masks will be the same shape as this tensor.
+        length_dim (int, optional): Dimension indicator of the above tensor. See the example.
+
+    Returns:
+        Tensor: Mask tensor containing indices of padded part.
+                dtype=torch.uint8 in PyTorch 1.2-
+                dtype=torch.bool in PyTorch 1.2+ (including 1.2)
+
+    Examples:
+        With only lengths.
+
+        >>> lengths = [5, 3, 2]
+        >>> make_non_pad_mask(lengths)
+        masks = [[0, 0, 0, 0 ,0],
+                 [0, 0, 0, 1, 1],
+                 [0, 0, 1, 1, 1]]
+
+        With the reference tensor.
+
+        >>> xs = torch.zeros((3, 2, 4))
+        >>> make_pad_mask(lengths, xs)
+        tensor([[[0, 0, 0, 0],
+                 [0, 0, 0, 0]],
+                [[0, 0, 0, 1],
+                 [0, 0, 0, 1]],
+                [[0, 0, 1, 1],
+                 [0, 0, 1, 1]]], dtype=torch.uint8)
+        >>> xs = torch.zeros((3, 2, 6))
+        >>> make_pad_mask(lengths, xs)
+        tensor([[[0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1]],
+                [[0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1]],
+                [[0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
+
+        With the reference tensor and dimension indicator.
+
+        >>> xs = torch.zeros((3, 6, 6))
+        >>> make_pad_mask(lengths, xs, 1)
+        tensor([[[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1]],
+                [[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1]],
+                [[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1]]], dtype=torch.uint8)
+        >>> make_pad_mask(lengths, xs, 2)
+        tensor([[[0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1]],
+                [[0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1]],
+                [[0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
+
+    """
+    if length_dim == 0:
+        raise ValueError('length_dim cannot be 0: {}'.format(length_dim))
+
+    if not isinstance(lengths, list):
+        lengths = lengths.tolist()
+    bs = int(len(lengths))
+    if xs is None:
+        maxlen = int(max(lengths))
+    else:
+        maxlen = xs.size(length_dim)
+
+    seq_range = torch.arange(0, maxlen, dtype=torch.int64)
+    seq_range_expand = seq_range.unsqueeze(0).expand(bs, maxlen)
+    seq_length_expand = seq_range_expand.new(lengths).unsqueeze(-1)
+    mask = seq_range_expand >= seq_length_expand
+
+    if xs is not None:
+        assert xs.size(0) == bs, (xs.size(0), bs)
+
+        if length_dim < 0:
+            length_dim = xs.dim() + length_dim
+        # ind = (:, None, ..., None, :, , None, ..., None)
+        ind = tuple(slice(None) if i in (0, length_dim) else None
+                    for i in range(xs.dim()))
+        mask = mask[ind].expand_as(xs).to(xs.device)
+    return mask
+
+
+def make_non_pad_mask(lengths, xs=None, length_dim=-1):
+    """Make mask tensor containing indices of non-padded part.
+
+    Args:
+        lengths (LongTensor or List): Batch of lengths (B,).
+        xs (Tensor, optional): The reference tensor. If set, masks will be the same shape as this tensor.
+        length_dim (int, optional): Dimension indicator of the above tensor. See the example.
+
+    Returns:
+        ByteTensor: mask tensor containing indices of padded part.
+                    dtype=torch.uint8 in PyTorch 1.2-
+                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)
+
+    Examples:
+        With only lengths.
+
+        >>> lengths = [5, 3, 2]
+        >>> make_non_pad_mask(lengths)
+        masks = [[1, 1, 1, 1 ,1],
+                 [1, 1, 1, 0, 0],
+                 [1, 1, 0, 0, 0]]
+
+        With the reference tensor.
+
+        >>> xs = torch.zeros((3, 2, 4))
+        >>> make_non_pad_mask(lengths, xs)
+        tensor([[[1, 1, 1, 1],
+                 [1, 1, 1, 1]],
+                [[1, 1, 1, 0],
+                 [1, 1, 1, 0]],
+                [[1, 1, 0, 0],
+                 [1, 1, 0, 0]]], dtype=torch.uint8)
+        >>> xs = torch.zeros((3, 2, 6))
+        >>> make_non_pad_mask(lengths, xs)
+        tensor([[[1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0]],
+                [[1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0]],
+                [[1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
+
+        With the reference tensor and dimension indicator.
+
+        >>> xs = torch.zeros((3, 6, 6))
+        >>> make_non_pad_mask(lengths, xs, 1)
+        tensor([[[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0]],
+                [[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0]],
+                [[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0]]], dtype=torch.uint8)
+        >>> make_non_pad_mask(lengths, xs, 2)
+        tensor([[[1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0]],
+                [[1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0]],
+                [[1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
+
+    """
+    return ~make_pad_mask(lengths, xs, length_dim)
+
+
+def mask_by_length(xs, lengths, fill=0):
+    """Mask tensor according to length.
+
+    Args:
+        xs (Tensor): Batch of input tensor (B, `*`).
+        lengths (LongTensor or List): Batch of lengths (B,).
+        fill (int or float): Value to fill masked part.
+
+    Returns:
+        Tensor: Batch of masked input tensor (B, `*`).
+
+    Examples:
+        >>> x = torch.arange(5).repeat(3, 1) + 1
+        >>> x
+        tensor([[1, 2, 3, 4, 5],
+                [1, 2, 3, 4, 5],
+                [1, 2, 3, 4, 5]])
+        >>> lengths = [5, 3, 2]
+        >>> mask_by_length(x, lengths)
+        tensor([[1, 2, 3, 4, 5],
+                [1, 2, 3, 0, 0],
+                [1, 2, 0, 0, 0]])
+
+    """
+    assert xs.size(0) == len(lengths)
+    ret = xs.data.new(*xs.size()).fill_(fill)
+    for i, l in enumerate(lengths):
+        ret[i, :l] = xs[i, :l]
+    return ret
+
+
+def th_accuracy(pad_outputs, pad_targets, ignore_label):
+    """Calculate accuracy.
+
+    Args:
+        pad_outputs (Tensor): Prediction tensors (B * Lmax, D).
+        pad_targets (LongTensor): Target label tensors (B, Lmax, D).
+        ignore_label (int): Ignore label id.
+
+    Returns:
+        float: Accuracy value (0.0 - 1.0).
+
+    """
+    pad_pred = pad_outputs.view(
+        pad_targets.size(0),
+        pad_targets.size(1),
+        pad_outputs.size(1)).argmax(2)
+    mask = pad_targets != ignore_label
+    numerator = torch.sum(pad_pred.masked_select(mask) == pad_targets.masked_select(mask))
+    denominator = torch.sum(mask)
+    return float(numerator) / float(denominator)
+
+
+def to_torch_tensor(x):
+    """Change to torch.Tensor or ComplexTensor from numpy.ndarray.
+
+    Args:
+        x: Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.
+
+    Returns:
+        Tensor or ComplexTensor: Type converted inputs.
+
+    Examples:
+        >>> xs = np.ones(3, dtype=np.float32)
+        >>> xs = to_torch_tensor(xs)
+        tensor([1., 1., 1.])
+        >>> xs = torch.ones(3, 4, 5)
+        >>> assert to_torch_tensor(xs) is xs
+        >>> xs = {'real': xs, 'imag': xs}
+        >>> to_torch_tensor(xs)
+        ComplexTensor(
+        Real:
+        tensor([1., 1., 1.])
+        Imag;
+        tensor([1., 1., 1.])
+        )
+
+    """
+    # If numpy, change to torch tensor
+    if isinstance(x, np.ndarray):
+        if x.dtype.kind == 'c':
+            # Dynamically importing because torch_complex requires python3
+            from torch_complex.tensor import ComplexTensor
+            return ComplexTensor(x)
+        else:
+            return torch.from_numpy(x)
+
+    # If {'real': ..., 'imag': ...}, convert to ComplexTensor
+    elif isinstance(x, dict):
+        # Dynamically importing because torch_complex requires python3
+        from torch_complex.tensor import ComplexTensor
+
+        if 'real' not in x or 'imag' not in x:
+            raise ValueError("has 'real' and 'imag' keys: {}".format(list(x)))
+        # Relative importing because of using python3 syntax
+        return ComplexTensor(x['real'], x['imag'])
+
+    # If torch.Tensor, as it is
+    elif isinstance(x, torch.Tensor):
+        return x
+
+    else:
+        error = ("x must be numpy.ndarray, torch.Tensor or a dict like "
+                 "{{'real': torch.Tensor, 'imag': torch.Tensor}}, "
+                 "but got {}".format(type(x)))
+        try:
+            from torch_complex.tensor import ComplexTensor
+        except Exception:
+            # If PY2
+            raise ValueError(error)
+        else:
+            # If PY3
+            if isinstance(x, ComplexTensor):
+                return x
+            else:
+                raise ValueError(error)
+
+
+def get_subsample(train_args, mode, arch):
+    """Parse the subsampling factors from the training args for the specified `mode` and `arch`.
+
+    Args:
+        train_args: argument Namespace containing options.
+        mode: one of ('asr', 'mt', 'st')
+        arch: one of ('rnn', 'rnn-t', 'rnn_mix', 'rnn_mulenc', 'transformer')
+
+    Returns:
+        np.ndarray / List[np.ndarray]: subsampling factors.
+    """
+    if arch == 'transformer':
+        return np.array([1])
+
+    elif mode == 'mt' and arch == 'rnn':
+        # +1 means input (+1) and layers outputs (train_args.elayer)
+        subsample = np.ones(train_args.elayers + 1, dtype=np.int)
+        logging.warning('Subsampling is not performed for machine translation.')
+        logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+        return subsample
+
+    elif (mode == 'asr' and arch in ('rnn', 'rnn-t')) or \
+         (mode == 'mt' and arch == 'rnn') or \
+         (mode == 'st' and arch == 'rnn'):
+        subsample = np.ones(train_args.elayers + 1, dtype=np.int)
+        if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
+            ss = train_args.subsample.split("_")
+            for j in range(min(train_args.elayers + 1, len(ss))):
+                subsample[j] = int(ss[j])
+        else:
+            logging.warning(
+                'Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.')
+        logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+        return subsample
+
+    elif mode == 'asr' and arch == 'rnn_mix':
+        subsample = np.ones(train_args.elayers_sd + train_args.elayers + 1, dtype=np.int)
+        if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
+            ss = train_args.subsample.split("_")
+            for j in range(min(train_args.elayers_sd + train_args.elayers + 1, len(ss))):
+                subsample[j] = int(ss[j])
+        else:
+            logging.warning(
+                'Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.')
+        logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+        return subsample
+
+    elif mode == 'asr' and arch == 'rnn_mulenc':
+        subsample_list = []
+        for idx in range(train_args.num_encs):
+            subsample = np.ones(train_args.elayers[idx] + 1, dtype=np.int)
+            if train_args.etype[idx].endswith("p") and not train_args.etype[idx].startswith("vgg"):
+                ss = train_args.subsample[idx].split("_")
+                for j in range(min(train_args.elayers[idx] + 1, len(ss))):
+                    subsample[j] = int(ss[j])
+            else:
+                logging.warning(
+                    'Encoder %d: Subsampling is not performed for vgg*. '
+                    'It is performed in max pooling layers at CNN.', idx + 1)
+            logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+            subsample_list.append(subsample)
+        return subsample_list
+
+    else:
+        raise ValueError('Invalid options: mode={}, arch={}'.format(mode, arch))
+
+
+def rename_state_dict(old_prefix: str, new_prefix: str, state_dict: Dict[str, torch.Tensor]):
+    """Replace keys of old prefix with new prefix in state dict."""
+    # need this list not to break the dict iterator
+    old_keys = [k for k in state_dict if k.startswith(old_prefix)]
+    if len(old_keys) > 0:
+        logging.warning(f'Rename: {old_prefix} -> {new_prefix}')
+    for k in old_keys:
+        v = state_dict.pop(k)
+        new_k = k.replace(old_prefix, new_prefix)
+        state_dict[new_k] = v
--- a/ppg2mel/utils/vc_utils.py
+++ b/ppg2mel/utils/vc_utils.py
@@ -0,0 +1,22 @@
+import torch
+
+
+def gcd(a, b):  
+    """Greatest common divisor."""
+    a, b = (a, b) if a >=b else (b, a)
+    if a%b == 0:  
+        return b  
+    else :  
+        return gcd(b, a%b) 
+
+def lcm(a, b):
+    """Least common multiple"""
+    return a * b // gcd(a, b)
+
+def get_mask_from_lengths(lengths, max_len=None):
+    if max_len is None:
+        max_len = torch.max(lengths).item()
+    ids = torch.arange(0, max_len, out=torch.cuda.LongTensor(max_len))
+    mask = (ids < lengths.unsqueeze(1)).bool()
+    return mask
+
--- a/ppg2mel_train.py
+++ b/ppg2mel_train.py
@@ -0,0 +1,67 @@
+import sys
+import torch
+import argparse
+import numpy as np
+from utils.load_yaml import HpsYaml
+from ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
+
+# For reproducibility, comment these may speed up training
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+
+def main():
+    # Arguments
+    parser = argparse.ArgumentParser(description=
+            'Training PPG2Mel VC model.')
+    parser.add_argument('--config', type=str, 
+                        help='Path to experiment config, e.g., config/vc.yaml')
+    parser.add_argument('--name', default=None, type=str, help='Name for logging.')
+    parser.add_argument('--logdir', default='log/', type=str,
+                        help='Logging path.', required=False)
+    parser.add_argument('--ckpdir', default='ppg2mel/saved_models/', type=str,
+                        help='Checkpoint path.', required=False)
+    parser.add_argument('--outdir', default='result/', type=str,
+                        help='Decode output path.', required=False)
+    parser.add_argument('--load', default=None, type=str,
+                        help='Load pre-trained model (for training only)', required=False)
+    parser.add_argument('--warm_start', action='store_true',
+                        help='Load model weights only, ignore specified layers.')
+    parser.add_argument('--seed', default=0, type=int,
+                        help='Random seed for reproducable results.', required=False)
+    parser.add_argument('--njobs', default=8, type=int,
+                        help='Number of threads for dataloader/decoding.', required=False)
+    parser.add_argument('--cpu', action='store_true', help='Disable GPU training.')
+    parser.add_argument('--no-pin', action='store_true',
+                        help='Disable pin-memory for dataloader')
+    parser.add_argument('--test', action='store_true', help='Test the model.')
+    parser.add_argument('--no-msg', action='store_true', help='Hide all messages.')
+    parser.add_argument('--finetune', action='store_true', help='Finetune model')
+    parser.add_argument('--oneshotvc', action='store_true', help='Oneshot VC model')
+    parser.add_argument('--bilstm', action='store_true', help='BiLSTM VC model')
+    parser.add_argument('--lsa', action='store_true', help='Use location-sensitive attention (LSA)')
+
+    ###
+
+    paras = parser.parse_args()
+    setattr(paras, 'gpu', not paras.cpu)
+    setattr(paras, 'pin_memory', not paras.no_pin)
+    setattr(paras, 'verbose', not paras.no_msg)
+    # Make the config dict dot visitable
+    config = HpsYaml(paras.config)
+
+    np.random.seed(paras.seed)
+    torch.manual_seed(paras.seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(paras.seed)
+
+    print(">>> OneShot VC training ...")
+    mode = "train"
+    solver = Solver(config, paras, mode)
+    solver.load_data()
+    solver.set_model()
+    solver.exec()
+    print(">>> Oneshot VC train finished!")
+    sys.exit(0)
+
+if __name__ == "__main__":
+    main()   
--- a/ppg_extractor/init.py
+++ b/ppg_extractor/init.py
@@ -0,0 +1,102 @@
+import argparse
+import torch
+from pathlib import Path
+import yaml
+
+from .frontend import DefaultFrontend
+from .utterance_mvn import UtteranceMVN
+from .encoder.conformer_encoder import ConformerEncoder
+
+_model = None # type: PPGModel
+_device = None
+
+class PPGModel(torch.nn.Module):
+    def __init__(
+        self,
+        frontend,
+        normalizer,
+        encoder,
+    ):
+        super().__init__()
+        self.frontend = frontend
+        self.normalize = normalizer
+        self.encoder = encoder
+
+    def forward(self, speech, speech_lengths):
+        """
+
+        Args:
+            speech (tensor): (B, L)
+            speech_lengths (tensor): (B, )
+
+        Returns:
+            bottle_neck_feats (tensor): (B, L//hop_size, 144)
+
+        """
+        feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+        feats, feats_lengths = self.normalize(feats, feats_lengths)
+        encoder_out, encoder_out_lens, _ = self.encoder(feats, feats_lengths)
+        return encoder_out
+
+    def _extract_feats(
+        self, speech: torch.Tensor, speech_lengths: torch.Tensor
+    ):
+        assert speech_lengths.dim() == 1, speech_lengths.shape
+
+        # for data-parallel
+        speech = speech[:, : speech_lengths.max()]
+
+        if self.frontend is not None:
+            # Frontend
+            #  e.g. STFT and Feature extract
+            #       data_loader may send time-domain signal in this case
+            # speech (Batch, NSamples) -> feats: (Batch, NFrames, Dim)
+            feats, feats_lengths = self.frontend(speech, speech_lengths)
+        else:
+            # No frontend and no feature extract
+            feats, feats_lengths = speech, speech_lengths
+        return feats, feats_lengths
+        
+    def extract_from_wav(self, src_wav):
+        src_wav_tensor = torch.from_numpy(src_wav).unsqueeze(0).float().to(_device)
+        src_wav_lengths = torch.LongTensor([len(src_wav)]).to(_device)
+        return self(src_wav_tensor, src_wav_lengths)
+
+
+def build_model(args):
+    normalizer = UtteranceMVN(**args.normalize_conf)
+    frontend = DefaultFrontend(**args.frontend_conf)
+    encoder = ConformerEncoder(input_size=80, **args.encoder_conf)
+    model = PPGModel(frontend, normalizer, encoder)
+    
+    return model
+
+
+def load_model(model_file, device=None):
+    global _model, _device
+    
+    if device is None:
+        _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    else:
+        _device = device
+    # search a config file
+    model_config_fpaths = list(model_file.parent.rglob("*.yaml"))
+    config_file = model_config_fpaths[0]
+    with config_file.open("r", encoding="utf-8") as f:
+        args = yaml.safe_load(f)
+
+    args = argparse.Namespace(**args)
+
+    model = build_model(args)
+    model_state_dict = model.state_dict()
+
+    ckpt_state_dict = torch.load(model_file, map_location=_device)
+    ckpt_state_dict = {k:v for k,v in ckpt_state_dict.items() if 'encoder' in k}
+
+    model_state_dict.update(ckpt_state_dict)
+    model.load_state_dict(model_state_dict)
+
+    _model = model.eval().to(_device)
+    return _model
+
+
--- a/ppg_extractor/e2e_asr_common.py
+++ b/ppg_extractor/e2e_asr_common.py
@@ -0,0 +1,398 @@
+#!/usr/bin/env python3
+
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Common functions for ASR."""
+
+import argparse
+import editdistance
+import json
+import logging
+import numpy as np
+import six
+import sys
+
+from itertools import groupby
+
+
+def end_detect(ended_hyps, i, M=3, D_end=np.log(1 * np.exp(-10))):
+    """End detection.
+
+    desribed in Eq. (50) of S. Watanabe et al
+    "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition"
+
+    :param ended_hyps:
+    :param i:
+    :param M:
+    :param D_end:
+    :return:
+    """
+    if len(ended_hyps) == 0:
+        return False
+    count = 0
+    best_hyp = sorted(ended_hyps, key=lambda x: x['score'], reverse=True)[0]
+    for m in six.moves.range(M):
+        # get ended_hyps with their length is i - m
+        hyp_length = i - m
+        hyps_same_length = [x for x in ended_hyps if len(x['yseq']) == hyp_length]
+        if len(hyps_same_length) > 0:
+            best_hyp_same_length = sorted(hyps_same_length, key=lambda x: x['score'], reverse=True)[0]
+            if best_hyp_same_length['score'] - best_hyp['score'] < D_end:
+                count += 1
+
+    if count == M:
+        return True
+    else:
+        return False
+
+
+# TODO(takaaki-hori): add different smoothing methods
+def label_smoothing_dist(odim, lsm_type, transcript=None, blank=0):
+    """Obtain label distribution for loss smoothing.
+
+    :param odim:
+    :param lsm_type:
+    :param blank:
+    :param transcript:
+    :return:
+    """
+    if transcript is not None:
+        with open(transcript, 'rb') as f:
+            trans_json = json.load(f)['utts']
+
+    if lsm_type == 'unigram':
+        assert transcript is not None, 'transcript is required for %s label smoothing' % lsm_type
+        labelcount = np.zeros(odim)
+        for k, v in trans_json.items():
+            ids = np.array([int(n) for n in v['output'][0]['tokenid'].split()])
+            # to avoid an error when there is no text in an uttrance
+            if len(ids) > 0:
+                labelcount[ids] += 1
+        labelcount[odim - 1] = len(transcript)  # count <eos>
+        labelcount[labelcount == 0] = 1  # flooring
+        labelcount[blank] = 0  # remove counts for blank
+        labeldist = labelcount.astype(np.float32) / np.sum(labelcount)
+    else:
+        logging.error(
+            "Error: unexpected label smoothing type: %s" % lsm_type)
+        sys.exit()
+
+    return labeldist
+
+
+def get_vgg2l_odim(idim, in_channel=3, out_channel=128, downsample=True):
+    """Return the output size of the VGG frontend.
+
+    :param in_channel: input channel size
+    :param out_channel: output channel size
+    :return: output size
+    :rtype int
+    """
+    idim = idim / in_channel
+    if downsample:
+        idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 1st max pooling
+        idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 2nd max pooling
+    return int(idim) * out_channel  # numer of channels
+
+
+class ErrorCalculator(object):
+    """Calculate CER and WER for E2E_ASR and CTC models during training.
+
+    :param y_hats: numpy array with predicted text
+    :param y_pads: numpy array with true (target) text
+    :param char_list:
+    :param sym_space:
+    :param sym_blank:
+    :return:
+    """
+
+    def __init__(self, char_list, sym_space, sym_blank, report_cer=False, report_wer=False,
+                 trans_type="char"):
+        """Construct an ErrorCalculator object."""
+        super(ErrorCalculator, self).__init__()
+
+        self.report_cer = report_cer
+        self.report_wer = report_wer
+        self.trans_type = trans_type
+        self.char_list = char_list
+        self.space = sym_space
+        self.blank = sym_blank
+        self.idx_blank = self.char_list.index(self.blank)
+        if self.space in self.char_list:
+            self.idx_space = self.char_list.index(self.space)
+        else:
+            self.idx_space = None
+
+    def __call__(self, ys_hat, ys_pad, is_ctc=False):
+        """Calculate sentence-level WER/CER score.
+
+        :param torch.Tensor ys_hat: prediction (batch, seqlen)
+        :param torch.Tensor ys_pad: reference (batch, seqlen)
+        :param bool is_ctc: calculate CER score for CTC
+        :return: sentence-level WER score
+        :rtype float
+        :return: sentence-level CER score
+        :rtype float
+        """
+        cer, wer = None, None
+        if is_ctc:
+            return self.calculate_cer_ctc(ys_hat, ys_pad)
+        elif not self.report_cer and not self.report_wer:
+            return cer, wer
+
+        seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad)
+        if self.report_cer:
+            cer = self.calculate_cer(seqs_hat, seqs_true)
+
+        if self.report_wer:
+            wer = self.calculate_wer(seqs_hat, seqs_true)
+        return cer, wer
+
+    def calculate_cer_ctc(self, ys_hat, ys_pad):
+        """Calculate sentence-level CER score for CTC.
+
+        :param torch.Tensor ys_hat: prediction (batch, seqlen)
+        :param torch.Tensor ys_pad: reference (batch, seqlen)
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        cers, char_ref_lens = [], []
+        for i, y in enumerate(ys_hat):
+            y_hat = [x[0] for x in groupby(y)]
+            y_true = ys_pad[i]
+            seq_hat, seq_true = [], []
+            for idx in y_hat:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_hat.append(self.char_list[int(idx)])
+
+            for idx in y_true:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_true.append(self.char_list[int(idx)])
+            if self.trans_type == "char":
+                hyp_chars = "".join(seq_hat)
+                ref_chars = "".join(seq_true)
+            else:
+                hyp_chars = " ".join(seq_hat)
+                ref_chars = " ".join(seq_true)
+
+            if len(ref_chars) > 0:
+                cers.append(editdistance.eval(hyp_chars, ref_chars))
+                char_ref_lens.append(len(ref_chars))
+
+        cer_ctc = float(sum(cers)) / sum(char_ref_lens) if cers else None
+        return cer_ctc
+
+    def convert_to_char(self, ys_hat, ys_pad):
+        """Convert index to character.
+
+        :param torch.Tensor seqs_hat: prediction (batch, seqlen)
+        :param torch.Tensor seqs_true: reference (batch, seqlen)
+        :return: token list of prediction
+        :rtype list
+        :return: token list of reference
+        :rtype list
+        """
+        seqs_hat, seqs_true = [], []
+        for i, y_hat in enumerate(ys_hat):
+            y_true = ys_pad[i]
+            eos_true = np.where(y_true == -1)[0]
+            eos_true = eos_true[0] if len(eos_true) > 0 else len(y_true)
+            # To avoid wrong higher WER than the one obtained from the decoding
+            # eos from y_true is used to mark the eos in y_hat
+            # because of that y_hats has not padded outs with -1.
+            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:eos_true]]
+            seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]
+            # seq_hat_text = "".join(seq_hat).replace(self.space, ' ')
+            seq_hat_text = " ".join(seq_hat).replace(self.space, ' ')
+            seq_hat_text = seq_hat_text.replace(self.blank, '')
+            # seq_true_text = "".join(seq_true).replace(self.space, ' ')
+            seq_true_text = " ".join(seq_true).replace(self.space, ' ')
+            seqs_hat.append(seq_hat_text)
+            seqs_true.append(seq_true_text)
+        return seqs_hat, seqs_true
+
+    def calculate_cer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level CER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        char_eds, char_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_chars = seq_hat_text.replace(' ', '')
+            ref_chars = seq_true_text.replace(' ', '')
+            char_eds.append(editdistance.eval(hyp_chars, ref_chars))
+            char_ref_lens.append(len(ref_chars))
+        return float(sum(char_eds)) / sum(char_ref_lens)
+
+    def calculate_wer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level WER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level WER score
+        :rtype float
+        """
+        word_eds, word_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_words = seq_hat_text.split()
+            ref_words = seq_true_text.split()
+            word_eds.append(editdistance.eval(hyp_words, ref_words))
+            word_ref_lens.append(len(ref_words))
+        return float(sum(word_eds)) / sum(word_ref_lens)
+
+
+class ErrorCalculatorTrans(object):
+    """Calculate CER and WER for transducer models.
+
+    Args:
+        decoder (nn.Module): decoder module
+        args (Namespace): argument Namespace containing options
+        report_cer (boolean): compute CER option
+        report_wer (boolean): compute WER option
+
+    """
+
+    def __init__(self, decoder, args, report_cer=False, report_wer=False):
+        """Construct an ErrorCalculator object for transducer model."""
+        super(ErrorCalculatorTrans, self).__init__()
+
+        self.dec = decoder
+
+        recog_args = {'beam_size': args.beam_size,
+                      'nbest': args.nbest,
+                      'space': args.sym_space,
+                      'score_norm_transducer': args.score_norm_transducer}
+
+        self.recog_args = argparse.Namespace(**recog_args)
+
+        self.char_list = args.char_list
+        self.space = args.sym_space
+        self.blank = args.sym_blank
+
+        self.report_cer = args.report_cer
+        self.report_wer = args.report_wer
+
+    def __call__(self, hs_pad, ys_pad):
+        """Calculate sentence-level WER/CER score for transducer models.
+
+        Args:
+            hs_pad (torch.Tensor): batch of padded input sequence (batch, T, D)
+            ys_pad (torch.Tensor): reference (batch, seqlen)
+
+        Returns:
+            (float): sentence-level CER score
+            (float): sentence-level WER score
+
+        """
+        cer, wer = None, None
+
+        if not self.report_cer and not self.report_wer:
+            return cer, wer
+
+        batchsize = int(hs_pad.size(0))
+        batch_nbest = []
+
+        for b in six.moves.range(batchsize):
+            if self.recog_args.beam_size == 1:
+                nbest_hyps = self.dec.recognize(hs_pad[b], self.recog_args)
+            else:
+                nbest_hyps = self.dec.recognize_beam(hs_pad[b], self.recog_args)
+            batch_nbest.append(nbest_hyps)
+
+        ys_hat = [nbest_hyp[0]['yseq'][1:] for nbest_hyp in batch_nbest]
+
+        seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad.cpu())
+
+        if self.report_cer:
+            cer = self.calculate_cer(seqs_hat, seqs_true)
+
+        if self.report_wer:
+            wer = self.calculate_wer(seqs_hat, seqs_true)
+
+        return cer, wer
+
+    def convert_to_char(self, ys_hat, ys_pad):
+        """Convert index to character.
+
+        Args:
+            ys_hat (torch.Tensor): prediction (batch, seqlen)
+            ys_pad (torch.Tensor): reference (batch, seqlen)
+
+        Returns:
+            (list): token list of prediction
+            (list): token list of reference
+
+        """
+        seqs_hat, seqs_true = [], []
+
+        for i, y_hat in enumerate(ys_hat):
+            y_true = ys_pad[i]
+
+            eos_true = np.where(y_true == -1)[0]
+            eos_true = eos_true[0] if len(eos_true) > 0 else len(y_true)
+
+            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:eos_true]]
+            seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]
+
+            seq_hat_text = "".join(seq_hat).replace(self.space, ' ')
+            seq_hat_text = seq_hat_text.replace(self.blank, '')
+            seq_true_text = "".join(seq_true).replace(self.space, ' ')
+
+            seqs_hat.append(seq_hat_text)
+            seqs_true.append(seq_true_text)
+
+        return seqs_hat, seqs_true
+
+    def calculate_cer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level CER score for transducer model.
+
+        Args:
+            seqs_hat (torch.Tensor): prediction (batch, seqlen)
+            seqs_true (torch.Tensor): reference (batch, seqlen)
+
+        Returns:
+            (float): average sentence-level CER score
+
+        """
+        char_eds, char_ref_lens = [], []
+
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_chars = seq_hat_text.replace(' ', '')
+            ref_chars = seq_true_text.replace(' ', '')
+
+            char_eds.append(editdistance.eval(hyp_chars, ref_chars))
+            char_ref_lens.append(len(ref_chars))
+
+        return float(sum(char_eds)) / sum(char_ref_lens)
+
+    def calculate_wer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level WER score for transducer model.
+
+        Args:
+            seqs_hat (torch.Tensor): prediction (batch, seqlen)
+            seqs_true (torch.Tensor): reference (batch, seqlen)
+
+        Returns:
+            (float): average sentence-level WER score
+
+        """
+        word_eds, word_ref_lens = [], []
+
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_words = seq_hat_text.split()
+            ref_words = seq_true_text.split()
+
+            word_eds.append(editdistance.eval(hyp_words, ref_words))
+            word_ref_lens.append(len(ref_words))
+
+        return float(sum(word_eds)) / sum(word_ref_lens)
--- a/ppg_extractor/encoder/init.py
+++ b/ppg_extractor/encoder/init.py
--- a/ppg_extractor/encoder/attention.py
+++ b/ppg_extractor/encoder/attention.py
@@ -0,0 +1,183 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Multi-Head Attention layer definition."""
+
+import math
+
+import numpy
+import torch
+from torch import nn
+
+
+class MultiHeadedAttention(nn.Module):
+    """Multi-Head Attention layer.
+
+    :param int n_head: the number of head s
+    :param int n_feat: the number of features
+    :param float dropout_rate: dropout rate
+
+    """
+
+    def __init__(self, n_head, n_feat, dropout_rate):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadedAttention, self).__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        self.linear_q = nn.Linear(n_feat, n_feat)
+        self.linear_k = nn.Linear(n_feat, n_feat)
+        self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.attn = None
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+    def forward_qkv(self, query, key, value):
+        """Transform query, key and value.
+
+        :param torch.Tensor query: (batch, time1, size)
+        :param torch.Tensor key: (batch, time2, size)
+        :param torch.Tensor value: (batch, time2, size)
+        :return torch.Tensor transformed query, key and value
+
+        """
+        n_batch = query.size(0)
+        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
+        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
+        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
+        q = q.transpose(1, 2)  # (batch, head, time1, d_k)
+        k = k.transpose(1, 2)  # (batch, head, time2, d_k)
+        v = v.transpose(1, 2)  # (batch, head, time2, d_k)
+
+        return q, k, v
+
+    def forward_attention(self, value, scores, mask):
+        """Compute attention context vector.
+
+        :param torch.Tensor value: (batch, head, time2, size)
+        :param torch.Tensor scores: (batch, head, time1, time2)
+        :param torch.Tensor mask: (batch, 1, time2) or (batch, time1, time2)
+        :return torch.Tensor transformed `value` (batch, time1, d_model)
+            weighted by the attention score (batch, time1, time2)
+
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+            min_value = float(
+                numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min
+            )
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(
+                mask, 0.0
+            )  # (batch, head, time1, time2)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)
+
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = (
+            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        )  # (batch, time1, d_model)
+
+        return self.linear_out(x)  # (batch, time1, d_model)
+
+    def forward(self, query, key, value, mask):
+        """Compute 'Scaled Dot Product Attention'.
+
+        :param torch.Tensor query: (batch, time1, size)
+        :param torch.Tensor key: (batch, time2, size)
+        :param torch.Tensor value: (batch, time2, size)
+        :param torch.Tensor mask: (batch, 1, time2) or (batch, time1, time2)
+        :param torch.nn.Dropout dropout:
+        :return torch.Tensor: attention output (batch, time1, d_model)
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
+        return self.forward_attention(v, scores, mask)
+
+
+class RelPositionMultiHeadedAttention(MultiHeadedAttention):
+    """Multi-Head Attention layer with relative position encoding.
+
+    Paper: https://arxiv.org/abs/1901.02860
+
+    :param int n_head: the number of head s
+    :param int n_feat: the number of features
+    :param float dropout_rate: dropout rate
+
+    """
+
+    def __init__(self, n_head, n_feat, dropout_rate):
+        """Construct an RelPositionMultiHeadedAttention object."""
+        super().__init__(n_head, n_feat, dropout_rate)
+        # linear transformation for positional ecoding
+        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
+        # these two learnable bias are used in matrix c and matrix d
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        torch.nn.init.xavier_uniform_(self.pos_bias_u)
+        torch.nn.init.xavier_uniform_(self.pos_bias_v)
+
+    def rel_shift(self, x, zero_triu=False):
+        """Compute relative positinal encoding.
+
+        :param torch.Tensor x: (batch, time, size)
+        :param bool zero_triu: return the lower triangular part of the matrix
+        """
+        zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)
+        x_padded = torch.cat([zero_pad, x], dim=-1)
+
+        x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))
+        x = x_padded[:, :, 1:].view_as(x)
+
+        if zero_triu:
+            ones = torch.ones((x.size(2), x.size(3)))
+            x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
+
+        return x
+
+    def forward(self, query, key, value, pos_emb, mask):
+        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
+
+        :param torch.Tensor query: (batch, time1, size)
+        :param torch.Tensor key: (batch, time2, size)
+        :param torch.Tensor value: (batch, time2, size)
+        :param torch.Tensor pos_emb: (batch, time1, size)
+        :param torch.Tensor mask: (batch, time1, time2)
+        :param torch.nn.Dropout dropout:
+        :return torch.Tensor: attention output  (batch, time1, d_model)
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        q = q.transpose(1, 2)  # (batch, time1, head, d_k)
+
+        n_batch_pos = pos_emb.size(0)
+        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
+        p = p.transpose(1, 2)  # (batch, head, time1, d_k)
+
+        # (batch, head, time1, d_k)
+        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)
+        # (batch, head, time1, d_k)
+        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)
+
+        # compute attention score
+        # first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # (batch, head, time1, time2)
+        matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))
+
+        # compute matrix b and matrix d
+        # (batch, head, time1, time2)
+        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))
+        matrix_bd = self.rel_shift(matrix_bd)
+
+        scores = (matrix_ac + matrix_bd) / math.sqrt(
+            self.d_k
+        )  # (batch, head, time1, time2)
+
+        return self.forward_attention(v, scores, mask)
--- a/ppg_extractor/encoder/conformer_encoder.py
+++ b/ppg_extractor/encoder/conformer_encoder.py
@@ -0,0 +1,262 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Encoder definition."""
+
+import logging
+import torch
+from typing import Callable
+from typing import Collection
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+from .convolution import ConvolutionModule
+from .encoder_layer import EncoderLayer
+from ..nets_utils import get_activation, make_pad_mask
+from .vgg import VGG2L
+from .attention import MultiHeadedAttention, RelPositionMultiHeadedAttention
+from .embedding import PositionalEncoding, ScaledPositionalEncoding, RelPositionalEncoding
+from .layer_norm import LayerNorm
+from .multi_layer_conv import Conv1dLinear, MultiLayeredConv1d
+from .positionwise_feed_forward import PositionwiseFeedForward
+from .repeat import repeat
+from .subsampling import Conv2dNoSubsampling, Conv2dSubsampling
+
+
+class ConformerEncoder(torch.nn.Module):
+    """Conformer encoder module.
+
+    :param int idim: input dim
+    :param int attention_dim: dimention of attention
+    :param int attention_heads: the number of heads of multi head attention
+    :param int linear_units: the number of units of position-wise feed forward
+    :param int num_blocks: the number of decoder blocks
+    :param float dropout_rate: dropout rate
+    :param float attention_dropout_rate: dropout rate in attention
+    :param float positional_dropout_rate: dropout rate after adding positional encoding
+    :param str or torch.nn.Module input_layer: input layer type
+    :param bool normalize_before: whether to use layer_norm before the first block
+    :param bool concat_after: whether to concat attention layer's input and output
+        if True, additional linear will be applied.
+        i.e. x -> x + linear(concat(x, att(x)))
+        if False, no additional linear will be applied. i.e. x -> x + att(x)
+    :param str positionwise_layer_type: linear of conv1d
+    :param int positionwise_conv_kernel_size: kernel size of positionwise conv1d layer
+    :param str encoder_pos_enc_layer_type: encoder positional encoding layer type
+    :param str encoder_attn_layer_type: encoder attention layer type
+    :param str activation_type: encoder activation function type
+    :param bool macaron_style: whether to use macaron style for positionwise layer
+    :param bool use_cnn_module: whether to use convolution module
+    :param int cnn_module_kernel: kernerl size of convolution module
+    :param int padding_idx: padding_idx for input_layer=embed
+    """
+
+    def __init__(
+        self,
+        input_size,
+        attention_dim=256,
+        attention_heads=4,
+        linear_units=2048,
+        num_blocks=6,
+        dropout_rate=0.1,
+        positional_dropout_rate=0.1,
+        attention_dropout_rate=0.0,
+        input_layer="conv2d",
+        normalize_before=True,
+        concat_after=False,
+        positionwise_layer_type="linear",
+        positionwise_conv_kernel_size=1,
+        macaron_style=False,
+        pos_enc_layer_type="abs_pos",
+        selfattention_layer_type="selfattn",
+        activation_type="swish",
+        use_cnn_module=False,
+        cnn_module_kernel=31,
+        padding_idx=-1,
+        no_subsample=False,
+        subsample_by_2=False,
+    ):
+        """Construct an Encoder object."""
+        super().__init__()
+        
+        self._output_size = attention_dim
+        idim = input_size
+
+        activation = get_activation(activation_type)
+        if pos_enc_layer_type == "abs_pos":
+            pos_enc_class = PositionalEncoding
+        elif pos_enc_layer_type == "scaled_abs_pos":
+            pos_enc_class = ScaledPositionalEncoding
+        elif pos_enc_layer_type == "rel_pos":
+            assert selfattention_layer_type == "rel_selfattn"
+            pos_enc_class = RelPositionalEncoding
+        else:
+            raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
+
+        if input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(idim, attention_dim),
+                torch.nn.LayerNorm(attention_dim),
+                torch.nn.Dropout(dropout_rate),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d":
+            logging.info("Encoder input layer type: conv2d")
+            if no_subsample:
+                self.embed = Conv2dNoSubsampling(
+                    idim,
+                    attention_dim,
+                    dropout_rate,
+                    pos_enc_class(attention_dim, positional_dropout_rate),
+                )
+            else:
+                self.embed = Conv2dSubsampling(
+                    idim,
+                    attention_dim,
+                    dropout_rate,
+                    pos_enc_class(attention_dim, positional_dropout_rate),
+                    subsample_by_2,  # NOTE(Sx): added by songxiang
+                )
+        elif input_layer == "vgg2l":
+            self.embed = VGG2L(idim, attention_dim)
+        elif input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(idim, attention_dim, padding_idx=padding_idx),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif isinstance(input_layer, torch.nn.Module):
+            self.embed = torch.nn.Sequential(
+                input_layer,
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer is None:
+            self.embed = torch.nn.Sequential(
+                pos_enc_class(attention_dim, positional_dropout_rate)
+            )
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+        self.normalize_before = normalize_before
+        if positionwise_layer_type == "linear":
+            positionwise_layer = PositionwiseFeedForward
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                dropout_rate,
+                activation,
+            )
+        elif positionwise_layer_type == "conv1d":
+            positionwise_layer = MultiLayeredConv1d
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d-linear":
+            positionwise_layer = Conv1dLinear
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        else:
+            raise NotImplementedError("Support only linear or conv1d.")
+
+        if selfattention_layer_type == "selfattn":
+            logging.info("encoder self-attention layer type = self-attention")
+            encoder_selfattn_layer = MultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                attention_dim,
+                attention_dropout_rate,
+            )
+        elif selfattention_layer_type == "rel_selfattn":
+            assert pos_enc_layer_type == "rel_pos"
+            encoder_selfattn_layer = RelPositionMultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                attention_dim,
+                attention_dropout_rate,
+            )
+        else:
+            raise ValueError("unknown encoder_attn_layer: " + selfattention_layer_type)
+
+        convolution_layer = ConvolutionModule
+        convolution_layer_args = (attention_dim, cnn_module_kernel, activation)
+
+        self.encoders = repeat(
+            num_blocks,
+            lambda lnum: EncoderLayer(
+                attention_dim,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                positionwise_layer(*positionwise_layer_args),
+                positionwise_layer(*positionwise_layer_args) if macaron_style else None,
+                convolution_layer(*convolution_layer_args) if use_cnn_module else None,
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if self.normalize_before:
+            self.after_norm = LayerNorm(attention_dim)
+    
+    def output_size(self) -> int:
+        return self._output_size 
+    
+    def forward(
+        self,
+        xs_pad: torch.Tensor,
+        ilens: torch.Tensor,
+        prev_states: torch.Tensor = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
+        """
+        Args:
+            xs_pad: input tensor (B, L, D)
+            ilens: input lengths (B)
+            prev_states: Not to be used now.
+        Returns:
+            Position embedded tensor and mask
+        """
+        masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device)
+
+        if isinstance(self.embed, (Conv2dSubsampling, Conv2dNoSubsampling, VGG2L)):
+            # print(xs_pad.shape)
+            xs_pad, masks = self.embed(xs_pad, masks)
+            # print(xs_pad[0].size())
+        else:
+            xs_pad = self.embed(xs_pad)
+        xs_pad, masks = self.encoders(xs_pad, masks)
+        if isinstance(xs_pad, tuple):
+            xs_pad = xs_pad[0]
+
+        if self.normalize_before:
+            xs_pad = self.after_norm(xs_pad)
+        olens = masks.squeeze(1).sum(1)
+        return xs_pad, olens, None
+    
+    # def forward(self, xs, masks):
+        # """Encode input sequence.
+
+        # :param torch.Tensor xs: input tensor
+        # :param torch.Tensor masks: input mask
+        # :return: position embedded tensor and mask
+        # :rtype Tuple[torch.Tensor, torch.Tensor]:
+        # """
+        # if isinstance(self.embed, (Conv2dSubsampling, VGG2L)):
+            # xs, masks = self.embed(xs, masks)
+        # else:
+            # xs = self.embed(xs)
+
+        # xs, masks = self.encoders(xs, masks)
+        # if isinstance(xs, tuple):
+            # xs = xs[0]
+
+        # if self.normalize_before:
+            # xs = self.after_norm(xs)
+        # return xs, masks
--- a/ppg_extractor/encoder/convolution.py
+++ b/ppg_extractor/encoder/convolution.py
@@ -0,0 +1,74 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2020 Johns Hopkins University (Shinji Watanabe)
+#                Northwestern Polytechnical University (Pengcheng Guo)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""ConvolutionModule definition."""
+
+from torch import nn
+
+
+class ConvolutionModule(nn.Module):
+    """ConvolutionModule in Conformer model.
+
+    :param int channels: channels of cnn
+    :param int kernel_size: kernerl size of cnn
+
+    """
+
+    def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):
+        """Construct an ConvolutionModule object."""
+        super(ConvolutionModule, self).__init__()
+        # kernerl_size should be a odd number for 'SAME' padding
+        assert (kernel_size - 1) % 2 == 0
+
+        self.pointwise_conv1 = nn.Conv1d(
+            channels,
+            2 * channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.depthwise_conv = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+            groups=channels,
+            bias=bias,
+        )
+        self.norm = nn.BatchNorm1d(channels)
+        self.pointwise_conv2 = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.activation = activation
+
+    def forward(self, x):
+        """Compute convolution module.
+
+        :param torch.Tensor x: (batch, time, size)
+        :return torch.Tensor: convoluted `value` (batch, time, d_model)
+        """
+        # exchange the temporal dimension and the feature dimension
+        x = x.transpose(1, 2)
+
+        # GLU mechanism
+        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
+        x = nn.functional.glu(x, dim=1)  # (batch, channel, dim)
+
+        # 1D Depthwise Conv
+        x = self.depthwise_conv(x)
+        x = self.activation(self.norm(x))
+
+        x = self.pointwise_conv2(x)
+
+        return x.transpose(1, 2)
--- a/ppg_extractor/encoder/embedding.py
+++ b/ppg_extractor/encoder/embedding.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Positonal Encoding Module."""
+
+import math
+
+import torch
+
+
+def _pre_hook(
+    state_dict,
+    prefix,
+    local_metadata,
+    strict,
+    missing_keys,
+    unexpected_keys,
+    error_msgs,
+):
+    """Perform pre-hook in load_state_dict for backward compatibility.
+
+    Note:
+        We saved self.pe until v.0.5.2 but we have omitted it later.
+        Therefore, we remove the item "pe" from `state_dict` for backward compatibility.
+
+    """
+    k = prefix + "pe"
+    if k in state_dict:
+        state_dict.pop(k)
+
+
+class PositionalEncoding(torch.nn.Module):
+    """Positional encoding.
+
+    :param int d_model: embedding dim
+    :param float dropout_rate: dropout rate
+    :param int max_len: maximum input length
+    :param reverse: whether to reverse the input position
+
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000, reverse=False):
+        """Construct an PositionalEncoding object."""
+        super(PositionalEncoding, self).__init__()
+        self.d_model = d_model
+        self.reverse = reverse
+        self.xscale = math.sqrt(self.d_model)
+        self.dropout = torch.nn.Dropout(p=dropout_rate)
+        self.pe = None
+        self.extend_pe(torch.tensor(0.0).expand(1, max_len))
+        self._register_load_state_dict_pre_hook(_pre_hook)
+
+    def extend_pe(self, x):
+        """Reset the positional encodings."""
+        if self.pe is not None:
+            if self.pe.size(1) >= x.size(1):
+                if self.pe.dtype != x.dtype or self.pe.device != x.device:
+                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)
+                return
+        pe = torch.zeros(x.size(1), self.d_model)
+        if self.reverse:
+            position = torch.arange(
+                x.size(1) - 1, -1, -1.0, dtype=torch.float32
+            ).unsqueeze(1)
+        else:
+            position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32)
+            * -(math.log(10000.0) / self.d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.pe = pe.to(device=x.device, dtype=x.dtype)
+
+    def forward(self, x: torch.Tensor):
+        """Add positional encoding.
+
+        Args:
+            x (torch.Tensor): Input. Its shape is (batch, time, ...)
+
+        Returns:
+            torch.Tensor: Encoded tensor. Its shape is (batch, time, ...)
+
+        """
+        self.extend_pe(x)
+        x = x * self.xscale + self.pe[:, : x.size(1)]
+        return self.dropout(x)
+
+
+class ScaledPositionalEncoding(PositionalEncoding):
+    """Scaled positional encoding module.
+
+    See also: Sec. 3.2  https://arxiv.org/pdf/1809.08895.pdf
+
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class.
+
+        :param int d_model: embedding dim
+        :param float dropout_rate: dropout rate
+        :param int max_len: maximum input length
+
+        """
+        super().__init__(d_model=d_model, dropout_rate=dropout_rate, max_len=max_len)
+        self.alpha = torch.nn.Parameter(torch.tensor(1.0))
+
+    def reset_parameters(self):
+        """Reset parameters."""
+        self.alpha.data = torch.tensor(1.0)
+
+    def forward(self, x):
+        """Add positional encoding.
+
+        Args:
+            x (torch.Tensor): Input. Its shape is (batch, time, ...)
+
+        Returns:
+            torch.Tensor: Encoded tensor. Its shape is (batch, time, ...)
+
+        """
+        self.extend_pe(x)
+        x = x + self.alpha * self.pe[:, : x.size(1)]
+        return self.dropout(x)
+
+
+class RelPositionalEncoding(PositionalEncoding):
+    """Relitive positional encoding module.
+
+    See : Appendix B in https://arxiv.org/abs/1901.02860
+
+    :param int d_model: embedding dim
+    :param float dropout_rate: dropout rate
+    :param int max_len: maximum input length
+
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class.
+
+        :param int d_model: embedding dim
+        :param float dropout_rate: dropout rate
+        :param int max_len: maximum input length
+
+        """
+        super().__init__(d_model, dropout_rate, max_len, reverse=True)
+
+    def forward(self, x):
+        """Compute positional encoding.
+
+        Args:
+            x (torch.Tensor): Input. Its shape is (batch, time, ...)
+
+        Returns:
+            torch.Tensor: x. Its shape is (batch, time, ...)
+            torch.Tensor: pos_emb. Its shape is (1, time, ...)
+
+        """
+        self.extend_pe(x)
+        x = x * self.xscale
+        pos_emb = self.pe[:, : x.size(1)]
+        return self.dropout(x), self.dropout(pos_emb)
--- a/ppg_extractor/encoder/encoder.py
+++ b/ppg_extractor/encoder/encoder.py
@@ -0,0 +1,217 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Encoder definition."""
+
+import logging
+import torch
+
+from espnet.nets.pytorch_backend.conformer.convolution import ConvolutionModule
+from espnet.nets.pytorch_backend.conformer.encoder_layer import EncoderLayer
+from espnet.nets.pytorch_backend.nets_utils import get_activation
+from espnet.nets.pytorch_backend.transducer.vgg import VGG2L
+from espnet.nets.pytorch_backend.transformer.attention import (
+    MultiHeadedAttention,  # noqa: H301
+    RelPositionMultiHeadedAttention,  # noqa: H301
+)
+from espnet.nets.pytorch_backend.transformer.embedding import (
+    PositionalEncoding,  # noqa: H301
+    ScaledPositionalEncoding,  # noqa: H301
+    RelPositionalEncoding,  # noqa: H301
+)
+from espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm
+from espnet.nets.pytorch_backend.transformer.multi_layer_conv import Conv1dLinear
+from espnet.nets.pytorch_backend.transformer.multi_layer_conv import MultiLayeredConv1d
+from espnet.nets.pytorch_backend.transformer.positionwise_feed_forward import (
+    PositionwiseFeedForward,  # noqa: H301
+)
+from espnet.nets.pytorch_backend.transformer.repeat import repeat
+from espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling
+
+
+class Encoder(torch.nn.Module):
+    """Conformer encoder module.
+
+    :param int idim: input dim
+    :param int attention_dim: dimention of attention
+    :param int attention_heads: the number of heads of multi head attention
+    :param int linear_units: the number of units of position-wise feed forward
+    :param int num_blocks: the number of decoder blocks
+    :param float dropout_rate: dropout rate
+    :param float attention_dropout_rate: dropout rate in attention
+    :param float positional_dropout_rate: dropout rate after adding positional encoding
+    :param str or torch.nn.Module input_layer: input layer type
+    :param bool normalize_before: whether to use layer_norm before the first block
+    :param bool concat_after: whether to concat attention layer's input and output
+        if True, additional linear will be applied.
+        i.e. x -> x + linear(concat(x, att(x)))
+        if False, no additional linear will be applied. i.e. x -> x + att(x)
+    :param str positionwise_layer_type: linear of conv1d
+    :param int positionwise_conv_kernel_size: kernel size of positionwise conv1d layer
+    :param str encoder_pos_enc_layer_type: encoder positional encoding layer type
+    :param str encoder_attn_layer_type: encoder attention layer type
+    :param str activation_type: encoder activation function type
+    :param bool macaron_style: whether to use macaron style for positionwise layer
+    :param bool use_cnn_module: whether to use convolution module
+    :param int cnn_module_kernel: kernerl size of convolution module
+    :param int padding_idx: padding_idx for input_layer=embed
+    """
+
+    def __init__(
+        self,
+        idim,
+        attention_dim=256,
+        attention_heads=4,
+        linear_units=2048,
+        num_blocks=6,
+        dropout_rate=0.1,
+        positional_dropout_rate=0.1,
+        attention_dropout_rate=0.0,
+        input_layer="conv2d",
+        normalize_before=True,
+        concat_after=False,
+        positionwise_layer_type="linear",
+        positionwise_conv_kernel_size=1,
+        macaron_style=False,
+        pos_enc_layer_type="abs_pos",
+        selfattention_layer_type="selfattn",
+        activation_type="swish",
+        use_cnn_module=False,
+        cnn_module_kernel=31,
+        padding_idx=-1,
+    ):
+        """Construct an Encoder object."""
+        super(Encoder, self).__init__()
+
+        activation = get_activation(activation_type)
+        if pos_enc_layer_type == "abs_pos":
+            pos_enc_class = PositionalEncoding
+        elif pos_enc_layer_type == "scaled_abs_pos":
+            pos_enc_class = ScaledPositionalEncoding
+        elif pos_enc_layer_type == "rel_pos":
+            assert selfattention_layer_type == "rel_selfattn"
+            pos_enc_class = RelPositionalEncoding
+        else:
+            raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
+
+        if input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(idim, attention_dim),
+                torch.nn.LayerNorm(attention_dim),
+                torch.nn.Dropout(dropout_rate),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d":
+            self.embed = Conv2dSubsampling(
+                idim,
+                attention_dim,
+                dropout_rate,
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer == "vgg2l":
+            self.embed = VGG2L(idim, attention_dim)
+        elif input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(idim, attention_dim, padding_idx=padding_idx),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif isinstance(input_layer, torch.nn.Module):
+            self.embed = torch.nn.Sequential(
+                input_layer,
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer is None:
+            self.embed = torch.nn.Sequential(
+                pos_enc_class(attention_dim, positional_dropout_rate)
+            )
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+        self.normalize_before = normalize_before
+        if positionwise_layer_type == "linear":
+            positionwise_layer = PositionwiseFeedForward
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                dropout_rate,
+                activation,
+            )
+        elif positionwise_layer_type == "conv1d":
+            positionwise_layer = MultiLayeredConv1d
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d-linear":
+            positionwise_layer = Conv1dLinear
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        else:
+            raise NotImplementedError("Support only linear or conv1d.")
+
+        if selfattention_layer_type == "selfattn":
+            logging.info("encoder self-attention layer type = self-attention")
+            encoder_selfattn_layer = MultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                attention_dim,
+                attention_dropout_rate,
+            )
+        elif selfattention_layer_type == "rel_selfattn":
+            assert pos_enc_layer_type == "rel_pos"
+            encoder_selfattn_layer = RelPositionMultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                attention_dim,
+                attention_dropout_rate,
+            )
+        else:
+            raise ValueError("unknown encoder_attn_layer: " + selfattention_layer_type)
+
+        convolution_layer = ConvolutionModule
+        convolution_layer_args = (attention_dim, cnn_module_kernel, activation)
+
+        self.encoders = repeat(
+            num_blocks,
+            lambda lnum: EncoderLayer(
+                attention_dim,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                positionwise_layer(*positionwise_layer_args),
+                positionwise_layer(*positionwise_layer_args) if macaron_style else None,
+                convolution_layer(*convolution_layer_args) if use_cnn_module else None,
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if self.normalize_before:
+            self.after_norm = LayerNorm(attention_dim)
+
+    def forward(self, xs, masks):
+        """Encode input sequence.
+
+        :param torch.Tensor xs: input tensor
+        :param torch.Tensor masks: input mask
+        :return: position embedded tensor and mask
+        :rtype Tuple[torch.Tensor, torch.Tensor]:
+        """
+        if isinstance(self.embed, (Conv2dSubsampling, VGG2L)):
+            xs, masks = self.embed(xs, masks)
+        else:
+            xs = self.embed(xs)
+
+        xs, masks = self.encoders(xs, masks)
+        if isinstance(xs, tuple):
+            xs = xs[0]
+
+        if self.normalize_before:
+            xs = self.after_norm(xs)
+        return xs, masks
--- a/ppg_extractor/encoder/encoder_layer.py
+++ b/ppg_extractor/encoder/encoder_layer.py
@@ -0,0 +1,152 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2020 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Encoder self-attention layer definition."""
+
+import torch
+
+from torch import nn
+
+from .layer_norm import LayerNorm
+
+
+class EncoderLayer(nn.Module):
+    """Encoder layer module.
+
+    :param int size: input dim
+    :param espnet.nets.pytorch_backend.transformer.attention.
+        MultiHeadedAttention self_attn: self attention module
+        RelPositionMultiHeadedAttention self_attn: self attention module
+    :param espnet.nets.pytorch_backend.transformer.positionwise_feed_forward.
+        PositionwiseFeedForward feed_forward:
+        feed forward module
+    :param espnet.nets.pytorch_backend.transformer.positionwise_feed_forward
+    for macaron style
+    PositionwiseFeedForward feed_forward:
+    feed forward module
+    :param espnet.nets.pytorch_backend.conformer.convolution.
+        ConvolutionModule feed_foreard:
+        feed forward module
+    :param float dropout_rate: dropout rate
+    :param bool normalize_before: whether to use layer_norm before the first block
+    :param bool concat_after: whether to concat attention layer's input and output
+        if True, additional linear will be applied.
+        i.e. x -> x + linear(concat(x, att(x)))
+        if False, no additional linear will be applied. i.e. x -> x + att(x)
+
+    """
+
+    def __init__(
+        self,
+        size,
+        self_attn,
+        feed_forward,
+        feed_forward_macaron,
+        conv_module,
+        dropout_rate,
+        normalize_before=True,
+        concat_after=False,
+    ):
+        """Construct an EncoderLayer object."""
+        super(EncoderLayer, self).__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.feed_forward_macaron = feed_forward_macaron
+        self.conv_module = conv_module
+        self.norm_ff = LayerNorm(size)  # for the FNN module
+        self.norm_mha = LayerNorm(size)  # for the MHA module
+        if feed_forward_macaron is not None:
+            self.norm_ff_macaron = LayerNorm(size)
+            self.ff_scale = 0.5
+        else:
+            self.ff_scale = 1.0
+        if self.conv_module is not None:
+            self.norm_conv = LayerNorm(size)  # for the CNN module
+            self.norm_final = LayerNorm(size)  # for the final output of the block
+        self.dropout = nn.Dropout(dropout_rate)
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear = nn.Linear(size + size, size)
+
+    def forward(self, x_input, mask, cache=None):
+        """Compute encoded features.
+
+        :param torch.Tensor x_input: encoded source features, w/o pos_emb
+        tuple((batch, max_time_in, size), (1, max_time_in, size))
+        or (batch, max_time_in, size)
+        :param torch.Tensor mask: mask for x (batch, max_time_in)
+        :param torch.Tensor cache: cache for x (batch, max_time_in - 1, size)
+        :rtype: Tuple[torch.Tensor, torch.Tensor]
+        """
+        if isinstance(x_input, tuple):
+            x, pos_emb = x_input[0], x_input[1]
+        else:
+            x, pos_emb = x_input, None
+
+        # whether to use macaron style
+        if self.feed_forward_macaron is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_ff_macaron(x)
+            x = residual + self.ff_scale * self.dropout(self.feed_forward_macaron(x))
+            if not self.normalize_before:
+                x = self.norm_ff_macaron(x)
+
+        # multi-headed self-attention module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_mha(x)
+
+        if cache is None:
+            x_q = x
+        else:
+            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)
+            x_q = x[:, -1:, :]
+            residual = residual[:, -1:, :]
+            mask = None if mask is None else mask[:, -1:, :]
+
+        if pos_emb is not None:
+            x_att = self.self_attn(x_q, x, x, pos_emb, mask)
+        else:
+            x_att = self.self_attn(x_q, x, x, mask)
+
+        if self.concat_after:
+            x_concat = torch.cat((x, x_att), dim=-1)
+            x = residual + self.concat_linear(x_concat)
+        else:
+            x = residual + self.dropout(x_att)
+        if not self.normalize_before:
+            x = self.norm_mha(x)
+
+        # convolution module
+        if self.conv_module is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_conv(x)
+            x = residual + self.dropout(self.conv_module(x))
+            if not self.normalize_before:
+                x = self.norm_conv(x)
+
+        # feed forward module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_ff(x)
+        x = residual + self.ff_scale * self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm_ff(x)
+
+        if self.conv_module is not None:
+            x = self.norm_final(x)
+
+        if cache is not None:
+            x = torch.cat([cache, x], dim=1)
+
+        if pos_emb is not None:
+            return (x, pos_emb), mask
+
+        return x, mask
--- a/ppg_extractor/encoder/layer_norm.py
+++ b/ppg_extractor/encoder/layer_norm.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Layer normalization module."""
+
+import torch
+
+
+class LayerNorm(torch.nn.LayerNorm):
+    """Layer normalization module.
+
+    :param int nout: output dim size
+    :param int dim: dimension to be normalized
+    """
+
+    def __init__(self, nout, dim=-1):
+        """Construct an LayerNorm object."""
+        super(LayerNorm, self).__init__(nout, eps=1e-12)
+        self.dim = dim
+
+    def forward(self, x):
+        """Apply layer normalization.
+
+        :param torch.Tensor x: input tensor
+        :return: layer normalized tensor
+        :rtype torch.Tensor
+        """
+        if self.dim == -1:
+            return super(LayerNorm, self).forward(x)
+        return super(LayerNorm, self).forward(x.transpose(1, -1)).transpose(1, -1)
--- a/ppg_extractor/encoder/multi_layer_conv.py
+++ b/ppg_extractor/encoder/multi_layer_conv.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Tomoki Hayashi
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Layer modules for FFT block in FastSpeech (Feed-forward Transformer)."""
+
+import torch
+
+
+class MultiLayeredConv1d(torch.nn.Module):
+    """Multi-layered conv1d for Transformer block.
+
+    This is a module of multi-leyered conv1d designed
+    to replace positionwise feed-forward network
+    in Transforner block, which is introduced in
+    `FastSpeech: Fast, Robust and Controllable Text to Speech`_.
+
+    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:
+        https://arxiv.org/pdf/1905.09263.pdf
+
+    """
+
+    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):
+        """Initialize MultiLayeredConv1d module.
+
+        Args:
+            in_chans (int): Number of input channels.
+            hidden_chans (int): Number of hidden channels.
+            kernel_size (int): Kernel size of conv1d.
+            dropout_rate (float): Dropout rate.
+
+        """
+        super(MultiLayeredConv1d, self).__init__()
+        self.w_1 = torch.nn.Conv1d(
+            in_chans,
+            hidden_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.w_2 = torch.nn.Conv1d(
+            hidden_chans,
+            in_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.dropout = torch.nn.Dropout(dropout_rate)
+
+    def forward(self, x):
+        """Calculate forward propagation.
+
+        Args:
+            x (Tensor): Batch of input tensors (B, ..., in_chans).
+
+        Returns:
+            Tensor: Batch of output tensors (B, ..., hidden_chans).
+
+        """
+        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)
+        return self.w_2(self.dropout(x).transpose(-1, 1)).transpose(-1, 1)
+
+
+class Conv1dLinear(torch.nn.Module):
+    """Conv1D + Linear for Transformer block.
+
+    A variant of MultiLayeredConv1d, which replaces second conv-layer to linear.
+
+    """
+
+    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):
+        """Initialize Conv1dLinear module.
+
+        Args:
+            in_chans (int): Number of input channels.
+            hidden_chans (int): Number of hidden channels.
+            kernel_size (int): Kernel size of conv1d.
+            dropout_rate (float): Dropout rate.
+
+        """
+        super(Conv1dLinear, self).__init__()
+        self.w_1 = torch.nn.Conv1d(
+            in_chans,
+            hidden_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.w_2 = torch.nn.Linear(hidden_chans, in_chans)
+        self.dropout = torch.nn.Dropout(dropout_rate)
+
+    def forward(self, x):
+        """Calculate forward propagation.
+
+        Args:
+            x (Tensor): Batch of input tensors (B, ..., in_chans).
+
+        Returns:
+            Tensor: Batch of output tensors (B, ..., hidden_chans).
+
+        """
+        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)
+        return self.w_2(self.dropout(x))
--- a/ppg_extractor/encoder/positionwise_feed_forward.py
+++ b/ppg_extractor/encoder/positionwise_feed_forward.py
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Positionwise feed forward layer definition."""
+
+import torch
+
+
+class PositionwiseFeedForward(torch.nn.Module):
+    """Positionwise feed forward layer.
+
+    :param int idim: input dimenstion
+    :param int hidden_units: number of hidden units
+    :param float dropout_rate: dropout rate
+
+    """
+
+    def __init__(self, idim, hidden_units, dropout_rate, activation=torch.nn.ReLU()):
+        """Construct an PositionwiseFeedForward object."""
+        super(PositionwiseFeedForward, self).__init__()
+        self.w_1 = torch.nn.Linear(idim, hidden_units)
+        self.w_2 = torch.nn.Linear(hidden_units, idim)
+        self.dropout = torch.nn.Dropout(dropout_rate)
+        self.activation = activation
+
+    def forward(self, x):
+        """Forward funciton."""
+        return self.w_2(self.dropout(self.activation(self.w_1(x))))
--- a/ppg_extractor/encoder/repeat.py
+++ b/ppg_extractor/encoder/repeat.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Repeat the same layer definition."""
+
+import torch
+
+
+class MultiSequential(torch.nn.Sequential):
+    """Multi-input multi-output torch.nn.Sequential."""
+
+    def forward(self, *args):
+        """Repeat."""
+        for m in self:
+            args = m(*args)
+        return args
+
+
+def repeat(N, fn):
+    """Repeat module N times.
+
+    :param int N: repeat time
+    :param function fn: function to generate module
+    :return: repeated modules
+    :rtype: MultiSequential
+    """
+    return MultiSequential(*[fn(n) for n in range(N)])
--- a/ppg_extractor/encoder/subsampling.py
+++ b/ppg_extractor/encoder/subsampling.py
@@ -0,0 +1,218 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Subsampling layer definition."""
+import logging
+import torch
+
+from espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding
+
+
+class Conv2dSubsampling(torch.nn.Module):
+    """Convolutional 2D subsampling (to 1/4 length or 1/2 length).
+
+    :param int idim: input dim
+    :param int odim: output dim
+    :param flaot dropout_rate: dropout rate
+    :param torch.nn.Module pos_enc: custom position encoding layer
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate, pos_enc=None, 
+                 subsample_by_2=False,
+        ):
+        """Construct an Conv2dSubsampling object."""
+        super(Conv2dSubsampling, self).__init__()
+        self.subsample_by_2 = subsample_by_2
+        if subsample_by_2:
+            self.conv = torch.nn.Sequential(
+                torch.nn.Conv2d(1, odim, kernel_size=5, stride=1, padding=2),
+                torch.nn.ReLU(),
+                torch.nn.Conv2d(odim, odim, kernel_size=4, stride=2, padding=1),
+                torch.nn.ReLU(),
+            )
+            self.out = torch.nn.Sequential(
+                torch.nn.Linear(odim * (idim // 2), odim),
+                pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
+            )
+        else:
+            self.conv = torch.nn.Sequential(
+                torch.nn.Conv2d(1, odim, kernel_size=4, stride=2, padding=1),
+                torch.nn.ReLU(),
+                torch.nn.Conv2d(odim, odim, kernel_size=4, stride=2, padding=1),
+                torch.nn.ReLU(),
+            )
+            self.out = torch.nn.Sequential(
+                torch.nn.Linear(odim * (idim // 4), odim),
+                pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
+            )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        :param torch.Tensor x: input tensor
+        :param torch.Tensor x_mask: input mask
+        :return: subsampled x and mask
+        :rtype Tuple[torch.Tensor, torch.Tensor]
+
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        if self.subsample_by_2:
+            return x, x_mask[:, :, ::2]
+        else:
+            return x, x_mask[:, :, ::2][:, :, ::2]
+
+    def __getitem__(self, key):
+        """Subsample x.
+
+        When reset_parameters() is called, if use_scaled_pos_enc is used,
+            return the positioning encoding.
+
+        """
+        if key != -1:
+            raise NotImplementedError("Support only `-1` (for `reset_parameters`).")
+        return self.out[key]
+
+
+class Conv2dNoSubsampling(torch.nn.Module):
+    """Convolutional 2D without subsampling.
+
+    :param int idim: input dim
+    :param int odim: output dim
+    :param flaot dropout_rate: dropout rate
+    :param torch.nn.Module pos_enc: custom position encoding layer
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate, pos_enc=None):
+        """Construct an Conv2dSubsampling object."""
+        super().__init__()
+        logging.info("Encoder does not do down-sample on mel-spectrogram.")
+        self.conv = torch.nn.Sequential(
+            torch.nn.Conv2d(1, odim, kernel_size=5, stride=1, padding=2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, kernel_size=5, stride=1, padding=2),
+            torch.nn.ReLU(),
+        )
+        self.out = torch.nn.Sequential(
+            torch.nn.Linear(odim * idim, odim),
+            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
+        )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        :param torch.Tensor x: input tensor
+        :param torch.Tensor x_mask: input mask
+        :return: subsampled x and mask
+        :rtype Tuple[torch.Tensor, torch.Tensor]
+
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        return x, x_mask
+
+    def __getitem__(self, key):
+        """Subsample x.
+
+        When reset_parameters() is called, if use_scaled_pos_enc is used,
+            return the positioning encoding.
+
+        """
+        if key != -1:
+            raise NotImplementedError("Support only `-1` (for `reset_parameters`).")
+        return self.out[key]
+
+
+class Conv2dSubsampling6(torch.nn.Module):
+    """Convolutional 2D subsampling (to 1/6 length).
+
+    :param int idim: input dim
+    :param int odim: output dim
+    :param flaot dropout_rate: dropout rate
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate):
+        """Construct an Conv2dSubsampling object."""
+        super(Conv2dSubsampling6, self).__init__()
+        self.conv = torch.nn.Sequential(
+            torch.nn.Conv2d(1, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 5, 3),
+            torch.nn.ReLU(),
+        )
+        self.out = torch.nn.Sequential(
+            torch.nn.Linear(odim * (((idim - 1) // 2 - 2) // 3), odim),
+            PositionalEncoding(odim, dropout_rate),
+        )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        :param torch.Tensor x: input tensor
+        :param torch.Tensor x_mask: input mask
+        :return: subsampled x and mask
+        :rtype Tuple[torch.Tensor, torch.Tensor]
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        return x, x_mask[:, :, :-2:2][:, :, :-4:3]
+
+
+class Conv2dSubsampling8(torch.nn.Module):
+    """Convolutional 2D subsampling (to 1/8 length).
+
+    :param int idim: input dim
+    :param int odim: output dim
+    :param flaot dropout_rate: dropout rate
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate):
+        """Construct an Conv2dSubsampling object."""
+        super(Conv2dSubsampling8, self).__init__()
+        self.conv = torch.nn.Sequential(
+            torch.nn.Conv2d(1, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 3, 2),
+            torch.nn.ReLU(),
+        )
+        self.out = torch.nn.Sequential(
+            torch.nn.Linear(odim * ((((idim - 1) // 2 - 1) // 2 - 1) // 2), odim),
+            PositionalEncoding(odim, dropout_rate),
+        )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        :param torch.Tensor x: input tensor
+        :param torch.Tensor x_mask: input mask
+        :return: subsampled x and mask
+        :rtype Tuple[torch.Tensor, torch.Tensor]
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        return x, x_mask[:, :, :-2:2][:, :, :-2:2][:, :, :-2:2]
--- a/ppg_extractor/encoder/swish.py
+++ b/ppg_extractor/encoder/swish.py
@@ -0,0 +1,18 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2020 Johns Hopkins University (Shinji Watanabe)
+#                Northwestern Polytechnical University (Pengcheng Guo)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Swish() activation function for Conformer."""
+
+import torch
+
+
+class Swish(torch.nn.Module):
+    """Construct an Swish object."""
+
+    def forward(self, x):
+        """Return Swich activation function."""
+        return x * torch.sigmoid(x)
--- a/ppg_extractor/encoder/vgg.py
+++ b/ppg_extractor/encoder/vgg.py
@@ -0,0 +1,77 @@
+"""VGG2L definition for transformer-transducer."""
+
+import torch
+
+
+class VGG2L(torch.nn.Module):
+    """VGG2L module for transformer-transducer encoder."""
+
+    def __init__(self, idim, odim):
+        """Construct a VGG2L object.
+
+        Args:
+            idim (int): dimension of inputs
+            odim (int): dimension of outputs
+
+        """
+        super(VGG2L, self).__init__()
+
+        self.vgg2l = torch.nn.Sequential(
+            torch.nn.Conv2d(1, 64, 3, stride=1, padding=1),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(64, 64, 3, stride=1, padding=1),
+            torch.nn.ReLU(),
+            torch.nn.MaxPool2d((3, 2)),
+            torch.nn.Conv2d(64, 128, 3, stride=1, padding=1),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(128, 128, 3, stride=1, padding=1),
+            torch.nn.ReLU(),
+            torch.nn.MaxPool2d((2, 2)),
+        )
+
+        self.output = torch.nn.Linear(128 * ((idim // 2) // 2), odim)
+
+    def forward(self, x, x_mask):
+        """VGG2L forward for x.
+
+        Args:
+            x (torch.Tensor): input torch (B, T, idim)
+            x_mask (torch.Tensor): (B, 1, T)
+
+        Returns:
+            x (torch.Tensor): input torch (B, sub(T), attention_dim)
+            x_mask (torch.Tensor): (B, 1, sub(T))
+
+        """
+        x = x.unsqueeze(1)
+        x = self.vgg2l(x)
+
+        b, c, t, f = x.size()
+
+        x = self.output(x.transpose(1, 2).contiguous().view(b, t, c * f))
+
+        if x_mask is None:
+            return x, None
+        else:
+            x_mask = self.create_new_mask(x_mask, x)
+
+            return x, x_mask
+
+    def create_new_mask(self, x_mask, x):
+        """Create a subsampled version of x_mask.
+
+        Args:
+            x_mask (torch.Tensor): (B, 1, T)
+            x (torch.Tensor): (B, sub(T), attention_dim)
+
+        Returns:
+            x_mask (torch.Tensor): (B, 1, sub(T))
+
+        """
+        x_t1 = x_mask.size(2) - (x_mask.size(2) % 3)
+        x_mask = x_mask[:, :, :x_t1][:, :, ::3]
+
+        x_t2 = x_mask.size(2) - (x_mask.size(2) % 2)
+        x_mask = x_mask[:, :, :x_t2][:, :, ::2]
+
+        return x_mask
--- a/ppg_extractor/encoders.py
+++ b/ppg_extractor/encoders.py
@@ -0,0 +1,298 @@
+import logging
+import six
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch.nn.utils.rnn import pack_padded_sequence
+from torch.nn.utils.rnn import pad_packed_sequence
+
+from .e2e_asr_common import get_vgg2l_odim
+from .nets_utils import make_pad_mask, to_device
+
+
+class RNNP(torch.nn.Module):
+    """RNN with projection layer module
+
+    :param int idim: dimension of inputs
+    :param int elayers: number of encoder layers
+    :param int cdim: number of rnn units (resulted in cdim * 2 if bidirectional)
+    :param int hdim: number of projection units
+    :param np.ndarray subsample: list of subsampling numbers
+    :param float dropout: dropout rate
+    :param str typ: The RNN type
+    """
+
+    def __init__(self, idim, elayers, cdim, hdim, subsample, dropout, typ="blstm"):
+        super(RNNP, self).__init__()
+        bidir = typ[0] == "b"
+        for i in six.moves.range(elayers):
+            if i == 0:
+                inputdim = idim
+            else:
+                inputdim = hdim
+            rnn = torch.nn.LSTM(inputdim, cdim, dropout=dropout, num_layers=1, bidirectional=bidir,
+                                batch_first=True) if "lstm" in typ \
+                else torch.nn.GRU(inputdim, cdim, dropout=dropout, num_layers=1, bidirectional=bidir, batch_first=True)
+            setattr(self, "%s%d" % ("birnn" if bidir else "rnn", i), rnn)
+            # bottleneck layer to merge
+            if bidir:
+                setattr(self, "bt%d" % i, torch.nn.Linear(2 * cdim, hdim))
+            else:
+                setattr(self, "bt%d" % i, torch.nn.Linear(cdim, hdim))
+
+        self.elayers = elayers
+        self.cdim = cdim
+        self.subsample = subsample
+        self.typ = typ
+        self.bidir = bidir
+
+    def forward(self, xs_pad, ilens, prev_state=None):
+        """RNNP forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :param torch.Tensor prev_state: batch of previous RNN states
+        :return: batch of hidden state sequences (B, Tmax, hdim)
+        :rtype: torch.Tensor
+        """
+        logging.debug(self.__class__.__name__ + ' input lengths: ' + str(ilens))
+        elayer_states = []
+        for layer in six.moves.range(self.elayers):
+            xs_pack = pack_padded_sequence(xs_pad, ilens, batch_first=True, enforce_sorted=False)
+            rnn = getattr(self, ("birnn" if self.bidir else "rnn") + str(layer))
+            rnn.flatten_parameters()
+            if prev_state is not None and rnn.bidirectional:
+                prev_state = reset_backward_rnn_state(prev_state)
+            ys, states = rnn(xs_pack, hx=None if prev_state is None else prev_state[layer])
+            elayer_states.append(states)
+            # ys: utt list of frame x cdim x 2 (2: means bidirectional)
+            ys_pad, ilens = pad_packed_sequence(ys, batch_first=True)
+            sub = self.subsample[layer + 1]
+            if sub > 1:
+                ys_pad = ys_pad[:, ::sub]
+                ilens = [int(i + 1) // sub for i in ilens]
+            # (sum _utt frame_utt) x dim
+            projected = getattr(self, 'bt' + str(layer)
+                                )(ys_pad.contiguous().view(-1, ys_pad.size(2)))
+            if layer == self.elayers - 1:
+                xs_pad = projected.view(ys_pad.size(0), ys_pad.size(1), -1)
+            else:
+                xs_pad = torch.tanh(projected.view(ys_pad.size(0), ys_pad.size(1), -1))
+
+        return xs_pad, ilens, elayer_states  # x: utt list of frame x dim
+
+
+class RNN(torch.nn.Module):
+    """RNN module
+
+    :param int idim: dimension of inputs
+    :param int elayers: number of encoder layers
+    :param int cdim: number of rnn units (resulted in cdim * 2 if bidirectional)
+    :param int hdim: number of final projection units
+    :param float dropout: dropout rate
+    :param str typ: The RNN type
+    """
+
+    def __init__(self, idim, elayers, cdim, hdim, dropout, typ="blstm"):
+        super(RNN, self).__init__()
+        bidir = typ[0] == "b"
+        self.nbrnn = torch.nn.LSTM(idim, cdim, elayers, batch_first=True,
+                                   dropout=dropout, bidirectional=bidir) if "lstm" in typ \
+            else torch.nn.GRU(idim, cdim, elayers, batch_first=True, dropout=dropout,
+                              bidirectional=bidir)
+        if bidir:
+            self.l_last = torch.nn.Linear(cdim * 2, hdim)
+        else:
+            self.l_last = torch.nn.Linear(cdim, hdim)
+        self.typ = typ
+
+    def forward(self, xs_pad, ilens, prev_state=None):
+        """RNN forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :param torch.Tensor prev_state: batch of previous RNN states
+        :return: batch of hidden state sequences (B, Tmax, eprojs)
+        :rtype: torch.Tensor
+        """
+        logging.debug(self.__class__.__name__ + ' input lengths: ' + str(ilens))
+        xs_pack = pack_padded_sequence(xs_pad, ilens, batch_first=True)
+        self.nbrnn.flatten_parameters()
+        if prev_state is not None and self.nbrnn.bidirectional:
+            # We assume that when previous state is passed, it means that we're streaming the input
+            # and therefore cannot propagate backward BRNN state (otherwise it goes in the wrong direction)
+            prev_state = reset_backward_rnn_state(prev_state)
+        ys, states = self.nbrnn(xs_pack, hx=prev_state)
+        # ys: utt list of frame x cdim x 2 (2: means bidirectional)
+        ys_pad, ilens = pad_packed_sequence(ys, batch_first=True)
+        # (sum _utt frame_utt) x dim
+        projected = torch.tanh(self.l_last(
+            ys_pad.contiguous().view(-1, ys_pad.size(2))))
+        xs_pad = projected.view(ys_pad.size(0), ys_pad.size(1), -1)
+        return xs_pad, ilens, states  # x: utt list of frame x dim
+
+
+def reset_backward_rnn_state(states):
+    """Sets backward BRNN states to zeroes - useful in processing of sliding windows over the inputs"""
+    if isinstance(states, (list, tuple)):
+        for state in states:
+            state[1::2] = 0.
+    else:
+        states[1::2] = 0.
+    return states
+
+
+class VGG2L(torch.nn.Module):
+    """VGG-like module
+
+    :param int in_channel: number of input channels
+    """
+
+    def __init__(self, in_channel=1, downsample=True):
+        super(VGG2L, self).__init__()
+        # CNN layer (VGG motivated)
+        self.conv1_1 = torch.nn.Conv2d(in_channel, 64, 3, stride=1, padding=1)
+        self.conv1_2 = torch.nn.Conv2d(64, 64, 3, stride=1, padding=1)
+        self.conv2_1 = torch.nn.Conv2d(64, 128, 3, stride=1, padding=1)
+        self.conv2_2 = torch.nn.Conv2d(128, 128, 3, stride=1, padding=1)
+
+        self.in_channel = in_channel
+        self.downsample = downsample
+        if downsample:
+            self.stride = 2
+        else:
+            self.stride = 1
+
+    def forward(self, xs_pad, ilens, **kwargs):
+        """VGG2L forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :return: batch of padded hidden state sequences (B, Tmax // 4, 128 * D // 4) if downsample
+        :rtype: torch.Tensor
+        """
+        logging.debug(self.__class__.__name__ + ' input lengths: ' + str(ilens))
+
+        # x: utt x frame x dim
+        # xs_pad = F.pad_sequence(xs_pad)
+
+        # x: utt x 1 (input channel num) x frame x dim
+        xs_pad = xs_pad.view(xs_pad.size(0), xs_pad.size(1), self.in_channel,
+                             xs_pad.size(2) // self.in_channel).transpose(1, 2)
+
+        # NOTE: max_pool1d ?
+        xs_pad = F.relu(self.conv1_1(xs_pad))
+        xs_pad = F.relu(self.conv1_2(xs_pad))
+        if self.downsample:
+            xs_pad = F.max_pool2d(xs_pad, 2, stride=self.stride, ceil_mode=True)
+
+        xs_pad = F.relu(self.conv2_1(xs_pad))
+        xs_pad = F.relu(self.conv2_2(xs_pad))
+        if self.downsample:
+            xs_pad = F.max_pool2d(xs_pad, 2, stride=self.stride, ceil_mode=True)
+        if torch.is_tensor(ilens):
+            ilens = ilens.cpu().numpy()
+        else:
+            ilens = np.array(ilens, dtype=np.float32)
+        if self.downsample:
+            ilens = np.array(np.ceil(ilens / 2), dtype=np.int64)
+            ilens = np.array(
+                np.ceil(np.array(ilens, dtype=np.float32) / 2), dtype=np.int64).tolist()
+
+        # x: utt_list of frame (remove zeropaded frames) x (input channel num x dim)
+        xs_pad = xs_pad.transpose(1, 2)
+        xs_pad = xs_pad.contiguous().view(
+            xs_pad.size(0), xs_pad.size(1), xs_pad.size(2) * xs_pad.size(3))
+        return xs_pad, ilens, None  # no state in this layer
+
+
+class Encoder(torch.nn.Module):
+    """Encoder module
+
+    :param str etype: type of encoder network
+    :param int idim: number of dimensions of encoder network
+    :param int elayers: number of layers of encoder network
+    :param int eunits: number of lstm units of encoder network
+    :param int eprojs: number of projection units of encoder network
+    :param np.ndarray subsample: list of subsampling numbers
+    :param float dropout: dropout rate
+    :param int in_channel: number of input channels
+    """
+
+    def __init__(self, etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1):
+        super(Encoder, self).__init__()
+        typ = etype.lstrip("vgg").rstrip("p")
+        if typ not in ['lstm', 'gru', 'blstm', 'bgru']:
+            logging.error("Error: need to specify an appropriate encoder architecture")
+
+        if etype.startswith("vgg"):
+            if etype[-1] == "p":
+                self.enc = torch.nn.ModuleList([VGG2L(in_channel),
+                                                RNNP(get_vgg2l_odim(idim, in_channel=in_channel), elayers, eunits,
+                                                     eprojs,
+                                                     subsample, dropout, typ=typ)])
+                logging.info('Use CNN-VGG + ' + typ.upper() + 'P for encoder')
+            else:
+                self.enc = torch.nn.ModuleList([VGG2L(in_channel),
+                                                RNN(get_vgg2l_odim(idim, in_channel=in_channel), elayers, eunits,
+                                                    eprojs,
+                                                    dropout, typ=typ)])
+                logging.info('Use CNN-VGG + ' + typ.upper() + ' for encoder')
+        else:
+            if etype[-1] == "p":
+                self.enc = torch.nn.ModuleList(
+                    [RNNP(idim, elayers, eunits, eprojs, subsample, dropout, typ=typ)])
+                logging.info(typ.upper() + ' with every-layer projection for encoder')
+            else:
+                self.enc = torch.nn.ModuleList([RNN(idim, elayers, eunits, eprojs, dropout, typ=typ)])
+                logging.info(typ.upper() + ' without projection for encoder')
+
+    def forward(self, xs_pad, ilens, prev_states=None):
+        """Encoder forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :param torch.Tensor prev_state: batch of previous encoder hidden states (?, ...)
+        :return: batch of hidden state sequences (B, Tmax, eprojs)
+        :rtype: torch.Tensor
+        """
+        if prev_states is None:
+            prev_states = [None] * len(self.enc)
+        assert len(prev_states) == len(self.enc)
+
+        current_states = []
+        for module, prev_state in zip(self.enc, prev_states):
+            xs_pad, ilens, states = module(xs_pad, ilens, prev_state=prev_state)
+            current_states.append(states)
+
+        # make mask to remove bias value in padded part
+        mask = to_device(self, make_pad_mask(ilens).unsqueeze(-1))
+
+        return xs_pad.masked_fill(mask, 0.0), ilens, current_states
+
+
+def encoder_for(args, idim, subsample):
+    """Instantiates an encoder module given the program arguments
+
+    :param Namespace args: The arguments
+    :param int or List of integer idim: dimension of input, e.g. 83, or
+                                        List of dimensions of inputs, e.g. [83,83]
+    :param List or List of List subsample: subsample factors, e.g. [1,2,2,1,1], or
+                                        List of subsample factors of each encoder. e.g. [[1,2,2,1,1], [1,2,2,1,1]]
+    :rtype torch.nn.Module
+    :return: The encoder module
+    """
+    num_encs = getattr(args, "num_encs", 1)  # use getattr to keep compatibility
+    if num_encs == 1:
+        # compatible with single encoder asr mode
+        return Encoder(args.etype, idim, args.elayers, args.eunits, args.eprojs, subsample, args.dropout_rate)
+    elif num_encs >= 1:
+        enc_list = torch.nn.ModuleList()
+        for idx in range(num_encs):
+            enc = Encoder(args.etype[idx], idim[idx], args.elayers[idx], args.eunits[idx], args.eprojs, subsample[idx],
+                          args.dropout_rate[idx])
+            enc_list.append(enc)
+        return enc_list
+    else:
+        raise ValueError("Number of encoders needs to be more than one. {}".format(num_encs))
--- a/ppg_extractor/frontend.py
+++ b/ppg_extractor/frontend.py
@@ -0,0 +1,115 @@
+import copy
+from typing import Tuple
+import numpy as np
+import torch
+from torch_complex.tensor import ComplexTensor
+
+from .log_mel import LogMel
+from .stft import Stft
+
+
+class DefaultFrontend(torch.nn.Module):
+    """Conventional frontend structure for ASR
+
+    Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN
+    """
+
+    def __init__(
+        self,
+        fs: 16000,
+        n_fft: int = 1024,
+        win_length: int = 800,
+        hop_length: int = 160,
+        center: bool = True,
+        pad_mode: str = "reflect",
+        normalized: bool = False,
+        onesided: bool = True,
+        n_mels: int = 80,
+        fmin: int = None,
+        fmax: int = None,
+        htk: bool = False,
+        norm=1,
+        frontend_conf=None, #Optional[dict] = get_default_kwargs(Frontend),
+        kaldi_padding_mode=False,
+        downsample_rate: int = 1,
+    ):
+        super().__init__()
+        self.downsample_rate = downsample_rate
+
+        # Deepcopy (In general, dict shouldn't be used as default arg)
+        frontend_conf = copy.deepcopy(frontend_conf)
+
+        self.stft = Stft(
+            n_fft=n_fft,
+            win_length=win_length,
+            hop_length=hop_length,
+            center=center,
+            pad_mode=pad_mode,
+            normalized=normalized,
+            onesided=onesided,
+            kaldi_padding_mode=kaldi_padding_mode
+        )
+        if frontend_conf is not None:
+            self.frontend = Frontend(idim=n_fft // 2 + 1, **frontend_conf)
+        else:
+            self.frontend = None
+
+        self.logmel = LogMel(
+            fs=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax, htk=htk, norm=norm,
+        )
+        self.n_mels = n_mels
+
+    def output_size(self) -> int:
+        return self.n_mels
+
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        # 1. Domain-conversion: e.g. Stft: time -> time-freq
+        input_stft, feats_lens = self.stft(input, input_lengths)
+
+        assert input_stft.dim() >= 4, input_stft.shape
+        # "2" refers to the real/imag parts of Complex
+        assert input_stft.shape[-1] == 2, input_stft.shape
+
+        # Change torch.Tensor to ComplexTensor
+        # input_stft: (..., F, 2) -> (..., F)
+        input_stft = ComplexTensor(input_stft[..., 0], input_stft[..., 1])
+
+        # 2. [Option] Speech enhancement
+        if self.frontend is not None:
+            assert isinstance(input_stft, ComplexTensor), type(input_stft)
+            # input_stft: (Batch, Length, [Channel], Freq)
+            input_stft, _, mask = self.frontend(input_stft, feats_lens)
+
+        # 3. [Multi channel case]: Select a channel
+        if input_stft.dim() == 4:
+            # h: (B, T, C, F) -> h: (B, T, F)
+            if self.training:
+                # Select 1ch randomly
+                ch = np.random.randint(input_stft.size(2))
+                input_stft = input_stft[:, :, ch, :]
+            else:
+                # Use the first channel
+                input_stft = input_stft[:, :, 0, :]
+
+        # 4. STFT -> Power spectrum
+        # h: ComplexTensor(B, T, F) -> torch.Tensor(B, T, F)
+        input_power = input_stft.real ** 2 + input_stft.imag ** 2
+
+        # 5. Feature transform e.g. Stft -> Log-Mel-Fbank
+        # input_power: (Batch, [Channel,] Length, Freq)
+        #       -> input_feats: (Batch, Length, Dim)
+        input_feats, _ = self.logmel(input_power, feats_lens)
+               
+        # NOTE(sx): pad
+        max_len = input_feats.size(1)
+        if self.downsample_rate > 1 and max_len % self.downsample_rate != 0:
+            padding = self.downsample_rate - max_len % self.downsample_rate
+            # print("Logmel: ", input_feats.size())
+            input_feats = torch.nn.functional.pad(input_feats, (0, 0, 0, padding),
+                                                  "constant", 0)
+            # print("Logmel(after padding): ",input_feats.size())
+            feats_lens[torch.argmax(feats_lens)] = max_len + padding 
+
+        return input_feats, feats_lens
--- a/ppg_extractor/log_mel.py
+++ b/ppg_extractor/log_mel.py
@@ -0,0 +1,74 @@
+import librosa
+import numpy as np
+import torch
+from typing import Tuple
+
+from .nets_utils import make_pad_mask
+
+
+class LogMel(torch.nn.Module):
+    """Convert STFT to fbank feats
+
+    The arguments is same as librosa.filters.mel
+
+    Args:
+        fs: number > 0 [scalar] sampling rate of the incoming signal
+        n_fft: int > 0 [scalar] number of FFT components
+        n_mels: int > 0 [scalar] number of Mel bands to generate
+        fmin: float >= 0 [scalar] lowest frequency (in Hz)
+        fmax: float >= 0 [scalar] highest frequency (in Hz).
+            If `None`, use `fmax = fs / 2.0`
+        htk: use HTK formula instead of Slaney
+        norm: {None, 1, np.inf} [scalar]
+            if 1, divide the triangular mel weights by the width of the mel band
+            (area normalization).  Otherwise, leave all the triangles aiming for
+            a peak value of 1.0
+
+    """
+
+    def __init__(
+        self,
+        fs: int = 16000,
+        n_fft: int = 512,
+        n_mels: int = 80,
+        fmin: float = None,
+        fmax: float = None,
+        htk: bool = False,
+        norm=1,
+    ):
+        super().__init__()
+
+        fmin = 0 if fmin is None else fmin
+        fmax = fs / 2 if fmax is None else fmax
+        _mel_options = dict(
+            sr=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax, htk=htk, norm=norm
+        )
+        self.mel_options = _mel_options
+
+        # Note(kamo): The mel matrix of librosa is different from kaldi.
+        melmat = librosa.filters.mel(**_mel_options)
+        # melmat: (D2, D1) -> (D1, D2)
+        self.register_buffer("melmat", torch.from_numpy(melmat.T).float())
+        inv_mel = np.linalg.pinv(melmat)
+        self.register_buffer("inv_melmat", torch.from_numpy(inv_mel.T).float())
+
+    def extra_repr(self):
+        return ", ".join(f"{k}={v}" for k, v in self.mel_options.items())
+
+    def forward(
+        self, feat: torch.Tensor, ilens: torch.Tensor = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        # feat: (B, T, D1) x melmat: (D1, D2) -> mel_feat: (B, T, D2)
+        mel_feat = torch.matmul(feat, self.melmat)
+
+        logmel_feat = (mel_feat + 1e-20).log()
+        # Zero padding
+        if ilens is not None:
+            logmel_feat = logmel_feat.masked_fill(
+                make_pad_mask(ilens, logmel_feat, 1), 0.0
+            )
+        else:
+            ilens = feat.new_full(
+                [feat.size(0)], fill_value=feat.size(1), dtype=torch.long
+            )
+        return logmel_feat, ilens
--- a/ppg_extractor/nets_utils.py
+++ b/ppg_extractor/nets_utils.py
@@ -0,0 +1,465 @@
+# -*- coding: utf-8 -*-
+
+"""Network related utility tools."""
+
+import logging
+from typing import Dict
+
+import numpy as np
+import torch
+
+
+def to_device(m, x):
+    """Send tensor into the device of the module.
+
+    Args:
+        m (torch.nn.Module): Torch module.
+        x (Tensor): Torch tensor.
+
+    Returns:
+        Tensor: Torch tensor located in the same place as torch module.
+
+    """
+    assert isinstance(m, torch.nn.Module)
+    device = next(m.parameters()).device
+    return x.to(device)
+
+
+def pad_list(xs, pad_value):
+    """Perform padding for the list of tensors.
+
+    Args:
+        xs (List): List of Tensors [(T_1, `*`), (T_2, `*`), ..., (T_B, `*`)].
+        pad_value (float): Value for padding.
+
+    Returns:
+        Tensor: Padded tensor (B, Tmax, `*`).
+
+    Examples:
+        >>> x = [torch.ones(4), torch.ones(2), torch.ones(1)]
+        >>> x
+        [tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])]
+        >>> pad_list(x, 0)
+        tensor([[1., 1., 1., 1.],
+                [1., 1., 0., 0.],
+                [1., 0., 0., 0.]])
+
+    """
+    n_batch = len(xs)
+    max_len = max(x.size(0) for x in xs)
+    pad = xs[0].new(n_batch, max_len, *xs[0].size()[1:]).fill_(pad_value)
+
+    for i in range(n_batch):
+        pad[i, :xs[i].size(0)] = xs[i]
+
+    return pad
+
+
+def make_pad_mask(lengths, xs=None, length_dim=-1):
+    """Make mask tensor containing indices of padded part.
+
+    Args:
+        lengths (LongTensor or List): Batch of lengths (B,).
+        xs (Tensor, optional): The reference tensor. If set, masks will be the same shape as this tensor.
+        length_dim (int, optional): Dimension indicator of the above tensor. See the example.
+
+    Returns:
+        Tensor: Mask tensor containing indices of padded part.
+                dtype=torch.uint8 in PyTorch 1.2-
+                dtype=torch.bool in PyTorch 1.2+ (including 1.2)
+
+    Examples:
+        With only lengths.
+
+        >>> lengths = [5, 3, 2]
+        >>> make_non_pad_mask(lengths)
+        masks = [[0, 0, 0, 0 ,0],
+                 [0, 0, 0, 1, 1],
+                 [0, 0, 1, 1, 1]]
+
+        With the reference tensor.
+
+        >>> xs = torch.zeros((3, 2, 4))
+        >>> make_pad_mask(lengths, xs)
+        tensor([[[0, 0, 0, 0],
+                 [0, 0, 0, 0]],
+                [[0, 0, 0, 1],
+                 [0, 0, 0, 1]],
+                [[0, 0, 1, 1],
+                 [0, 0, 1, 1]]], dtype=torch.uint8)
+        >>> xs = torch.zeros((3, 2, 6))
+        >>> make_pad_mask(lengths, xs)
+        tensor([[[0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1]],
+                [[0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1]],
+                [[0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
+
+        With the reference tensor and dimension indicator.
+
+        >>> xs = torch.zeros((3, 6, 6))
+        >>> make_pad_mask(lengths, xs, 1)
+        tensor([[[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1]],
+                [[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1]],
+                [[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1]]], dtype=torch.uint8)
+        >>> make_pad_mask(lengths, xs, 2)
+        tensor([[[0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1]],
+                [[0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1]],
+                [[0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
+
+    """
+    if length_dim == 0:
+        raise ValueError('length_dim cannot be 0: {}'.format(length_dim))
+
+    if not isinstance(lengths, list):
+        lengths = lengths.tolist()
+    bs = int(len(lengths))
+    if xs is None:
+        maxlen = int(max(lengths))
+    else:
+        maxlen = xs.size(length_dim)
+
+    seq_range = torch.arange(0, maxlen, dtype=torch.int64)
+    seq_range_expand = seq_range.unsqueeze(0).expand(bs, maxlen)
+    seq_length_expand = seq_range_expand.new(lengths).unsqueeze(-1)
+    mask = seq_range_expand >= seq_length_expand
+
+    if xs is not None:
+        assert xs.size(0) == bs, (xs.size(0), bs)
+
+        if length_dim < 0:
+            length_dim = xs.dim() + length_dim
+        # ind = (:, None, ..., None, :, , None, ..., None)
+        ind = tuple(slice(None) if i in (0, length_dim) else None
+                    for i in range(xs.dim()))
+        mask = mask[ind].expand_as(xs).to(xs.device)
+    return mask
+
+
+def make_non_pad_mask(lengths, xs=None, length_dim=-1):
+    """Make mask tensor containing indices of non-padded part.
+
+    Args:
+        lengths (LongTensor or List): Batch of lengths (B,).
+        xs (Tensor, optional): The reference tensor. If set, masks will be the same shape as this tensor.
+        length_dim (int, optional): Dimension indicator of the above tensor. See the example.
+
+    Returns:
+        ByteTensor: mask tensor containing indices of padded part.
+                    dtype=torch.uint8 in PyTorch 1.2-
+                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)
+
+    Examples:
+        With only lengths.
+
+        >>> lengths = [5, 3, 2]
+        >>> make_non_pad_mask(lengths)
+        masks = [[1, 1, 1, 1 ,1],
+                 [1, 1, 1, 0, 0],
+                 [1, 1, 0, 0, 0]]
+
+        With the reference tensor.
+
+        >>> xs = torch.zeros((3, 2, 4))
+        >>> make_non_pad_mask(lengths, xs)
+        tensor([[[1, 1, 1, 1],
+                 [1, 1, 1, 1]],
+                [[1, 1, 1, 0],
+                 [1, 1, 1, 0]],
+                [[1, 1, 0, 0],
+                 [1, 1, 0, 0]]], dtype=torch.uint8)
+        >>> xs = torch.zeros((3, 2, 6))
+        >>> make_non_pad_mask(lengths, xs)
+        tensor([[[1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0]],
+                [[1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0]],
+                [[1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
+
+        With the reference tensor and dimension indicator.
+
+        >>> xs = torch.zeros((3, 6, 6))
+        >>> make_non_pad_mask(lengths, xs, 1)
+        tensor([[[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0]],
+                [[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0]],
+                [[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0]]], dtype=torch.uint8)
+        >>> make_non_pad_mask(lengths, xs, 2)
+        tensor([[[1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0]],
+                [[1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0]],
+                [[1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
+
+    """
+    return ~make_pad_mask(lengths, xs, length_dim)
+
+
+def mask_by_length(xs, lengths, fill=0):
+    """Mask tensor according to length.
+
+    Args:
+        xs (Tensor): Batch of input tensor (B, `*`).
+        lengths (LongTensor or List): Batch of lengths (B,).
+        fill (int or float): Value to fill masked part.
+
+    Returns:
+        Tensor: Batch of masked input tensor (B, `*`).
+
+    Examples:
+        >>> x = torch.arange(5).repeat(3, 1) + 1
+        >>> x
+        tensor([[1, 2, 3, 4, 5],
+                [1, 2, 3, 4, 5],
+                [1, 2, 3, 4, 5]])
+        >>> lengths = [5, 3, 2]
+        >>> mask_by_length(x, lengths)
+        tensor([[1, 2, 3, 4, 5],
+                [1, 2, 3, 0, 0],
+                [1, 2, 0, 0, 0]])
+
+    """
+    assert xs.size(0) == len(lengths)
+    ret = xs.data.new(*xs.size()).fill_(fill)
+    for i, l in enumerate(lengths):
+        ret[i, :l] = xs[i, :l]
+    return ret
+
+
+def th_accuracy(pad_outputs, pad_targets, ignore_label):
+    """Calculate accuracy.
+
+    Args:
+        pad_outputs (Tensor): Prediction tensors (B * Lmax, D).
+        pad_targets (LongTensor): Target label tensors (B, Lmax, D).
+        ignore_label (int): Ignore label id.
+
+    Returns:
+        float: Accuracy value (0.0 - 1.0).
+
+    """
+    pad_pred = pad_outputs.view(
+        pad_targets.size(0),
+        pad_targets.size(1),
+        pad_outputs.size(1)).argmax(2)
+    mask = pad_targets != ignore_label
+    numerator = torch.sum(pad_pred.masked_select(mask) == pad_targets.masked_select(mask))
+    denominator = torch.sum(mask)
+    return float(numerator) / float(denominator)
+
+
+def to_torch_tensor(x):
+    """Change to torch.Tensor or ComplexTensor from numpy.ndarray.
+
+    Args:
+        x: Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.
+
+    Returns:
+        Tensor or ComplexTensor: Type converted inputs.
+
+    Examples:
+        >>> xs = np.ones(3, dtype=np.float32)
+        >>> xs = to_torch_tensor(xs)
+        tensor([1., 1., 1.])
+        >>> xs = torch.ones(3, 4, 5)
+        >>> assert to_torch_tensor(xs) is xs
+        >>> xs = {'real': xs, 'imag': xs}
+        >>> to_torch_tensor(xs)
+        ComplexTensor(
+        Real:
+        tensor([1., 1., 1.])
+        Imag;
+        tensor([1., 1., 1.])
+        )
+
+    """
+    # If numpy, change to torch tensor
+    if isinstance(x, np.ndarray):
+        if x.dtype.kind == 'c':
+            # Dynamically importing because torch_complex requires python3
+            from torch_complex.tensor import ComplexTensor
+            return ComplexTensor(x)
+        else:
+            return torch.from_numpy(x)
+
+    # If {'real': ..., 'imag': ...}, convert to ComplexTensor
+    elif isinstance(x, dict):
+        # Dynamically importing because torch_complex requires python3
+        from torch_complex.tensor import ComplexTensor
+
+        if 'real' not in x or 'imag' not in x:
+            raise ValueError("has 'real' and 'imag' keys: {}".format(list(x)))
+        # Relative importing because of using python3 syntax
+        return ComplexTensor(x['real'], x['imag'])
+
+    # If torch.Tensor, as it is
+    elif isinstance(x, torch.Tensor):
+        return x
+
+    else:
+        error = ("x must be numpy.ndarray, torch.Tensor or a dict like "
+                 "{{'real': torch.Tensor, 'imag': torch.Tensor}}, "
+                 "but got {}".format(type(x)))
+        try:
+            from torch_complex.tensor import ComplexTensor
+        except Exception:
+            # If PY2
+            raise ValueError(error)
+        else:
+            # If PY3
+            if isinstance(x, ComplexTensor):
+                return x
+            else:
+                raise ValueError(error)
+
+
+def get_subsample(train_args, mode, arch):
+    """Parse the subsampling factors from the training args for the specified `mode` and `arch`.
+
+    Args:
+        train_args: argument Namespace containing options.
+        mode: one of ('asr', 'mt', 'st')
+        arch: one of ('rnn', 'rnn-t', 'rnn_mix', 'rnn_mulenc', 'transformer')
+
+    Returns:
+        np.ndarray / List[np.ndarray]: subsampling factors.
+    """
+    if arch == 'transformer':
+        return np.array([1])
+
+    elif mode == 'mt' and arch == 'rnn':
+        # +1 means input (+1) and layers outputs (train_args.elayer)
+        subsample = np.ones(train_args.elayers + 1, dtype=np.int)
+        logging.warning('Subsampling is not performed for machine translation.')
+        logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+        return subsample
+
+    elif (mode == 'asr' and arch in ('rnn', 'rnn-t')) or \
+         (mode == 'mt' and arch == 'rnn') or \
+         (mode == 'st' and arch == 'rnn'):
+        subsample = np.ones(train_args.elayers + 1, dtype=np.int)
+        if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
+            ss = train_args.subsample.split("_")
+            for j in range(min(train_args.elayers + 1, len(ss))):
+                subsample[j] = int(ss[j])
+        else:
+            logging.warning(
+                'Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.')
+        logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+        return subsample
+
+    elif mode == 'asr' and arch == 'rnn_mix':
+        subsample = np.ones(train_args.elayers_sd + train_args.elayers + 1, dtype=np.int)
+        if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
+            ss = train_args.subsample.split("_")
+            for j in range(min(train_args.elayers_sd + train_args.elayers + 1, len(ss))):
+                subsample[j] = int(ss[j])
+        else:
+            logging.warning(
+                'Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.')
+        logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+        return subsample
+
+    elif mode == 'asr' and arch == 'rnn_mulenc':
+        subsample_list = []
+        for idx in range(train_args.num_encs):
+            subsample = np.ones(train_args.elayers[idx] + 1, dtype=np.int)
+            if train_args.etype[idx].endswith("p") and not train_args.etype[idx].startswith("vgg"):
+                ss = train_args.subsample[idx].split("_")
+                for j in range(min(train_args.elayers[idx] + 1, len(ss))):
+                    subsample[j] = int(ss[j])
+            else:
+                logging.warning(
+                    'Encoder %d: Subsampling is not performed for vgg*. '
+                    'It is performed in max pooling layers at CNN.', idx + 1)
+            logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
+            subsample_list.append(subsample)
+        return subsample_list
+
+    else:
+        raise ValueError('Invalid options: mode={}, arch={}'.format(mode, arch))
+
+
+def rename_state_dict(old_prefix: str, new_prefix: str, state_dict: Dict[str, torch.Tensor]):
+    """Replace keys of old prefix with new prefix in state dict."""
+    # need this list not to break the dict iterator
+    old_keys = [k for k in state_dict if k.startswith(old_prefix)]
+    if len(old_keys) > 0:
+        logging.warning(f'Rename: {old_prefix} -> {new_prefix}')
+    for k in old_keys:
+        v = state_dict.pop(k)
+        new_k = k.replace(old_prefix, new_prefix)
+        state_dict[new_k] = v
+
+def get_activation(act):
+    """Return activation function."""
+    # Lazy load to avoid unused import
+    from .encoder.swish import Swish
+
+    activation_funcs = {
+        "hardtanh": torch.nn.Hardtanh,
+        "relu": torch.nn.ReLU,
+        "selu": torch.nn.SELU,
+        "swish": Swish,
+    }
+
+    return activation_funcs[act]()
--- a/ppg_extractor/stft.py
+++ b/ppg_extractor/stft.py
@@ -0,0 +1,118 @@
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import torch
+
+from .nets_utils import make_pad_mask
+
+
+class Stft(torch.nn.Module):
+    def __init__(
+        self,
+        n_fft: int = 512,
+        win_length: Union[int, None] = 512,
+        hop_length: int = 128,
+        center: bool = True,
+        pad_mode: str = "reflect",
+        normalized: bool = False,
+        onesided: bool = True,
+        kaldi_padding_mode=False,
+    ):
+        super().__init__()
+        self.n_fft = n_fft
+        if win_length is None:
+            self.win_length = n_fft
+        else:
+            self.win_length = win_length
+        self.hop_length = hop_length
+        self.center = center
+        self.pad_mode = pad_mode
+        self.normalized = normalized
+        self.onesided = onesided
+        self.kaldi_padding_mode = kaldi_padding_mode
+        if self.kaldi_padding_mode:
+            self.win_length = 400
+
+    def extra_repr(self):
+        return (
+            f"n_fft={self.n_fft}, "
+            f"win_length={self.win_length}, "
+            f"hop_length={self.hop_length}, "
+            f"center={self.center}, "
+            f"pad_mode={self.pad_mode}, "
+            f"normalized={self.normalized}, "
+            f"onesided={self.onesided}"
+        )
+
+    def forward(
+        self, input: torch.Tensor, ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """STFT forward function.
+
+        Args:
+            input: (Batch, Nsamples) or (Batch, Nsample, Channels)
+            ilens: (Batch)
+        Returns:
+            output: (Batch, Frames, Freq, 2) or (Batch, Frames, Channels, Freq, 2)
+
+        """
+        bs = input.size(0)
+        if input.dim() == 3:
+            multi_channel = True
+            # input: (Batch, Nsample, Channels) -> (Batch * Channels, Nsample)
+            input = input.transpose(1, 2).reshape(-1, input.size(1))
+        else:
+            multi_channel = False
+
+        # output: (Batch, Freq, Frames, 2=real_imag)
+        # or (Batch, Channel, Freq, Frames, 2=real_imag)
+        if not self.kaldi_padding_mode:
+            output = torch.stft(
+                input,
+                n_fft=self.n_fft,
+                win_length=self.win_length,
+                hop_length=self.hop_length,
+                center=self.center,
+                pad_mode=self.pad_mode,
+                normalized=self.normalized,
+                onesided=self.onesided,
+                return_complex=False
+            )
+        else:
+            # NOTE(sx): Use Kaldi-fasion padding, maybe wrong
+            num_pads = self.n_fft - self.win_length
+            input = torch.nn.functional.pad(input, (num_pads, 0))
+            output = torch.stft(
+                input,
+                n_fft=self.n_fft,
+                win_length=self.win_length,
+                hop_length=self.hop_length,
+                center=False,
+                pad_mode=self.pad_mode,
+                normalized=self.normalized,
+                onesided=self.onesided,
+                return_complex=False
+            )
+
+        # output: (Batch, Freq, Frames, 2=real_imag)
+        # -> (Batch, Frames, Freq, 2=real_imag)
+        output = output.transpose(1, 2)
+        if multi_channel:
+            # output: (Batch * Channel, Frames, Freq, 2=real_imag)
+            # -> (Batch, Frame, Channel, Freq, 2=real_imag)
+            output = output.view(bs, -1, output.size(1), output.size(2), 2).transpose(
+                1, 2
+            )
+
+        if ilens is not None:
+            if self.center:
+                pad = self.win_length // 2
+                ilens = ilens + 2 * pad
+            olens = torch.div(ilens - self.win_length, self.hop_length, rounding_mode='floor') + 1
+            # olens = ilens - self.win_length // self.hop_length + 1
+            output.masked_fill_(make_pad_mask(olens, output, 1), 0.0)
+        else:
+            olens = None
+
+        return output, olens
--- a/ppg_extractor/utterance_mvn.py
+++ b/ppg_extractor/utterance_mvn.py
@@ -0,0 +1,82 @@
+from typing import Tuple
+
+import torch
+
+from .nets_utils import make_pad_mask
+
+
+class UtteranceMVN(torch.nn.Module):
+    def __init__(
+        self, norm_means: bool = True, norm_vars: bool = False, eps: float = 1.0e-20,
+    ):
+        super().__init__()
+        self.norm_means = norm_means
+        self.norm_vars = norm_vars
+        self.eps = eps
+
+    def extra_repr(self):
+        return f"norm_means={self.norm_means}, norm_vars={self.norm_vars}"
+
+    def forward(
+        self, x: torch.Tensor, ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward function
+
+        Args:
+            x: (B, L, ...)
+            ilens: (B,)
+
+        """
+        return utterance_mvn(
+            x,
+            ilens,
+            norm_means=self.norm_means,
+            norm_vars=self.norm_vars,
+            eps=self.eps,
+        )
+
+
+def utterance_mvn(
+    x: torch.Tensor,
+    ilens: torch.Tensor = None,
+    norm_means: bool = True,
+    norm_vars: bool = False,
+    eps: float = 1.0e-20,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Apply utterance mean and variance normalization
+
+    Args:
+        x: (B, T, D), assumed zero padded
+        ilens: (B,)
+        norm_means:
+        norm_vars:
+        eps:
+
+    """
+    if ilens is None:
+        ilens = x.new_full([x.size(0)], x.size(1))
+    ilens_ = ilens.to(x.device, x.dtype).view(-1, *[1 for _ in range(x.dim() - 1)])
+    # Zero padding
+    if x.requires_grad:
+        x = x.masked_fill(make_pad_mask(ilens, x, 1), 0.0)
+    else:
+        x.masked_fill_(make_pad_mask(ilens, x, 1), 0.0)
+    # mean: (B, 1, D)
+    mean = x.sum(dim=1, keepdim=True) / ilens_
+
+    if norm_means:
+        x -= mean
+
+        if norm_vars:
+            var = x.pow(2).sum(dim=1, keepdim=True) / ilens_
+            std = torch.clamp(var.sqrt(), min=eps)
+            x = x / std.sqrt()
+        return x, ilens
+    else:
+        if norm_vars:
+            y = x - mean
+            y.masked_fill_(make_pad_mask(ilens, y, 1), 0.0)
+            var = y.pow(2).sum(dim=1, keepdim=True) / ilens_
+            std = torch.clamp(var.sqrt(), min=eps)
+            x /= std
+        return x, ilens
--- a/pre4ppg.py
+++ b/pre4ppg.py
@@ -0,0 +1,49 @@
+from pathlib import Path
+import argparse
+
+from ppg2mel.preprocess import preprocess_dataset
+from pathlib import Path
+import argparse
+
+recognized_datasets = [
+    "aidatatang_200zh",
+    "aidatatang_200zh_s", #      sample 
+]
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Preprocesses audio files from datasets, to be used by the "
+                    "ppg2mel model for training.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("datasets_root", type=Path, help=\
+        "Path to the directory containing your datasets.")
+    parser.add_argument("-d", "--dataset", type=str, default="aidatatang_200zh", help=\
+        "Name of the dataset to process, allowing values: aidatatang_200zh.")
+    parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
+        "Path to the output directory that will contain the mel spectrograms, the audios and the "
+        "embeds. Defaults to <datasets_root>/PPGVC/ppg2mel/")
+    parser.add_argument("-n", "--n_processes", type=int, default=8, help=\
+        "Number of processes in parallel.")
+    # parser.add_argument("-s", "--skip_existing", action="store_true", help=\
+    #     "Whether to overwrite existing files with the same name. Useful if the preprocessing was "
+    #     "interrupted. ")
+    # parser.add_argument("--hparams", type=str, default="", help=\
+    #     "Hyperparameter overrides as a comma-separated list of name-value pairs")
+    # parser.add_argument("--no_trim", action="store_true", help=\
+    #     "Preprocess audio without trimming silences (not recommended).")
+    parser.add_argument("-pf", "--ppg_encoder_model_fpath", type=Path, default="ppg_extractor/saved_models/24epoch.pt", help=\
+        "Path your trained ppg encoder model.")
+    parser.add_argument("-sf", "--speaker_encoder_model", type=Path, default="encoder/saved_models/pretrained_bak_5805000.pt", help=\
+        "Path your trained speaker encoder model.")
+    args = parser.parse_args()
+
+    assert args.dataset in recognized_datasets, 'is not supported, file a issue to propose a new one'
+
+    # Create directories
+    assert args.datasets_root.exists()
+    if not hasattr(args, "out_dir"):
+        args.out_dir = args.datasets_root.joinpath("PPGVC", "ppg2mel")
+    args.out_dir.mkdir(exist_ok=True, parents=True)
+
+    preprocess_dataset(**vars(args)) 
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,6 @@
 umap-learn
 visdom
-librosa>=0.8.0
+librosa==0.8.1
 matplotlib>=3.3.0
 numpy==1.19.3; platform_system == "Windows"
 numpy==1.19.4; platform_system != "Windows"
@@ -17,7 +17,10 @@ webrtcvad; platform_system != "Windows"
 pypinyin
 flask
 flask_wtf
-flask_cors
+flask_cors==3.0.10
 gevent==21.8.0
 flask_restx
-tensorboard
+tensorboard
+PyYAML==5.4.1
+torch_complex
+espnet
--- a/specdeno/enhance_speach.py
+++ b/specdeno/enhance_speach.py
@@ -1,188 +0,0 @@
-#!/usr/bin/env python
-import librosa
-import numpy as np
-import wave
-import math
-from synthesizer.hparams import hparams
-import os
-import ctypes as ct
-from encoder import inference as encoder
-from utils import logmmse
-
-
-def enhance(fpath):
-    class FloatBits(ct.Structure):
-        _fields_ = [
-            ('M', ct.c_uint, 23),
-            ('E', ct.c_uint, 8),
-            ('S', ct.c_uint, 1)
-        ]
-
-    class Float(ct.Union):
-        _anonymous_ = ('bits',)
-        _fields_ = [
-            ('value', ct.c_float),
-            ('bits', FloatBits)
-        ]
-
-    def nextpow2(x):
-        if x < 0:
-            x = -x
-        if x == 0:
-            return 0
-        d = Float()
-        d.value = x
-        if d.M == 0:
-            return d.E - 127
-        return d.E - 127 + 1
-
-
-    # 打开WAV文档
-    f = wave.open(str(fpath))
-    # 读取格式信息
-    # (nchannels, sampwidth, framerate, nframes, comptype, compname)
-    params = f.getparams()
-    nchannels, sampwidth, framerate, nframes = params[:4]
-    fs = framerate
-    # 读取波形数据
-    str_data = f.readframes(nframes)
-    f.close()
-    # 将波形数据转换为数组
-    x = np.fromstring(str_data, dtype=np.short)
-    # 计算参数
-    len_ = 20 * fs // 1000 # 样本中帧的大小
-    PERC = 50 # 窗口重叠占帧的百分比
-    len1 = len_ * PERC // 100  # 重叠窗口
-    len2 = len_ - len1   # 非重叠窗口
-    # 设置默认参数
-    Thres = 3
-    Expnt = 2.0
-    beta = 0.002
-    G = 0.9
-    # 初始化汉明窗
-    win = np.hamming(len_)
-    # normalization gain for overlap+add with 50% overlap
-    winGain = len2 / sum(win)
-
-    # Noise magnitude calculations - assuming that the first 5 frames is noise/silence
-    nFFT = 2 * 2 ** (nextpow2(len_))
-    noise_mean = np.zeros(nFFT)
-
-    j = 0
-    for k in range(1, 6):
-        noise_mean = noise_mean + abs(np.fft.fft(win * x[j:j + len_], nFFT))
-        j = j + len_
-    noise_mu = noise_mean / 5
-
-    # --- allocate memory and initialize various variables
-    k = 1
-    img = 1j
-    x_old = np.zeros(len1)
-    Nframes = len(x) // len2 - 1
-    xfinal = np.zeros(Nframes * len2)
-
-    # =========================    Start Processing   ===============================
-    for n in range(0, Nframes):
-        # Windowing
-        insign = win * x[k-1:k + len_ - 1]
-        # compute fourier transform of a frame
-        spec = np.fft.fft(insign, nFFT)
-        # compute the magnitude
-        sig = abs(spec)
-
-        # save the noisy phase information
-        theta = np.angle(spec)
-        SNRseg = 10 * np.log10(np.linalg.norm(sig, 2) ** 2 / np.linalg.norm(noise_mu, 2) ** 2)
-
-
-        def berouti(SNR):
-            if -5.0 <= SNR <= 20.0:
-                a = 4 - SNR * 3 / 20
-            else:
-                if SNR < -5.0:
-                    a = 5
-                if SNR > 20:
-                    a = 1
-            return a
-
-
-        def berouti1(SNR):
-            if -5.0 <= SNR <= 20.0:
-                a = 3 - SNR * 2 / 20
-            else:
-                if SNR < -5.0:
-                    a = 4
-                if SNR > 20:
-                    a = 1
-            return a
-
-        if Expnt == 1.0:  # 幅度谱
-            alpha = berouti1(SNRseg)
-        else:  # 功率谱
-            alpha = berouti(SNRseg)
-        #############
-        sub_speech = sig ** Expnt - alpha * noise_mu ** Expnt;
-        # 当纯净信号小于噪声信号的功率时
-        diffw = sub_speech - beta * noise_mu ** Expnt
-        # beta negative components
-
-        def find_index(x_list):
-            index_list = []
-            for i in range(len(x_list)):
-                if x_list[i] < 0:
-                    index_list.append(i)
-            return index_list
-
-        z = find_index(diffw)
-        if len(z) > 0:
-            # 用估计出来的噪声信号表示下限值
-            sub_speech[z] = beta * noise_mu[z] ** Expnt
-            # --- implement a simple VAD detector --------------
-        if SNRseg < Thres:  # Update noise spectrum
-            noise_temp = G * noise_mu ** Expnt + (1 - G) * sig ** Expnt  # 平滑处理噪声功率谱
-            noise_mu = noise_temp ** (1 / Expnt)  # 新的噪声幅度谱
-        # flipud函数实现矩阵的上下翻转，是以矩阵的“水平中线”为对称轴
-        # 交换上下对称元素
-        sub_speech[nFFT // 2 + 1:nFFT] = np.flipud(sub_speech[1:nFFT // 2])
-        x_phase = (sub_speech ** (1 / Expnt)) * (np.array([math.cos(x) for x in theta]) + img * (np.array([math.sin(x) for x in theta])))
-        # take the IFFT
-
-        xi = np.fft.ifft(x_phase).real
-        # --- Overlap and add ---------------
-        xfinal[k-1:k + len2 - 1] = x_old + xi[0:len1]
-        x_old = xi[0 + len1:len_]
-        k = k + len2
-    # 保存文件
-    wf = wave.open('out.wav', 'wb')
-    # 设置参数
-    wf.setparams(params)
-    # 设置波形文件 .tostring()将array转换为data
-    wave_data = (winGain * xfinal).astype(np.short)
-    wf.writeframes(wave_data.tostring())
-    wf.close()
-    wav = librosa.load("./out.wav", hparams.sample_rate)[0]
-
-    #在给定噪声配置文件的情况下清除语音波形中的噪声。 波形必须有与用于创建噪声配置文件的采样率相同
-    if hparams.rescale:
-        wav = wav / np.abs(wav).max() * hparams.rescaling_max
-    # denoise
-    if len(wav) > hparams.sample_rate * (0.3 + 0.1):
-        noise_wav = np.concatenate([wav[:int(hparams.sample_rate * 0.15)],
-                                    wav[-int(hparams.sample_rate * 0.15):]])
-        profile = logmmse.profile_noise(noise_wav, hparams.sample_rate)
-        wav = logmmse.denoise(wav, profile)
-
-    # Trim excessive silences
-    wav = encoder.preprocess_wav(wav)
-
-
-
-
-    #删除保存的输出文件
-    os.remove("./out.wav")
-    return wav
-
-
-
-
-
--- a/synthesizer/hparams.py
+++ b/synthesizer/hparams.py
@@ -1,5 +1,6 @@
 import ast
 import pprint
+import json

 class HParams(object):
    def __init__(self, **kwargs): self.__dict__.update(kwargs)
@@ -18,6 +19,18 @@ class HParams(object):
                self.__dict__[k] = ast.literal_eval(values[keys.index(k)])
        return self

+    def loadJson(self, dict):
+        print("\Loading the json with %s\n", dict)
+        for k in dict.keys():
+            self.__dict__[k] = dict[k]
+        return self
+
+    def dumpJson(self, fp):
+        print("\Saving the json with %s\n", fp)
+        with fp.open("w", encoding="utf-8") as f:
+            json.dump(self.__dict__, f)
+        return self
+
 hparams = HParams(
        ### Signal Processing (used in both synthesizer and vocoder)
        sample_rate = 16000,
--- a/synthesizer/inference.py
+++ b/synthesizer/inference.py
@@ -10,6 +10,7 @@ from typing import Union, List
 import numpy as np
 import librosa
 from utils import logmmse
+import json
 from pypinyin import lazy_pinyin, Style

 class Synthesizer:
@@ -44,6 +45,11 @@ class Synthesizer:
        return self._model is not None
    
    def load(self):
+        # Try to scan config file
+        model_config_fpaths = list(self.model_fpath.parent.rglob("*.json"))
+        if len(model_config_fpaths)>0 and model_config_fpaths[0].exists():
+            with model_config_fpaths[0].open("r", encoding="utf-8") as f:
+                hparams.loadJson(json.load(f))
        """
        Instantiates and loads the model given the weights file that was passed in the constructor.
        """
--- a/synthesizer/synthesizer_dataset.py
+++ b/synthesizer/synthesizer_dataset.py
@@ -73,6 +73,7 @@ def collate_synthesizer(batch):

    # Speaker embedding (SV2TTS)
    embeds = [x[2] for x in batch]
+    embeds = np.stack(embeds)

    # Index (for vocoder preprocessing)
    indices = [x[3] for x in batch]
--- a/synthesizer/train.py
+++ b/synthesizer/train.py
@@ -12,6 +12,7 @@ from synthesizer.utils.symbols import symbols
 from synthesizer.utils.text import sequence_to_text
 from vocoder.display import *
 from datetime import datetime
+import json
 import numpy as np
 from pathlib import Path
 import sys
@@ -75,6 +76,13 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
        if num_chars != loaded_shape[0]:
            print("WARNING: you are using compatible mode due to wrong sympols length, please modify varible _characters in `utils\symbols.py`")
            num_chars != loaded_shape[0]
+                # Try to scan config file
+        model_config_fpaths = list(weights_fpath.parent.rglob("*.json"))
+        if len(model_config_fpaths)>0 and model_config_fpaths[0].exists():
+            with model_config_fpaths[0].open("r", encoding="utf-8") as f:
+                hparams.loadJson(json.load(f))
+        else:  # save a config
+            hparams.dumpJson(weights_fpath.parent.joinpath(run_id).with_suffix(".json"))


    model = Tacotron(embed_dims=hparams.tts_embed_dims,
@@ -222,7 +230,7 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,

                # Backup or save model as appropriate
                if backup_every != 0 and step % backup_every == 0 : 
-                    backup_fpath = Path("{}/{}_{}k.pt".format(str(weights_fpath.parent), run_id, k))
+                    backup_fpath = Path("{}/{}_{}.pt".format(str(weights_fpath.parent), run_id, step))
                    model.save(backup_fpath, optimizer)

                if save_every != 0 and step % save_every == 0 : 
--- a/toolbox/init.py
+++ b/toolbox/init.py
@@ -3,21 +3,17 @@ from encoder import inference as encoder
 from synthesizer.inference import Synthesizer
 from vocoder.wavernn import inference as rnn_vocoder
 from vocoder.hifigan import inference as gan_vocoder
-from vocoder.fregan import inference as fgan_vocoder
+import ppg_extractor as extractor
+import ppg2mel as convertor
 from pathlib import Path
 from time import perf_counter as timer
 from toolbox.utterance import Utterance
+from utils.f0_utils import compute_f0, f02lf0, compute_mean_std, get_converted_lf0uv
 import numpy as np
 import traceback
 import sys
 import torch
-import librosa
 import re
-from audioread.exceptions import NoBackendError
-from specdeno.enhance_speach import enhance
-import os
-from synthesizer.hparams import hparams
-import soundfile as sf

 # 默认使用wavernn
 vocoder = rnn_vocoder
@@ -54,14 +50,20 @@ recognized_datasets = [
 MAX_WAVES = 15

 class Toolbox:
-    def __init__(self, datasets_root, enc_models_dir, syn_models_dir, voc_models_dir, seed, no_mp3_support):
+    def __init__(self, datasets_root, enc_models_dir, syn_models_dir, voc_models_dir, extractor_models_dir, convertor_models_dir, seed, no_mp3_support, vc_mode):
        self.no_mp3_support = no_mp3_support
+        self.vc_mode = vc_mode
        sys.excepthook = self.excepthook
        self.datasets_root = datasets_root
        self.utterances = set()
        self.current_generated = (None, None, None, None) # speaker_name, spec, breaks, wav
        
        self.synthesizer = None # type: Synthesizer
+
+        # for ppg-based voice conversion
+        self.extractor = None 
+        self.convertor = None # ppg2mel
+
        self.current_wav = None
        self.waves_list = []
        self.waves_count = 0
@@ -75,9 +77,9 @@ class Toolbox:
            self.trim_silences = False

        # Initialize the events and the interface
-        self.ui = UI()
+        self.ui = UI(vc_mode)
        self.style_idx = 0
-        self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, seed)
+        self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, extractor_models_dir, convertor_models_dir, seed)
        self.setup_events()
        self.ui.start()

@@ -101,7 +103,11 @@ class Toolbox:
        self.ui.encoder_box.currentIndexChanged.connect(self.init_encoder)
        def func(): 
            self.synthesizer = None
-        self.ui.synthesizer_box.currentIndexChanged.connect(func)
+        if self.vc_mode:
+            self.ui.extractor_box.currentIndexChanged.connect(self.init_extractor)
+        else:
+            self.ui.synthesizer_box.currentIndexChanged.connect(func)
+
        self.ui.vocoder_box.currentIndexChanged.connect(self.init_vocoder)
        
        # Utterance selection
@@ -114,12 +120,10 @@ class Toolbox:
        self.ui.stop_button.clicked.connect(self.ui.stop)
        self.ui.record_button.clicked.connect(self.record)

-        #添加source_mfcc分析槽
-        func = lambda: self.ui.plot_mfcc(self.ui.selected_utterance.wav, Synthesizer.sample_rate)
-        self.ui.play_button.clicked.connect(func)
-
-
-
+        # Source Utterance selection
+        if self.vc_mode:
+            func = lambda: self.load_soruce_button(self.ui.selected_utterance)
+            self.ui.load_soruce_button.clicked.connect(func)

        #Audio
        self.ui.setup_audio_devices(Synthesizer.sample_rate)
@@ -131,18 +135,18 @@ class Toolbox:
        self.ui.export_wav_button.clicked.connect(func)
        self.ui.waves_cb.currentIndexChanged.connect(self.set_current_wav)

-
-
        # Generation
-        func = lambda: self.synthesize() or self.vocode()
-        self.ui.generate_button.clicked.connect(func)
-        self.ui.synthesize_button.clicked.connect(self.synthesize)
        self.ui.vocode_button.clicked.connect(self.vocode)
        self.ui.random_seed_checkbox.clicked.connect(self.update_seed_textbox)

-        # 添加result_mfcc分析槽,该槽要在语音合成之后
-        func = lambda: self.ui.plot_mfcc1(self.current_wav, Synthesizer.sample_rate)
-        self.ui.generate_button.clicked.connect(func)
+        if self.vc_mode:
+            func = lambda: self.convert() or self.vocode()
+            self.ui.convert_button.clicked.connect(func)
+        else:
+            func = lambda: self.synthesize() or self.vocode()
+            self.ui.generate_button.clicked.connect(func)
+            self.ui.synthesize_button.clicked.connect(self.synthesize)
+
        # UMAP legend
        self.ui.clear_button.clicked.connect(self.clear_utterances)

@@ -155,9 +159,9 @@ class Toolbox:
    def replay_last_wav(self):
        self.ui.play(self.current_wav, Synthesizer.sample_rate)

-    def reset_ui(self, encoder_models_dir, synthesizer_models_dir, vocoder_models_dir, seed):
+    def reset_ui(self, encoder_models_dir, synthesizer_models_dir, vocoder_models_dir, extractor_models_dir, convertor_models_dir, seed):
        self.ui.populate_browser(self.datasets_root, recognized_datasets, 0, True)
-        self.ui.populate_models(encoder_models_dir, synthesizer_models_dir, vocoder_models_dir)
+        self.ui.populate_models(encoder_models_dir, synthesizer_models_dir, vocoder_models_dir, extractor_models_dir, convertor_models_dir, self.vc_mode)
        self.ui.populate_gen_options(seed, self.trim_silences)
        
    def load_from_browser(self, fpath=None):
@@ -184,18 +188,16 @@ class Toolbox:

        # Get the wav from the disk. We take the wav with the vocoder/synthesizer format for
        # playback, so as to have a fair comparison with the generated audio
-        #wav = Synthesizer.load_preprocess_wav(fpath)
-        wav = enhance(fpath)
-
+        wav = Synthesizer.load_preprocess_wav(fpath)
        self.ui.log("Loaded %s" % name)

        self.add_real_utterance(wav, name, speaker_name)
-        
+    
+    def load_soruce_button(self, utterance: Utterance):
+        self.selected_source_utterance = utterance
+
    def record(self):
        wav = self.ui.record_one(encoder.sampling_rate, 5)
-        sf.write('output1.wav', wav, hparams.sample_rate)  # 先将变量wav写为文件的形式
-        wav = enhance('output1.wav')
-        os.remove("./output1.wav")
        if wav is None:
            return 
        self.ui.play(wav, encoder.sampling_rate)
@@ -218,7 +220,7 @@ class Toolbox:
        # Add the utterance
        utterance = Utterance(name, speaker_name, wav, spec, embed, partial_embeds, False)
        self.utterances.add(utterance)
-        self.ui.register_utterance(utterance)
+        self.ui.register_utterance(utterance, self.vc_mode)

        # Plot it
        self.ui.draw_embed(embed, name, "current")
@@ -291,7 +293,7 @@ class Toolbox:
            self.ui.set_loading(i, seq_len)
        if self.ui.current_vocoder_fpath is not None:
            self.ui.log("")
-            wav = vocoder.infer_waveform(spec, progress_callback=vocoder_progress)
+            wav, sample_rate = vocoder.infer_waveform(spec, progress_callback=vocoder_progress)
        else:
            self.ui.log("Waveform generation with Griffin-Lim... ")
            wav = Synthesizer.griffin_lim(spec)
@@ -302,19 +304,16 @@ class Toolbox:
        b_ends = np.cumsum(np.array(breaks) * Synthesizer.hparams.hop_size)
        b_starts = np.concatenate(([0], b_ends[:-1]))
        wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
-        breaks = [np.zeros(int(0.15 * Synthesizer.sample_rate))] * len(breaks)
+        breaks = [np.zeros(int(0.15 * sample_rate))] * len(breaks)
        wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])

        # Trim excessive silences
        if self.ui.trim_silences_checkbox.isChecked():
-            #wav = encoder.preprocess_wav(wav)
-            sf.write('output.wav', wav, hparams.sample_rate)      #先将变量wav写为文件的形式
-            wav = enhance('output.wav')
-            os.remove("./output.wav")
+            wav = encoder.preprocess_wav(wav)

        # Play it
        wav = wav / np.abs(wav).max() * 0.97
-        self.ui.play(wav, Synthesizer.sample_rate)
+        self.ui.play(wav, sample_rate)

        # Name it (history displayed in combobox)
        # TODO better naming for the combobox items?
@@ -356,6 +355,68 @@ class Toolbox:
        self.ui.draw_embed(embed, name, "generated")
        self.ui.draw_umap_projections(self.utterances)
        
+    def convert(self):
+        self.ui.log("Extract PPG and Converting...")
+        self.ui.set_loading(1)
+        
+        # Init
+        if self.convertor is None:
+            self.init_convertor()
+        if self.extractor is None:
+            self.init_extractor()
+        
+        src_wav = self.selected_source_utterance.wav
+
+        # Compute the ppg
+        if not self.extractor is None:
+            ppg = self.extractor.extract_from_wav(src_wav)
+        
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        ref_wav = self.ui.selected_utterance.wav
+        ref_lf0_mean, ref_lf0_std = compute_mean_std(f02lf0(compute_f0(ref_wav)))
+        lf0_uv = get_converted_lf0uv(src_wav, ref_lf0_mean, ref_lf0_std, convert=True)
+        min_len = min(ppg.shape[1], len(lf0_uv))
+        ppg = ppg[:, :min_len]
+        lf0_uv = lf0_uv[:min_len]
+        _, mel_pred, att_ws = self.convertor.inference(
+            ppg,
+            logf0_uv=torch.from_numpy(lf0_uv).unsqueeze(0).float().to(device),
+            spembs=torch.from_numpy(self.ui.selected_utterance.embed).unsqueeze(0).to(device),
+        )
+        mel_pred= mel_pred.transpose(0, 1)
+        breaks = [mel_pred.shape[1]]
+        mel_pred= mel_pred.detach().cpu().numpy()
+        self.ui.draw_spec(mel_pred, "generated")
+        self.current_generated = (self.ui.selected_utterance.speaker_name, mel_pred, breaks, None)
+        self.ui.set_loading(0)
+
+    def init_extractor(self):
+        if self.ui.current_extractor_fpath is None:
+            return
+        model_fpath = self.ui.current_extractor_fpath
+        self.ui.log("Loading the extractor %s... " % model_fpath)
+        self.ui.set_loading(1)
+        start = timer()
+        self.extractor = extractor.load_model(model_fpath)
+        self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
+        self.ui.set_loading(0)
+
+    def init_convertor(self):
+        if self.ui.current_convertor_fpath is None:
+            return
+        model_fpath = self.ui.current_convertor_fpath
+        # search a config file
+        model_config_fpaths = list(model_fpath.parent.rglob("*.yaml"))
+        if self.ui.current_convertor_fpath is None:
+            return
+        model_config_fpath = model_config_fpaths[0]
+        self.ui.log("Loading the convertor %s... " % model_fpath)
+        self.ui.set_loading(1)
+        start = timer()
+        self.convertor = convertor.load_model(model_config_fpath, model_fpath)
+        self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
+        self.ui.set_loading(0)
+        
    def init_encoder(self):
        model_fpath = self.ui.current_encoder_fpath
        
@@ -383,15 +444,16 @@ class Toolbox:
        # Case of Griffin-lim
        if model_fpath is None:
            return 
-        
-
-        # Select vocoder based on model name
-        if model_fpath.name is not None and model_fpath.name.find("hifigan") > -1:
+        # Sekect vocoder based on model name
+        model_config_fpath = None
+        if model_fpath.name[0] == "g":
            vocoder = gan_vocoder
            self.ui.log("set hifigan as vocoder")
-        elif model_fpath.name is not None and model_fpath.name.find("fregan") > -1:
-            vocoder = fgan_vocoder
-            self.ui.log("set fregan as vocoder")
+            # search a config file
+            model_config_fpaths = list(model_fpath.parent.rglob("*.json"))
+            if self.ui.current_extractor_fpath is None:
+                return
+            model_config_fpath = model_config_fpaths[0]
        else:
            vocoder = rnn_vocoder
            self.ui.log("set wavernn as vocoder")
@@ -399,11 +461,9 @@ class Toolbox:
        self.ui.log("Loading the vocoder %s... " % model_fpath)
        self.ui.set_loading(1)
        start = timer()
-        vocoder.load_model(model_fpath)
+        vocoder.load_model(model_fpath, model_config_fpath)
        self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
        self.ui.set_loading(0)

    def update_seed_textbox(self):
-       self.ui.update_seed_textbox()
-
-
+       self.ui.update_seed_textbox() 
--- a/toolbox/assets/1.png
+++ b/toolbox/assets/1.png
--- a/toolbox/assets/2.png
+++ b/toolbox/assets/2.png
--- a/toolbox/assets/picture1.jpg
+++ b/toolbox/assets/picture1.jpg
--- a/toolbox/assets/按钮控件.png
+++ b/toolbox/assets/按钮控件.png
--- a/toolbox/ui.py
+++ b/toolbox/ui.py
@@ -1,12 +1,9 @@
-import matplotlib.pyplot as plt
-import numpy
-from scipy.fftpack import dct
-from PyQt5.QtGui import QPalette, QBrush, QPixmap
-from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
-from matplotlib.figure import Figure
 from PyQt5.QtCore import Qt, QStringListModel
 from PyQt5 import QtGui
 from PyQt5.QtWidgets import *
+import matplotlib.pyplot as plt
+from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
+from matplotlib.figure import Figure
 from encoder.inference import plot_embedding_as_heatmap
 from toolbox.utterance import Utterance
 from pathlib import Path
@@ -19,15 +16,9 @@ from time import sleep
 import umap
 import sys
 from warnings import filterwarnings, warn
-
-
-
 filterwarnings("ignore")


-
-
-
 colormap = np.array([
    [0, 127, 70],
    [255, 0, 0],
@@ -46,7 +37,7 @@ colormap = np.array([
 ], dtype=np.float) / 255 

 default_text = \
-    "请输入需要克隆的语音文本！"
+    "欢迎使用工具箱, 现已支持中文输入！"


   
@@ -58,12 +49,7 @@ class UI(QDialog):
    def draw_utterance(self, utterance: Utterance, which):
        self.draw_spec(utterance.spec, which)
        self.draw_embed(utterance.embed, utterance.name, which)
-
-
-
-
-
-
+    
    def draw_embed(self, embed, name, which):
        embed_ax, _ = self.current_ax if which == "current" else self.gen_ax
        embed_ax.figure.suptitle("" if embed is None else name)
@@ -110,7 +96,7 @@ class UI(QDialog):

        # Display a message if there aren't enough points
        if len(utterances) < self.min_umap_points:
-            self.umap_ax.text(.5, .5, "umap:\nAdd %d more points to\ngenerate the projections" %
+            self.umap_ax.text(.5, .5, "Add %d more points to\ngenerate the projections" % 
                              (self.min_umap_points - len(utterances)), 
                              horizontalalignment='center', fontsize=15)
            self.umap_ax.set_title("")
@@ -241,110 +227,6 @@ class UI(QDialog):
        
        return wav.squeeze()

-
-
-    #添加source_mfcc分析函数
-    def plot_mfcc(self, wav, sample_rate):
-
-        signal = wav
-        print(sample_rate, len(signal))
-        # 读取前3.5s 的数据
-        signal = signal[0:int(3.5 * sample_rate)]
-        print(signal)
-
-        # 预先处理
-        pre_emphasis = 0.97
-        emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
-
-        frame_size = 0.025
-        frame_stride = 0.1
-        frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate
-        signal_length = len(emphasized_signal)
-        frame_length = int(round(frame_length))
-        frame_step = int(round(frame_step))
-        num_frames = int(numpy.ceil(float(numpy.abs(signal_length - frame_length)) / frame_step))
-
-        pad_signal_length = num_frames * frame_step + frame_length
-        z = numpy.zeros((pad_signal_length - signal_length))
-        pad_signal = numpy.append(emphasized_signal, z)
-
-        indices = numpy.tile(numpy.arange(0, frame_length), (num_frames, 1)) + numpy.tile(
-            numpy.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
-
-        frames = pad_signal[numpy.mat(indices).astype(numpy.int32, copy=False)]
-
-        # 加上汉明窗
-        frames *= numpy.hamming(frame_length)
-        # frames *= 0.54 - 0.46 * numpy.cos((2 * numpy.pi * n) / (frame_length - 1))  # Explicit Implementation **
-
-        # 傅立叶变换和功率谱
-        NFFT = 512
-        mag_frames = numpy.absolute(numpy.fft.rfft(frames, NFFT))  # Magnitude of the FFT
-        # print(mag_frames.shape)
-        pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2))  # Power Spectrum
-
-        low_freq_mel = 0
-        # 将频率转换为Mel
-        nfilt = 40
-        high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))
-        mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nfilt + 2)  # Equally spaced in Mel scale
-        hz_points = (700 * (10 ** (mel_points / 2595) - 1))  # Convert Mel to Hz
-
-        bin = numpy.floor((NFFT + 1) * hz_points / sample_rate)
-
-        fbank = numpy.zeros((nfilt, int(numpy.floor(NFFT / 2 + 1))))
-
-        for m in range(1, nfilt + 1):
-            f_m_minus = int(bin[m - 1])  # left
-            f_m = int(bin[m])  # center
-            f_m_plus = int(bin[m + 1])  # right
-            for k in range(f_m_minus, f_m):
-                fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
-            for k in range(f_m, f_m_plus):
-                fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
-        filter_banks = numpy.dot(pow_frames, fbank.T)
-        filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)  # Numerical Stability
-        filter_banks = 20 * numpy.log10(filter_banks)  # dB
-
-        # 所得到的倒谱系数2-13被保留，其余的被丢弃
-        num_ceps = 12
-        mfcc = dct(filter_banks, type=2, axis=1, norm='ortho')[:, 1: (num_ceps + 1)]
-        (nframes, ncoeff) = mfcc.shape
-
-        n = numpy.arange(ncoeff)
-        cep_lifter = 22
-        lift = 1 + (cep_lifter / 2) * numpy.sin(numpy.pi * n / cep_lifter)
-        mfcc *= lift  # *
-
-        # filter_banks -= (numpy.mean(filter_banks, axis=0) + 1e-8)
-        mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8)
-        print(mfcc.shape)
-
-        # 创建新的figure
-        fig10 = plt.figure(figsize=(16,8))
-
-        # 绘制1x2两行两列共四个图，编号从1开始
-        ax = fig10.add_subplot(121)
-        plt.plot(mfcc)
-
-        ax = fig10.add_subplot(122)
-        # 平均归一化MFCC
-        mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8)
-        plt.imshow(numpy.flipud(mfcc.T), cmap=plt.cm.jet, aspect=0.2,
-                   extent=[0, mfcc.shape[0], 0, mfcc.shape[1]])  # 热力图
-        #将figure保存为png并显示在新创建的子窗口上
-        plt.savefig("fmcc_source.png")
-        dialog_fault = QDialog()
-        dialog_fault.setWindowTitle("源音频MFCC特征图及MFCC平均归一化热图")  # 设置窗口名
-        pic = QPixmap("fmcc_source.png")
-        label_pic = QLabel("show", dialog_fault)
-        label_pic.setPixmap(pic)
-        label_pic.setGeometry(0,0,1500,800)
-        dialog_fault.exec_()
-
-
-
-
    @property        
    def current_dataset_name(self):
        return self.dataset_box.currentText()
@@ -390,7 +272,7 @@ class UI(QDialog):
                datasets = [d.relative_to(datasets_root) for d in datasets if d.exists()]
                self.browser_load_button.setDisabled(len(datasets) == 0)
            if datasets_root is None or len(datasets) == 0:
-                msg = "Tip: Please " + (" select the voice to be cloned" \
+                msg = "Warning: you d" + ("id not pass a root directory for datasets as argument" \
                    if datasets_root is None else "o not have any of the recognized datasets" \
                                                  " in %s" % datasets_root) 
                self.log(msg)
@@ -444,30 +326,51 @@ class UI(QDialog):
    def current_vocoder_fpath(self):
        return self.vocoder_box.itemData(self.vocoder_box.currentIndex())

+    @property
+    def current_extractor_fpath(self):
+        return self.extractor_box.itemData(self.extractor_box.currentIndex())
+
+    @property
+    def current_convertor_fpath(self):
+        return self.convertor_box.itemData(self.convertor_box.currentIndex())
+
    def populate_models(self, encoder_models_dir: Path, synthesizer_models_dir: Path, 
-                        vocoder_models_dir: Path):
+                        vocoder_models_dir: Path, extractor_models_dir: Path, convertor_models_dir: Path, vc_mode: bool):
        # Encoder
        encoder_fpaths = list(encoder_models_dir.glob("*.pt"))
        if len(encoder_fpaths) == 0:
            raise Exception("No encoder models found in %s" % encoder_models_dir)
        self.repopulate_box(self.encoder_box, [(f.stem, f) for f in encoder_fpaths])
        
-        # Synthesizer
-        synthesizer_fpaths = list(synthesizer_models_dir.glob("**/*.pt"))
-        if len(synthesizer_fpaths) == 0:
-            raise Exception("No synthesizer models found in %s" % synthesizer_models_dir)
-        self.repopulate_box(self.synthesizer_box, [(f.stem, f) for f in synthesizer_fpaths])
+        if vc_mode:
+            # Extractor
+            extractor_fpaths = list(extractor_models_dir.glob("*.pt"))
+            if len(extractor_fpaths) == 0:
+                self.log("No extractor models found in %s" % extractor_fpaths)
+            self.repopulate_box(self.extractor_box, [(f.stem, f) for f in extractor_fpaths])
+            
+            # Convertor
+            convertor_fpaths = list(convertor_models_dir.glob("*.pth"))
+            if len(convertor_fpaths) == 0:
+                self.log("No convertor models found in %s" % convertor_fpaths)
+            self.repopulate_box(self.convertor_box, [(f.stem, f) for f in convertor_fpaths])
+        else:
+            # Synthesizer
+            synthesizer_fpaths = list(synthesizer_models_dir.glob("**/*.pt"))
+            if len(synthesizer_fpaths) == 0:
+                raise Exception("No synthesizer models found in %s" % synthesizer_models_dir)
+            self.repopulate_box(self.synthesizer_box, [(f.stem, f) for f in synthesizer_fpaths])

        # Vocoder
        vocoder_fpaths = list(vocoder_models_dir.glob("**/*.pt"))
        vocoder_items = [(f.stem, f) for f in vocoder_fpaths] + [("Griffin-Lim", None)]
        self.repopulate_box(self.vocoder_box, vocoder_items)
-        
+
    @property
    def selected_utterance(self):
        return self.utterance_history.itemData(self.utterance_history.currentIndex())
        
-    def register_utterance(self, utterance: Utterance):
+    def register_utterance(self, utterance: Utterance, vc_mode):
        self.utterance_history.blockSignals(True)
        self.utterance_history.insertItem(0, utterance.name, utterance)
        self.utterance_history.setCurrentIndex(0)
@@ -477,8 +380,11 @@ class UI(QDialog):
            self.utterance_history.removeItem(self.max_saved_utterances)

        self.play_button.setDisabled(False)
-        self.generate_button.setDisabled(False)
-        self.synthesize_button.setDisabled(False)
+        if vc_mode:
+            self.convert_button.setDisabled(False)
+        else:
+            self.generate_button.setDisabled(False)
+            self.synthesize_button.setDisabled(False)

    def log(self, line, mode="newline"):
        if mode == "newline":
@@ -520,7 +426,7 @@ class UI(QDialog):
        else:
            self.seed_textbox.setEnabled(False)

-    def reset_interface(self):
+    def reset_interface(self, vc_mode):
        self.draw_embed(None, None, "current")
        self.draw_embed(None, None, "generated")
        self.draw_spec(None, "current")
@@ -528,132 +434,26 @@ class UI(QDialog):
        self.draw_umap_projections(set())
        self.set_loading(0)
        self.play_button.setDisabled(True)
-        self.generate_button.setDisabled(True)
-        self.synthesize_button.setDisabled(True)
+        if vc_mode:
+            self.convert_button.setDisabled(True)
+        else:
+            self.generate_button.setDisabled(True)
+            self.synthesize_button.setDisabled(True)
        self.vocode_button.setDisabled(True)
        self.replay_wav_button.setDisabled(True)
        self.export_wav_button.setDisabled(True)
        [self.log("") for _ in range(self.max_log_lines)]

-
-    #添加result_mfcc分析函数
-    def plot_mfcc1(self, wav, sample_rate):
-
-        signal = wav
-        print(sample_rate, len(signal))
-        # 读取前3.5s 的数据
-        signal = signal[0:int(3.5 * sample_rate)]
-        print(signal)
-
-        # 预先处理
-        pre_emphasis = 0.97
-        emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
-
-        frame_size = 0.025
-        frame_stride = 0.1
-        frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate
-        signal_length = len(emphasized_signal)
-        frame_length = int(round(frame_length))
-        frame_step = int(round(frame_step))
-        num_frames = int(numpy.ceil(float(numpy.abs(signal_length - frame_length)) / frame_step))
-
-        pad_signal_length = num_frames * frame_step + frame_length
-        z = numpy.zeros((pad_signal_length - signal_length))
-        pad_signal = numpy.append(emphasized_signal, z)
-
-        indices = numpy.tile(numpy.arange(0, frame_length), (num_frames, 1)) + numpy.tile(
-            numpy.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
-
-        frames = pad_signal[numpy.mat(indices).astype(numpy.int32, copy=False)]
-
-        # 加上汉明窗
-        frames *= numpy.hamming(frame_length)
-        # frames *= 0.54 - 0.46 * numpy.cos((2 * numpy.pi * n) / (frame_length - 1))  # Explicit Implementation **
-
-        # 傅立叶变换和功率谱
-        NFFT = 512
-        mag_frames = numpy.absolute(numpy.fft.rfft(frames, NFFT))  # Magnitude of the FFT
-        # print(mag_frames.shape)
-        pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2))  # Power Spectrum
-
-        low_freq_mel = 0
-        # 将频率转换为Mel
-        nfilt = 40
-        high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))
-        mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nfilt + 2)  # Equally spaced in Mel scale
-        hz_points = (700 * (10 ** (mel_points / 2595) - 1))  # Convert Mel to Hz
-
-        bin = numpy.floor((NFFT + 1) * hz_points / sample_rate)
-
-        fbank = numpy.zeros((nfilt, int(numpy.floor(NFFT / 2 + 1))))
-
-        for m in range(1, nfilt + 1):
-            f_m_minus = int(bin[m - 1])  # left
-            f_m = int(bin[m])  # center
-            f_m_plus = int(bin[m + 1])  # right
-            for k in range(f_m_minus, f_m):
-                fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
-            for k in range(f_m, f_m_plus):
-                fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
-        filter_banks = numpy.dot(pow_frames, fbank.T)
-        filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)  # Numerical Stability
-        filter_banks = 20 * numpy.log10(filter_banks)  # dB
-
-        # 所得到的倒谱系数2-13被保留，其余的被丢弃
-        num_ceps = 12
-        mfcc = dct(filter_banks, type=2, axis=1, norm='ortho')[:, 1: (num_ceps + 1)]
-        (nframes, ncoeff) = mfcc.shape
-
-        n = numpy.arange(ncoeff)
-        cep_lifter = 22
-        lift = 1 + (cep_lifter / 2) * numpy.sin(numpy.pi * n / cep_lifter)
-        mfcc *= lift  # *
-
-        # filter_banks -= (numpy.mean(filter_banks, axis=0) + 1e-8)
-        mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8)
-        print(mfcc.shape)
-
-        # 创建新的figure
-        fig11 = plt.figure(figsize=(16,8))
-
-        # 绘制1x2两行两列共四个图，编号从1开始
-        ax = fig11.add_subplot(121)
-        plt.plot(mfcc)
-
-        ax = fig11.add_subplot(122)
-        # 平均归一化MFCC
-        mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8)
-        plt.imshow(numpy.flipud(mfcc.T), cmap=plt.cm.jet, aspect=0.2,
-                   extent=[0, mfcc.shape[0], 0, mfcc.shape[1]])  # 热力图
-        #将figure保存为png并显示在新创建的子窗口上
-        plt.savefig("fmcc_result.png")
-        dialog_fault1 = QDialog()
-        dialog_fault1.setWindowTitle("合成音频MFCC特征图及MFCC平均归一化热图")  # 设置窗口名
-        pic = QPixmap("fmcc_result.png")
-        label_pic = QLabel("show", dialog_fault1)
-        label_pic.setPixmap(pic)
-        label_pic.setGeometry(0,0,1500,800)
-        dialog_fault1.exec_()
-
-
-
-
-    def __init__(self):
+    def __init__(self, vc_mode):
        ## Initialize the application
        self.app = QApplication(sys.argv)
-
-
-
        super().__init__(None)
-        self.setWindowTitle("中文语音克隆系统")
+        self.setWindowTitle("MockingBird GUI")
        self.setWindowIcon(QtGui.QIcon('toolbox\\assets\\mb.png'))
        self.setWindowFlag(Qt.WindowMinimizeButtonHint, True)
        self.setWindowFlag(Qt.WindowMaximizeButtonHint, True)
-
-
-
-
-
+        
+        
        ## Main layouts
        # Root
        root_layout = QGridLayout()
@@ -686,60 +486,46 @@ class UI(QDialog):
        self.projections_layout.addWidget(FigureCanvas(fig))
        self.umap_hot = False
        self.clear_button = QPushButton("Clear")
-        self.clear_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/2.png)}')
        self.projections_layout.addWidget(self.clear_button)


        ## Browser
        # Dataset, speaker and utterance selection
        i = 0
-
+        
        source_groupbox = QGroupBox('Source(源音频)')
        source_layout = QGridLayout()
        source_groupbox.setLayout(source_layout)
-        browser_layout.addWidget(source_groupbox, i, 0, 1, 4)
+        browser_layout.addWidget(source_groupbox, i, 0, 1, 5)

        self.dataset_box = QComboBox()
-      #  source_layout.addWidget(QLabel("Dataset(数据集):"), i, 0)  #隐藏标签文字
+        source_layout.addWidget(QLabel("Dataset(数据集):"), i, 0)
        source_layout.addWidget(self.dataset_box, i, 1)
        self.random_dataset_button = QPushButton("Random")
        source_layout.addWidget(self.random_dataset_button, i, 2)
-
-        self.random_dataset_button.hide()    #隐藏按钮
-        self.dataset_box.hide()              #隐藏选项条
-
        i += 1
        self.speaker_box = QComboBox()
-      #  source_layout.addWidget(QLabel("Speaker(说话者)"), i, 0)
+        source_layout.addWidget(QLabel("Speaker(说话者)"), i, 0)
        source_layout.addWidget(self.speaker_box, i, 1)
        self.random_speaker_button = QPushButton("Random")
        source_layout.addWidget(self.random_speaker_button, i, 2)
-
-        self.random_speaker_button.hide()
-        self.speaker_box.hide()
-
        i += 1
        self.utterance_box = QComboBox()
-      #  source_layout.addWidget(QLabel("Utterance(音频):"), i, 0)
+        source_layout.addWidget(QLabel("Utterance(音频):"), i, 0)
        source_layout.addWidget(self.utterance_box, i, 1)
        self.random_utterance_button = QPushButton("Random")
        source_layout.addWidget(self.random_utterance_button, i, 2)

-        self.random_utterance_button.hide()
-        self.utterance_box.hide()
-
        i += 1
        source_layout.addWidget(QLabel("<b>Use(使用):</b>"), i, 0)
-        self.browser_load_button = QPushButton("")
+        self.browser_load_button = QPushButton("Load Above(加载上面)")
        source_layout.addWidget(self.browser_load_button, i, 1, 1, 2)
        self.auto_next_checkbox = QCheckBox("Auto select next")
        self.auto_next_checkbox.setChecked(True)
-        source_layout.addWidget(self.auto_next_checkbox, i + 1, 1)
+        source_layout.addWidget(self.auto_next_checkbox, i+1, 1)
        self.browser_browse_button = QPushButton("Browse(打开本地)")
-        self.browser_browse_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
        source_layout.addWidget(self.browser_browse_button, i, 3)
        self.record_button = QPushButton("Record(录音)")
-        self.record_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
        source_layout.addWidget(self.record_button, i+1, 3)
        
        i += 2
@@ -748,30 +534,38 @@ class UI(QDialog):
        self.utterance_history = QComboBox()
        browser_layout.addWidget(self.utterance_history, i, 1)
        self.play_button = QPushButton("Play(播放)")
-        self.play_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
        browser_layout.addWidget(self.play_button, i, 2)
        self.stop_button = QPushButton("Stop(暂停)")
-        self.stop_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
        browser_layout.addWidget(self.stop_button, i, 3)
+        if vc_mode:
+            self.load_soruce_button = QPushButton("Select(选择为被转换的语音输入)")
+            browser_layout.addWidget(self.load_soruce_button, i, 4)

        i += 1
        model_groupbox = QGroupBox('Models(模型选择)')
        model_layout = QHBoxLayout()
        model_groupbox.setLayout(model_layout)
-        browser_layout.addWidget(model_groupbox, i, 0, 1, 4)
+        browser_layout.addWidget(model_groupbox, i, 0, 2, 5)

        # Model and audio output selection
        self.encoder_box = QComboBox()
        model_layout.addWidget(QLabel("Encoder:"))
        model_layout.addWidget(self.encoder_box)
        self.synthesizer_box = QComboBox()
-        model_layout.addWidget(QLabel("Synthesizer:"))
-        model_layout.addWidget(self.synthesizer_box)
+        if vc_mode:
+            self.extractor_box = QComboBox()
+            model_layout.addWidget(QLabel("Extractor:"))
+            model_layout.addWidget(self.extractor_box)
+            self.convertor_box = QComboBox()
+            model_layout.addWidget(QLabel("Convertor:"))
+            model_layout.addWidget(self.convertor_box)
+        else:
+            model_layout.addWidget(QLabel("Synthesizer:"))
+            model_layout.addWidget(self.synthesizer_box)
        self.vocoder_box = QComboBox()
        model_layout.addWidget(QLabel("Vocoder:"))
        model_layout.addWidget(self.vocoder_box)
-        
-
+    
        #Replay & Save Audio
        i = 0
        output_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
@@ -780,12 +574,10 @@ class UI(QDialog):
        self.waves_cb.setModel(self.waves_cb_model)
        self.waves_cb.setToolTip("Select one of the last generated waves in this section for replaying or exporting")
        output_layout.addWidget(self.waves_cb, i, 1)
-        self.replay_wav_button = QPushButton("Replay(重播)")
-        self.replay_wav_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
+        self.replay_wav_button = QPushButton("Replay")
        self.replay_wav_button.setToolTip("Replay last generated vocoder")
        output_layout.addWidget(self.replay_wav_button, i, 2)
-        self.export_wav_button = QPushButton("Export(导出)")
-        self.export_wav_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
+        self.export_wav_button = QPushButton("Export")
        self.export_wav_button.setToolTip("Save last generated vocoder audio in filesystem as a wav file")
        output_layout.addWidget(self.export_wav_button, i, 3)
        self.audio_out_devices_cb=QComboBox()
@@ -795,28 +587,15 @@ class UI(QDialog):

        ## Embed & spectrograms
        vis_layout.addStretch()
-
-        #添加标签控件，设置标签文字格式并且居中
-        label1 = QLabel("source audio")
-        label1.setStyleSheet("QLabel{color:red;font-size:20px;font-weight:bold;font-family:Roman times;}")
-        label1.setAlignment(Qt.AlignCenter)
-        vis_layout.addWidget(label1)      #addwidget:添加控件
-
+        # TODO: add spectrograms for source
        gridspec_kw = {"width_ratios": [1, 4]}
-        fig, self.current_ax = plt.subplots(1, 2, figsize=(10, 2.25), facecolor="#F0F0F0",
+        fig, self.current_ax = plt.subplots(1, 2, figsize=(10, 2.25), facecolor="#F0F0F0", 
                                            gridspec_kw=gridspec_kw)
-        #self.current_ax[1].set_title("source audio", fontsize=50, color='red', fontstyle='italic', fontweight="heavy")
        fig.subplots_adjust(left=0, bottom=0.1, right=1, top=0.8)
        vis_layout.addWidget(FigureCanvas(fig))

-        label2 = QLabel("target audio")
-        label2.setStyleSheet("QLabel{color:red;font-size:20px;font-weight:bold;font-family:Roman times;}")
-        label2.setAlignment(Qt.AlignCenter)
-        vis_layout.addWidget(label2)
-
-        fig, self.gen_ax = plt.subplots(1, 2, figsize=(10, 2.25), facecolor="#F0F0F0",
+        fig, self.gen_ax = plt.subplots(1, 2, figsize=(10, 2.25), facecolor="#F0F0F0", 
                                        gridspec_kw=gridspec_kw)
-        #self.gen_ax[1].set_title("target audio", fontsize=50, color='red', fontstyle='italic', fontweight="heavy")
        fig.subplots_adjust(left=0, bottom=0.1, right=1, top=0.8)
        vis_layout.addWidget(FigureCanvas(fig))

@@ -824,37 +603,36 @@ class UI(QDialog):
            ax.set_facecolor("#F0F0F0")
            for side in ["top", "right", "bottom", "left"]:
                ax.spines[side].set_visible(False)
-
-
-
-
-
+        
        ## Generation
        self.text_prompt = QPlainTextEdit(default_text)
        gen_layout.addWidget(self.text_prompt, stretch=1)
        
-        self.generate_button = QPushButton("Synthesize and vocode(合成并播放)")
-        self.generate_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
-        gen_layout.addWidget(self.generate_button)
-        
-        layout = QHBoxLayout()
-        self.synthesize_button = QPushButton("Synthesize only(仅合成)")
-        self.synthesize_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
-        layout.addWidget(self.synthesize_button)
-        self.vocode_button = QPushButton("Vocode only(仅播放)")
-        self.vocode_button.setStyleSheet('QPushButton{border-image:url(toolbox/assets/1.png)}')
+        if vc_mode:
+            layout = QHBoxLayout()
+            self.convert_button = QPushButton("Extract and Convert")
+            layout.addWidget(self.convert_button)
+            gen_layout.addLayout(layout)
+        else:
+            self.generate_button = QPushButton("Synthesize and vocode")
+            gen_layout.addWidget(self.generate_button)
+            layout = QHBoxLayout()
+            self.synthesize_button = QPushButton("Synthesize only")
+            layout.addWidget(self.synthesize_button)

+        self.vocode_button = QPushButton("Vocode only")
        layout.addWidget(self.vocode_button)
        gen_layout.addLayout(layout)

+
        layout_seed = QGridLayout()
-        self.random_seed_checkbox = QCheckBox("Random seed(随机数种子):")
+        self.random_seed_checkbox = QCheckBox("Random seed:")
        self.random_seed_checkbox.setToolTip("When checked, makes the synthesizer and vocoder deterministic.")
        layout_seed.addWidget(self.random_seed_checkbox, 0, 0)
        self.seed_textbox = QLineEdit()
        self.seed_textbox.setMaximumWidth(80)
        layout_seed.addWidget(self.seed_textbox, 0, 1)
-        self.trim_silences_checkbox = QCheckBox("Enhance vocoder output（语音增强）")
+        self.trim_silences_checkbox = QCheckBox("Enhance vocoder output")
        self.trim_silences_checkbox.setToolTip("When checked, trims excess silence in vocoder output."
            " This feature requires `webrtcvad` to be installed.")
        layout_seed.addWidget(self.trim_silences_checkbox, 0, 2, 1, 2)
@@ -865,7 +643,7 @@ class UI(QDialog):
        self.style_slider.setRange(-1, 9)
        self.style_value_label = QLabel("-1")
        self.style_slider.setValue(-1)
-        layout_seed.addWidget(QLabel("Style(风格):"), 1, 0)
+        layout_seed.addWidget(QLabel("Style:"), 1, 0)

        self.style_slider.valueChanged.connect(lambda s: self.style_value_label.setNum(s))
        layout_seed.addWidget(self.style_value_label, 1, 1)
@@ -876,7 +654,7 @@ class UI(QDialog):
        self.token_slider.setFocusPolicy(Qt.NoFocus)
        self.token_slider.setSingleStep(1)
        self.token_slider.setRange(3, 9)
-        self.token_value_label = QLabel("4")
+        self.token_value_label = QLabel("5")
        self.token_slider.setValue(4)
        layout_seed.addWidget(QLabel("Accuracy(精度):"), 2, 0)

@@ -914,19 +692,8 @@ class UI(QDialog):
        self.resize(max_size)
        
        ## Finalize the display
-        self.reset_interface()
+        self.reset_interface(vc_mode)
        self.show()

-        ##set the picture of background
-        palette1 = QPalette()
-        # palette1.setColor(self.backgroundRole(), QColor(192,253,123))   # 设置背景颜色
-        palette1.setBrush(self.backgroundRole(), QBrush(QPixmap('toolbox\\assets\\picture1.jpg')))  # 设置背景图片
-        self.setPalette(palette1)
-        self.setAutoFillBackground(True)
-
-
-
-
-
    def start(self):
        self.app.exec_()
--- a/utils/audio_utils.py
+++ b/utils/audio_utils.py
@@ -0,0 +1,60 @@
+
+import torch
+import torch.utils.data
+from scipy.io.wavfile import read
+from librosa.filters import mel as librosa_mel_fn
+
+MAX_WAV_VALUE = 32768.0
+
+
+def load_wav(full_path):
+    sampling_rate, data = read(full_path)
+    return data, sampling_rate
+
+def _dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
+    return torch.log(torch.clamp(x, min=clip_val) * C)
+
+
+def _spectral_normalize_torch(magnitudes):
+    output = _dynamic_range_compression_torch(magnitudes)
+    return output
+
+mel_basis = {}
+hann_window = {}
+
+def mel_spectrogram(
+    y, 
+    n_fft, 
+    num_mels, 
+    sampling_rate, 
+    hop_size, 
+    win_size, 
+    fmin, 
+    fmax, 
+    center=False,
+    output_energy=False,
+):
+    if torch.min(y) < -1.:
+        print('min value is ', torch.min(y))
+    if torch.max(y) > 1.:
+        print('max value is ', torch.max(y))
+
+    global mel_basis, hann_window
+    if fmax not in mel_basis:
+        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
+        mel_basis[str(fmax)+'_'+str(y.device)] = torch.from_numpy(mel).float().to(y.device)
+        hann_window[str(y.device)] = torch.hann_window(win_size).to(y.device)
+
+    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
+    y = y.squeeze(1)
+
+    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
+                      center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
+    spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
+    mel_spec = torch.matmul(mel_basis[str(fmax)+'_'+str(y.device)], spec)
+    mel_spec = _spectral_normalize_torch(mel_spec)
+    if output_energy:
+        energy = torch.norm(spec, dim=1)
+        return mel_spec, energy
+    else:
+        return mel_spec
--- a/utils/data_load.py
+++ b/utils/data_load.py
@@ -0,0 +1,214 @@
+import random
+import numpy as np
+import torch
+from utils.f0_utils import get_cont_lf0
+import resampy
+from .audio_utils import MAX_WAV_VALUE, load_wav, mel_spectrogram
+from librosa.util import normalize
+import os
+
+
+SAMPLE_RATE=16000
+
+def read_fids(fid_list_f):
+    with open(fid_list_f, 'r') as f:
+        fids = [l.strip().split()[0] for l in f if l.strip()]
+    return fids   
+
+class OneshotVcDataset(torch.utils.data.Dataset):
+    def __init__(
+        self,
+        meta_file: str,
+        vctk_ppg_dir: str,
+        libri_ppg_dir: str,
+        vctk_f0_dir: str,
+        libri_f0_dir: str,
+        vctk_wav_dir: str,
+        libri_wav_dir: str,
+        vctk_spk_dvec_dir: str,
+        libri_spk_dvec_dir: str,
+        min_max_norm_mel: bool = False,
+        mel_min: float = None,
+        mel_max: float = None,
+        ppg_file_ext: str = "ling_feat.npy",
+        f0_file_ext: str = "f0.npy",
+        wav_file_ext: str = "wav",
+    ):
+        self.fid_list = read_fids(meta_file)
+        self.vctk_ppg_dir = vctk_ppg_dir
+        self.libri_ppg_dir = libri_ppg_dir
+        self.vctk_f0_dir = vctk_f0_dir
+        self.libri_f0_dir = libri_f0_dir
+        self.vctk_wav_dir = vctk_wav_dir
+        self.libri_wav_dir = libri_wav_dir
+        self.vctk_spk_dvec_dir = vctk_spk_dvec_dir
+        self.libri_spk_dvec_dir = libri_spk_dvec_dir
+
+        self.ppg_file_ext = ppg_file_ext
+        self.f0_file_ext = f0_file_ext
+        self.wav_file_ext = wav_file_ext
+
+        self.min_max_norm_mel = min_max_norm_mel
+        if min_max_norm_mel:
+            print("[INFO] Min-Max normalize Melspec.")
+            assert mel_min is not None
+            assert mel_max is not None
+            self.mel_max = mel_max
+            self.mel_min = mel_min
+        
+        random.seed(1234)
+        random.shuffle(self.fid_list)
+        print(f'[INFO] Got {len(self.fid_list)} samples.')
+        
+    def __len__(self):
+        return len(self.fid_list)
+    
+    def get_spk_dvec(self, fid):
+        spk_name = fid
+        if spk_name.startswith("p"):
+            spk_dvec_path = f"{self.vctk_spk_dvec_dir}{os.sep}{spk_name}.npy"
+        else:
+            spk_dvec_path = f"{self.libri_spk_dvec_dir}{os.sep}{spk_name}.npy"
+        return torch.from_numpy(np.load(spk_dvec_path))
+    
+    def compute_mel(self, wav_path):
+        audio, sr = load_wav(wav_path)
+        if sr != SAMPLE_RATE:
+            audio = resampy.resample(audio, sr, SAMPLE_RATE)
+        audio = audio / MAX_WAV_VALUE
+        audio = normalize(audio) * 0.95
+        audio = torch.FloatTensor(audio).unsqueeze(0)
+        melspec = mel_spectrogram(
+            audio,
+            n_fft=1024,
+            num_mels=80,
+            sampling_rate=SAMPLE_RATE,
+            hop_size=160,
+            win_size=1024,
+            fmin=80,
+            fmax=8000,
+        )
+        return melspec.squeeze(0).numpy().T
+
+    def bin_level_min_max_norm(self, melspec):
+        # frequency bin level min-max normalization to [-4, 4]
+        mel = (melspec - self.mel_min) / (self.mel_max - self.mel_min) * 8.0 - 4.0
+        return np.clip(mel, -4., 4.)   
+
+    def __getitem__(self, index):
+        fid = self.fid_list[index]
+        
+        # 1. Load features
+        if fid.startswith("p"):
+            # vctk
+            sub = fid.split("_")[0]
+            ppg = np.load(f"{self.vctk_ppg_dir}{os.sep}{fid}.{self.ppg_file_ext}")
+            f0 = np.load(f"{self.vctk_f0_dir}{os.sep}{fid}.{self.f0_file_ext}")
+            mel = self.compute_mel(f"{self.vctk_wav_dir}{os.sep}{sub}{os.sep}{fid}.{self.wav_file_ext}")
+        else:
+            # aidatatang
+            sub = fid[5:10]
+            ppg = np.load(f"{self.libri_ppg_dir}{os.sep}{fid}.{self.ppg_file_ext}")
+            f0 = np.load(f"{self.libri_f0_dir}{os.sep}{fid}.{self.f0_file_ext}")
+            mel = self.compute_mel(f"{self.libri_wav_dir}{os.sep}{sub}{os.sep}{fid}.{self.wav_file_ext}")
+        if self.min_max_norm_mel:
+            mel = self.bin_level_min_max_norm(mel)
+        
+        f0, ppg, mel = self._adjust_lengths(f0, ppg, mel, fid)
+        spk_dvec = self.get_spk_dvec(fid)
+
+        # 2. Convert f0 to continuous log-f0 and u/v flags
+        uv, cont_lf0 = get_cont_lf0(f0, 10.0, False)
+        # cont_lf0 = (cont_lf0 - np.amin(cont_lf0)) / (np.amax(cont_lf0) - np.amin(cont_lf0))
+        # cont_lf0 = self.utt_mvn(cont_lf0)
+        lf0_uv = np.concatenate([cont_lf0[:, np.newaxis], uv[:, np.newaxis]], axis=1)
+
+        # uv, cont_f0 = convert_continuous_f0(f0)
+        # cont_f0 = (cont_f0 - np.amin(cont_f0)) / (np.amax(cont_f0) - np.amin(cont_f0))
+        # lf0_uv = np.concatenate([cont_f0[:, np.newaxis], uv[:, np.newaxis]], axis=1)
+        
+        # 3. Convert numpy array to torch.tensor
+        ppg = torch.from_numpy(ppg)
+        lf0_uv = torch.from_numpy(lf0_uv)
+        mel = torch.from_numpy(mel)
+        
+        return (ppg, lf0_uv, mel, spk_dvec, fid)
+
+    def check_lengths(self, f0, ppg, mel, fid):
+        LEN_THRESH = 10
+        assert abs(len(ppg) - len(f0)) <= LEN_THRESH, \
+            f"{abs(len(ppg) - len(f0))}: for file {fid}"
+        assert abs(len(mel) - len(f0)) <= LEN_THRESH, \
+            f"{abs(len(mel) - len(f0))}: for file {fid}"
+    
+    def _adjust_lengths(self, f0, ppg, mel, fid):
+        self.check_lengths(f0, ppg, mel, fid)
+        min_len = min(
+            len(f0),
+            len(ppg),
+            len(mel),
+        )
+        f0 = f0[:min_len]
+        ppg = ppg[:min_len]
+        mel = mel[:min_len]
+        return f0, ppg, mel
+
+class MultiSpkVcCollate():
+    """Zero-pads model inputs and targets based on number of frames per step
+    """
+    def __init__(self, n_frames_per_step=1, give_uttids=False,
+                 f02ppg_length_ratio=1, use_spk_dvec=False):
+        self.n_frames_per_step = n_frames_per_step
+        self.give_uttids = give_uttids
+        self.f02ppg_length_ratio = f02ppg_length_ratio
+        self.use_spk_dvec = use_spk_dvec
+
+    def __call__(self, batch):
+        batch_size = len(batch)              
+        # Prepare different features 
+        ppgs = [x[0] for x in batch]
+        lf0_uvs = [x[1] for x in batch]
+        mels = [x[2] for x in batch]
+        fids = [x[-1] for x in batch]
+        if len(batch[0]) == 5:
+            spk_ids = [x[3] for x in batch]
+            if self.use_spk_dvec:
+                # use d-vector
+                spk_ids = torch.stack(spk_ids).float()
+            else:
+                # use one-hot ids
+                spk_ids = torch.LongTensor(spk_ids)
+        # Pad features into chunk
+        ppg_lengths = [x.shape[0] for x in ppgs]
+        mel_lengths = [x.shape[0] for x in mels]
+        max_ppg_len = max(ppg_lengths)
+        max_mel_len = max(mel_lengths)
+        if max_mel_len % self.n_frames_per_step != 0:
+            max_mel_len += (self.n_frames_per_step - max_mel_len % self.n_frames_per_step)
+        ppg_dim = ppgs[0].shape[1]
+        mel_dim = mels[0].shape[1]
+        ppgs_padded = torch.FloatTensor(batch_size, max_ppg_len, ppg_dim).zero_()
+        mels_padded = torch.FloatTensor(batch_size, max_mel_len, mel_dim).zero_()
+        lf0_uvs_padded = torch.FloatTensor(batch_size, self.f02ppg_length_ratio * max_ppg_len, 2).zero_()
+        stop_tokens = torch.FloatTensor(batch_size, max_mel_len).zero_()
+        for i in range(batch_size):
+            cur_ppg_len = ppgs[i].shape[0]
+            cur_mel_len = mels[i].shape[0]
+            ppgs_padded[i, :cur_ppg_len, :] = ppgs[i]
+            lf0_uvs_padded[i, :self.f02ppg_length_ratio*cur_ppg_len, :] = lf0_uvs[i]
+            mels_padded[i, :cur_mel_len, :] = mels[i]
+            stop_tokens[i, cur_ppg_len-self.n_frames_per_step:] = 1
+        if len(batch[0]) == 5:
+            ret_tup = (ppgs_padded, lf0_uvs_padded, mels_padded, torch.LongTensor(ppg_lengths), \
+                torch.LongTensor(mel_lengths), spk_ids, stop_tokens)
+            if self.give_uttids:
+                return ret_tup + (fids, )
+            else:
+                return ret_tup
+        else:
+            ret_tup = (ppgs_padded, lf0_uvs_padded, mels_padded, torch.LongTensor(ppg_lengths), \
+                torch.LongTensor(mel_lengths), stop_tokens)
+            if self.give_uttids:
+                return ret_tup + (fids, )
+            else:
+                return ret_tup
--- a/utils/f0_utils.py
+++ b/utils/f0_utils.py
@@ -0,0 +1,124 @@
+import logging
+import numpy as np
+import pyworld
+from scipy.interpolate import interp1d
+from scipy.signal import firwin, get_window, lfilter
+
+def compute_mean_std(lf0):
+    nonzero_indices = np.nonzero(lf0)
+    mean = np.mean(lf0[nonzero_indices])
+    std = np.std(lf0[nonzero_indices])
+    return mean, std 
+
+
+def compute_f0(wav, sr=16000, frame_period=10.0):
+    """Compute f0 from wav using pyworld harvest algorithm."""
+    wav = wav.astype(np.float64)
+    f0, _ = pyworld.harvest(
+        wav, sr, frame_period=frame_period, f0_floor=80.0, f0_ceil=600.0)
+    return f0.astype(np.float32)
+
+def f02lf0(f0):
+    lf0 = f0.copy()
+    nonzero_indices = np.nonzero(f0)
+    lf0[nonzero_indices] = np.log(f0[nonzero_indices])
+    return lf0
+
+def get_converted_lf0uv(
+    wav, 
+    lf0_mean_trg, 
+    lf0_std_trg,
+    convert=True,
+):
+    f0_src = compute_f0(wav)
+    if not convert:
+        uv, cont_lf0 = get_cont_lf0(f0_src)
+        lf0_uv = np.concatenate([cont_lf0[:, np.newaxis], uv[:, np.newaxis]], axis=1)
+        return lf0_uv
+
+    lf0_src = f02lf0(f0_src)
+    lf0_mean_src, lf0_std_src = compute_mean_std(lf0_src)
+    
+    lf0_vc = lf0_src.copy()
+    lf0_vc[lf0_src > 0.0] = (lf0_src[lf0_src > 0.0] - lf0_mean_src) / lf0_std_src * lf0_std_trg + lf0_mean_trg
+    f0_vc = lf0_vc.copy()
+    f0_vc[lf0_src > 0.0] = np.exp(lf0_vc[lf0_src > 0.0])
+    
+    uv, cont_lf0_vc = get_cont_lf0(f0_vc)
+    lf0_uv = np.concatenate([cont_lf0_vc[:, np.newaxis], uv[:, np.newaxis]], axis=1)
+    return lf0_uv
+
+def low_pass_filter(x, fs, cutoff=70, padding=True):
+    """FUNCTION TO APPLY LOW PASS FILTER
+
+    Args:
+        x (ndarray): Waveform sequence
+        fs (int): Sampling frequency
+        cutoff (float): Cutoff frequency of low pass filter
+
+    Return:
+        (ndarray): Low pass filtered waveform sequence
+    """
+
+    nyquist = fs // 2
+    norm_cutoff = cutoff / nyquist
+
+    # low cut filter
+    numtaps = 255
+    fil = firwin(numtaps, norm_cutoff)
+    x_pad = np.pad(x, (numtaps, numtaps), 'edge')
+    lpf_x = lfilter(fil, 1, x_pad)
+    lpf_x = lpf_x[numtaps + numtaps // 2: -numtaps // 2]
+
+    return lpf_x
+
+
+def convert_continuos_f0(f0):
+    """CONVERT F0 TO CONTINUOUS F0
+
+    Args:
+        f0 (ndarray): original f0 sequence with the shape (T)
+
+    Return:
+        (ndarray): continuous f0 with the shape (T)
+    """
+    # get uv information as binary
+    uv = np.float32(f0 != 0)
+
+    # get start and end of f0
+    if (f0 == 0).all():
+        logging.warn("all of the f0 values are 0.")
+        return uv, f0
+    start_f0 = f0[f0 != 0][0]
+    end_f0 = f0[f0 != 0][-1]
+
+    # padding start and end of f0 sequence
+    start_idx = np.where(f0 == start_f0)[0][0]
+    end_idx = np.where(f0 == end_f0)[0][-1]
+    f0[:start_idx] = start_f0
+    f0[end_idx:] = end_f0
+
+    # get non-zero frame index
+    nz_frames = np.where(f0 != 0)[0]
+
+    # perform linear interpolation
+    f = interp1d(nz_frames, f0[nz_frames])
+    cont_f0 = f(np.arange(0, f0.shape[0]))
+
+    return uv, cont_f0
+
+
+def get_cont_lf0(f0, frame_period=10.0, lpf=False):
+    uv, cont_f0 = convert_continuos_f0(f0)
+    if lpf:
+        cont_f0_lpf = low_pass_filter(cont_f0, int(1.0 / (frame_period * 0.001)), cutoff=20)
+        cont_lf0_lpf = cont_f0_lpf.copy()
+        nonzero_indices = np.nonzero(cont_lf0_lpf)
+        cont_lf0_lpf[nonzero_indices] = np.log(cont_f0_lpf[nonzero_indices])
+        # cont_lf0_lpf = np.log(cont_f0_lpf)
+        return uv, cont_lf0_lpf 
+    else:
+        nonzero_indices = np.nonzero(cont_f0)
+        cont_lf0 = cont_f0.copy()
+        cont_lf0[cont_f0>0] = np.log(cont_f0[cont_f0>0])
+        return uv, cont_lf0
--- a/utils/load_yaml.py
+++ b/utils/load_yaml.py
@@ -0,0 +1,58 @@
+import yaml
+
+
+def load_hparams(filename):
+    stream = open(filename, 'r')
+    docs = yaml.safe_load_all(stream)
+    hparams_dict = dict()
+    for doc in docs:
+        for k, v in doc.items():
+            hparams_dict[k] = v
+    return hparams_dict
+
+def merge_dict(user, default):
+    if isinstance(user, dict) and isinstance(default, dict):
+        for k, v in default.items():
+            if k not in user:
+                user[k] = v
+            else:
+                user[k] = merge_dict(user[k], v)
+    return user
+
+class Dotdict(dict):
+    """
+    a dictionary that supports dot notation 
+    as well as dictionary access notation 
+    usage: d = DotDict() or d = DotDict({'val1':'first'})
+    set attributes: d.val2 = 'second' or d['val2'] = 'second'
+    get attributes: d.val2 or d['val2']
+    """
+    __getattr__ = dict.__getitem__
+    __setattr__ = dict.__setitem__
+    __delattr__ = dict.__delitem__
+
+    def __init__(self, dct=None):
+        dct = dict() if not dct else dct
+        for key, value in dct.items():
+            if hasattr(value, 'keys'):
+                value = Dotdict(value)
+            self[key] = value
+
+class HpsYaml(Dotdict):
+    def __init__(self, yaml_file):
+        super(Dotdict, self).__init__()
+        hps = load_hparams(yaml_file)
+        hp_dict = Dotdict(hps)
+        for k, v in hp_dict.items():
+            setattr(self, k, v)
+
+    __getattr__ = Dotdict.__getitem__
+    __setattr__ = Dotdict.__setitem__
+    __delattr__ = Dotdict.__delitem__
+    
+
+
+
+
+
+
--- a/utils/util.py
+++ b/utils/util.py
@@ -0,0 +1,44 @@
+import matplotlib
+matplotlib.use('Agg')
+import time
+
+class Timer():
+    ''' Timer for recording training time distribution. '''
+    def __init__(self):
+        self.prev_t = time.time()
+        self.clear()
+
+    def set(self):
+        self.prev_t = time.time()
+
+    def cnt(self, mode):
+        self.time_table[mode] += time.time()-self.prev_t
+        self.set()
+        if mode == 'bw':
+            self.click += 1
+
+    def show(self):
+        total_time = sum(self.time_table.values())
+        self.time_table['avg'] = total_time/self.click
+        self.time_table['rd'] = 100*self.time_table['rd']/total_time
+        self.time_table['fw'] = 100*self.time_table['fw']/total_time
+        self.time_table['bw'] = 100*self.time_table['bw']/total_time
+        msg = '{avg:.3f} sec/step (rd {rd:.1f}% | fw {fw:.1f}% | bw {bw:.1f}%)'.format(
+            **self.time_table)
+        self.clear()
+        return msg
+
+    def clear(self):
+        self.time_table = {'rd': 0, 'fw': 0, 'bw': 0}
+        self.click = 0
+
+# Reference : https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/e2e_asr.py#L168
+
+def human_format(num):
+    magnitude = 0
+    while num >= 1000:
+        magnitude += 1
+        num /= 1000.0
+    # add more suffixes if you need them
+    return '{:3.1f}{}'.format(num, [' ', 'K', 'M', 'G', 'T', 'P'][magnitude])
+
--- a/vocoder/fregan/.gitignore
+++ b/vocoder/fregan/.gitignore
@@ -1,129 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-pip-wheel-metadata/
-share/python-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.nox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-*.py,cover
-.hypothesis/
-.pytest_cache/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# IPython
-profile_default/
-ipython_config.py
-
-# pyenv
-.python-version
-
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow
-__pypackages__/
-
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-
-# Pyre type checker
-.pyre/
--- a/vocoder/fregan/LICENSE
+++ b/vocoder/fregan/LICENSE
@@ -1,21 +0,0 @@
-MIT License
-
-Copyright (c) 2021 Rishikesh (ऋषिकेश)
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
--- a/vocoder/fregan/LJSpeech-1.1/training.txt
+++ b/vocoder/fregan/LJSpeech-1.1/training.txt
--- a/vocoder/fregan/LJSpeech-1.1/validation.txt
+++ b/vocoder/fregan/LJSpeech-1.1/validation.txt
@@ -1,150 +0,0 @@
-LJ050-0269|The essential terms of such memoranda might well be embodied in an Executive order.|The essential terms of such memoranda might well be embodied in an Executive order.
-LJ050-0270|This Commission can recommend no procedures for the future protection of our Presidents which will guarantee security.|This Commission can recommend no procedures for the future protection of our Presidents which will guarantee security.
-LJ050-0271|The demands on the President in the execution of His responsibilities in today's world are so varied and complex|The demands on the President in the execution of His responsibilities in today's world are so varied and complex
-LJ050-0272|and the traditions of the office in a democracy such as ours are so deep-seated as to preclude absolute security.|and the traditions of the office in a democracy such as ours are so deep-seated as to preclude absolute security.
-LJ050-0273|The Commission has, however, from its examination of the facts of President Kennedy's assassination|The Commission has, however, from its examination of the facts of President Kennedy's assassination
-LJ050-0274|made certain recommendations which it believes would, if adopted,|made certain recommendations which it believes would, if adopted,
-LJ050-0275|materially improve upon the procedures in effect at the time of President Kennedy's assassination and result in a substantial lessening of the danger.|materially improve upon the procedures in effect at the time of President Kennedy's assassination and result in a substantial lessening of the danger.
-LJ050-0276|As has been pointed out, the Commission has not resolved all the proposals which could be made. The Commission nevertheless is confident that,|As has been pointed out, the Commission has not resolved all the proposals which could be made. The Commission nevertheless is confident that,
-LJ050-0277|with the active cooperation of the responsible agencies and with the understanding of the people of the United States in their demands upon their President,|with the active cooperation of the responsible agencies and with the understanding of the people of the United States in their demands upon their President,
-LJ050-0278|the recommendations we have here suggested would greatly advance the security of the office without any impairment of our fundamental liberties.|the recommendations we have here suggested would greatly advance the security of the office without any impairment of our fundamental liberties.
-LJ001-0028|but by printers in Strasburg, Basle, Paris, Lubeck, and other cities.|but by printers in Strasburg, Basle, Paris, Lubeck, and other cities.
-LJ001-0068|The characteristic Dutch type, as represented by the excellent printer Gerard Leew, is very pronounced and uncompromising Gothic.|The characteristic Dutch type, as represented by the excellent printer Gerard Leew, is very pronounced and uncompromising Gothic.
-LJ002-0149|The latter indeed hung like millstones round the neck of the unhappy insolvent wretches who found themselves in limbo.|The latter indeed hung like millstones round the neck of the unhappy insolvent wretches who found themselves in limbo.
-LJ002-0157|and Susannah Evans, in October the same year, for 2 shillings, with costs of 6 shillings, 8 pence.|and Susannah Evans, in October the same year, for two shillings, with costs of six shillings, eight pence.
-LJ002-0167|quotes a case which came within his own knowledge of a boy sent to prison for non-payment of one penny.|quotes a case which came within his own knowledge of a boy sent to prison for non-payment of one penny.
-LJ003-0042|The completion of this very necessary building was, however, much delayed for want of funds,|The completion of this very necessary building was, however, much delayed for want of funds,
-LJ003-0307|but as yet no suggestion was made to provide prison uniform.|but as yet no suggestion was made to provide prison uniform.
-LJ004-0169|On the dirty bedstead lay a wretched being in the throes of severe illness.|On the dirty bedstead lay a wretched being in the throes of severe illness.
-LJ004-0233|Under the new rule visitors were not allowed to pass into the interior of the prison, but were detained between the grating.|Under the new rule visitors were not allowed to pass into the interior of the prison, but were detained between the grating.
-LJ005-0101|whence it deduced the practice and condition of every prison that replied.|whence it deduced the practice and condition of every prison that replied.
-LJ005-0108|the prisoners, without firing, bedding, or sufficient food, spent their days "in surveying their grotesque prison,|the prisoners, without firing, bedding, or sufficient food, spent their days "in surveying their grotesque prison,
-LJ005-0202|An examination of this report shows how even the most insignificant township had its jail.|An examination of this report shows how even the most insignificant township had its jail.
-LJ005-0234|The visits of friends was once more unreservedly allowed, and these incomers freely brought in extra provisions and beer.|The visits of friends was once more unreservedly allowed, and these incomers freely brought in extra provisions and beer.
-LJ005-0248|and stated that in his opinion Newgate, as the common jail of Middlesex, was wholly inadequate to the proper confinement of its prisoners.|and stated that in his opinion Newgate, as the common jail of Middlesex, was wholly inadequate to the proper confinement of its prisoners.
-LJ006-0001|The Chronicles of Newgate, Volume 2. By Arthur Griffiths. Section 9: The first report of the inspector of prisons.|The Chronicles of Newgate, Volume two. By Arthur Griffiths. Section nine: The first report of the inspector of prisons.
-LJ006-0018|One was Mr. William Crawford, the other the Rev. Whitworth Russell.|One was Mr. William Crawford, the other the Rev. Whitworth Russell.
-LJ006-0034|They attended early and late; they mustered the prisoners, examined into their condition,|They attended early and late; they mustered the prisoners, examined into their condition,
-LJ006-0078|A new prisoner's fate, as to location, rested really with a powerful fellow-prisoner.|A new prisoner's fate, as to location, rested really with a powerful fellow-prisoner.
-LJ007-0217|They go on to say|They go on to say
-LJ007-0243|It was not till the erection of the new prison at Holloway in 1850, and the entire internal reconstruction of Newgate according to new ideas,|It was not till the erection of the new prison at Holloway in eighteen fifty, and the entire internal reconstruction of Newgate according to new ideas,
-LJ008-0087|The change from Tyburn to the Old Bailey had worked no improvement as regards the gathering together of the crowd or its demeanor.|The change from Tyburn to the Old Bailey had worked no improvement as regards the gathering together of the crowd or its demeanor.
-LJ008-0131|the other he kept between his hands.|the other he kept between his hands.
-LJ008-0140|Whenever the public attention had been specially called to a particular crime, either on account of its atrocity,|Whenever the public attention had been specially called to a particular crime, either on account of its atrocity,
-LJ008-0158|The pressure soon became so frightful that many would have willingly escaped from the crowd; but their attempts only increased the general confusion.|The pressure soon became so frightful that many would have willingly escaped from the crowd; but their attempts only increased the general confusion.
-LJ008-0174|One cart-load of spectators having broken down, some of its occupants fell off the vehicle, and were instantly trampled to death.|One cart-load of spectators having broken down, some of its occupants fell off the vehicle, and were instantly trampled to death.
-LJ010-0047|while in 1850 Her Majesty was the victim of another outrage at the hands of one Pate.|while in eighteen fifty Her Majesty was the victim of another outrage at the hands of one Pate.
-LJ010-0061|That some thirty or more needy men should hope to revolutionize England is a sufficient proof of the absurdity of their attempt.|That some thirty or more needy men should hope to revolutionize England is a sufficient proof of the absurdity of their attempt.
-LJ010-0105|Thistlewood was discovered next morning in a mean house in White Street, Moorfields.|Thistlewood was discovered next morning in a mean house in White Street, Moorfields.
-LJ010-0233|Here again probably it was partly the love of notoriety which was the incentive,|Here again probably it was partly the love of notoriety which was the incentive,
-LJ010-0234|backed possibly with the hope that, as in a much more recent case,|backed possibly with the hope that, as in a much more recent case,
-LJ010-0258|As the Queen was driving from Buckingham Palace to the Chapel Royal,|As the Queen was driving from Buckingham Palace to the Chapel Royal,
-LJ010-0262|charged him with the offense.|charged him with the offense.
-LJ010-0270|exactly tallied with that of the deformed person "wanted" for the assault on the Queen.|exactly tallied with that of the deformed person "wanted" for the assault on the Queen.
-LJ010-0293|I have already remarked that as violence was more and more eliminated from crimes against the person,|I have already remarked that as violence was more and more eliminated from crimes against the person,
-LJ011-0009|Nothing more was heard of the affair, although the lady declared that she had never instructed Fauntleroy to sell.|Nothing more was heard of the affair, although the lady declared that she had never instructed Fauntleroy to sell.
-LJ011-0256|By this time the neighbors were aroused, and several people came to the scene of the affray.|By this time the neighbors were aroused, and several people came to the scene of the affray.
-LJ012-0044|When his trade was busiest he set up a second establishment, at the head of which, although he was married,|When his trade was busiest he set up a second establishment, at the head of which, although he was married,
-LJ012-0145|Solomons was now also admitted as a witness, and his evidence, with that of Moss, secured the transportation of the principal actors in the theft.|Solomons was now also admitted as a witness, and his evidence, with that of Moss, secured the transportation of the principal actors in the theft.
-LJ013-0020|he acted in a manner which excited the suspicions of the crew.|he acted in a manner which excited the suspicions of the crew.
-LJ013-0077|Barber and Fletcher were both transported for life, although Fletcher declared that Barber was innocent, and had no guilty knowledge of what was being done.|Barber and Fletcher were both transported for life, although Fletcher declared that Barber was innocent, and had no guilty knowledge of what was being done.
-LJ013-0228|In the pocket of the coat Mr. Cope, the governor, found a neatly-folded cloth, and asked what it was for.|In the pocket of the coat Mr. Cope, the governor, found a neatly-folded cloth, and asked what it was for.
-LJ014-0020|He was soon afterwards arrested on suspicion, and a search of his lodgings brought to light several garments saturated with blood;|He was soon afterwards arrested on suspicion, and a search of his lodgings brought to light several garments saturated with blood;
-LJ014-0054|a maidservant, Sarah Thomas, murdered her mistress, an aged woman, by beating out her brains with a stone.|a maidservant, Sarah Thomas, murdered her mistress, an aged woman, by beating out her brains with a stone.
-LJ014-0101|he found that it was soft and new, while elsewhere it was set and hard.|he found that it was soft and new, while elsewhere it was set and hard.
-LJ014-0103|beneath them was a layer of fresh mortar, beneath that a lot of loose earth, amongst which a stocking was turned up, and presently a human toe.|beneath them was a layer of fresh mortar, beneath that a lot of loose earth, amongst which a stocking was turned up, and presently a human toe.
-LJ014-0263|When other pleasures palled he took a theatre, and posed as a munificent patron of the dramatic art.|When other pleasures palled he took a theatre, and posed as a munificent patron of the dramatic art.
-LJ014-0272|and 1850 to embezzle and apply to his own purposes some £71,000.|and eighteen fifty to embezzle and apply to his own purposes some seventy-one thousand pounds.
-LJ014-0311|His extensive business had been carried on by fraud.|His extensive business had been carried on by fraud.
-LJ015-0197|which at one time spread terror throughout London. Thieves preferred now to use ingenuity rather than brute force.|which at one time spread terror throughout London. Thieves preferred now to use ingenuity rather than brute force.
-LJ016-0089|He was engaged in whitewashing and cleaning; the officer who had him in charge left him on the stairs leading to the gallery.|He was engaged in whitewashing and cleaning; the officer who had him in charge left him on the stairs leading to the gallery.
-LJ016-0407|who generally attended the prison services.|who generally attended the prison services.
-LJ016-0443|He was promptly rescued from his perilous condition, but not before his face and hands were badly scorched.|He was promptly rescued from his perilous condition, but not before his face and hands were badly scorched.
-LJ017-0033|a medical practitioner, charged with doing to death persons who relied upon his professional skill.|a medical practitioner, charged with doing to death persons who relied upon his professional skill.
-LJ017-0038|That the administration of justice should never be interfered with by local prejudice or local feeling|That the administration of justice should never be interfered with by local prejudice or local feeling
-LJ018-0018|he wore gold-rimmed eye-glasses and a gold watch and chain.|he wore gold-rimmed eye-glasses and a gold watch and chain.
-LJ018-0119|His offer was not, however, accepted.|His offer was not, however, accepted.
-LJ018-0280|The commercial experience of these clever rogues was cosmopolitan.|The commercial experience of these clever rogues was cosmopolitan.
-LJ019-0178|and abandoned because of the expense. As to the entire reconstruction of Newgate, nothing had been done as yet.|and abandoned because of the expense. As to the entire reconstruction of Newgate, nothing had been done as yet.
-LJ019-0240|But no structural alterations were made from the date first quoted until the time of closing the prison in 1881.|But no structural alterations were made from the date first quoted until the time of closing the prison in eighteen eighty-one.
-LJ021-0049|and the curtailment of rank stock speculation through the Securities Exchange Act.|and the curtailment of rank stock speculation through the Securities Exchange Act.
-LJ021-0155|both directly on the public works themselves, and indirectly in the industries supplying the materials for these public works.|both directly on the public works themselves, and indirectly in the industries supplying the materials for these public works.
-LJ022-0046|It is true that while business and industry are definitely better our relief rolls are still too large.|It is true that while business and industry are definitely better our relief rolls are still too large.
-LJ022-0173|for the regulation of transportation by water, for the strengthening of our Merchant Marine and Air Transport,|for the regulation of transportation by water, for the strengthening of our Merchant Marine and Air Transport,
-LJ024-0087|I have thus explained to you the reasons that lie behind our efforts to secure results by legislation within the Constitution.|I have thus explained to you the reasons that lie behind our efforts to secure results by legislation within the Constitution.
-LJ024-0110|And the strategy of that last stand is to suggest the time-consuming process of amendment in order to kill off by delay|And the strategy of that last stand is to suggest the time-consuming process of amendment in order to kill off by delay
-LJ024-0119|When before have you found them really at your side in your fights for progress?|When before have you found them really at your side in your fights for progress?
-LJ025-0091|as it was current among contemporary chemists.|as it was current among contemporary chemists.
-LJ026-0029|so in the case under discussion.|so in the case under discussion.
-LJ026-0039|the earliest organisms were protists and that from them animals and plants were evolved along divergent lines of descent.|the earliest organisms were protists and that from them animals and plants were evolved along divergent lines of descent.
-LJ026-0064|but unlike that of the animal, it is not chiefly an income of foods, but only of the raw materials of food.|but unlike that of the animal, it is not chiefly an income of foods, but only of the raw materials of food.
-LJ026-0105|This is done by diastase, an enzyme of plant cells.|This is done by diastase, an enzyme of plant cells.
-LJ026-0137|and be laid down as "reserve starch" in the cells of root or stem or elsewhere.|and be laid down as "reserve starch" in the cells of root or stem or elsewhere.
-LJ027-0006|In all these lines the facts are drawn together by a strong thread of unity.|In all these lines the facts are drawn together by a strong thread of unity.
-LJ028-0134|He also erected what is called a pensile paradise:|He also erected what is called a pensile paradise:
-LJ028-0138|perhaps the tales that travelers told him were exaggerated as travelers' tales are likely to be,|perhaps the tales that travelers told him were exaggerated as travelers' tales are likely to be,
-LJ028-0189|The fall of Babylon with its lofty walls was a most important event in the history of the ancient world.|The fall of Babylon with its lofty walls was a most important event in the history of the ancient world.
-LJ028-0281|Till mules foal ye shall not take our city, he thought, as he reflected on this speech, that Babylon might now be taken,|Till mules foal ye shall not take our city, he thought, as he reflected on this speech, that Babylon might now be taken,
-LJ029-0188|Stevenson was jeered, jostled, and spat upon by hostile demonstrators outside the Dallas Memorial Auditorium Theater.|Stevenson was jeered, jostled, and spat upon by hostile demonstrators outside the Dallas Memorial Auditorium Theater.
-LJ030-0098|The remainder of the motorcade consisted of five cars for other dignitaries, including the mayor of Dallas and Texas Congressmen,|The remainder of the motorcade consisted of five cars for other dignitaries, including the mayor of Dallas and Texas Congressmen,
-LJ031-0007|Chief of Police Curry and police motorcyclists at the head of the motorcade led the way to the hospital.|Chief of Police Curry and police motorcyclists at the head of the motorcade led the way to the hospital.
-LJ031-0091|You have to determine which things, which are immediately life threatening and cope with them, before attempting to evaluate the full extent of the injuries.|You have to determine which things, which are immediately life threatening and cope with them, before attempting to evaluate the full extent of the injuries.
-LJ031-0227|The doctors traced the course of the bullet through the body and, as information was received from Parkland Hospital,|The doctors traced the course of the bullet through the body and, as information was received from Parkland Hospital,
-LJ032-0100|Marina Oswald|Marina Oswald
-LJ032-0165|to the exclusion of all others because there are not enough microscopic characteristics present in fibers.|to the exclusion of all others because there are not enough microscopic characteristics present in fibers.
-LJ032-0198|During the period from March 2, 1963, to April 24, 1963,|During the period from March two, nineteen sixty-three, to April twenty-four, nineteen sixty-three,
-LJ033-0046|went out to the garage to paint some children's blocks, and worked in the garage for half an hour or so.|went out to the garage to paint some children's blocks, and worked in the garage for half an hour or so.
-LJ033-0072|I then stepped off of it and the officer picked it up in the middle and it bent so.|I then stepped off of it and the officer picked it up in the middle and it bent so.
-LJ033-0135|Location of Bag|Location of Bag
-LJ034-0083|The significance of Givens' observation that Oswald was carrying his clipboard|The significance of Givens' observation that Oswald was carrying his clipboard
-LJ034-0179|and, quote, seemed to be sitting a little forward, end quote,|and, quote, seemed to be sitting a little forward, end quote,
-LJ035-0125|Victoria Adams, who worked on the fourth floor of the Depository Building,|Victoria Adams, who worked on the fourth floor of the Depository Building,
-LJ035-0162|approximately 30 to 45 seconds after Oswald's lunchroom encounter with Baker and Truly.|approximately thirty to forty-five seconds after Oswald's lunchroom encounter with Baker and Truly.
-LJ035-0189|Special Agent Forrest V. Sorrels of the Secret Service, who had been in the motorcade,|Special Agent Forrest V. Sorrels of the Secret Service, who had been in the motorcade,
-LJ035-0208|Oswald's known actions in the building immediately after the assassination are consistent with his having been at the southeast corner window of the sixth floor|Oswald's known actions in the building immediately after the assassination are consistent with his having been at the southeast corner window of the sixth floor
-LJ036-0216|Tippit got out and started to walk around the front of the car|Tippit got out and started to walk around the front of the car
-LJ037-0093|William Arthur Smith was about a block east of 10th and Patton when he heard shots.|William Arthur Smith was about a block east of tenth and Patton when he heard shots.
-LJ037-0157|taken from Oswald.|taken from Oswald.
-LJ037-0178|or one used Remington-Peters cartridge case, which may have been in the revolver before the shooting,|or one used Remington-Peters cartridge case, which may have been in the revolver before the shooting,
-LJ037-0219|Oswald's Jacket|Oswald's Jacket
-LJ037-0222|When Oswald was arrested, he did not have a jacket.|When Oswald was arrested, he did not have a jacket.
-LJ038-0017|Attracted by the sound of the sirens, Mrs. Postal stepped out of the box office and walked to the curb.|Attracted by the sound of the sirens, Mrs. Postal stepped out of the box office and walked to the curb.
-LJ038-0052|testified regarding the arrest of Oswald, as did the various police officers who participated in the fight.|testified regarding the arrest of Oswald, as did the various police officers who participated in the fight.
-LJ038-0077|Statements of Oswald during Detention.|Statements of Oswald during Detention.
-LJ038-0161|and he asked me did I know which way he was coming, and I told him, yes, he probably come down Main and turn on Houston and then back again on Elm.|and he asked me did I know which way he was coming, and I told him, yes, he probably come down Main and turn on Houston and then back again on Elm.
-LJ038-0212|which appeared to be the work of a man expecting to be killed, or imprisoned, or to disappear.|which appeared to be the work of a man expecting to be killed, or imprisoned, or to disappear.
-LJ039-0103|Oswald, like all Marine recruits, received training on the rifle range at distances up to 500 yards,|Oswald, like all Marine recruits, received training on the rifle range at distances up to five hundred yards,
-LJ039-0149|established that they had been previously loaded and ejected from the assassination rifle,|established that they had been previously loaded and ejected from the assassination rifle,
-LJ040-0107|but apparently was not able to spend as much time with them as he would have liked, because of the age gaps of 5 and 7 years,|but apparently was not able to spend as much time with them as he would have liked, because of the age gaps of five and seven years,
-LJ040-0119|When Pic returned home, Mrs. Oswald tried to play down the event but Mrs. Pic took a different view and asked the Oswalds to leave.|When Pic returned home, Mrs. Oswald tried to play down the event but Mrs. Pic took a different view and asked the Oswalds to leave.
-LJ040-0161|Dr. Hartogs recommended that Oswald be placed on probation on condition that he seek help and guidance through a child guidance clinic.|Dr. Hartogs recommended that Oswald be placed on probation on condition that he seek help and guidance through a child guidance clinic.
-LJ040-0169|She observed that since Lee's mother worked all day, he made his own meals and spent all his time alone|She observed that since Lee's mother worked all day, he made his own meals and spent all his time alone
-LJ041-0098|All the Marine Corps did was to teach you to kill and after you got out of the Marines you might be good gangsters, end quote.|All the Marine Corps did was to teach you to kill and after you got out of the Marines you might be good gangsters, end quote.
-LJ042-0017|and see for himself how a revolutionary society operates, a Marxist society.|and see for himself how a revolutionary society operates, a Marxist society.
-LJ042-0070|Oswald was discovered in time to thwart his attempt at suicide.|Oswald was discovered in time to thwart his attempt at suicide.
-LJ042-0161|Immediately after serving out his 3 years in the U.S. Marine Corps, he abandoned his American life to seek a new life in the USSR.|Immediately after serving out his three years in the U.S. Marine Corps, he abandoned his American life to seek a new life in the USSR.
-LJ043-0147|He had left a note for his wife telling her what to do in case he were apprehended, as well as his notebook and the pictures of himself holding the rifle.|He had left a note for his wife telling her what to do in case he were apprehended, as well as his notebook and the pictures of himself holding the rifle.
-LJ043-0178|as, in fact, one of them did appear after the assassination.|as, in fact, one of them did appear after the assassination.
-LJ043-0183|Oswald did not lack the determination and other traits required|Oswald did not lack the determination and other traits required
-LJ043-0185|Some idea of what he thought was sufficient reason for such an act may be found in the nature of the motive that he stated for his attack on General Walker.|Some idea of what he thought was sufficient reason for such an act may be found in the nature of the motive that he stated for his attack on General Walker.
-LJ044-0057|extensive investigation was not able to connect Oswald with that address, although it did develop the fact|extensive investigation was not able to connect Oswald with that address, although it did develop the fact
-LJ044-0109|It is good to know that movements in support of fair play for Cuba has developed in New Orleans as well as in other cities.|It is good to know that movements in support of fair play for Cuba has developed in New Orleans as well as in other cities.
-LJ045-0081|Although she denied it in some of her testimony before the Commission,|Although she denied it in some of her testimony before the Commission,
-LJ045-0147|She asked Oswald, quote,|She asked Oswald, quote,
-LJ045-0204|he had never found anything to which he felt he could really belong.|he had never found anything to which he felt he could really belong.
-LJ046-0193|and 12 to 15 of these cases as highly dangerous risks.|and twelve to fifteen of these cases as highly dangerous risks.
-LJ046-0244|PRS should have investigated and been prepared to guard against it.|PRS should have investigated and been prepared to guard against it.
-LJ047-0059|However, pursuant to a regular Bureau practice of interviewing certain immigrants from Iron Curtain countries,|However, pursuant to a regular Bureau practice of interviewing certain immigrants from Iron Curtain countries,
-LJ047-0142|The Bureau had no earlier information suggesting that Oswald had left the United States.|The Bureau had no earlier information suggesting that Oswald had left the United States.
-LJ048-0035|It was against this background and consistent with the criteria followed by the FBI prior to November 22|It was against this background and consistent with the criteria followed by the FBI prior to November twenty-two
-LJ048-0063|The formal FBI instructions to its agents outlining the information to be referred to the Secret Service were too narrow at the time of the assassination.|The formal FBI instructions to its agents outlining the information to be referred to the Secret Service were too narrow at the time of the assassination.
-LJ048-0104|There were far safer routes via freeways directly to the Trade Mart,|There were far safer routes via freeways directly to the Trade Mart,
-LJ048-0187|In addition, Secret Service agents riding in the motorcade were trained to scan buildings as part of their general observation of the crowd of spectators.|In addition, Secret Service agents riding in the motorcade were trained to scan buildings as part of their general observation of the crowd of spectators.
-LJ048-0271|will be cause for removal from the Service, end quote.|will be cause for removal from the Service, end quote.
-LJ049-0031|The Presidential vehicle in use in Dallas, described in chapter 2,|The Presidential vehicle in use in Dallas, described in chapter two,
-LJ049-0059|Agents are instructed that it is not their responsibility to investigate or evaluate a present danger,|Agents are instructed that it is not their responsibility to investigate or evaluate a present danger,
-LJ049-0174|to notify the Secret Service of the substantial information about Lee Harvey Oswald which the FBI had accumulated|to notify the Secret Service of the substantial information about Lee Harvey Oswald which the FBI had accumulated
-LJ050-0049|and from a specialist in psychiatric prognostication at Walter Reed Hospital.|and from a specialist in psychiatric prognostication at Walter Reed Hospital.
-LJ050-0113|Such agreements should describe in detail the information which is sought, the manner in which it will be provided to the Secret Service,|Such agreements should describe in detail the information which is sought, the manner in which it will be provided to the Secret Service,
-LJ050-0150|Its present manual filing system is obsolete;|Its present manual filing system is obsolete;
-LJ050-0189|that written instructions might come into the hands of local newspapers, to the prejudice of the precautions described.|that written instructions might come into the hands of local newspapers, to the prejudice of the precautions described.
--- a/vocoder/fregan/README.md
+++ b/vocoder/fregan/README.md
@@ -1,25 +0,0 @@
-# Fre-GAN Vocoder
-[Fre-GAN: Adversarial Frequency-consistent Audio Synthesis](https://arxiv.org/abs/2106.02297)
-
-## Training:
-```
-python train.py --config config.json
-```
-
-## Citation:
-```
-@misc{kim2021fregan,
-      title={Fre-GAN: Adversarial Frequency-consistent Audio Synthesis}, 
-      author={Ji-Hoon Kim and Sang-Hoon Lee and Ji-Hyun Lee and Seong-Whan Lee},
-      year={2021},
-      eprint={2106.02297},
-      archivePrefix={arXiv},
-      primaryClass={eess.AS}
-}
-```
-## Note
-* For more complete and end to end Voice cloning or Text to Speech (TTS) toolbox please visit [Deepsync Technologies](https://deepsync.co/).
-
-## References:
-* [Hi-Fi-GAN repo](https://github.com/jik876/hifi-gan)
-* [WaveSNet repo](https://github.com/LiQiufu/WaveSNet)
--- a/vocoder/fregan/config.json
+++ b/vocoder/fregan/config.json
@@ -1,41 +0,0 @@
-{
-    "resblock": "1",
-    "num_gpus": 0,
-    "batch_size": 16,
-    "learning_rate": 0.0002,
-    "adam_b1": 0.8,
-    "adam_b2": 0.99,
-    "lr_decay": 0.999,
-    "seed": 1234,
-
-
-    "upsample_rates": [5,5,2,2,2],
-    "upsample_kernel_sizes": [10,10,4,4,4],
-    "upsample_initial_channel": 512,
-    "resblock_kernel_sizes": [3,7,11],
-    "resblock_dilation_sizes": [[1, 3, 5, 7], [1,3,5,7], [1,3,5,7]],
-
-    "segment_size": 6400,
-    "num_mels": 80,
-    "num_freq": 1025,
-    "n_fft": 1024,
-    "hop_size": 200,
-    "win_size": 800,
-
-    "sampling_rate": 16000,
-
-    "fmin": 0,
-    "fmax": 7600,
-    "fmax_for_loss": null,
-
-    "num_workers": 4,
-
-    "dist_config": {
-        "dist_backend": "nccl",
-        "dist_url": "tcp://localhost:54321",
-        "world_size": 1
-    }
-
-
-
-}
--- a/vocoder/fregan/discriminator.py
+++ b/vocoder/fregan/discriminator.py
@@ -1,303 +0,0 @@
-import torch
-import torch.nn.functional as F
-import torch.nn as nn
-from torch.nn import Conv1d, AvgPool1d, Conv2d
-from torch.nn.utils import weight_norm, spectral_norm
-from vocoder.fregan.utils import get_padding
-from vocoder.fregan.stft_loss import stft
-from vocoder.fregan.dwt import DWT_1D
-LRELU_SLOPE = 0.1
-
-
-
-class SpecDiscriminator(nn.Module):
-    """docstring for Discriminator."""
-
-    def __init__(self, fft_size=1024, shift_size=120, win_length=600, window="hann_window", use_spectral_norm=False):
-        super(SpecDiscriminator, self).__init__()
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.fft_size = fft_size
-        self.shift_size = shift_size
-        self.win_length = win_length
-        self.window = getattr(torch, window)(win_length)
-        self.discriminators = nn.ModuleList([
-            norm_f(nn.Conv2d(1, 32, kernel_size=(3, 9), padding=(1, 4))),
-            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1,2), padding=(1, 4))),
-            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1,2), padding=(1, 4))),
-            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1,2), padding=(1, 4))),
-            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 3), stride=(1,1), padding=(1, 1))),
-        ])
-
-        self.out = norm_f(nn.Conv2d(32, 1, 3, 1, 1))
-
-    def forward(self, y):
-
-        fmap = []
-        with torch.no_grad():
-            y = y.squeeze(1)
-            y = stft(y, self.fft_size, self.shift_size, self.win_length, self.window.to(y.get_device()))
-        y = y.unsqueeze(1)
-        for i, d in enumerate(self.discriminators):
-            y = d(y)
-            y = F.leaky_relu(y, LRELU_SLOPE)
-            fmap.append(y)
-
-        y = self.out(y)
-        fmap.append(y)
-
-        return torch.flatten(y, 1, -1), fmap
-
-class MultiResSpecDiscriminator(torch.nn.Module):
-
-    def __init__(self,
-                 fft_sizes=[1024, 2048, 512],
-                 hop_sizes=[120, 240, 50],
-                 win_lengths=[600, 1200, 240],
-                 window="hann_window"):
-
-        super(MultiResSpecDiscriminator, self).__init__()
-        self.discriminators = nn.ModuleList([
-            SpecDiscriminator(fft_sizes[0], hop_sizes[0], win_lengths[0], window),
-            SpecDiscriminator(fft_sizes[1], hop_sizes[1], win_lengths[1], window),
-            SpecDiscriminator(fft_sizes[2], hop_sizes[2], win_lengths[2], window)
-            ])
-
-    def forward(self, y, y_hat):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        for i, d in enumerate(self.discriminators):
-            y_d_r, fmap_r = d(y)
-            y_d_g, fmap_g = d(y_hat)
-            y_d_rs.append(y_d_r)
-            fmap_rs.append(fmap_r)
-            y_d_gs.append(y_d_g)
-            fmap_gs.append(fmap_g)
-
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-
-
-class DiscriminatorP(torch.nn.Module):
-    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
-        super(DiscriminatorP, self).__init__()
-        self.period = period
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.dwt1d = DWT_1D()
-        self.dwt_conv1 = norm_f(Conv1d(2, 1, 1))
-        self.dwt_proj1 = norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0)))
-        self.dwt_conv2 = norm_f(Conv1d(4, 1, 1))
-        self.dwt_proj2 = norm_f(Conv2d(1, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0)))
-        self.dwt_conv3 = norm_f(Conv1d(8, 1, 1))
-        self.dwt_proj3 = norm_f(Conv2d(1, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0)))
-        self.convs = nn.ModuleList([
-            norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
-        ])
-        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
-
-    def forward(self, x):
-        fmap = []
-
-        # DWT 1
-        x_d1_high1, x_d1_low1 = self.dwt1d(x)
-        x_d1 = self.dwt_conv1(torch.cat([x_d1_high1, x_d1_low1], dim=1))
-        # 1d to 2d
-        b, c, t = x_d1.shape
-        if t % self.period != 0:  # pad first
-            n_pad = self.period - (t % self.period)
-            x_d1 = F.pad(x_d1, (0, n_pad), "reflect")
-            t = t + n_pad
-        x_d1 = x_d1.view(b, c, t // self.period, self.period)
-
-        x_d1 = self.dwt_proj1(x_d1)
-
-        # DWT 2
-        x_d2_high1, x_d2_low1 = self.dwt1d(x_d1_high1)
-        x_d2_high2, x_d2_low2 = self.dwt1d(x_d1_low1)
-        x_d2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
-        # 1d to 2d
-        b, c, t = x_d2.shape
-        if t % self.period != 0:  # pad first
-            n_pad = self.period - (t % self.period)
-            x_d2 = F.pad(x_d2, (0, n_pad), "reflect")
-            t = t + n_pad
-        x_d2 = x_d2.view(b, c, t // self.period, self.period)
-
-        x_d2 = self.dwt_proj2(x_d2)
-
-        # DWT 3
-
-        x_d3_high1, x_d3_low1 = self.dwt1d(x_d2_high1)
-        x_d3_high2, x_d3_low2 = self.dwt1d(x_d2_low1)
-        x_d3_high3, x_d3_low3 = self.dwt1d(x_d2_high2)
-        x_d3_high4, x_d3_low4 = self.dwt1d(x_d2_low2)
-        x_d3 = self.dwt_conv3(
-            torch.cat([x_d3_high1, x_d3_low1, x_d3_high2, x_d3_low2, x_d3_high3, x_d3_low3, x_d3_high4, x_d3_low4],
-                      dim=1))
-        # 1d to 2d
-        b, c, t = x_d3.shape
-        if t % self.period != 0:  # pad first
-            n_pad = self.period - (t % self.period)
-            x_d3 = F.pad(x_d3, (0, n_pad), "reflect")
-            t = t + n_pad
-        x_d3 = x_d3.view(b, c, t // self.period, self.period)
-
-        x_d3 = self.dwt_proj3(x_d3)
-
-        # 1d to 2d
-        b, c, t = x.shape
-        if t % self.period != 0:  # pad first
-            n_pad = self.period - (t % self.period)
-            x = F.pad(x, (0, n_pad), "reflect")
-            t = t + n_pad
-        x = x.view(b, c, t // self.period, self.period)
-        i = 0
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, LRELU_SLOPE)
-
-            fmap.append(x)
-            if i == 0:
-                x = torch.cat([x, x_d1], dim=2)
-            elif i == 1:
-                x = torch.cat([x, x_d2], dim=2)
-            elif i == 2:
-                x = torch.cat([x, x_d3], dim=2)
-            else:
-                x = x
-            i = i + 1
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-
-        return x, fmap
-
-
-class ResWiseMultiPeriodDiscriminator(torch.nn.Module):
-    def __init__(self):
-        super(ResWiseMultiPeriodDiscriminator, self).__init__()
-        self.discriminators = nn.ModuleList([
-            DiscriminatorP(2),
-            DiscriminatorP(3),
-            DiscriminatorP(5),
-            DiscriminatorP(7),
-            DiscriminatorP(11),
-        ])
-
-    def forward(self, y, y_hat):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        for i, d in enumerate(self.discriminators):
-            y_d_r, fmap_r = d(y)
-            y_d_g, fmap_g = d(y_hat)
-            y_d_rs.append(y_d_r)
-            fmap_rs.append(fmap_r)
-            y_d_gs.append(y_d_g)
-            fmap_gs.append(fmap_g)
-
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-
-
-class DiscriminatorS(torch.nn.Module):
-    def __init__(self, use_spectral_norm=False):
-        super(DiscriminatorS, self).__init__()
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.dwt1d = DWT_1D()
-        self.dwt_conv1 = norm_f(Conv1d(2, 128, 15, 1, padding=7))
-        self.dwt_conv2 = norm_f(Conv1d(4, 128, 41, 2, padding=20))
-        self.convs = nn.ModuleList([
-            norm_f(Conv1d(1, 128, 15, 1, padding=7)),
-            norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
-            norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
-            norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
-            norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
-            norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
-            norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
-        ])
-        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
-
-    def forward(self, x):
-        fmap = []
-
-        # DWT 1
-        x_d1_high1, x_d1_low1 = self.dwt1d(x)
-        x_d1 = self.dwt_conv1(torch.cat([x_d1_high1, x_d1_low1], dim=1))
-
-        # DWT 2
-        x_d2_high1, x_d2_low1 = self.dwt1d(x_d1_high1)
-        x_d2_high2, x_d2_low2 = self.dwt1d(x_d1_low1)
-        x_d2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
-
-        i = 0
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, LRELU_SLOPE)
-            fmap.append(x)
-            if i == 0:
-                x = torch.cat([x, x_d1], dim=2)
-            if i == 1:
-                x = torch.cat([x, x_d2], dim=2)
-            i = i + 1
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-
-        return x, fmap
-
-
-class ResWiseMultiScaleDiscriminator(torch.nn.Module):
-    def __init__(self, use_spectral_norm=False):
-        super(ResWiseMultiScaleDiscriminator, self).__init__()
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.dwt1d = DWT_1D()
-        self.dwt_conv1 = norm_f(Conv1d(2, 1, 1))
-        self.dwt_conv2 = norm_f(Conv1d(4, 1, 1))
-        self.discriminators = nn.ModuleList([
-            DiscriminatorS(use_spectral_norm=True),
-            DiscriminatorS(),
-            DiscriminatorS(),
-        ])
-
-    def forward(self, y, y_hat):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        # DWT 1
-        y_hi, y_lo = self.dwt1d(y)
-        y_1 = self.dwt_conv1(torch.cat([y_hi, y_lo], dim=1))
-        x_d1_high1, x_d1_low1 = self.dwt1d(y_hat)
-        y_hat_1 = self.dwt_conv1(torch.cat([x_d1_high1, x_d1_low1], dim=1))
-
-        # DWT 2
-        x_d2_high1, x_d2_low1 = self.dwt1d(y_hi)
-        x_d2_high2, x_d2_low2 = self.dwt1d(y_lo)
-        y_2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
-
-        x_d2_high1, x_d2_low1 = self.dwt1d(x_d1_high1)
-        x_d2_high2, x_d2_low2 = self.dwt1d(x_d1_low1)
-        y_hat_2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))
-
-        for i, d in enumerate(self.discriminators):
-
-            if i == 1:
-                y = y_1
-                y_hat = y_hat_1
-            if i == 2:
-                y = y_2
-                y_hat = y_hat_2
-
-            y_d_r, fmap_r = d(y)
-            y_d_g, fmap_g = d(y_hat)
-            y_d_rs.append(y_d_r)
-            fmap_rs.append(fmap_r)
-            y_d_gs.append(y_d_g)
-            fmap_gs.append(fmap_g)
-
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
--- a/vocoder/fregan/dwt.py
+++ b/vocoder/fregan/dwt.py
@@ -1,76 +0,0 @@
-# Copyright (c) 2019, Adobe Inc. All rights reserved.
-#
-# This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike
-# 4.0 International Public License. To view a copy of this license, visit
-# https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.
-
-# DWT code borrow from https://github.com/LiQiufu/WaveSNet/blob/12cb9d24208c3d26917bf953618c30f0c6b0f03d/DWT_IDWT/DWT_IDWT_layer.py
-
-
-import pywt
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-__all__ = ['DWT_1D']
-Pad_Mode = ['constant', 'reflect', 'replicate', 'circular']
-
-
-class DWT_1D(nn.Module):
-    def __init__(self, pad_type='reflect', wavename='haar',
-                 stride=2, in_channels=1, out_channels=None, groups=None,
-                 kernel_size=None, trainable=False):
-
-        super(DWT_1D, self).__init__()
-        self.trainable = trainable
-        self.kernel_size = kernel_size
-        if not self.trainable:
-            assert self.kernel_size == None
-        self.in_channels = in_channels
-        self.out_channels = self.in_channels if out_channels == None else out_channels
-        self.groups = self.in_channels if groups == None else groups
-        assert isinstance(self.groups, int) and self.in_channels % self.groups == 0
-        self.stride = stride
-        assert self.stride == 2
-        self.wavename = wavename
-        self.pad_type = pad_type
-        assert self.pad_type in Pad_Mode
-        self.get_filters()
-        self.initialization()
-
-    def get_filters(self):
-        wavelet = pywt.Wavelet(self.wavename)
-        band_low = torch.tensor(wavelet.rec_lo)
-        band_high = torch.tensor(wavelet.rec_hi)
-        length_band = band_low.size()[0]
-        self.kernel_size = length_band if self.kernel_size == None else self.kernel_size
-        assert self.kernel_size >= length_band
-        a = (self.kernel_size - length_band) // 2
-        b = - (self.kernel_size - length_band - a)
-        b = None if b == 0 else b
-        self.filt_low = torch.zeros(self.kernel_size)
-        self.filt_high = torch.zeros(self.kernel_size)
-        self.filt_low[a:b] = band_low
-        self.filt_high[a:b] = band_high
-
-    def initialization(self):
-        self.filter_low = self.filt_low[None, None, :].repeat((self.out_channels, self.in_channels // self.groups, 1))
-        self.filter_high = self.filt_high[None, None, :].repeat((self.out_channels, self.in_channels // self.groups, 1))
-        if torch.cuda.is_available():
-            self.filter_low = self.filter_low.cuda()
-            self.filter_high = self.filter_high.cuda()
-        if self.trainable:
-            self.filter_low = nn.Parameter(self.filter_low)
-            self.filter_high = nn.Parameter(self.filter_high)
-        if self.kernel_size % 2 == 0:
-            self.pad_sizes = [self.kernel_size // 2 - 1, self.kernel_size // 2 - 1]
-        else:
-            self.pad_sizes = [self.kernel_size // 2, self.kernel_size // 2]
-
-    def forward(self, input):
-        assert isinstance(input, torch.Tensor)
-        assert len(input.size()) == 3
-        assert input.size()[1] == self.in_channels
-        input = F.pad(input, pad=self.pad_sizes, mode=self.pad_type)
-        return F.conv1d(input, self.filter_low.to(input.device), stride=self.stride, groups=self.groups), \
-               F.conv1d(input, self.filter_high.to(input.device), stride=self.stride, groups=self.groups)
--- a/vocoder/fregan/generator.py
+++ b/vocoder/fregan/generator.py
@@ -1,210 +0,0 @@
-import torch
-import torch.nn.functional as F
-import torch.nn as nn
-from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
-from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
-from vocoder.fregan.utils import init_weights, get_padding
-
-LRELU_SLOPE = 0.1
-
-
-class ResBlock1(torch.nn.Module):
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5, 7)):
-        super(ResBlock1, self).__init__()
-        self.h = h
-        self.convs1 = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
-                               padding=get_padding(kernel_size, dilation[0]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
-                               padding=get_padding(kernel_size, dilation[1]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
-                               padding=get_padding(kernel_size, dilation[2]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[3],
-                               padding=get_padding(kernel_size, dilation[3])))
-        ])
-        self.convs1.apply(init_weights)
-
-        self.convs2 = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1)))
-        ])
-        self.convs2.apply(init_weights)
-
-    def forward(self, x):
-        for c1, c2 in zip(self.convs1, self.convs2):
-            xt = F.leaky_relu(x, LRELU_SLOPE)
-            xt = c1(xt)
-            xt = F.leaky_relu(xt, LRELU_SLOPE)
-            xt = c2(xt)
-            x = xt + x
-        return x
-
-    def remove_weight_norm(self):
-        for l in self.convs1:
-            remove_weight_norm(l)
-        for l in self.convs2:
-            remove_weight_norm(l)
-
-
-class ResBlock2(torch.nn.Module):
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
-        super(ResBlock2, self).__init__()
-        self.h = h
-        self.convs = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
-                               padding=get_padding(kernel_size, dilation[0]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
-                               padding=get_padding(kernel_size, dilation[1])))
-        ])
-        self.convs.apply(init_weights)
-
-    def forward(self, x):
-        for c in self.convs:
-            xt = F.leaky_relu(x, LRELU_SLOPE)
-            xt = c(xt)
-            x = xt + x
-        return x
-
-    def remove_weight_norm(self):
-        for l in self.convs:
-            remove_weight_norm(l)
-
-
-class FreGAN(torch.nn.Module):
-    def __init__(self, h, top_k=4):
-        super(FreGAN, self).__init__()
-        self.h = h
-
-        self.num_kernels = len(h.resblock_kernel_sizes)
-        self.num_upsamples = len(h.upsample_rates)
-        self.upsample_rates = h.upsample_rates
-        self.up_kernels = h.upsample_kernel_sizes
-        self.cond_level = self.num_upsamples - top_k
-        self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
-        resblock = ResBlock1 if h.resblock == '1' else ResBlock2
-
-        self.ups = nn.ModuleList()
-        self.cond_up = nn.ModuleList()
-        self.res_output = nn.ModuleList()
-        upsample_ = 1
-        kr = 80
-
-        for i, (u, k) in enumerate(zip(self.upsample_rates, self.up_kernels)):
-#            self.ups.append(weight_norm(
- #               ConvTranspose1d(h.upsample_initial_channel // (2 ** i), h.upsample_initial_channel // (2 ** (i + 1)),
- #                               k, u, padding=(k - u) // 2)))
-            self.ups.append(weight_norm(ConvTranspose1d(h.upsample_initial_channel//(2**i),
-                            h.upsample_initial_channel//(2**(i+1)),
-                            k, u, padding=(u//2 + u%2), output_padding=u%2)))
-
-            if i > (self.num_upsamples - top_k):
-                self.res_output.append(
-                    nn.Sequential(
-                        nn.Upsample(scale_factor=u, mode='nearest'),
-                        weight_norm(nn.Conv1d(h.upsample_initial_channel // (2 ** i),
-                                              h.upsample_initial_channel // (2 ** (i + 1)), 1))
-                    )
-                )
-            if i >= (self.num_upsamples - top_k):
-                self.cond_up.append(
-                    weight_norm(
-                        ConvTranspose1d(kr, h.upsample_initial_channel // (2 ** i),
-                                        self.up_kernels[i - 1], self.upsample_rates[i - 1],
-                                        padding=(self.upsample_rates[i-1]//2+self.upsample_rates[i-1]%2), output_padding=self.upsample_rates[i-1]%2))
-                )
-                kr = h.upsample_initial_channel // (2 ** i)
-
-            upsample_ *= u
-
-        self.resblocks = nn.ModuleList()
-        for i in range(len(self.ups)):
-            ch = h.upsample_initial_channel // (2 ** (i + 1))
-            for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
-                self.resblocks.append(resblock(h, ch, k, d))
-
-        self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
-        self.ups.apply(init_weights)
-        self.conv_post.apply(init_weights)
-        self.cond_up.apply(init_weights)
-        self.res_output.apply(init_weights)
-
-    def forward(self, x):
-        mel = x
-        x = self.conv_pre(x)
-        output = None
-        for i in range(self.num_upsamples):
-            if i >= self.cond_level:
-                mel = self.cond_up[i - self.cond_level](mel)
-                x += mel
-            if i > self.cond_level:
-                if output is None:
-                    output = self.res_output[i - self.cond_level - 1](x)
-                else:
-                    output = self.res_output[i - self.cond_level - 1](output)
-            x = F.leaky_relu(x, LRELU_SLOPE)
-            x = self.ups[i](x)
-            xs = None
-            for j in range(self.num_kernels):
-                if xs is None:
-                    xs = self.resblocks[i * self.num_kernels + j](x)
-                else:
-                    xs += self.resblocks[i * self.num_kernels + j](x)
-            x = xs / self.num_kernels
-            if output is not None:
-                output = output + x
-
-        x = F.leaky_relu(output)
-        x = self.conv_post(x)
-        x = torch.tanh(x)
-
-        return x
-
-    def remove_weight_norm(self):
-        print('Removing weight norm...')
-        for l in self.ups:
-            remove_weight_norm(l)
-        for l in self.resblocks:
-            l.remove_weight_norm()
-        for l in self.cond_up:
-            remove_weight_norm(l)
-        for l in self.res_output:
-            remove_weight_norm(l[1])
-        remove_weight_norm(self.conv_pre)
-        remove_weight_norm(self.conv_post)
-
-
-'''
-    to run this, fix 
-    from . import ResStack
-    into
-    from res_stack import ResStack
-'''
-if __name__ == '__main__':
-    '''
-    torch.Size([3, 80, 10])
-    torch.Size([3, 1, 2000])
-    4527362
-    '''
-    with open('config.json') as f:
-        data = f.read()
-    from utils import AttrDict
-    import json
-    json_config = json.loads(data)
-    h = AttrDict(json_config)
-    model = FreGAN(h)
-
-    c = torch.randn(3, 80, 10)  # (B, channels, T).
-    print(c.shape)
-
-    y = model(c) # (B, 1, T ** prod(upsample_scales)
-    print(y.shape)
-    assert y.shape == torch.Size([3, 1, 2560])  # For normal melgan torch.Size([3, 1, 2560])
-
-    pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
-    print(pytorch_total_params)
--- a/vocoder/fregan/inference.py
+++ b/vocoder/fregan/inference.py
@@ -1,70 +0,0 @@
-from __future__ import absolute_import, division, print_function, unicode_literals
-
-import os
-import json
-import torch
-from scipy.io.wavfile import write
-from vocoder.hifigan.env import AttrDict
-from vocoder.hifigan.meldataset import mel_spectrogram, MAX_WAV_VALUE, load_wav
-from vocoder.fregan.generator import FreGAN
-import soundfile as sf
-
-
-generator = None       # type: FreGAN
-_device = None
-
-
-def load_checkpoint(filepath, device):
-    assert os.path.isfile(filepath)
-    print("Loading '{}'".format(filepath))
-    checkpoint_dict = torch.load(filepath, map_location=device)
-    print("Complete.")
-    return checkpoint_dict
-
-
-def load_model(weights_fpath, verbose=True):
-    global generator, _device
-
-    if verbose:
-        print("Building fregan")
-
-    with open("./vocoder/fregan/config.json") as f:
-        data = f.read()
-    json_config = json.loads(data)
-    h = AttrDict(json_config)
-    torch.manual_seed(h.seed)
-
-    if torch.cuda.is_available():
-        # _model = _model.cuda()
-        _device = torch.device('cuda')
-    else:
-        _device = torch.device('cpu')
-
-    generator = FreGAN(h).to(_device)
-    state_dict_g = load_checkpoint(
-        weights_fpath, _device
-    )
-    generator.load_state_dict(state_dict_g['generator'])
-    generator.eval()
-    generator.remove_weight_norm()
-
-
-def is_loaded():
-    return generator is not None
-
-
-def infer_waveform(mel, progress_callback=None):
-
-    if generator is None:
-        raise Exception("Please load fre-gan in memory before using it")
-
-    mel = torch.FloatTensor(mel).to(_device)
-    mel = mel.unsqueeze(0)
-
-    with torch.no_grad():
-        y_g_hat = generator(mel)
-        audio = y_g_hat.squeeze()
-    audio = audio.cpu().numpy()
-
-    return audio
-
--- a/vocoder/fregan/loss.py
+++ b/vocoder/fregan/loss.py
@@ -1,35 +0,0 @@
-import torch
-
-
-def feature_loss(fmap_r, fmap_g):
-    loss = 0
-    for dr, dg in zip(fmap_r, fmap_g):
-        for rl, gl in zip(dr, dg):
-            loss += torch.mean(torch.abs(rl - gl))
-
-    return loss*2
-
-
-def discriminator_loss(disc_real_outputs, disc_generated_outputs):
-    loss = 0
-    r_losses = []
-    g_losses = []
-    for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
-        r_loss = torch.mean((1-dr)**2)
-        g_loss = torch.mean(dg**2)
-        loss += (r_loss + g_loss)
-        r_losses.append(r_loss.item())
-        g_losses.append(g_loss.item())
-
-    return loss, r_losses, g_losses
-
-
-def generator_loss(disc_outputs):
-    loss = 0
-    gen_losses = []
-    for dg in disc_outputs:
-        l = torch.mean((1-dg)**2)
-        gen_losses.append(l)
-        loss += l
-
-    return loss, gen_losses
--- a/vocoder/fregan/meldataset.py
+++ b/vocoder/fregan/meldataset.py
@@ -1,176 +0,0 @@
-import math
-import os
-import random
-import torch
-import torch.utils.data
-import numpy as np
-from librosa.util import normalize
-from scipy.io.wavfile import read
-from librosa.filters import mel as librosa_mel_fn
-
-MAX_WAV_VALUE = 32768.0
-
-
-def load_wav(full_path):
-    sampling_rate, data = read(full_path)
-    return data, sampling_rate
-
-
-def dynamic_range_compression(x, C=1, clip_val=1e-5):
-    return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
-
-
-def dynamic_range_decompression(x, C=1):
-    return np.exp(x) / C
-
-
-def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
-    return torch.log(torch.clamp(x, min=clip_val) * C)
-
-
-def dynamic_range_decompression_torch(x, C=1):
-    return torch.exp(x) / C
-
-
-def spectral_normalize_torch(magnitudes):
-    output = dynamic_range_compression_torch(magnitudes)
-    return output
-
-
-def spectral_de_normalize_torch(magnitudes):
-    output = dynamic_range_decompression_torch(magnitudes)
-    return output
-
-
-mel_basis = {}
-hann_window = {}
-
-
-def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
-    if torch.min(y) < -1.:
-        print('min value is ', torch.min(y))
-    if torch.max(y) > 1.:
-        print('max value is ', torch.max(y))
-
-    global mel_basis, hann_window
-    if fmax not in mel_basis:
-        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
-        mel_basis[str(fmax)+'_'+str(y.device)] = torch.from_numpy(mel).float().to(y.device)
-        hann_window[str(y.device)] = torch.hann_window(win_size).to(y.device)
-
-    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
-    y = y.squeeze(1)
-
-    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
-                      center=center, pad_mode='reflect', normalized=False, onesided=True)
-
-    spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
-
-    spec = torch.matmul(mel_basis[str(fmax)+'_'+str(y.device)], spec)
-    spec = spectral_normalize_torch(spec)
-
-    return spec
-
-
-def get_dataset_filelist(a):
-    #with open(a.input_training_file, 'r', encoding='utf-8') as fi:
-    #    training_files = [os.path.join(a.input_wavs_dir, x.split('|')[0] + '.wav')
-    #                      for x in fi.read().split('\n') if len(x) > 0]
-
-    #with open(a.input_validation_file, 'r', encoding='utf-8') as fi:
-    #   validation_files = [os.path.join(a.input_wavs_dir, x.split('|')[0] + '.wav')
-    #                        for x in fi.read().split('\n') if len(x) > 0]
-    files = os.listdir(a.input_wavs_dir)
-    random.shuffle(files)
-    files = [os.path.join(a.input_wavs_dir, f) for f in files]
-    training_files = files[: -int(len(files) * 0.05)]
-    validation_files = files[-int(len(files) * 0.05):]
-    return training_files, validation_files
-
-
-class MelDataset(torch.utils.data.Dataset):
-    def __init__(self, training_files, segment_size, n_fft, num_mels,
-                 hop_size, win_size, sampling_rate,  fmin, fmax, split=True, shuffle=True, n_cache_reuse=1,
-                 device=None, fmax_loss=None, fine_tuning=False, base_mels_path=None):
-        self.audio_files = training_files
-        random.seed(1234)
-        if shuffle:
-            random.shuffle(self.audio_files)
-        self.segment_size = segment_size
-        self.sampling_rate = sampling_rate
-        self.split = split
-        self.n_fft = n_fft
-        self.num_mels = num_mels
-        self.hop_size = hop_size
-        self.win_size = win_size
-        self.fmin = fmin
-        self.fmax = fmax
-        self.fmax_loss = fmax_loss
-        self.cached_wav = None
-        self.n_cache_reuse = n_cache_reuse
-        self._cache_ref_count = 0
-        self.device = device
-        self.fine_tuning = fine_tuning
-        self.base_mels_path = base_mels_path
-
-    def __getitem__(self, index):
-        filename = self.audio_files[index]
-        if self._cache_ref_count == 0:
-            #audio, sampling_rate = load_wav(filename)
-            #audio = audio / MAX_WAV_VALUE
-            audio = np.load(filename)
-            if not self.fine_tuning:
-                audio = normalize(audio) * 0.95
-            self.cached_wav = audio
-            #if sampling_rate != self.sampling_rate:
-            #    raise ValueError("{} SR doesn't match target {} SR".format(
-            #        sampling_rate, self.sampling_rate))
-            self._cache_ref_count = self.n_cache_reuse
-        else:
-            audio = self.cached_wav
-            self._cache_ref_count -= 1
-
-        audio = torch.FloatTensor(audio)
-        audio = audio.unsqueeze(0)
-
-        if not self.fine_tuning:
-            if self.split:
-                if audio.size(1) >= self.segment_size:
-                    max_audio_start = audio.size(1) - self.segment_size
-                    audio_start = random.randint(0, max_audio_start)
-                    audio = audio[:, audio_start:audio_start+self.segment_size]
-                else:
-                    audio = torch.nn.functional.pad(audio, (0, self.segment_size - audio.size(1)), 'constant')
-
-            mel = mel_spectrogram(audio, self.n_fft, self.num_mels,
-                                  self.sampling_rate, self.hop_size, self.win_size, self.fmin, self.fmax,
-                                  center=False)
-        else:
-            mel_path = os.path.join(self.base_mels_path, "mel" + "-" + filename.split("/")[-1].split("-")[-1])
-            mel = np.load(mel_path).T
-            #mel = np.load(
-            #    os.path.join(self.base_mels_path, os.path.splitext(os.path.split(filename)[-1])[0] + '.npy'))
-            mel = torch.from_numpy(mel)
-
-            if len(mel.shape) < 3:
-                mel = mel.unsqueeze(0)
-
-            if self.split:
-                frames_per_seg = math.ceil(self.segment_size / self.hop_size)
-
-                if audio.size(1) >= self.segment_size:
-                    mel_start = random.randint(0, mel.size(2) - frames_per_seg - 1)
-                    mel = mel[:, :, mel_start:mel_start + frames_per_seg]
-                    audio = audio[:, mel_start * self.hop_size:(mel_start + frames_per_seg) * self.hop_size]
-                else:
-                    mel = torch.nn.functional.pad(mel, (0, frames_per_seg - mel.size(2)), 'constant')
-                    audio = torch.nn.functional.pad(audio, (0, self.segment_size - audio.size(1)), 'constant')
-
-        mel_loss = mel_spectrogram(audio, self.n_fft, self.num_mels,
-                                   self.sampling_rate, self.hop_size, self.win_size, self.fmin, self.fmax_loss,
-                                   center=False)
-
-        return (mel.squeeze(), audio.squeeze(0), filename, mel_loss.squeeze())
-
-    def __len__(self):
-        return len(self.audio_files)
--- a/vocoder/fregan/modules.py
+++ b/vocoder/fregan/modules.py
@@ -1,201 +0,0 @@
-import torch
-import torch.nn.functional as F
-
-class KernelPredictor(torch.nn.Module):
-    ''' Kernel predictor for the location-variable convolutions
-    '''
-
-    def __init__(self,
-                 cond_channels,
-                 conv_in_channels,
-                 conv_out_channels,
-                 conv_layers,
-                 conv_kernel_size=3,
-                 kpnet_hidden_channels=64,
-                 kpnet_conv_size=3,
-                 kpnet_dropout=0.0,
-                 kpnet_nonlinear_activation="LeakyReLU",
-                 kpnet_nonlinear_activation_params={"negative_slope": 0.1}
-                 ):
-        '''
-        Args:
-            cond_channels (int): number of channel for the conditioning sequence,
-            conv_in_channels (int): number of channel for the input sequence,
-            conv_out_channels (int): number of channel for the output sequence,
-            conv_layers (int):
-            kpnet_
-        '''
-        super().__init__()
-
-        self.conv_in_channels = conv_in_channels
-        self.conv_out_channels = conv_out_channels
-        self.conv_kernel_size = conv_kernel_size
-        self.conv_layers = conv_layers
-
-        l_w = conv_in_channels * conv_out_channels * conv_kernel_size * conv_layers
-        l_b = conv_out_channels * conv_layers
-
-        padding = (kpnet_conv_size - 1) // 2
-        self.input_conv = torch.nn.Sequential(
-            torch.nn.Conv1d(cond_channels, kpnet_hidden_channels, 5, padding=(5 - 1) // 2, bias=True),
-            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
-        )
-
-        self.residual_conv = torch.nn.Sequential(
-            torch.nn.Dropout(kpnet_dropout),
-            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
-            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
-            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
-            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
-            torch.nn.Dropout(kpnet_dropout),
-            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
-            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
-            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
-            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
-            torch.nn.Dropout(kpnet_dropout),
-            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
-            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
-            torch.nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True),
-            getattr(torch.nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
-        )
-
-        self.kernel_conv = torch.nn.Conv1d(kpnet_hidden_channels, l_w, kpnet_conv_size,
-                                           padding=padding, bias=True)
-        self.bias_conv = torch.nn.Conv1d(kpnet_hidden_channels, l_b, kpnet_conv_size, padding=padding,
-                                         bias=True)
-
-    def forward(self, c):
-        '''
-        Args:
-            c (Tensor): the conditioning sequence (batch, cond_channels, cond_length)
-        Returns:
-        '''
-        batch, cond_channels, cond_length = c.shape
-
-        c = self.input_conv(c)
-        c = c + self.residual_conv(c)
-        k = self.kernel_conv(c)
-        b = self.bias_conv(c)
-
-        kernels = k.contiguous().view(batch,
-                                      self.conv_layers,
-                                      self.conv_in_channels,
-                                      self.conv_out_channels,
-                                      self.conv_kernel_size,
-                                      cond_length)
-        bias = b.contiguous().view(batch,
-                                   self.conv_layers,
-                                   self.conv_out_channels,
-                                   cond_length)
-        return kernels, bias
-
-
-class LVCBlock(torch.nn.Module):
-    ''' the location-variable convolutions
-    '''
-
-    def __init__(self,
-                 in_channels,
-                 cond_channels,
-                 upsample_ratio,
-                 conv_layers=4,
-                 conv_kernel_size=3,
-                 cond_hop_length=256,
-                 kpnet_hidden_channels=64,
-                 kpnet_conv_size=3,
-                 kpnet_dropout=0.0
-                 ):
-        super().__init__()
-
-        self.cond_hop_length = cond_hop_length
-        self.conv_layers = conv_layers
-        self.conv_kernel_size = conv_kernel_size
-        self.convs = torch.nn.ModuleList()
-
-        self.upsample = torch.nn.ConvTranspose1d(in_channels, in_channels,
-                                    kernel_size=upsample_ratio*2, stride=upsample_ratio,
-                                    padding=upsample_ratio // 2 + upsample_ratio % 2,
-                                    output_padding=upsample_ratio % 2)
-
-
-        self.kernel_predictor = KernelPredictor(
-            cond_channels=cond_channels,
-            conv_in_channels=in_channels,
-            conv_out_channels=2 * in_channels,
-            conv_layers=conv_layers,
-            conv_kernel_size=conv_kernel_size,
-            kpnet_hidden_channels=kpnet_hidden_channels,
-            kpnet_conv_size=kpnet_conv_size,
-            kpnet_dropout=kpnet_dropout
-        )
-
-
-        for i in range(conv_layers):
-            padding = (3 ** i) * int((conv_kernel_size - 1) / 2)
-            conv = torch.nn.Conv1d(in_channels, in_channels, kernel_size=conv_kernel_size, padding=padding, dilation=3 ** i)
-
-            self.convs.append(conv)
-
-
-    def forward(self, x, c):
-        ''' forward propagation of the location-variable convolutions.
-        Args:
-            x (Tensor): the input sequence (batch, in_channels, in_length)
-            c (Tensor): the conditioning sequence (batch, cond_channels, cond_length)
-
-        Returns:
-            Tensor: the output sequence (batch, in_channels, in_length)
-        '''
-        batch, in_channels, in_length = x.shape
-
-
-        kernels, bias = self.kernel_predictor(c)
-
-        x = F.leaky_relu(x, 0.2)
-        x = self.upsample(x)
-
-        for i in range(self.conv_layers):
-            y = F.leaky_relu(x, 0.2)
-            y = self.convs[i](y)
-            y = F.leaky_relu(y, 0.2)
-
-            k = kernels[:, i, :, :, :, :]
-            b = bias[:, i, :, :]
-            y = self.location_variable_convolution(y, k, b, 1, self.cond_hop_length)
-            x = x + torch.sigmoid(y[:, :in_channels, :]) * torch.tanh(y[:, in_channels:, :])
-        return x
-
-    def location_variable_convolution(self, x, kernel, bias, dilation, hop_size):
-        ''' perform location-variable convolution operation on the input sequence (x) using the local convolution kernl.
-        Time: 414 μs ± 309 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each), test on NVIDIA V100.
-        Args:
-            x (Tensor): the input sequence (batch, in_channels, in_length).
-            kernel (Tensor): the local convolution kernel (batch, in_channel, out_channels, kernel_size, kernel_length)
-            bias (Tensor): the bias for the local convolution (batch, out_channels, kernel_length)
-            dilation (int): the dilation of convolution.
-            hop_size (int): the hop_size of the conditioning sequence.
-        Returns:
-            (Tensor): the output sequence after performing local convolution. (batch, out_channels, in_length).
-        '''
-        batch, in_channels, in_length = x.shape
-        batch, in_channels, out_channels, kernel_size, kernel_length = kernel.shape
-
-
-        assert in_length == (kernel_length * hop_size), "length of (x, kernel) is not matched"
-
-        padding = dilation * int((kernel_size - 1) / 2)
-        x = F.pad(x, (padding, padding), 'constant', 0)  # (batch, in_channels, in_length + 2*padding)
-        x = x.unfold(2, hop_size + 2 * padding, hop_size)  # (batch, in_channels, kernel_length, hop_size + 2*padding)
-
-        if hop_size < dilation:
-            x = F.pad(x, (0, dilation), 'constant', 0)
-        x = x.unfold(3, dilation,
-                     dilation)  # (batch, in_channels, kernel_length, (hop_size + 2*padding)/dilation, dilation)
-        x = x[:, :, :, :, :hop_size]
-        x = x.transpose(3, 4)  # (batch, in_channels, kernel_length, dilation, (hop_size + 2*padding)/dilation)
-        x = x.unfold(4, kernel_size, 1)  # (batch, in_channels, kernel_length, dilation, _, kernel_size)
-
-        o = torch.einsum('bildsk,biokl->bolsd', x, kernel)
-        o = o + bias.unsqueeze(-1).unsqueeze(-1)
-        o = o.contiguous().view(batch, out_channels, -1)
-        return o
--- a/vocoder/fregan/requirements.txt
+++ b/vocoder/fregan/requirements.txt
@@ -1 +0,0 @@
-PyWavelets
--- a/vocoder/fregan/stft_loss.py
+++ b/vocoder/fregan/stft_loss.py
@@ -1,136 +0,0 @@
-# -*- coding: utf-8 -*-
-
-# Copyright 2019 Tomoki Hayashi
-#  MIT License (https://opensource.org/licenses/MIT)
-
-"""STFT-based Loss modules."""
-
-import torch
-import torch.nn.functional as F
-
-
-def stft(x, fft_size, hop_size, win_length, window):
-    """Perform STFT and convert to magnitude spectrogram.
-    Args:
-        x (Tensor): Input signal tensor (B, T).
-        fft_size (int): FFT size.
-        hop_size (int): Hop size.
-        win_length (int): Window length.
-        window (str): Window function type.
-    Returns:
-        Tensor: Magnitude spectrogram (B, #frames, fft_size // 2 + 1).
-    """
-    x_stft = torch.stft(x, fft_size, hop_size, win_length, window)
-    real = x_stft[..., 0]
-    imag = x_stft[..., 1]
-
-    # NOTE(kan-bayashi): clamp is needed to avoid nan or inf
-    return torch.sqrt(torch.clamp(real ** 2 + imag ** 2, min=1e-7)).transpose(2, 1)
-
-
-class SpectralConvergengeLoss(torch.nn.Module):
-    """Spectral convergence loss module."""
-
-    def __init__(self):
-        """Initilize spectral convergence loss module."""
-        super(SpectralConvergengeLoss, self).__init__()
-
-    def forward(self, x_mag, y_mag):
-        """Calculate forward propagation.
-        Args:
-            x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
-            y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
-        Returns:
-            Tensor: Spectral convergence loss value.
-        """
-        return torch.norm(y_mag - x_mag, p="fro") / torch.norm(y_mag, p="fro")
-
-
-class LogSTFTMagnitudeLoss(torch.nn.Module):
-    """Log STFT magnitude loss module."""
-
-    def __init__(self):
-        """Initilize los STFT magnitude loss module."""
-        super(LogSTFTMagnitudeLoss, self).__init__()
-
-    def forward(self, x_mag, y_mag):
-        """Calculate forward propagation.
-        Args:
-            x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
-            y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
-        Returns:
-            Tensor: Log STFT magnitude loss value.
-        """
-        return F.l1_loss(torch.log(y_mag), torch.log(x_mag))
-
-
-class STFTLoss(torch.nn.Module):
-    """STFT loss module."""
-
-    def __init__(self, fft_size=1024, shift_size=120, win_length=600, window="hann_window"):
-        """Initialize STFT loss module."""
-        super(STFTLoss, self).__init__()
-        self.fft_size = fft_size
-        self.shift_size = shift_size
-        self.win_length = win_length
-        self.window = getattr(torch, window)(win_length)
-        self.spectral_convergenge_loss = SpectralConvergengeLoss()
-        self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss()
-
-    def forward(self, x, y):
-        """Calculate forward propagation.
-        Args:
-            x (Tensor): Predicted signal (B, T).
-            y (Tensor): Groundtruth signal (B, T).
-        Returns:
-            Tensor: Spectral convergence loss value.
-            Tensor: Log STFT magnitude loss value.
-        """
-        x_mag = stft(x, self.fft_size, self.shift_size, self.win_length, self.window.to(x.get_device()))
-        y_mag = stft(y, self.fft_size, self.shift_size, self.win_length, self.window.to(x.get_device()))
-        sc_loss = self.spectral_convergenge_loss(x_mag, y_mag)
-        mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag)
-
-        return sc_loss, mag_loss
-
-
-class MultiResolutionSTFTLoss(torch.nn.Module):
-    """Multi resolution STFT loss module."""
-
-    def __init__(self,
-                 fft_sizes=[1024, 2048, 512],
-                 hop_sizes=[120, 240, 50],
-                 win_lengths=[600, 1200, 240],
-                 window="hann_window"):
-        """Initialize Multi resolution STFT loss module.
-        Args:
-            fft_sizes (list): List of FFT sizes.
-            hop_sizes (list): List of hop sizes.
-            win_lengths (list): List of window lengths.
-            window (str): Window function type.
-        """
-        super(MultiResolutionSTFTLoss, self).__init__()
-        assert len(fft_sizes) == len(hop_sizes) == len(win_lengths)
-        self.stft_losses = torch.nn.ModuleList()
-        for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths):
-            self.stft_losses += [STFTLoss(fs, ss, wl, window)]
-
-    def forward(self, x, y):
-        """Calculate forward propagation.
-        Args:
-            x (Tensor): Predicted signal (B, T).
-            y (Tensor): Groundtruth signal (B, T).
-        Returns:
-            Tensor: Multi resolution spectral convergence loss value.
-            Tensor: Multi resolution log STFT magnitude loss value.
-        """
-        sc_loss = 0.0
-        mag_loss = 0.0
-        for f in self.stft_losses:
-            sc_l, mag_l = f(x, y)
-            sc_loss += sc_l
-            mag_loss += mag_l
-        sc_loss /= len(self.stft_losses)
-        mag_loss /= len(self.stft_losses)
-
-        return sc_loss, mag_loss
--- a/vocoder/fregan/train.py
+++ b/vocoder/fregan/train.py
@@ -1,253 +0,0 @@
-import warnings
-
-warnings.simplefilter(action='ignore', category=FutureWarning)
-import itertools
-import os
-import time
-import argparse
-import json
-import torch
-import torch.nn.functional as F
-from torch.utils.tensorboard import SummaryWriter
-from torch.utils.data import DistributedSampler, DataLoader
-import torch.multiprocessing as mp
-from torch.distributed import init_process_group
-from torch.nn.parallel import DistributedDataParallel
-from vocoder.fregan.utils import AttrDict, build_env
-from vocoder.fregan.meldataset import MelDataset, mel_spectrogram, get_dataset_filelist
-from vocoder.fregan.generator import FreGAN
-from vocoder.fregan.discriminator import ResWiseMultiPeriodDiscriminator, ResWiseMultiScaleDiscriminator
-from vocoder.fregan.loss import feature_loss, generator_loss, discriminator_loss
-from vocoder.fregan.utils import plot_spectrogram, scan_checkpoint, load_checkpoint, save_checkpoint
-from vocoder.fregan.stft_loss import MultiResolutionSTFTLoss
-
-torch.backends.cudnn.benchmark = True
-
-
-def train(rank, a, h):
-
-    a.checkpoint_path = a.models_dir.joinpath(a.run_id+'_fregan')
-    a.checkpoint_path.mkdir(exist_ok=True)
-    a.training_epochs = 3100
-    a.stdout_interval = 5
-    a.checkpoint_interval = a.backup_every
-    a.summary_interval = 5000
-    a.validation_interval = 1000
-    a.fine_tuning = True
-
-    a.input_wavs_dir = a.syn_dir.joinpath("audio")
-    a.input_mels_dir = a.syn_dir.joinpath("mels")
-
-    if h.num_gpus > 1:
-        init_process_group(backend=h.dist_config['dist_backend'], init_method=h.dist_config['dist_url'],
-                           world_size=h.dist_config['world_size'] * h.num_gpus, rank=rank)
-
-    torch.cuda.manual_seed(h.seed)
-    device = torch.device('cuda:{:d}'.format(rank))
-
-    generator = FreGAN(h).to(device)
-    mpd = ResWiseMultiPeriodDiscriminator().to(device)
-    msd = ResWiseMultiScaleDiscriminator().to(device)
-
-    if rank == 0:
-        print(generator)
-        os.makedirs(a.checkpoint_path, exist_ok=True)
-        print("checkpoints directory : ", a.checkpoint_path)
-
-    if os.path.isdir(a.checkpoint_path):
-        cp_g = scan_checkpoint(a.checkpoint_path, 'g_')
-        cp_do = scan_checkpoint(a.checkpoint_path, 'do_')
-
-    steps = 0
-    if cp_g is None or cp_do is None:
-        state_dict_do = None
-        last_epoch = -1
-    else:
-        state_dict_g = load_checkpoint(cp_g, device)
-        state_dict_do = load_checkpoint(cp_do, device)
-        generator.load_state_dict(state_dict_g['generator'])
-        mpd.load_state_dict(state_dict_do['mpd'])
-        msd.load_state_dict(state_dict_do['msd'])
-        steps = state_dict_do['steps'] + 1
-        last_epoch = state_dict_do['epoch']
-
-    if h.num_gpus > 1:
-        generator = DistributedDataParallel(generator, device_ids=[rank]).to(device)
-        mpd = DistributedDataParallel(mpd, device_ids=[rank]).to(device)
-        msd = DistributedDataParallel(msd, device_ids=[rank]).to(device)
-
-    optim_g = torch.optim.AdamW(generator.parameters(), h.learning_rate, betas=[h.adam_b1, h.adam_b2])
-    optim_d = torch.optim.AdamW(itertools.chain(msd.parameters(), mpd.parameters()),
-                                h.learning_rate, betas=[h.adam_b1, h.adam_b2])
-
-    if state_dict_do is not None:
-        optim_g.load_state_dict(state_dict_do['optim_g'])
-        optim_d.load_state_dict(state_dict_do['optim_d'])
-
-    scheduler_g = torch.optim.lr_scheduler.ExponentialLR(optim_g, gamma=h.lr_decay, last_epoch=last_epoch)
-    scheduler_d = torch.optim.lr_scheduler.ExponentialLR(optim_d, gamma=h.lr_decay, last_epoch=last_epoch)
-
-    training_filelist, validation_filelist = get_dataset_filelist(a)
-
-    trainset = MelDataset(training_filelist, h.segment_size, h.n_fft, h.num_mels,
-                          h.hop_size, h.win_size, h.sampling_rate, h.fmin, h.fmax, n_cache_reuse=0,
-                          shuffle=False if h.num_gpus > 1 else True, fmax_loss=h.fmax_for_loss, device=device,
-                          fine_tuning=a.fine_tuning, base_mels_path=a.input_mels_dir)
-
-    train_sampler = DistributedSampler(trainset) if h.num_gpus > 1 else None
-
-    train_loader = DataLoader(trainset, num_workers=h.num_workers, shuffle=False,
-                              sampler=train_sampler,
-                              batch_size=h.batch_size,
-                              pin_memory=True,
-                              drop_last=True)
-
-    if rank == 0:
-        validset = MelDataset(validation_filelist, h.segment_size, h.n_fft, h.num_mels,
-                              h.hop_size, h.win_size, h.sampling_rate, h.fmin, h.fmax, False, False, n_cache_reuse=0,
-                              fmax_loss=h.fmax_for_loss, device=device, fine_tuning=a.fine_tuning,
-                              base_mels_path=a.input_mels_dir)
-        validation_loader = DataLoader(validset, num_workers=1, shuffle=False,
-                                       sampler=None,
-                                       batch_size=1,
-                                       pin_memory=True,
-                                       drop_last=True)
-
-        sw = SummaryWriter(os.path.join(a.checkpoint_path, 'logs'))
-
-    generator.train()
-    mpd.train()
-    msd.train()
-    for epoch in range(max(0, last_epoch), a.training_epochs):
-        if rank == 0:
-            start = time.time()
-            print("Epoch: {}".format(epoch + 1))
-
-        if h.num_gpus > 1:
-            train_sampler.set_epoch(epoch)
-
-        for i, batch in enumerate(train_loader):
-            if rank == 0:
-                start_b = time.time()
-            x, y, _, y_mel = batch
-            x = torch.autograd.Variable(x.to(device, non_blocking=True))
-            y = torch.autograd.Variable(y.to(device, non_blocking=True))
-            y_mel = torch.autograd.Variable(y_mel.to(device, non_blocking=True))
-
-            y = y.unsqueeze(1)
-
-            y_g_hat = generator(x)
-
-            y_g_hat_mel = mel_spectrogram(y_g_hat.squeeze(1), h.n_fft, h.num_mels, h.sampling_rate, h.hop_size,
-                                          h.win_size,
-                                          h.fmin, h.fmax_for_loss)
-
-
-
-            optim_d.zero_grad()
-
-            # MPD
-            y_df_hat_r, y_df_hat_g, _, _ = mpd(y, y_g_hat.detach())
-            loss_disc_f, losses_disc_f_r, losses_disc_f_g = discriminator_loss(y_df_hat_r, y_df_hat_g)
-
-            # MSD
-            y_ds_hat_r, y_ds_hat_g, _, _ = msd(y, y_g_hat.detach())
-            loss_disc_s, losses_disc_s_r, losses_disc_s_g = discriminator_loss(y_ds_hat_r, y_ds_hat_g)
-
-            loss_disc_all = loss_disc_s + loss_disc_f
-
-            loss_disc_all.backward()
-            optim_d.step()
-
-            # Generator
-            optim_g.zero_grad()
-
-
-            # L1 Mel-Spectrogram Loss
-            loss_mel = F.l1_loss(y_mel, y_g_hat_mel) * 45
-
-            # sc_loss, mag_loss = stft_loss(y_g_hat[:, :, :y.size(2)].squeeze(1), y.squeeze(1))
-            # loss_mel = h.lambda_aux * (sc_loss + mag_loss)  # STFT Loss
-
-
-            y_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = mpd(y, y_g_hat)
-            y_ds_hat_r, y_ds_hat_g, fmap_s_r, fmap_s_g = msd(y, y_g_hat)
-            loss_fm_f = feature_loss(fmap_f_r, fmap_f_g)
-            loss_fm_s = feature_loss(fmap_s_r, fmap_s_g)
-            loss_gen_f, losses_gen_f = generator_loss(y_df_hat_g)
-            loss_gen_s, losses_gen_s = generator_loss(y_ds_hat_g)
-            loss_gen_all = loss_gen_s + loss_gen_f + (2 * (loss_fm_s + loss_fm_f)) + loss_mel
-
-
-            loss_gen_all.backward()
-            optim_g.step()
-
-            if rank == 0:
-                # STDOUT logging
-                if steps % a.stdout_interval == 0:
-                    with torch.no_grad():
-                        mel_error = F.l1_loss(y_mel, y_g_hat_mel).item()
-
-                    print('Steps : {:d}, Gen Loss Total : {:4.3f}, Mel-Spec. Error : {:4.3f}, s/b : {:4.3f}'.
-                          format(steps, loss_gen_all, mel_error, time.time() - start_b))
-
-                # checkpointing
-                if steps % a.checkpoint_interval == 0 and steps != 0:
-                    checkpoint_path = "{}/m_fregan_{:08d}".format(a.checkpoint_path, steps)
-                    save_checkpoint(checkpoint_path,
-                                    {'generator': (generator.module if h.num_gpus > 1 else generator).state_dict()})
-                    checkpoint_path = "{}/do_fregan_{:08d}".format(a.checkpoint_path, steps)
-                    save_checkpoint(checkpoint_path,
-                                    {'mpd': (mpd.module if h.num_gpus > 1
-                                             else mpd).state_dict(),
-                                     'msd': (msd.module if h.num_gpus > 1
-                                             else msd).state_dict(),
-                                     'optim_g': optim_g.state_dict(), 'optim_d': optim_d.state_dict(), 'steps': steps,
-                                     'epoch': epoch})
-
-                # Tensorboard summary logging
-                if steps % a.summary_interval == 0:
-                    sw.add_scalar("training/gen_loss_total", loss_gen_all, steps)
-                    sw.add_scalar("training/mel_spec_error", mel_error, steps)
-
-                # Validation
-                if steps % a.validation_interval == 0:  # and steps != 0:
-                    generator.eval()
-                    torch.cuda.empty_cache()
-                    val_err_tot = 0
-                    with torch.no_grad():
-                        for j, batch in enumerate(validation_loader):
-                            x, y, _, y_mel = batch
-                            y_g_hat = generator(x.to(device))
-                            y_mel = torch.autograd.Variable(y_mel.to(device, non_blocking=True))
-                            y_g_hat_mel = mel_spectrogram(y_g_hat.squeeze(1), h.n_fft, h.num_mels, h.sampling_rate,
-                                                          h.hop_size, h.win_size,
-                                                          h.fmin, h.fmax_for_loss)
-                            #val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
-
-                            if j <= 4:
-                                if steps == 0:
-                                    sw.add_audio('gt/y_{}'.format(j), y[0], steps, h.sampling_rate)
-                                    sw.add_figure('gt/y_spec_{}'.format(j), plot_spectrogram(x[0]), steps)
-
-                                sw.add_audio('generated/y_hat_{}'.format(j), y_g_hat[0], steps, h.sampling_rate)
-                                y_hat_spec = mel_spectrogram(y_g_hat.squeeze(1), h.n_fft, h.num_mels,
-                                                             h.sampling_rate, h.hop_size, h.win_size,
-                                                             h.fmin, h.fmax)
-                                sw.add_figure('generated/y_hat_spec_{}'.format(j),
-                                              plot_spectrogram(y_hat_spec.squeeze(0).cpu().numpy()), steps)
-
-                        val_err = val_err_tot / (j + 1)
-                        sw.add_scalar("validation/mel_spec_error", val_err, steps)
-
-                    generator.train()
-
-            steps += 1
-
-        scheduler_g.step()
-        scheduler_d.step()
-
-        if rank == 0:
-            print('Time taken for epoch {} is {} sec\n'.format(epoch + 1, int(time.time() - start)))
-
-
--- a/vocoder/fregan/utils.py
+++ b/vocoder/fregan/utils.py
@@ -1,71 +0,0 @@
-import glob
-import os
-import matplotlib
-import torch
-from torch.nn.utils import weight_norm
-matplotlib.use("Agg")
-import matplotlib.pylab as plt
-import shutil
-
-
-class AttrDict(dict):
-    def __init__(self, *args, **kwargs):
-        super(AttrDict, self).__init__(*args, **kwargs)
-        self.__dict__ = self
-
-
-def build_env(config, config_name, path):
-    t_path = os.path.join(path, config_name)
-    if config != t_path:
-        os.makedirs(path, exist_ok=True)
-        shutil.copyfile(config, os.path.join(path, config_name))
-
-
-def plot_spectrogram(spectrogram):
-    fig, ax = plt.subplots(figsize=(10, 2))
-    im = ax.imshow(spectrogram, aspect="auto", origin="lower",
-                   interpolation='none')
-    plt.colorbar(im, ax=ax)
-
-    fig.canvas.draw()
-    plt.close()
-
-    return fig
-
-
-def init_weights(m, mean=0.0, std=0.01):
-    classname = m.__class__.__name__
-    if classname.find("Conv") != -1:
-        m.weight.data.normal_(mean, std)
-
-
-def apply_weight_norm(m):
-    classname = m.__class__.__name__
-    if classname.find("Conv") != -1:
-        weight_norm(m)
-
-
-def get_padding(kernel_size, dilation=1):
-    return int((kernel_size*dilation - dilation)/2)
-
-
-def load_checkpoint(filepath, device):
-    assert os.path.isfile(filepath)
-    print("Loading '{}'".format(filepath))
-    checkpoint_dict = torch.load(filepath, map_location=device)
-    print("Complete.")
-    return checkpoint_dict
-
-
-def save_checkpoint(filepath, obj):
-    print("Saving checkpoint to {}".format(filepath))
-    torch.save(obj, filepath)
-    print("Complete.")
-
-
-def scan_checkpoint(cp_dir, prefix):
-    pattern = os.path.join(cp_dir, prefix + '????????')
-    cp_list = glob.glob(pattern)
-    if len(cp_list) == 0:
-        return None
-    return sorted(cp_list)[-1]
--- a/vocoder/hifigan/config_16k_.json
+++ b/vocoder/hifigan/config_16k_.json
@@ -27,10 +27,5 @@
    "fmax": 7600,
    "fmax_for_loss": null,

-    "num_workers": 4,
-    "dist_config": {
-        "dist_backend": "nccl",
-        "dist_url": "tcp://localhost:54321",
-        "world_size": 1
-    }
+    "num_workers": 4
 }
--- a/vocoder/hifigan/inference.py
+++ b/vocoder/hifigan/inference.py
@@ -3,14 +3,11 @@ from __future__ import absolute_import, division, print_function, unicode_litera
 import os
 import json
 import torch
-from scipy.io.wavfile import write
 from vocoder.hifigan.env import AttrDict
-from vocoder.hifigan.meldataset import mel_spectrogram, MAX_WAV_VALUE, load_wav
 from vocoder.hifigan.models import Generator
-import soundfile as sf
-

 generator = None       # type: Generator
+output_sample_rate = None     
 _device = None


@@ -22,16 +19,17 @@ def load_checkpoint(filepath, device):
    return checkpoint_dict


-def load_model(weights_fpath, verbose=True):
-    global generator, _device
+def load_model(weights_fpath, config_fpath="./vocoder/saved_models/24k/config.json", verbose=True):
+    global generator, _device, output_sample_rate

    if verbose:
        print("Building hifigan")

-    with open("./vocoder/hifigan/config_16k_.json") as f:
+    with open(config_fpath) as f:
        data = f.read()
    json_config = json.loads(data)
    h = AttrDict(json_config)
+    output_sample_rate = h.sampling_rate
    torch.manual_seed(h.seed)

    if torch.cuda.is_available():
@@ -66,5 +64,5 @@ def infer_waveform(mel, progress_callback=None):
        audio = y_g_hat.squeeze()
    audio = audio.cpu().numpy()

-    return audio
+    return audio, output_sample_rate

--- a/vocoder/hifigan/meldataset.py
+++ b/vocoder/hifigan/meldataset.py
@@ -52,7 +52,6 @@ def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin,
    if torch.max(y) > 1.:
        print('max value is ', torch.max(y))

-
    global mel_basis, hann_window
    if fmax not in mel_basis:
        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
--- a/vocoder/hifigan/models.py
+++ b/vocoder/hifigan/models.py
@@ -71,6 +71,24 @@ class ResBlock2(torch.nn.Module):
        for l in self.convs:
            remove_weight_norm(l)

+class InterpolationBlock(torch.nn.Module):
+    def __init__(self, scale_factor, mode='nearest', align_corners=None, downsample=False):
+        super(InterpolationBlock, self).__init__()
+        self.downsample = downsample
+        self.scale_factor = scale_factor
+        self.mode = mode
+        self.align_corners = align_corners
+    
+    def forward(self, x):
+        outputs = torch.nn.functional.interpolate(
+            x,
+            size=x.shape[-1] * self.scale_factor \
+                if not self.downsample else x.shape[-1] // self.scale_factor,
+            mode=self.mode,
+            align_corners=self.align_corners,
+            recompute_scale_factor=False
+        )
+        return outputs

 class Generator(torch.nn.Module):
    def __init__(self, h):
@@ -82,14 +100,27 @@ class Generator(torch.nn.Module):
        resblock = ResBlock1 if h.resblock == '1' else ResBlock2

        self.ups = nn.ModuleList()
-        for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
-#             self.ups.append(weight_norm(
-#                 ConvTranspose1d(h.upsample_initial_channel//(2**i), h.upsample_initial_channel//(2**(i+1)),
-#                                 k, u, padding=(k-u)//2)))
-            self.ups.append(weight_norm(ConvTranspose1d(h.upsample_initial_channel//(2**i), 
-                h.upsample_initial_channel//(2**(i+1)),
-                k, u, padding=(u//2 + u%2), output_padding=u%2)))
-
+#         for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+# #             self.ups.append(weight_norm(
+# #                 ConvTranspose1d(h.upsample_initial_channel//(2**i), h.upsample_initial_channel//(2**(i+1)),
+# #                                 k, u, padding=(k-u)//2)))
+        if h.sampling_rate == 24000:
+            for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+                self.ups.append(
+                    torch.nn.Sequential(
+                        InterpolationBlock(u),
+                        weight_norm(torch.nn.Conv1d(
+                            h.upsample_initial_channel//(2**i),
+                            h.upsample_initial_channel//(2**(i+1)),
+                            k, padding=(k-1)//2,
+                        ))
+                    )
+                )
+        else:
+            for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+                self.ups.append(weight_norm(ConvTranspose1d(h.upsample_initial_channel//(2**i), 
+                    h.upsample_initial_channel//(2**(i+1)),
+                    k, u, padding=(u//2 + u%2), output_padding=u%2)))
        self.resblocks = nn.ModuleList()
        for i in range(len(self.ups)):
            ch = h.upsample_initial_channel//(2**(i+1))
@@ -121,7 +152,10 @@ class Generator(torch.nn.Module):
    def remove_weight_norm(self):
        print('Removing weight norm...')
        for l in self.ups:
-            remove_weight_norm(l)
+            if self.h.sampling_rate == 24000:
+                remove_weight_norm(l[-1])
+            else:
+                remove_weight_norm(l)
        for l in self.resblocks:
            l.remove_weight_norm()
        remove_weight_norm(self.conv_pre)
--- a/vocoder/wavernn/inference.py
+++ b/vocoder/wavernn/inference.py
@@ -61,4 +61,4 @@ def infer_waveform(mel, normalize=True,  batched=True, target=8000, overlap=800,
        mel = mel / hp.mel_max_abs_value
    mel = torch.from_numpy(mel[None, ...])
    wav = _model.generate(mel, batched, target, overlap, hp.mu_law, progress_callback)
-    return wav
+    return wav, hp.sample_rate
--- a/vocoder_train.py
+++ b/vocoder_train.py
@@ -2,7 +2,6 @@ from utils.argutils import print_args
 from vocoder.wavernn.train import train
 from vocoder.hifigan.train import train as train_hifigan
 from vocoder.hifigan.env import AttrDict
-from vocoder.fregan.train import train as train_fregan
 from pathlib import Path
 import argparse
 import json
@@ -62,18 +61,11 @@ if __name__ == "__main__":
    # Process the arguments
    if args.vocoder_type == "wavernn":
        # Run the training wavernn
-        delattr(args,'vocoder_type')
-        delattr(args,'config')
        train(**vars(args))
    elif args.vocoder_type == "hifigan":
        with open(args.config) as f:
            json_config = json.load(f)
        h = AttrDict(json_config)
        train_hifigan(0, args, h)
-    elif args.vocoder_type == "fregan":
-        with open('vocoder/fregan/config.json') as f:
-            json_config = json.load(f)
-        h = AttrDict(json_config)
-        train_fregan(0, args, h)

        
--- a/web/init.py
+++ b/web/init.py
@@ -107,14 +107,15 @@ def webApp():
        embeds = [embed] * len(texts)
        specs = current_synt.synthesize_spectrograms(texts, embeds)
        spec = np.concatenate(specs, axis=1)
+        sample_rate = Synthesizer.sample_rate
        if "vocoder" in request.form and request.form["vocoder"] == "WaveRNN":
            wav = rnn_vocoder.infer_waveform(spec)
        else:
-            wav = gan_vocoder.infer_waveform(spec)
+            wav, sample_rate = gan_vocoder.infer_waveform(spec)

        # Return cooked wav
        out = io.BytesIO()
-        write(out, Synthesizer.sample_rate, wav.astype(np.float32))
+        write(out, sample_rate, wav.astype(np.float32))
        return Response(out, mimetype="audio/wav")

    @app.route('/', methods=['GET'])
Author	SHA1	Message	Date
babysor00	3a2d50c862	Add readme	2022-03-05 00:51:55 +08:00
babysor00	d786e78121	Add UI usage of PPG-vc	2022-03-03 23:34:47 +08:00
babysor00	6befb700e9	Fix sample issues	2022-03-02 23:15:37 +08:00
babysor00	dd3abebc4d	Fix bug of preparing fid	2022-02-27 13:25:58 +08:00
babysor00	eeee32f3e3	Fix length issue	2022-02-26 17:26:27 +08:00
babysor00	8ef5e1411d	Update __init__.py Allow to gen audio	2022-02-24 09:46:24 +08:00
babysor00	20bea3546b	Merge branch 'main' into ppg-vc-init	2022-02-24 00:31:13 +08:00
babysor00	0536874dec	Add config file for pretrained	2022-02-23 09:37:39 +08:00
babysor00	fad5023fca	FIx known issues	2022-02-20 11:56:58 +08:00
babysor00	19eaa68202	add preprocess and training	2022-02-13 11:28:41 +08:00
李子	4529479091	指定librosa版本 (#378 ) * 支持data_aishell（SLR33）数据集 * 更新readme * 指定librosa版本	2022-02-10 20:47:26 +08:00
babysor00	379fd2b9fd	Init ppg extractor and ppg2mel	2022-02-09 00:44:43 +08:00
babysor00	8ad9ba2b60	change naming logic of saving trained file for synthesizer to allow shorter interval	2022-01-15 17:56:14 +08:00
D-Blue	b56ec5ee1b	Fix a UserWarning (#273 ) Fix a UserWarning in synthesizer/synthesizer_dataset.py, because of converting list of numpy array to torch tensor at Ln.85.	2021-12-20 20:33:12 +08:00
CrystalRays	0bc34a5bc9	Fix TypeError at line 459 in toolbox/ui.py when both PySide6(PyQt6) and PyQt5 installed (#255 ) ### Error Info Screenshot ![](https://cdn.jsdelivr.net/gh/CrystalRays/CDN@main/img/16389623959301638962395845.png) ### Error Reason Matplotlib.backends.qt_compat.py decide the version of qt library according to sys.modules firstly, os.environ secondly and the sequence of PyQt6, PySide6, PyQt5, PySide 2 and etc finally. Import PyQt5 after matplotlib make that there is no PyQt5 in sys.modules so that it choose PyQt6 or PySide6 before PyQt5 if it installed. 因为Matplotlib.backends.qt_compat.py优先根据导入的库决定要使用的Python Qt的库，如果没有导入则根据环境变量PYQT_APT决定，再不济就按照PyQt6, PySide6, PyQt5, PySide 2的顺序导入已经安装的库。因为ui.py先导入matplotlib而不是PYQT5导致matplotlib在导入的库里找不到Qt的库，又没有指定环境变量，然后用户安装了Qt6的库的话就导入Qt6的库去了	2021-12-15 12:41:10 +08:00