mirror of
https://github.com/babysor/Realtime-Voice-Clone-Chinese.git
synced 2026-02-03 18:43:41 +08:00
Compare commits
15 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
724194a4de | ||
|
|
31bc6656c3 | ||
|
|
aa35fb3139 | ||
|
|
727eafc51b | ||
|
|
d328ecba81 | ||
|
|
fad574118c | ||
|
|
b0c156a537 | ||
|
|
724809abf4 | ||
|
|
05cd1a54ea | ||
|
|
245099c740 | ||
|
|
6dd2af49fe | ||
|
|
8b43ec9a64 | ||
|
|
2a99f0ff05 | ||
|
|
a824b54122 | ||
|
|
81befb91b0 |
16
README-CN.md
16
README-CN.md
@@ -5,10 +5,10 @@
|
||||
|
||||
### [English](README.md) | 中文
|
||||
|
||||
### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/)
|
||||
### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/) | [Wiki教程](https://github.com/babysor/MockingBird/wiki/Quick-Start-(Newbie)) | [训练教程](https://vaj2fgg8yn.feishu.cn/docs/doccn7kAbr3SJz0KM0SIDJ0Xnhd)
|
||||
|
||||
## 特性
|
||||
🌍 **中文** 支持普通话并使用多种中文数据集进行测试:aidatatang_200zh, magicdata, aishell3, biaobei,MozillaCommonVoice 等
|
||||
🌍 **中文** 支持普通话并使用多种中文数据集进行测试:aidatatang_200zh, magicdata, aishell3, biaobei, MozillaCommonVoice, data_aishell 等
|
||||
|
||||
🤩 **PyTorch** 适用于 pytorch,已在 1.9.0 版本(最新于 2021 年 8 月)中测试,GPU Tesla T4 和 GTX 2060
|
||||
|
||||
@@ -18,6 +18,7 @@
|
||||
|
||||
🌍 **Webserver Ready** 可伺服你的训练结果,供远程调用
|
||||
|
||||
## 开始
|
||||
### 1. 安装要求
|
||||
> 按照原始存储库测试您是否已准备好所有环境。
|
||||
**Python 3.7 或更高版本** 需要运行工具箱。
|
||||
@@ -34,8 +35,10 @@
|
||||
#### 2.1 使用数据集自己训练合成器模型(与2.2二选一)
|
||||
* 下载 数据集并解压:确保您可以访问 *train* 文件夹中的所有音频文件(如.wav)
|
||||
* 进行音频和梅尔频谱图预处理:
|
||||
`python pre.py <datasets_root>`
|
||||
可以传入参数 --dataset `{dataset}` 支持 aidatatang_200zh, magicdata, aishell3
|
||||
`python pre.py <datasets_root> -d {dataset} -n {number}`
|
||||
可传入参数:
|
||||
* -d`{dataset}` 指定数据集,支持 aidatatang_200zh, magicdata, aishell3, data_aishell, 不传默认为aidatatang_200zh
|
||||
* -n `{number}` 指定并行数,CPU 11770k + 32GB实测10没有问题
|
||||
> 假如你下载的 `aidatatang_200zh`文件放在D盘,`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`
|
||||
|
||||
* 训练合成器:
|
||||
@@ -48,7 +51,7 @@
|
||||
|
||||
| 作者 | 下载链接 | 效果预览 | 信息 |
|
||||
| --- | ----------- | ----- | ----- |
|
||||
| 作者 | https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA [百度盘链接](https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA ) 提取码:i183 | | 200k steps 只用aidatatang_200zh
|
||||
| 作者 | https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ [百度盘链接](https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ) 提取码:gdn5 | | 25k steps 用3个开源数据集混合训练
|
||||
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [百度盘链接](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) 提取码:1024 | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音
|
||||
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码:2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 旧版需根据[issue](https://github.com/babysor/MockingBird/issues/37)修复
|
||||
|
||||
@@ -121,7 +124,7 @@
|
||||
| --- | ----------- | ----- | --------------------- |
|
||||
| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
|
||||
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
|
||||
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | This repo |
|
||||
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | 本代码库 |
|
||||
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|
||||
|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
|
||||
|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |
|
||||
@@ -133,6 +136,7 @@
|
||||
| aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
|
||||
| magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
|
||||
| aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
|
||||
| data_aishell | [OpenSLR](https://www.openslr.org/33/) | |
|
||||
> 解壓 aidatatang_200zh 後,還需將 `aidatatang_200zh\corpus\train`下的檔案全選解壓縮
|
||||
|
||||
#### 2.`<datasets_root>`是什麼意思?
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
> English | [中文](README-CN.md)
|
||||
|
||||
## Features
|
||||
🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, and etc.
|
||||
🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.
|
||||
|
||||
🤩 **PyTorch** worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060
|
||||
|
||||
@@ -36,7 +36,7 @@ You can either train your models or use existing ones:
|
||||
* Download dataset and unzip: make sure you can access all .wav in folder
|
||||
* Preprocess with the audios and the mel spectrograms:
|
||||
`python pre.py <datasets_root>`
|
||||
Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata, aishell3, etc.
|
||||
Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata, aishell3, data_aishell, etc.If this parameter is not passed, the default dataset will be aidatatang_200zh.
|
||||
|
||||
* Train the synthesizer:
|
||||
`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
|
||||
@@ -49,7 +49,7 @@ Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata,
|
||||
| author | Download link | Preview Video | Info |
|
||||
| --- | ----------- | ----- |----- |
|
||||
| @myself | https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA [Baidu](https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA ) code:i183 | | 200k steps only trained by aidatatang_200zh
|
||||
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [Baidu Pan](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) Code:1024 | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan
|
||||
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing https://u.teknik.io/AYxWf.pt | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan
|
||||
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code:2021 | https://www.bilibili.com/video/BV1uh411B7AD/
|
||||
|
||||
#### 2.3 Train vocoder (Optional)
|
||||
@@ -91,6 +91,7 @@ You can then try the toolbox:
|
||||
| aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
|
||||
| magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
|
||||
| aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
|
||||
| data_aishell | [OpenSLR](https://www.openslr.org/33/) | |
|
||||
> After unzip aidatatang_200zh, you need to unzip all the files under `aidatatang_200zh\corpus\train`
|
||||
|
||||
#### 2.What is`<datasets_root>`?
|
||||
|
||||
5
pre.py
5
pre.py
@@ -12,7 +12,8 @@ import argparse
|
||||
recognized_datasets = [
|
||||
"aidatatang_200zh",
|
||||
"magicdata",
|
||||
"aishell3"
|
||||
"aishell3",
|
||||
"data_aishell"
|
||||
]
|
||||
|
||||
if __name__ == "__main__":
|
||||
@@ -40,7 +41,7 @@ if __name__ == "__main__":
|
||||
"Use this option when dataset does not include alignments\
|
||||
(these are used to split long audio files into sub-utterances.)")
|
||||
parser.add_argument("-d", "--dataset", type=str, default="aidatatang_200zh", help=\
|
||||
"Name of the dataset to process, allowing values: magicdata, aidatatang_200zh, aishell3.")
|
||||
"Name of the dataset to process, allowing values: magicdata, aidatatang_200zh, aishell3, data_aishell.")
|
||||
parser.add_argument("-e", "--encoder_model_fpath", type=Path, default="encoder/saved_models/pretrained.pt", help=\
|
||||
"Path your trained encoder model.")
|
||||
parser.add_argument("-ne", "--n_processes_embed", type=int, default=1, help=\
|
||||
|
||||
@@ -62,9 +62,11 @@ hparams = HParams(
|
||||
tts_clip_grad_norm = 1.0, # clips the gradient norm to prevent explosion - set to None if not needed
|
||||
tts_eval_interval = 500, # Number of steps between model evaluation (sample generation)
|
||||
# Set to -1 to generate after completing epoch, or 0 to disable
|
||||
|
||||
tts_eval_num_samples = 1, # Makes this number of samples
|
||||
|
||||
## For finetune usage, if set, only selected layers will be trained, available: encoder,encoder_proj,gst,decoder,postnet,post_proj
|
||||
tts_finetune_layers = [],
|
||||
|
||||
### Data Preprocessing
|
||||
max_mel_frames = 900,
|
||||
rescale = True,
|
||||
|
||||
@@ -70,7 +70,7 @@ class Synthesizer:
|
||||
|
||||
def synthesize_spectrograms(self, texts: List[str],
|
||||
embeddings: Union[np.ndarray, List[np.ndarray]],
|
||||
return_alignments=False, style_idx=0):
|
||||
return_alignments=False, style_idx=0, min_stop_token=5):
|
||||
"""
|
||||
Synthesizes mel spectrograms from texts and speaker embeddings.
|
||||
|
||||
@@ -125,7 +125,7 @@ class Synthesizer:
|
||||
speaker_embeddings = torch.tensor(speaker_embeds).float().to(self.device)
|
||||
|
||||
# Inference
|
||||
_, mels, alignments = self._model.generate(chars, speaker_embeddings, style_idx=style_idx)
|
||||
_, mels, alignments = self._model.generate(chars, speaker_embeddings, style_idx=style_idx, min_stop_token=min_stop_token)
|
||||
mels = mels.detach().cpu().numpy()
|
||||
for m in mels:
|
||||
# Trim silence from end of each spectrogram
|
||||
|
||||
@@ -419,7 +419,7 @@ class Tacotron(nn.Module):
|
||||
|
||||
return mel_outputs, linear, attn_scores, stop_outputs
|
||||
|
||||
def generate(self, x, speaker_embedding=None, steps=200, style_idx=0):
|
||||
def generate(self, x, speaker_embedding=None, steps=200, style_idx=0, min_stop_token=5):
|
||||
self.eval()
|
||||
device = next(self.parameters()).device # use same device as parameters
|
||||
|
||||
@@ -454,9 +454,9 @@ class Tacotron(nn.Module):
|
||||
scale[:] = 0.3
|
||||
speaker_embedding = (gst_embed[style_idx] * scale).astype(np.float32)
|
||||
speaker_embedding = torch.from_numpy(np.tile(speaker_embedding, (x.shape[0], 1))).to(device)
|
||||
style_embed = self.gst(speaker_embedding)
|
||||
style_embed = style_embed.expand_as(encoder_seq)
|
||||
encoder_seq = encoder_seq + style_embed
|
||||
style_embed = self.gst(speaker_embedding)
|
||||
style_embed = style_embed.expand_as(encoder_seq)
|
||||
encoder_seq = encoder_seq + style_embed
|
||||
encoder_seq_proj = self.encoder_proj(encoder_seq)
|
||||
|
||||
# Need a couple of lists for outputs
|
||||
@@ -472,7 +472,7 @@ class Tacotron(nn.Module):
|
||||
attn_scores.append(scores)
|
||||
stop_outputs.extend([stop_tokens] * self.r)
|
||||
# Stop the loop when all stop tokens in batch exceed threshold
|
||||
if (stop_tokens > 0.5).all() and t > 10: break
|
||||
if (stop_tokens * 10 > min_stop_token).all() and t > 10: break
|
||||
|
||||
# Concat the mel outputs into sequence
|
||||
mel_outputs = torch.cat(mel_outputs, dim=2)
|
||||
@@ -496,6 +496,15 @@ class Tacotron(nn.Module):
|
||||
for p in self.parameters():
|
||||
if p.dim() > 1: nn.init.xavier_uniform_(p)
|
||||
|
||||
def finetune_partial(self, whitelist_layers):
|
||||
self.zero_grad()
|
||||
for name, child in self.named_children():
|
||||
if name in whitelist_layers:
|
||||
print("Trainable Layer: %s" % name)
|
||||
print("Trainable Parameters: %.3f" % sum([np.prod(p.size()) for p in child.parameters()]))
|
||||
for param in child.parameters():
|
||||
param.requires_grad = False
|
||||
|
||||
def get_step(self):
|
||||
return self.step.data.item()
|
||||
|
||||
|
||||
@@ -7,7 +7,7 @@ from tqdm import tqdm
|
||||
import numpy as np
|
||||
from encoder import inference as encoder
|
||||
from synthesizer.preprocess_speaker import preprocess_speaker_general
|
||||
from synthesizer.preprocess_transcript import preprocess_transcript_aishell3
|
||||
from synthesizer.preprocess_transcript import preprocess_transcript_aishell3, preprocess_transcript_magicdata
|
||||
|
||||
data_info = {
|
||||
"aidatatang_200zh": {
|
||||
@@ -18,13 +18,19 @@ data_info = {
|
||||
"magicdata": {
|
||||
"subfolders": ["train"],
|
||||
"trans_filepath": "train/TRANS.txt",
|
||||
"speak_func": preprocess_speaker_general
|
||||
"speak_func": preprocess_speaker_general,
|
||||
"transcript_func": preprocess_transcript_magicdata,
|
||||
},
|
||||
"aishell3":{
|
||||
"subfolders": ["train/wav"],
|
||||
"trans_filepath": "train/content.txt",
|
||||
"speak_func": preprocess_speaker_general,
|
||||
"transcript_func": preprocess_transcript_aishell3,
|
||||
},
|
||||
"data_aishell":{
|
||||
"subfolders": ["wav/train"],
|
||||
"trans_filepath": "transcript/aishell_transcript_v0.8.txt",
|
||||
"speak_func": preprocess_speaker_general
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -6,4 +6,13 @@ def preprocess_transcript_aishell3(dict_info, dict_transcript):
|
||||
transList = []
|
||||
for i in range(2, len(v), 2):
|
||||
transList.append(v[i])
|
||||
dict_info[v[0]] = " ".join(transList)
|
||||
dict_info[v[0]] = " ".join(transList)
|
||||
|
||||
|
||||
def preprocess_transcript_magicdata(dict_info, dict_transcript):
|
||||
for v in dict_transcript:
|
||||
if not v:
|
||||
continue
|
||||
v = v.strip().replace("\n","").replace("\t"," ").split(" ")
|
||||
dict_info[v[0]] = " ".join(v[2:])
|
||||
|
||||
@@ -93,7 +93,7 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
|
||||
speaker_embedding_size=hparams.speaker_embedding_size).to(device)
|
||||
|
||||
# Initialize the optimizer
|
||||
optimizer = optim.Adam(model.parameters())
|
||||
optimizer = optim.Adam(model.parameters(), amsgrad=True)
|
||||
|
||||
# Load the weights
|
||||
if force_restart or not weights_fpath.exists():
|
||||
@@ -146,7 +146,6 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
|
||||
continue
|
||||
|
||||
model.r = r
|
||||
|
||||
# Begin the training
|
||||
simple_table([(f"Steps with r={r}", str(training_steps // 1000) + "k Steps"),
|
||||
("Batch Size", batch_size),
|
||||
@@ -155,6 +154,8 @@ def train(run_id: str, syn_dir: str, models_dir: str, save_every: int,
|
||||
|
||||
for p in optimizer.param_groups:
|
||||
p["lr"] = lr
|
||||
if hparams.tts_finetune_layers is not None and len(hparams.tts_finetune_layers) > 0:
|
||||
model.finetune_partial(hparams.tts_finetune_layers)
|
||||
|
||||
data_loader = DataLoader(dataset,
|
||||
collate_fn=collate_synthesizer,
|
||||
|
||||
@@ -234,7 +234,8 @@ class Toolbox:
|
||||
texts = processed_texts
|
||||
embed = self.ui.selected_utterance.embed
|
||||
embeds = [embed] * len(texts)
|
||||
specs = self.synthesizer.synthesize_spectrograms(texts, embeds, style_idx=int(self.ui.style_idx_textbox.text()))
|
||||
min_token = int(self.ui.token_slider.value())
|
||||
specs = self.synthesizer.synthesize_spectrograms(texts, embeds, style_idx=int(self.ui.style_slider.value()), min_stop_token=min_token)
|
||||
breaks = [spec.shape[1] for spec in specs]
|
||||
spec = np.concatenate(specs, axis=1)
|
||||
|
||||
|
||||
BIN
toolbox/assets/mb.png
Normal file
BIN
toolbox/assets/mb.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 5.6 KiB |
158
toolbox/ui.py
158
toolbox/ui.py
@@ -2,6 +2,7 @@ import matplotlib.pyplot as plt
|
||||
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
|
||||
from matplotlib.figure import Figure
|
||||
from PyQt5.QtCore import Qt, QStringListModel
|
||||
from PyQt5 import QtGui
|
||||
from PyQt5.QtWidgets import *
|
||||
from encoder.inference import plot_embedding_as_heatmap
|
||||
from toolbox.utterance import Utterance
|
||||
@@ -420,7 +421,10 @@ class UI(QDialog):
|
||||
## Initialize the application
|
||||
self.app = QApplication(sys.argv)
|
||||
super().__init__(None)
|
||||
self.setWindowTitle("SV2TTS toolbox")
|
||||
self.setWindowTitle("MockingBird GUI")
|
||||
self.setWindowIcon(QtGui.QIcon('toolbox\\assets\\mb.png'))
|
||||
self.setWindowFlag(Qt.WindowMinimizeButtonHint, True)
|
||||
self.setWindowFlag(Qt.WindowMaximizeButtonHint, True)
|
||||
|
||||
|
||||
## Main layouts
|
||||
@@ -430,21 +434,24 @@ class UI(QDialog):
|
||||
|
||||
# Browser
|
||||
browser_layout = QGridLayout()
|
||||
root_layout.addLayout(browser_layout, 0, 0, 1, 2)
|
||||
root_layout.addLayout(browser_layout, 0, 0, 1, 8)
|
||||
|
||||
# Generation
|
||||
gen_layout = QVBoxLayout()
|
||||
root_layout.addLayout(gen_layout, 0, 2, 1, 2)
|
||||
|
||||
# Projections
|
||||
self.projections_layout = QVBoxLayout()
|
||||
root_layout.addLayout(self.projections_layout, 1, 0, 1, 1)
|
||||
|
||||
root_layout.addLayout(gen_layout, 0, 8)
|
||||
|
||||
# Visualizations
|
||||
vis_layout = QVBoxLayout()
|
||||
root_layout.addLayout(vis_layout, 1, 1, 1, 3)
|
||||
root_layout.addLayout(vis_layout, 1, 0, 2, 8)
|
||||
|
||||
# Output
|
||||
output_layout = QGridLayout()
|
||||
vis_layout.addLayout(output_layout, 0)
|
||||
|
||||
# Projections
|
||||
self.projections_layout = QVBoxLayout()
|
||||
root_layout.addLayout(self.projections_layout, 1, 8, 2, 2)
|
||||
|
||||
## Projections
|
||||
# UMap
|
||||
fig, self.umap_ax = plt.subplots(figsize=(3, 3), facecolor="#F0F0F0")
|
||||
@@ -458,80 +465,88 @@ class UI(QDialog):
|
||||
## Browser
|
||||
# Dataset, speaker and utterance selection
|
||||
i = 0
|
||||
self.dataset_box = QComboBox()
|
||||
browser_layout.addWidget(QLabel("<b>Dataset</b>"), i, 0)
|
||||
browser_layout.addWidget(self.dataset_box, i + 1, 0)
|
||||
self.speaker_box = QComboBox()
|
||||
browser_layout.addWidget(QLabel("<b>Speaker</b>"), i, 1)
|
||||
browser_layout.addWidget(self.speaker_box, i + 1, 1)
|
||||
self.utterance_box = QComboBox()
|
||||
browser_layout.addWidget(QLabel("<b>Utterance</b>"), i, 2)
|
||||
browser_layout.addWidget(self.utterance_box, i + 1, 2)
|
||||
self.browser_load_button = QPushButton("Load")
|
||||
browser_layout.addWidget(self.browser_load_button, i + 1, 3)
|
||||
i += 2
|
||||
|
||||
# Random buttons
|
||||
source_groupbox = QGroupBox('Source(源音频)')
|
||||
source_layout = QGridLayout()
|
||||
source_groupbox.setLayout(source_layout)
|
||||
browser_layout.addWidget(source_groupbox, i, 0, 1, 4)
|
||||
|
||||
self.dataset_box = QComboBox()
|
||||
source_layout.addWidget(QLabel("Dataset(数据集):"), i, 0)
|
||||
source_layout.addWidget(self.dataset_box, i, 1)
|
||||
self.random_dataset_button = QPushButton("Random")
|
||||
browser_layout.addWidget(self.random_dataset_button, i, 0)
|
||||
source_layout.addWidget(self.random_dataset_button, i, 2)
|
||||
i += 1
|
||||
self.speaker_box = QComboBox()
|
||||
source_layout.addWidget(QLabel("Speaker(说话者)"), i, 0)
|
||||
source_layout.addWidget(self.speaker_box, i, 1)
|
||||
self.random_speaker_button = QPushButton("Random")
|
||||
browser_layout.addWidget(self.random_speaker_button, i, 1)
|
||||
source_layout.addWidget(self.random_speaker_button, i, 2)
|
||||
i += 1
|
||||
self.utterance_box = QComboBox()
|
||||
source_layout.addWidget(QLabel("Utterance(音频):"), i, 0)
|
||||
source_layout.addWidget(self.utterance_box, i, 1)
|
||||
self.random_utterance_button = QPushButton("Random")
|
||||
browser_layout.addWidget(self.random_utterance_button, i, 2)
|
||||
source_layout.addWidget(self.random_utterance_button, i, 2)
|
||||
|
||||
i += 1
|
||||
source_layout.addWidget(QLabel("<b>Use(使用):</b>"), i, 0)
|
||||
self.browser_load_button = QPushButton("Load Above(加载上面)")
|
||||
source_layout.addWidget(self.browser_load_button, i, 1, 1, 2)
|
||||
self.auto_next_checkbox = QCheckBox("Auto select next")
|
||||
self.auto_next_checkbox.setChecked(True)
|
||||
browser_layout.addWidget(self.auto_next_checkbox, i, 3)
|
||||
i += 1
|
||||
source_layout.addWidget(self.auto_next_checkbox, i+1, 1)
|
||||
self.browser_browse_button = QPushButton("Browse(打开本地)")
|
||||
source_layout.addWidget(self.browser_browse_button, i, 3)
|
||||
self.record_button = QPushButton("Record(录音)")
|
||||
source_layout.addWidget(self.record_button, i+1, 3)
|
||||
|
||||
i += 2
|
||||
# Utterance box
|
||||
browser_layout.addWidget(QLabel("<b>Use embedding from:</b>"), i, 0)
|
||||
browser_layout.addWidget(QLabel("<b>Current(当前):</b>"), i, 0)
|
||||
self.utterance_history = QComboBox()
|
||||
browser_layout.addWidget(self.utterance_history, i, 1, 1, 3)
|
||||
i += 1
|
||||
|
||||
# Random & next utterance buttons
|
||||
self.browser_browse_button = QPushButton("Browse")
|
||||
browser_layout.addWidget(self.browser_browse_button, i, 0)
|
||||
self.record_button = QPushButton("Record")
|
||||
browser_layout.addWidget(self.record_button, i, 1)
|
||||
self.play_button = QPushButton("Play")
|
||||
browser_layout.addWidget(self.utterance_history, i, 1)
|
||||
self.play_button = QPushButton("Play(播放)")
|
||||
browser_layout.addWidget(self.play_button, i, 2)
|
||||
self.stop_button = QPushButton("Stop")
|
||||
self.stop_button = QPushButton("Stop(暂停)")
|
||||
browser_layout.addWidget(self.stop_button, i, 3)
|
||||
i += 1
|
||||
|
||||
i += 1
|
||||
model_groupbox = QGroupBox('Models(模型选择)')
|
||||
model_layout = QHBoxLayout()
|
||||
model_groupbox.setLayout(model_layout)
|
||||
browser_layout.addWidget(model_groupbox, i, 0, 1, 4)
|
||||
|
||||
# Model and audio output selection
|
||||
self.encoder_box = QComboBox()
|
||||
browser_layout.addWidget(QLabel("<b>Encoder</b>"), i, 0)
|
||||
browser_layout.addWidget(self.encoder_box, i + 1, 0)
|
||||
model_layout.addWidget(QLabel("Encoder:"))
|
||||
model_layout.addWidget(self.encoder_box)
|
||||
self.synthesizer_box = QComboBox()
|
||||
browser_layout.addWidget(QLabel("<b>Synthesizer</b>"), i, 1)
|
||||
browser_layout.addWidget(self.synthesizer_box, i + 1, 1)
|
||||
model_layout.addWidget(QLabel("Synthesizer:"))
|
||||
model_layout.addWidget(self.synthesizer_box)
|
||||
self.vocoder_box = QComboBox()
|
||||
browser_layout.addWidget(QLabel("<b>Vocoder</b>"), i, 2)
|
||||
browser_layout.addWidget(self.vocoder_box, i + 1, 2)
|
||||
model_layout.addWidget(QLabel("Vocoder:"))
|
||||
model_layout.addWidget(self.vocoder_box)
|
||||
|
||||
self.audio_out_devices_cb=QComboBox()
|
||||
browser_layout.addWidget(QLabel("<b>Audio Output</b>"), i, 3)
|
||||
browser_layout.addWidget(self.audio_out_devices_cb, i + 1, 3)
|
||||
i += 2
|
||||
|
||||
#Replay & Save Audio
|
||||
browser_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
|
||||
i = 0
|
||||
output_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
|
||||
self.waves_cb = QComboBox()
|
||||
self.waves_cb_model = QStringListModel()
|
||||
self.waves_cb.setModel(self.waves_cb_model)
|
||||
self.waves_cb.setToolTip("Select one of the last generated waves in this section for replaying or exporting")
|
||||
browser_layout.addWidget(self.waves_cb, i, 1)
|
||||
output_layout.addWidget(self.waves_cb, i, 1)
|
||||
self.replay_wav_button = QPushButton("Replay")
|
||||
self.replay_wav_button.setToolTip("Replay last generated vocoder")
|
||||
browser_layout.addWidget(self.replay_wav_button, i, 2)
|
||||
output_layout.addWidget(self.replay_wav_button, i, 2)
|
||||
self.export_wav_button = QPushButton("Export")
|
||||
self.export_wav_button.setToolTip("Save last generated vocoder audio in filesystem as a wav file")
|
||||
browser_layout.addWidget(self.export_wav_button, i, 3)
|
||||
output_layout.addWidget(self.export_wav_button, i, 3)
|
||||
self.audio_out_devices_cb=QComboBox()
|
||||
i += 1
|
||||
|
||||
output_layout.addWidget(QLabel("<b>Audio Output</b>"), i, 0)
|
||||
output_layout.addWidget(self.audio_out_devices_cb, i, 1)
|
||||
|
||||
## Embed & spectrograms
|
||||
vis_layout.addStretch()
|
||||
@@ -552,7 +567,6 @@ class UI(QDialog):
|
||||
for side in ["top", "right", "bottom", "left"]:
|
||||
ax.spines[side].set_visible(False)
|
||||
|
||||
|
||||
## Generation
|
||||
self.text_prompt = QPlainTextEdit(default_text)
|
||||
gen_layout.addWidget(self.text_prompt, stretch=1)
|
||||
@@ -574,14 +588,36 @@ class UI(QDialog):
|
||||
self.seed_textbox = QLineEdit()
|
||||
self.seed_textbox.setMaximumWidth(80)
|
||||
layout_seed.addWidget(self.seed_textbox, 0, 1)
|
||||
layout_seed.addWidget(QLabel("Style#:(0~9)"), 0, 2)
|
||||
self.style_idx_textbox = QLineEdit("-1")
|
||||
self.style_idx_textbox.setMaximumWidth(80)
|
||||
layout_seed.addWidget(self.style_idx_textbox, 0, 3)
|
||||
self.trim_silences_checkbox = QCheckBox("Enhance vocoder output")
|
||||
self.trim_silences_checkbox.setToolTip("When checked, trims excess silence in vocoder output."
|
||||
" This feature requires `webrtcvad` to be installed.")
|
||||
layout_seed.addWidget(self.trim_silences_checkbox, 0, 4, 1, 2)
|
||||
layout_seed.addWidget(self.trim_silences_checkbox, 0, 2, 1, 2)
|
||||
self.style_slider = QSlider(Qt.Horizontal)
|
||||
self.style_slider.setTickInterval(1)
|
||||
self.style_slider.setFocusPolicy(Qt.NoFocus)
|
||||
self.style_slider.setSingleStep(1)
|
||||
self.style_slider.setRange(-1, 9)
|
||||
self.style_value_label = QLabel("-1")
|
||||
self.style_slider.setValue(-1)
|
||||
layout_seed.addWidget(QLabel("Style:"), 1, 0)
|
||||
|
||||
self.style_slider.valueChanged.connect(lambda s: self.style_value_label.setNum(s))
|
||||
layout_seed.addWidget(self.style_value_label, 1, 1)
|
||||
layout_seed.addWidget(self.style_slider, 1, 3)
|
||||
|
||||
self.token_slider = QSlider(Qt.Horizontal)
|
||||
self.token_slider.setTickInterval(1)
|
||||
self.token_slider.setFocusPolicy(Qt.NoFocus)
|
||||
self.token_slider.setSingleStep(1)
|
||||
self.token_slider.setRange(3, 9)
|
||||
self.token_value_label = QLabel("5")
|
||||
self.token_slider.setValue(4)
|
||||
layout_seed.addWidget(QLabel("Accuracy(精度):"), 2, 0)
|
||||
|
||||
self.token_slider.valueChanged.connect(lambda s: self.token_value_label.setNum(s))
|
||||
layout_seed.addWidget(self.token_value_label, 2, 1)
|
||||
layout_seed.addWidget(self.token_slider, 2, 3)
|
||||
|
||||
gen_layout.addLayout(layout_seed)
|
||||
|
||||
self.loading_bar = QProgressBar()
|
||||
@@ -595,7 +631,7 @@ class UI(QDialog):
|
||||
|
||||
|
||||
## Set the size of the window and of the elements
|
||||
max_size = QDesktopWidget().availableGeometry(self).size() * 0.8
|
||||
max_size = QDesktopWidget().availableGeometry(self).size() * 0.5
|
||||
self.resize(max_size)
|
||||
|
||||
## Finalize the display
|
||||
|
||||
Reference in New Issue
Block a user