28 Commits

Author SHA1 Message Date
Vega
918933ea86 Update README.md 2025-11-13 10:03:10 +08:00
Vega
1cde29d5f3 Update README.md (#1011) 2024-11-14 21:00:29 -08:00
Vega
4b8fa992b7 Update README.md (#1009) 2024-11-01 12:57:10 +08:00
Bob Conan
42789babd8 Update README.md, fix a typo (#1007) 2024-10-22 10:21:44 +08:00
Vega
2354bb42d1 Update README.md (#1005) 2024-10-16 22:48:15 +08:00
Vega
4358f6f353 Update README.md 2024-08-29 17:52:56 +08:00
xxxxx
5971555319 Update requirements.txt (#747)
Ubuntu 20.04.1 CUDA 11.3 缺少依赖,还有依赖冲突

Co-authored-by: Vega <babysor00@gmail.com>
2024-08-22 15:06:40 +08:00
Emma Thompson
6f84026c51 Env update 添加环境需求注释 (#660)
* Update Readme Doc

添加环境需求注释

* Update Readme Doc

Add environmental requirement notes

---------

Co-authored-by: Limingrui0 <65227354+Limingrui0@users.noreply.github.com>
2024-07-06 10:13:09 +08:00
Terminal
a30657ecf5 fix:preprocess_audio.py--The .npy file failed to save (#988) 2024-07-06 10:12:36 +08:00
Terminal
cc250af1f6 fix requirements monotonic-align error (#989) 2024-07-06 10:12:06 +08:00
Vega
156723e37c Skip embedding (#950)
* Skip embedding

* Skip earlier

* Remove unused paramater

* Pass param
2023-09-05 23:15:04 +08:00
Vega
1862d2145b Merge pull request #953 from babysor/babysor-patch-3
Update README.md
2023-08-31 11:42:15 +08:00
Vega
72a22d448b Update README.md 2023-08-31 11:42:05 +08:00
Vega
98d38d84c3 Merge pull request #952 from SeaTidesPro/main
add readme-linux-zh
2023-08-31 11:41:10 +08:00
Tide
7ab86c6f4c Update README-LINUX-CN.md 2023-08-30 14:41:45 +08:00
Tide
ab79881480 Update README-LINUX-CN.md 2023-08-30 14:40:30 +08:00
Tide
fd93b40398 Update README-LINUX-CN.md 2023-08-30 14:35:34 +08:00
Tide
dbf01347fc Update README-LINUX-CN.md 2023-08-30 14:35:12 +08:00
Tide
28f9173dfa Update README-LINUX-CN.md 2023-08-30 14:34:20 +08:00
Tide
d073e1f349 Update README-LINUX-CN.md 2023-08-30 14:24:05 +08:00
Tide
baa8b5005d Update README-LINUX-CN.md 2023-08-30 14:05:40 +08:00
Tide
d54f4fb631 Update README-LINUX-CN.md 2023-08-30 13:18:33 +08:00
Tide
7353888d35 Create README-LINUX-CN.md 2023-08-30 12:20:29 +08:00
Vega
e9ce943f6c Merge pull request #947 from FawenYo/doc/update_link
📝 Update model download link
2023-08-11 22:02:41 +08:00
FawenYo
77c145328c 📝 Update model download link 2023-08-11 14:31:39 +08:00
Vega
3bce6bbbe7 Merge pull request #945 from babysor/babysor-patch-1
Update README.md
2023-08-10 15:54:23 +08:00
Vega
9dd8ea11e5 Merge pull request #944 from babysor/babysor-patch-2
Update README-CN.md
2023-08-10 15:53:52 +08:00
Vega
5a0d77e699 Update README-CN.md 2023-08-10 15:53:42 +08:00
7 changed files with 279 additions and 80 deletions

View File

@@ -29,6 +29,7 @@
> 如果在用 pip 方式安装的时候出现 `ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu102 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)` 这个错误可能是 python 版本过低3.9 可以安装成功
* 安装 [ffmpeg](https://ffmpeg.org/download.html#get-packages)。
* 运行`pip install -r requirements.txt` 来安装剩余的必要包。
> 这里的环境建议使用 `Repo Tag 0.0.1` `Pytorch1.9.0 with Torchvision0.10.0 and cudatoolkit10.2` `requirements.txt` `webrtcvad-wheels` 因为 `requiremants.txt` 是在几个月前导出的,所以不适配新版本
* 安装 webrtcvad `pip install webrtcvad-wheels`
或者
@@ -113,7 +114,7 @@
> 假如你下载的 `aidatatang_200zh`文件放在D盘`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`
* 训练合成器:
`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
`python ./control/cli/synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
* 当您在训练文件夹 *synthesizer/saved_models/* 中看到注意线显示和损失满足您的需要时,请转到`启动程序`一步。
@@ -124,7 +125,7 @@
| --- | ----------- | ----- | ----- |
| 作者 | https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g [百度盘链接](https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g) 4j5d | | 75k steps 用3个开源数据集混合训练
| 作者 | https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw [百度盘链接](https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw) 提取码om7f | | 25k steps 用3个开源数据集混合训练, 切换到tag v0.0.1使用
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [百度盘链接](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) 提取码1024 | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音需切换到tag v0.0.1使用
|@FawenYo | https://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_fawenyo_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=n0gGgC | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音需切换到tag v0.0.1使用
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 注意:根据[issue](https://github.com/babysor/MockingBird/issues/37)修复 并切换到tag v0.0.1使用
#### 2.4训练声码器 (可选)
@@ -135,14 +136,14 @@
* 训练wavernn声码器:
`python vocoder_train.py <trainid> <datasets_root>`
`python ./control/cli/vocoder_train.py <trainid> <datasets_root>`
> `<trainid>`替换为你想要的标识,同一标识再次训练时会延续原模型
* 训练hifigan声码器:
`python vocoder_train.py <trainid> <datasets_root> hifigan`
`python ./control/cli/vocoder_train.py <trainid> <datasets_root> hifigan`
> `<trainid>`替换为你想要的标识,同一标识再次训练时会延续原模型
* 训练fregan声码器:
`python vocoder_train.py <trainid> <datasets_root> --config config.json fregan`
`python ./control/cli/vocoder_train.py <trainid> <datasets_root> --config config.json fregan`
> `<trainid>`替换为你想要的标识,同一标识再次训练时会延续原模型
* 将GAN声码器的训练切换为多GPU模式修改GAN文件夹下.json文件中的"num_gpus"参数
### 3. 启动程序或工具箱
@@ -173,14 +174,14 @@
* 下载aidatatang_200zh数据集并解压确保您可以访问 *train* 文件夹中的所有音频文件(如.wav
* 进行音频和梅尔频谱图预处理:
`python pre4ppg.py <datasets_root> -d {dataset} -n {number}`
`python ./control/cli/pre4ppg.py <datasets_root> -d {dataset} -n {number}`
可传入参数:
* `-d {dataset}` 指定数据集,支持 aidatatang_200zh, 不传默认为aidatatang_200zh
* `-n {number}` 指定并行数CPU 11700k在8的情况下需要运行12到18小时待优化
> 假如你下载的 `aidatatang_200zh`文件放在D盘`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`
* 训练合成器, 注意在上一步先下载好`ppg2mel.yaml`, 修改里面的地址指向预训练好的文件夹:
`python ppg2mel_train.py --config .\ppg2mel\saved_models\ppg2mel.yaml --oneshotvc `
`python ./control/cli/ppg2mel_train.py --config .\ppg2mel\saved_models\ppg2mel.yaml --oneshotvc `
* 如果想要继续上一次的训练,可以通过`--load .\ppg2mel\saved_models\<old_pt_file>` 参数指定一个预训练模型文件。
#### 4.2 启动工具箱VC模式

223
README-LINUX-CN.md Normal file
View File

@@ -0,0 +1,223 @@
## 实时语音克隆 - 中文/普通话
![mockingbird](https://user-images.githubusercontent.com/12797292/131216767-6eb251d6-14fc-4951-8324-2722f0cd4c63.jpg)
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)
### [English](README.md) | 中文
### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/) | [Wiki教程](https://github.com/babysor/MockingBird/wiki/Quick-Start-(Newbie)) [训练教程](https://vaj2fgg8yn.feishu.cn/docs/doccn7kAbr3SJz0KM0SIDJ0Xnhd)
## 特性
🌍 **中文** 支持普通话并使用多种中文数据集进行测试aidatatang_200zh, magicdata, aishell3, biaobei, MozillaCommonVoice, data_aishell 等
🤩 **Easy & Awesome** 仅需下载或新训练合成器synthesizer就有良好效果复用预训练的编码器/声码器或实时的HiFi-GAN作为vocoder
🌍 **Webserver Ready** 可伺服你的训练结果,供远程调用。
🤩 **感谢各位小伙伴的支持,本项目将开启新一轮的更新**
## 1.快速开始
### 1.1 建议环境
- Ubuntu 18.04
- Cuda 11.7 && CuDNN 8.5.0
- Python 3.8 或 3.9
- Pytorch 2.0.1 <post cuda-11.7>
### 1.2 环境配置
```shell
# 下载前建议更换国内镜像源
conda create -n sound python=3.9
conda activate sound
git clone https://github.com/babysor/MockingBird.git
cd MockingBird
pip install -r requirements.txt
pip install webrtcvad-wheels
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
```
### 1.3 模型准备
> 当实在没有设备或者不想慢慢调试,可以使用社区贡献的模型(欢迎持续分享):
| 作者 | 下载链接 | 效果预览 | 信息 |
| --- | ----------- | ----- | ----- |
| 作者 | https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g [百度盘链接](https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g) 4j5d | | 75k steps 用3个开源数据集混合训练
| 作者 | https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw [百度盘链接](https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw) 提取码om7f | | 25k steps 用3个开源数据集混合训练, 切换到tag v0.0.1使用
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [百度盘链接](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) 提取码1024 | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音需切换到tag v0.0.1使用
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 注意:根据[issue](https://github.com/babysor/MockingBird/issues/37)修复 并切换到tag v0.0.1使用
### 1.4 文件结构准备
文件结构准备如下所示算法将自动遍历synthesizer下的.pt模型文件。
```
# 以第一个 pretrained-11-7-21_75k.pt 为例
└── data
└── ckpt
└── synthesizer
└── pretrained-11-7-21_75k.pt
```
### 1.5 运行
```
python web.py
```
## 2.模型训练
### 2.1 数据准备
#### 2.1.1 数据下载
``` shell
# aidatatang_200zh
wget https://openslr.elda.org/resources/62/aidatatang_200zh.tgz
```
``` shell
# MAGICDATA
wget https://openslr.magicdatatech.com/resources/68/train_set.tar.gz
wget https://openslr.magicdatatech.com/resources/68/dev_set.tar.gz
wget https://openslr.magicdatatech.com/resources/68/test_set.tar.gz
```
``` shell
# AISHELL-3
wget https://openslr.elda.org/resources/93/data_aishell3.tgz
```
```shell
# Aishell
wget https://openslr.elda.org/resources/33/data_aishell.tgz
```
#### 2.1.2 数据批量解压
```shell
# 该指令为解压当前目录下的所有压缩文件
for gz in *.gz; do tar -zxvf $gz; done
```
### 2.2 encoder模型训练
#### 2.2.1 数据预处理:
需要先在`pre.py `头部加入:
```python
import torch
torch.multiprocessing.set_start_method('spawn', force=True)
```
使用以下指令对数据预处理:
```shell
python pre.py <datasets_root> \
-d <datasets_name>
```
其中`<datasets_root>`为原数据集路径,`<datasets_name>` 为数据集名称。
支持 `librispeech_other``voxceleb1``aidatatang_200zh`,使用逗号分割处理多数据集。
### 2.2.2 encoder模型训练
超参数文件路径:`models/encoder/hparams.py`
```shell
python encoder_train.py <name> \
<datasets_root>/SV2TTS/encoder
```
其中 `<name>` 是训练产生文件的名称,可自行修改。
其中 `<datasets_root>` 是经过 `Step 2.1.1` 处理过后的数据集路径。
#### 2.2.3 开启encoder模型训练数据可视化可选
```shell
visdom
```
### 2.3 synthesizer模型训练
#### 2.3.1 数据预处理:
```shell
python pre.py <datasets_root> \
-d <datasets_name> \
-o <datasets_path> \
-n <number>
```
`<datasets_root>` 为原数据集路径,当你的`aidatatang_200zh`路径为`/data/aidatatang_200zh/corpus/train`时,`<datasets_root>` 为 `/data/`。
`<datasets_name>` 为数据集名称。
`<datasets_path>` 为数据集处理后的保存路径。
`<number>` 为数据集处理时进程数根据CPU情况调整大小。
#### 2.3.2 新增数据预处理:
```shell
python pre.py <datasets_root> \
-d <datasets_name> \
-o <datasets_path> \
-n <number> \
-s
```
当新增数据集时,应加 `-s` 选择数据拼接,不加则为覆盖。
#### 2.3.3 synthesizer模型训练
超参数文件路径:`models/synthesizer/hparams.py`,需将`MockingBird/control/cli/synthesizer_train.py`移成`MockingBird/synthesizer_train.py`结构。
```shell
python synthesizer_train.py <name> <datasets_path> \
-m <out_dir>
```
其中 `<name>` 是训练产生文件的名称,可自行修改。
其中 `<datasets_path>` 是经过 `Step 2.2.1` 处理过后的数据集路径。
其中 `<out_dir> `为训练时所有数据的保存路径。
### 2.4 vocoder模型训练
vocoder模型对生成效果影响不大已预置3款。
#### 2.4.1 数据预处理
```shell
python vocoder_preprocess.py <datasets_root> \
-m <synthesizer_model_path>
```
其中`<datasets_root>`为你数据集路径。
其中 `<synthesizer_model_path>`为synthesizer模型地址。
#### 2.4.2 wavernn声码器训练:
```
python vocoder_train.py <name> <datasets_root>
```
#### 2.4.3 hifigan声码器训练:
```
python vocoder_train.py <name> <datasets_root> hifigan
```
#### 2.4.4 fregan声码器训练:
```
python vocoder_train.py <name> <datasets_root> \
--config config.json fregan
```
将GAN声码器的训练切换为多GPU模式修改`GAN`文件夹下`.json`文件中的`num_gpus`参数。
## 3.致谢
### 3.1 项目致谢
该库一开始从仅支持英语的[Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) 分叉出来的,鸣谢作者。
### 3.2 论文致谢
| URL | Designation | 标题 | 实现源码 |
| --- | ----------- | ----- | --------------------- |
| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
| [2106.02297](https://arxiv.org/abs/2106.02297) | Fre-GAN (vocoder)| Fre-GAN: Adversarial Frequency-consistent Audio Synthesis | 本代码库 |
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | 本代码库 |
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |
### 3.3 开发者致谢
作为AI领域的从业者我们不仅乐于开发一些具有里程碑意义的算法项目同时也乐于分享项目以及开发过程中收获的喜悦。
因此你们的使用是对我们项目的最大认可。同时当你们在项目使用中遇到一些问题时欢迎你们随时在issue上留言。你们的指正这对于项目的后续优化具有十分重大的的意义。
为了表示感谢,我们将在本项目中留下各位开发者信息以及相对应的贡献。
- ------------------------------------------------ 开 发 者 贡 献 内 容 ---------------------------------------------------------------------------------

View File

@@ -1,9 +1,11 @@
> 🚧 While I no longer actively update this repo, you can find me continuously pushing this tech forward to good side and open-source. I'm also building an optimized and cloud hosted version: https://noiz.ai/ and it's free but not ready for commersial use now.
>
![mockingbird](https://user-images.githubusercontent.com/12797292/131216767-6eb251d6-14fc-4951-8324-2722f0cd4c63.jpg)
<a href="https://trendshift.io/repositories/3869" target="_blank"><img src="https://trendshift.io/api/badge/repositories/3869" alt="babysor%2FMockingBird | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)
> English | [中文](README-CN.md)
> English | [中文](README-CN.md)| [中文Linux](README-LINUX-CN.md)
## Features
🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.
@@ -29,6 +31,7 @@
> If you get an `ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu102 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2 )` This error is probably due to a low version of python, try using 3.9 and it will install successfully
* Install [ffmpeg](https://ffmpeg.org/download.html#get-packages).
* Run `pip install -r requirements.txt` to install the remaining necessary packages.
> The recommended environment here is `Repo Tag 0.0.1` `Pytorch1.9.0 with Torchvision0.10.0 and cudatoolkit10.2` `requirements.txt` `webrtcvad-wheels` because `requirements. txt` was exported a few months ago, so it doesn't work with newer versions
* Install webrtcvad `pip install webrtcvad-wheels`(If you need)
or
@@ -126,7 +129,7 @@ Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata,
| --- | ----------- | ----- |----- |
| @author | https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g [Baidu](https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g) 4j5d | | 75k steps trained by multiple datasets
| @author | https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw [Baidu](https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw) codeom7f | | 25k steps trained by multiple datasets, only works under version 0.0.1
|@FawenYo | https://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_yisiou_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=Cc4EFA https://u.teknik.io/AYxWf.pt | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan, only works under version 0.0.1
|@FawenYo | https://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_fawenyo_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=n0gGgC | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan, only works under version 0.0.1
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code: 2021 https://www.aliyundrive.com/s/AwPsbo8mcSP code: z2m0 | https://www.bilibili.com/video/BV1uh411B7AD/ | only works under version 0.0.1
#### 2.4 Train vocoder (Optional)

View File

@@ -39,6 +39,9 @@ data_info = {
}
}
def should_skip(fpath: Path, skip_existing: bool) -> bool:
return skip_existing and fpath.exists()
def preprocess_dataset(datasets_root: Path, out_dir: Path, n_processes: int,
skip_existing: bool, hparams, no_alignments: bool,
dataset: str, emotion_extract = False, encoder_model_fpath=None):
@@ -99,7 +102,7 @@ def preprocess_dataset(datasets_root: Path, out_dir: Path, n_processes: int,
print("Max mel frames length: %d" % max(int(m[4]) for m in metadata))
print("Max audio timesteps length: %d" % max(int(m[3]) for m in metadata))
def embed_utterance(fpaths, encoder_model_fpath):
def _embed_utterance(fpaths: str, encoder_model_fpath: str):
if not encoder.is_loaded():
encoder.load_model(encoder_model_fpath)
@@ -110,15 +113,13 @@ def embed_utterance(fpaths, encoder_model_fpath):
embed = encoder.embed_utterance(wav)
np.save(embed_fpath, embed, allow_pickle=False)
def _emo_extract_from_utterance(fpaths, hparams, skip_existing=False):
if skip_existing and fpaths.exists():
return
def _emo_extract_from_utterance(fpaths, hparams):
wav_fpath, emo_fpath = fpaths
wav = np.load(wav_fpath)
emo = extract_emo(np.expand_dims(wav, 0), hparams.sample_rate, True)
np.save(emo_fpath, emo.squeeze(0), allow_pickle=False)
def create_embeddings(synthesizer_root: Path, encoder_model_fpath: Path, n_processes: int):
def create_embeddings(synthesizer_root: Path, encoder_model_fpath: Path, n_processes: int, skip_existing: bool):
wav_dir = synthesizer_root.joinpath("audio")
metadata_fpath = synthesizer_root.joinpath("train.txt")
assert wav_dir.exists() and metadata_fpath.exists()
@@ -128,11 +129,11 @@ def create_embeddings(synthesizer_root: Path, encoder_model_fpath: Path, n_proce
# Gather the input wave filepath and the target output embed filepath
with metadata_fpath.open("r", encoding="utf-8") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
fpaths = [(wav_dir.joinpath(m[0]), embed_dir.joinpath(m[2])) for m in metadata]
fpaths = [(wav_dir.joinpath(m[0]), embed_dir.joinpath(m[2])) for m in metadata if not should_skip(embed_dir.joinpath(m[2]), skip_existing)]
# TODO: improve on the multiprocessing, it's terrible. Disk I/O is the bottleneck here.
# Embed the utterances in separate threads
func = partial(embed_utterance, encoder_model_fpath=encoder_model_fpath)
func = partial(_embed_utterance, encoder_model_fpath=encoder_model_fpath)
job = Pool(n_processes).imap(func, fpaths)
tuple(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
@@ -142,14 +143,14 @@ def create_emo(synthesizer_root: Path, n_processes: int, skip_existing: bool, hp
assert wav_dir.exists() and metadata_fpath.exists()
emo_dir = synthesizer_root.joinpath("emo")
emo_dir.mkdir(exist_ok=True)
# Gather the input wave filepath and the target output embed filepath
with metadata_fpath.open("r", encoding="utf-8") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
fpaths = [(wav_dir.joinpath(m[0]), emo_dir.joinpath(m[0].replace("audio-", "emo-"))) for m in metadata]
fpaths = [(wav_dir.joinpath(m[0]), emo_dir.joinpath(m[0].replace("audio-", "emo-"))) for m in metadata if not should_skip(emo_dir.joinpath(m[0].replace("audio-", "emo-")), skip_existing)]
# TODO: improve on the multiprocessing, it's terrible. Disk I/O is the bottleneck here.
# Embed the utterances in separate threads
func = partial(_emo_extract_from_utterance, hparams=hparams, skip_existing=skip_existing)
func = partial(_emo_extract_from_utterance, hparams=hparams)
job = Pool(n_processes).imap(func, fpaths)
tuple(tqdm(job, "Emo", len(fpaths), unit="utterances"))

View File

@@ -45,7 +45,7 @@ def extract_emo(
return y
def _process_utterance(wav: np.ndarray, text: str, out_dir: Path, basename: str,
skip_existing: bool, hparams, encoder_model_fpath):
mel_fpath: str, wav_fpath: str, hparams, encoder_model_fpath):
## FOR REFERENCE:
# For you not to lose your head if you ever wish to change things here or implement your own
# synthesizer.
@@ -58,13 +58,6 @@ def _process_utterance(wav: np.ndarray, text: str, out_dir: Path, basename: str,
# without extra padding. This means that you won't have an exact relation between the length
# of the wav and of the mel spectrogram. See the vocoder data loader.
# Skip existing utterances if needed
mel_fpath = out_dir.joinpath("mels", "mel-%s.npy" % basename)
wav_fpath = out_dir.joinpath("audio", "audio-%s.npy" % basename)
if skip_existing and mel_fpath.exists() and wav_fpath.exists():
return None
# Trim silence
if hparams.trim_silence:
if not encoder.is_loaded():
@@ -112,50 +105,27 @@ def _split_on_silences(wav_fpath, words, hparams):
def preprocess_general(speaker_dir, out_dir: Path, skip_existing: bool, hparams, dict_info, no_alignments: bool, encoder_model_fpath: Path):
metadata = []
extensions = ("*.wav", "*.flac", "*.mp3")
if skip_existing:
for extension in extensions:
wav_fpath_list = speaker_dir.glob(extension)
# Iterate over each wav
for wav_fpath in wav_fpath_list:
words = dict_info.get(wav_fpath.name.split(".")[0])
for extension in extensions:
wav_fpath_list = speaker_dir.glob(extension)
# Iterate over each wav
for wav_fpath in wav_fpath_list:
words = dict_info.get(wav_fpath.name.split(".")[0])
if not words:
words = dict_info.get(wav_fpath.name) # try with extension
if not words:
words = dict_info.get(wav_fpath.name) # try with extension
if not words:
print("no wordS")
continue
sub_basename = "%s_%02d" % (wav_fpath.name, 0)
mel_fpath = out_dir.joinpath("mels", f"mel-{sub_basename}.npy")
wav_fpath_ = out_dir.joinpath("audio", f"audio-{sub_basename}.npy")
if mel_fpath.exists() and wav_fpath_.exists():
print(f"No word found in dict_info for {wav_fpath.name}, skip it")
continue
sub_basename = "%s_%02d" % (wav_fpath.name, 0)
mel_fpath_out = out_dir.joinpath("mels", f"mel-{sub_basename}.npy")
wav_fpath_out = out_dir.joinpath("audio", f"audio-{sub_basename}.npy")
if skip_existing and mel_fpath_out.exists() and wav_fpath_out.exists():
continue
wav, text = _split_on_silences(wav_fpath, words, hparams)
result = _process_utterance(wav, text, out_dir, sub_basename, mel_fpath_out, wav_fpath_out, hparams, encoder_model_fpath)
if result is None:
continue
wav_fpath_name, mel_fpath_name, embed_fpath_name, wav, mel_frames, text = result
metadata.append ((wav_fpath_name, mel_fpath_name, embed_fpath_name, len(wav), mel_frames, text))
wav, text = _split_on_silences(wav_fpath, words, hparams)
result = _process_utterance(wav, text, out_dir, sub_basename,
False, hparams, encoder_model_fpath) # accelarate
if result is None:
continue
wav_fpath_name, mel_fpath_name, embed_fpath_name, wav, mel_frames, text = result
metadata.append ((wav_fpath_name, mel_fpath_name, embed_fpath_name, len(wav), mel_frames, text))
else:
for extension in extensions:
wav_fpath_list = speaker_dir.glob(extension)
# Iterate over each wav
for wav_fpath in wav_fpath_list:
words = dict_info.get(wav_fpath.name.split(".")[0])
if not words:
words = dict_info.get(wav_fpath.name) # try with extension
if not words:
print("no wordS")
continue
sub_basename = "%s_%02d" % (wav_fpath.name, 0)
wav, text = _split_on_silences(wav_fpath, words, hparams)
result = _process_utterance(wav, text, out_dir, sub_basename,
False, hparams, encoder_model_fpath)
if result is None:
continue
wav_fpath_name, mel_fpath_name, embed_fpath_name, wav, mel_frames, text = result
metadata.append ((wav_fpath_name, mel_fpath_name, embed_fpath_name, len(wav), mel_frames, text))
return metadata

2
pre.py
View File

@@ -71,7 +71,7 @@ if __name__ == "__main__":
del args.n_processes_embed
preprocess_dataset(**vars(args))
create_embeddings(synthesizer_root=args.out_dir, n_processes=n_processes_embed, encoder_model_fpath=encoder_model_fpath)
create_embeddings(synthesizer_root=args.out_dir, n_processes=n_processes_embed, encoder_model_fpath=encoder_model_fpath, skip_existing=args.skip_existing)
if args.emotion_extract:
create_emo(synthesizer_root=args.out_dir, n_processes=n_processes_embed, skip_existing=args.skip_existing, hparams=args.hparams)

View File

@@ -2,7 +2,8 @@ umap-learn
visdom
librosa
matplotlib>=3.3.0
numpy
numpy==1.19.3; platform_system == "Windows"
numpy==1.20.3; platform_system != "Windows"
scipy>=1.0.0
tqdm
sounddevice
@@ -12,8 +13,8 @@ inflect
PyQt5
multiprocess
numba
webrtcvad
pypinyin
webrtcvad; platform_system != "Windows"
pypinyin==0.44.0
flask
flask_wtf
flask_cors
@@ -25,9 +26,9 @@ PyYAML
torch_complex
espnet
PyWavelets
monotonic-align==0.0.3
transformers
fastapi
loguru
typer[all]
click
click==8.0.4
typer
monotonic-align==1.0.0
transformers