118 Commits

Author SHA1 Message Date
Vega
d889235518 Update README.md 2024-11-01 12:54:16 +08:00
Bob Conan
42789babd8 Update README.md, fix a typo (#1007) 2024-10-22 10:21:44 +08:00
Vega
2354bb42d1 Update README.md (#1005) 2024-10-16 22:48:15 +08:00
Vega
4358f6f353 Update README.md 2024-08-29 17:52:56 +08:00
xxxxx
5971555319 Update requirements.txt (#747)
Ubuntu 20.04.1 CUDA 11.3 缺少依赖,还有依赖冲突

Co-authored-by: Vega <babysor00@gmail.com>
2024-08-22 15:06:40 +08:00
Emma Thompson
6f84026c51 Env update 添加环境需求注释 (#660)
* Update Readme Doc

添加环境需求注释

* Update Readme Doc

Add environmental requirement notes

---------

Co-authored-by: Limingrui0 <65227354+Limingrui0@users.noreply.github.com>
2024-07-06 10:13:09 +08:00
Terminal
a30657ecf5 fix:preprocess_audio.py--The .npy file failed to save (#988) 2024-07-06 10:12:36 +08:00
Terminal
cc250af1f6 fix requirements monotonic-align error (#989) 2024-07-06 10:12:06 +08:00
Vega
156723e37c Skip embedding (#950)
* Skip embedding

* Skip earlier

* Remove unused paramater

* Pass param
2023-09-05 23:15:04 +08:00
Vega
1862d2145b Merge pull request #953 from babysor/babysor-patch-3
Update README.md
2023-08-31 11:42:15 +08:00
Vega
72a22d448b Update README.md 2023-08-31 11:42:05 +08:00
Vega
98d38d84c3 Merge pull request #952 from SeaTidesPro/main
add readme-linux-zh
2023-08-31 11:41:10 +08:00
Tide
7ab86c6f4c Update README-LINUX-CN.md 2023-08-30 14:41:45 +08:00
Tide
ab79881480 Update README-LINUX-CN.md 2023-08-30 14:40:30 +08:00
Tide
fd93b40398 Update README-LINUX-CN.md 2023-08-30 14:35:34 +08:00
Tide
dbf01347fc Update README-LINUX-CN.md 2023-08-30 14:35:12 +08:00
Tide
28f9173dfa Update README-LINUX-CN.md 2023-08-30 14:34:20 +08:00
Tide
d073e1f349 Update README-LINUX-CN.md 2023-08-30 14:24:05 +08:00
Tide
baa8b5005d Update README-LINUX-CN.md 2023-08-30 14:05:40 +08:00
Tide
d54f4fb631 Update README-LINUX-CN.md 2023-08-30 13:18:33 +08:00
Tide
7353888d35 Create README-LINUX-CN.md 2023-08-30 12:20:29 +08:00
Vega
e9ce943f6c Merge pull request #947 from FawenYo/doc/update_link
📝 Update model download link
2023-08-11 22:02:41 +08:00
FawenYo
77c145328c 📝 Update model download link 2023-08-11 14:31:39 +08:00
Vega
3bce6bbbe7 Merge pull request #945 from babysor/babysor-patch-1
Update README.md
2023-08-10 15:54:23 +08:00
Vega
9dd8ea11e5 Merge pull request #944 from babysor/babysor-patch-2
Update README-CN.md
2023-08-10 15:53:52 +08:00
Vega
5a0d77e699 Update README-CN.md 2023-08-10 15:53:42 +08:00
Vega
4914fc0776 Update README.md 2023-08-10 15:50:21 +08:00
Vega
8f95faa0d3 Merge pull request #914 from cloudxu/readme_update
Removing ongoing work session in README
2023-06-15 17:39:13 +08:00
Cloud Xu
facc97e236 removing ongoing work 2023-06-15 17:00:05 +08:00
Vega
3d1e3dc542 Merge pull request #892 from FawenYo/main
Update FawenYo's shared model
2023-06-10 10:25:04 +08:00
Vega
536ae8899c Merge pull request #908 from 0warning0error/main
Some changes to make it easier to install the dependencies
2023-06-10 10:24:39 +08:00
0warning0error
9f1dbeeecc Some changes to make it easier to install the dependencies 2023-06-03 00:07:36 +08:00
FawenYo
d3bdf816db Update FawenYo's shared model 2023-05-06 14:37:44 +08:00
Vega
b78d0d2a26 Merge pull request #782 from Nier-Y/main
Update README.md and README-CN.md
2023-03-07 16:41:48 +08:00
babysor00
5c17fc8bb0 add pretrained 2023-02-18 09:31:05 +08:00
babysor00
3ce874ab46 Fix issue for training and preprocessing 2023-02-10 20:34:01 +08:00
babysor00
beec0b93ed Fix issues 2023-02-04 17:00:49 +08:00
Vega
9d67b757f0 Merge pull request #822 from babysor/restruct-project
Restruct project
2023-02-04 14:37:48 +08:00
Vega
26331fe019 Merge branch 'main' into restruct-project 2023-02-04 14:22:37 +08:00
babysor00
712a53f557 Add vits 2023-02-04 14:13:38 +08:00
babysor00
24cb262c3f remove used files 2023-02-01 20:16:06 +08:00
babysor00
e469bd06ae init 2023-02-01 19:59:15 +08:00
神楽坂·喵
cd20d21f3d add docker support (#802)
* add docker support

* 修复训练集解压问题

* 修复web.py启动问题

* 混合数据集训练参数
2022-12-16 11:16:25 +08:00
babysor00
74a3fc97d0 Refactor Project to 3 parts: Models, Control, Data
Need readme
2022-12-03 16:54:06 +08:00
李子
b402f9dbdf 修复web界面vc模式不能用问题。改拼写错误、给字符串前加f标 (#790) 2022-11-30 15:32:52 +08:00
yan
d1ba355c5f Adding English version M1 Mac Setup 2022-11-18 00:36:42 +08:00
yan
83d95c6c81 新增M1 Mac的环境配置 2022-11-17 23:48:52 +08:00
wei-z
85a53c9e05 update aidatatang_200zh folders (#743)
Co-authored-by: wei-z-git <wei-z>
2022-10-15 11:45:51 +08:00
xxxxx
028b131570 Update streamlit_ui.py (#748) 2022-10-15 11:44:44 +08:00
Dong
2a1890f9e1 Update README-CN.md (#751)
修正文档复选框格式
2022-10-15 11:44:24 +08:00
wei-z
c91bc3208e 添加 -d 指定数据集时错误提示 (#741)
* 添加 -d 指定数据集时错误提示

Warning: you do not have any of the recognized datasets in G:\AI\Dataset\aidatatang_200zh\aidatatang_200zh 
Please note use 'E:\datasets' as root path instead of 'E:\datasetsidatatang_200zh\corpus/test' as a example .
The recognized datasets are:

* Update ui.py

* Update ui.py
2022-09-14 21:37:54 +08:00
XCwosjw
dd1ea3e714 更正错误的CPU型号 (#715)
11770k😂😂😂
2022-09-10 23:56:01 +08:00
Vega
e7313c514f Refactor (#705)
* Refactor model

* Add description for

* update launch json

* Fix #657

* Avoid recursive calls of web ui for M1
2022-08-12 23:13:57 +08:00
Xu Meng
f57d1a69b6 Translate update README-CN.md (#698)
Fix: Traditional Chinese to Simplified Chinese
2022-08-06 23:51:34 +08:00
Vega
ab7d692619 Refactor (#663)
* Refactor model

* Add description for

* update launch json

* Fix #657
2022-07-19 23:43:51 +08:00
Vega
f17e3b04e1 Refactor (#650)
* Refactor model

* Add description for

* update launch json
2022-07-17 14:27:45 +08:00
Vega
6abdd0ebf0 Refactor (#649)
* Refactor model

* Refactor and fix bug to save plots
2022-07-17 09:58:17 +08:00
wenqingl
400a7207e3 Update README.md (#640) 2022-07-14 17:41:00 +08:00
babysor00
6f023e313d Add web gui of training and reconstruct taco model methods 2022-06-26 23:21:32 +08:00
babysor00
a39b6d3117 Remove breaking import for Macos 2022-06-26 11:56:50 +08:00
babysor00
885225045d 修复兼容性 - mac + linux 2022-06-25 20:17:06 +08:00
babysor00
ee643d7cbc Fix compatibility issue 2022-06-18 23:46:44 +08:00
flysmart
6a793cea84 Added missing files for Fre-GAN (#579)
* The new vocoder Fre-GAN is now supported

* Improved some fregan details

* Fixed the problem that the existing model could not be loaded to continue training when training GAN

* Updated reference papers

* GAN training now supports DistributedDataParallel (DDP)

* Added requirements.txt

* GAN training uses single card training by default

* Added note about GAN vocoder training with multiple GPUs

* Added missing files for Fre-GAN
2022-05-25 23:29:59 +08:00
Evers
7317ba5ffe add gen_voice.py for handle by python command instead of demon_tool gui. (#560) 2022-05-22 16:28:58 +08:00
flysmart
05f886162c GAN training now supports DistributedDataParallel (DDP) (#558)
* The new vocoder Fre-GAN is now supported

* Improved some fregan details

* Fixed the problem that the existing model could not be loaded to continue training when training GAN

* Updated reference papers

* GAN training now supports DistributedDataParallel (DDP)

* Added requirements.txt

* GAN training uses single card training by default

* Added note about GAN vocoder training with multiple GPUs
2022-05-22 16:24:50 +08:00
babysor00
e726c2eb12 Merge branch 'main' of https://github.com/babysor/Realtime-Voice-Clone-Chinese 2022-05-15 16:09:09 +08:00
babysor00
c00474525a Fix nits of file path 2022-05-15 16:08:58 +08:00
flysmart
350b190662 Solved the problem that the existing model could not be loaded when training the GAN model (#549)
* The new vocoder Fre-GAN is now supported

* Improved some fregan details

* Fixed the problem that the existing model could not be loaded to continue training when training GAN

* Updated reference papers
2022-05-13 13:41:03 +08:00
flysmart
0caed984e3 The new vocoder Fre-GAN is now supported (#546)
* The new vocoder Fre-GAN is now supported

* Improved some fregan details
2022-05-12 12:27:17 +08:00
Vega
c5d03fb3cb Upgrade to new web service (#529)
* Init new GUI

* Remove unused codes

* Reset layout

* Add samples

* Make framework to support multiple pages

* Add vc mode

* Add preprocessing mode

* Add training mode

* Remove text input in vc mode

* Add entry for GUI and revise readme

* Move requirement together

* Add error raise when no model folder found

* Add readme
2022-05-09 18:44:02 +08:00
babysor00
7f799d322f Tell the hifigan type of a vocoder model by searching full text 2022-04-30 10:31:01 +08:00
LZY
a1f2e4a790 Update README-CN.md (#523) 2022-04-28 16:16:37 +08:00
Lix Zhou
b136f80f43 Add an aliyunpan download link (#505)
Baidu Yun Pan is so fxxking slow
2022-04-26 21:20:39 +08:00
Moose W. Oler
f082a82420 fix issue #496 (#498)
pass `wav`, `sampling_rate` (in encoder/audio.py line 59 ) as keyword args instead of postional args to prevent warning messages from massing up console outputs while adopting librosa 0.9.1 occasionally.
2022-04-11 17:26:52 +08:00
babysor00
7f0d983da7 Merge branch 'main' of https://github.com/babysor/Realtime-Voice-Clone-Chinese 2022-04-02 18:11:52 +08:00
babysor00
0353bfc6e6 New GUI in order to combine web and toolbox in future 2022-04-02 18:11:49 +08:00
Vega
9ec114a7c1 Create FUNDING.yml 2022-04-02 10:16:02 +08:00
1itt1eB0y
ddf612e87c Update README-CN.md (#470)
修正一个简单的翻译问题
2022-03-24 12:52:47 +08:00
babysor00
374cc89cfa Fix web generate with rnn bug 2022-03-19 12:16:55 +08:00
babysor00
6009da7072 Merge branch 'main' of https://github.com/babysor/Realtime-Voice-Clone-Chinese 2022-03-19 12:14:24 +08:00
babysor00
1c61a601d1 Remove dependency of pyworld for non-vc mode 2022-03-19 12:14:21 +08:00
Vega
02ee514aa3 Update issue templates 2022-03-12 19:28:03 +08:00
babysor00
6532d65153 Add default param for inferencing of vocoder 2022-03-12 19:14:01 +08:00
babysor00
3fe0690cc6 Update README-CN.md 2022-03-09 09:39:24 +08:00
babysor00
79f424d614 Add default path for hifi 2022-03-08 09:17:56 +08:00
babysor00
3c97d22938 Fix bug when searching for vocoder 2022-03-07 20:08:29 +08:00
babysor00
fc26c38152 Fix compatibility issue 2022-03-07 17:05:22 +08:00
babysor00
6c01b92703 Fix bug introduced by config file reading 2022-03-06 09:35:25 +08:00
babysor00
c36f02634a Add link and separate requirement file 2022-03-05 12:20:09 +08:00
Vega
b05e7441ff Fix nit in readme 2022-03-05 00:55:08 +08:00
Vega
693de98f4d Add instruction image 2022-03-05 00:54:31 +08:00
Vega
252a5e11b3 Ppg vc init (#421)
* Init  ppg extractor and ppg2mel

* add preprocess and training

* FIx known issues

* Update __init__.py

Allow to gen audio

* Fix length issue

* Fix bug of preparing fid

* Fix sample issues

* Add UI usage of PPG-vc

* Add readme
2022-03-05 00:52:36 +08:00
Vega
b617a87ee4 Init ppg extractor and ppg2mel (#375)
* Init  ppg extractor and ppg2mel

* add preprocess and training

* FIx known issues

* Update __init__.py

Allow to gen audio

* Fix length issue

* Fix bug of preparing fid

* Fix sample issues

* Add UI usage of PPG-vc
2022-03-03 23:38:12 +08:00
AyahaShirane
ad22997614 fixed the issues #372 (#379)
修复了一些参数传递造成的问题,把过时的torch.nn.functional.tanh()改成了torch.tanh()
2022-02-27 11:02:01 +08:00
hertz
9e072c2619 Hifigan Support train from existed checkpoint. (#389)
* 1k steps to save tmp hifigan model

* hifigan support train from existed ckpt
2022-02-27 11:01:47 +08:00
Alex Newton
b79e9d68e4 连续换行造成的多了个None (#405)
小问题,gui好像没有这个问题,自己测试web的时候直接调用的函数发现的这个情况
2022-02-27 10:55:00 +08:00
babysor00
0536874dec Add config file for pretrained 2022-02-23 09:37:39 +08:00
李子
4529479091 指定librosa版本 (#378)
* 支持data_aishell(SLR33)数据集

* 更新readme

* 指定librosa版本
2022-02-10 20:47:26 +08:00
babysor00
8ad9ba2b60 change naming logic of saving trained file for synthesizer to allow shorter interval 2022-01-15 17:56:14 +08:00
D-Blue
b56ec5ee1b Fix a UserWarning (#273)
Fix a UserWarning in synthesizer/synthesizer_dataset.py, because of converting list of numpy array to torch tensor at Ln.85.
2021-12-20 20:33:12 +08:00
CrystalRays
0bc34a5bc9 Fix TypeError at line 459 in toolbox/ui.py when both PySide6(PyQt6) and PyQt5 installed (#255)
### Error Info Screenshot
![](https://cdn.jsdelivr.net/gh/CrystalRays/CDN@main/img/16389623959301638962395845.png)

### Error Reason
Matplotlib.backends.qt_compat.py decide the version of qt library according to sys.modules firstly, os.environ secondly and the sequence of PyQt6, PySide6, PyQt5, PySide 2 and etc finally. Import PyQt5 after matplotlib make that there is no PyQt5 in sys.modules so that it choose PyQt6 or PySide6 before PyQt5 if it installed.
因为Matplotlib.backends.qt_compat.py优先根据导入的库决定要使用的Python Qt的库,如果没有导入则根据环境变量PYQT_APT决定,再不济就按照PyQt6, PySide6, PyQt5, PySide 2的顺序导入已经安装的库。因为ui.py先导入matplotlib而不是PYQT5导致matplotlib在导入的库里找不到Qt的库,又没有指定环境变量,然后用户安装了Qt6的库的话就导入Qt6的库去了
2021-12-15 12:41:10 +08:00
Wings Music
875fe15069 Update readme for training encoder (#250) 2021-12-07 19:10:29 +08:00
zzxiang
4728863f9d Fix inference on cpu device (#241) 2021-11-29 21:10:07 +08:00
hertz
a4daf42868 1k steps to save tmp hifigan model (#240) 2021-11-29 21:09:54 +08:00
harian
b50c7984ab tacotron.py-Multi GPU with DataParallel (#231) 2021-11-27 20:53:08 +08:00
babysor00
26fe4a047d Differentiate GST token 2021-11-18 22:55:13 +08:00
babysor00
aff1b5313b Order of declared pytorch module matters 2021-11-17 00:12:27 +08:00
babysor
7dca74e032 Change default to use speaker embed for reference 2021-11-13 10:57:45 +08:00
babysor00
a37b26a89c 模型兼容问题加强 Compatibility Enhance of Pretrained Models and code base #209 2021-11-10 23:23:13 +08:00
babysor00
902e1eb537 Merge branch 'main' of https://github.com/babysor/Realtime-Voice-Clone-Chinese 2021-11-09 21:08:33 +08:00
babysor00
5c0e53a29a Fix #205 2021-11-09 21:08:28 +08:00
DragonDreamer
4edebdfeba 修复synthesizer/models/tacotron.Encoder注释错误 (#203)
fix Issue#202
2021-11-09 13:59:19 +08:00
babysor00
6c8f3f4515 Allow to select vocoder in web 2021-11-08 23:55:16 +08:00
babysor00
2bd323b7df Update readme 2021-11-07 21:59:03 +08:00
babysor00
3674d8b5c6 Use speaker embedding anyway even with default style 2021-11-07 21:48:15 +08:00
babysor00
80aaf32164 Add max steps control in toolbox 2021-11-06 13:27:11 +08:00
babysor00
c396792b22 Upload new models 2021-10-27 20:19:50 +08:00
babysor00
7c58fe01d1 Concat GST output instead of adding directly with original output 2021-10-23 10:28:32 +08:00
218 changed files with 36478 additions and 28854 deletions

4
.dockerignore Normal file
View File

@@ -0,0 +1,4 @@
*/saved_models
!vocoder/saved_models/pretrained/**
!encoder/saved_models/pretrained.pt
/datasets

1
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1 @@
github: babysor

17
.github/ISSUE_TEMPLATE/issue.md vendored Normal file
View File

@@ -0,0 +1,17 @@
---
name: Issue
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''
---
**Summary[问题简述(一句话)]**
A clear and concise description of what the issue is.
**Env & To Reproduce[复现与环境]**
描述你用的环境、代码版本、模型
**Screenshots[截图(如有)]**
If applicable, add screenshots to help

17
.gitignore vendored
View File

@@ -13,11 +13,14 @@
*.bbl
*.bcf
*.toc
*.wav
*.sh
synthesizer/saved_models/*
vocoder/saved_models/*
encoder/saved_models/*
cp_hifigan/*
!vocoder/saved_models/pretrained/*
!encoder/saved_models/pretrained.pt
data/ckpt/*/*
!data/ckpt/encoder/pretrained.pt
!data/ckpt/vocoder/pretrained/
wavs
log
!/docker-entrypoint.sh
!/datasets_download/*.sh
/datasets
monotonic_align/build
monotonic_align/monotonic_align

38
.vscode/launch.json vendored
View File

@@ -15,7 +15,8 @@
"name": "Python: Vocoder Preprocess",
"type": "python",
"request": "launch",
"program": "vocoder_preprocess.py",
"program": "control\\cli\\vocoder_preprocess.py",
"cwd": "${workspaceFolder}",
"console": "integratedTerminal",
"args": ["..\\audiodata"]
},
@@ -23,7 +24,8 @@
"name": "Python: Vocoder Train",
"type": "python",
"request": "launch",
"program": "vocoder_train.py",
"program": "control\\cli\\vocoder_train.py",
"cwd": "${workspaceFolder}",
"console": "integratedTerminal",
"args": ["dev", "..\\audiodata"]
},
@@ -32,16 +34,44 @@
"type": "python",
"request": "launch",
"program": "demo_toolbox.py",
"cwd": "${workspaceFolder}",
"console": "integratedTerminal",
"args": ["-d","..\\audiodata"]
},
{
"name": "Python: Demo Box VC",
"type": "python",
"request": "launch",
"program": "demo_toolbox.py",
"cwd": "${workspaceFolder}",
"console": "integratedTerminal",
"args": ["-d","..\\audiodata","-vc"]
},
{
"name": "Python: Synth Train",
"type": "python",
"request": "launch",
"program": "synthesizer_train.py",
"program": "train.py",
"console": "integratedTerminal",
"args": ["my_run", "..\\"]
"args": ["--type", "vits"]
},
{
"name": "Python: PPG Convert",
"type": "python",
"request": "launch",
"program": "run.py",
"console": "integratedTerminal",
"args": ["-c", ".\\ppg2mel\\saved_models\\seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2.yaml",
"-m", ".\\ppg2mel\\saved_models\\best_loss_step_304000.pth", "--wav_dir", ".\\wavs\\input", "--ref_wav_path", ".\\wavs\\pkq.mp3", "-o", ".\\wavs\\output\\"
]
},
{
"name": "Python: Vits Train",
"type": "python",
"request": "launch",
"program": "train.py",
"console": "integratedTerminal",
"args": ["--type", "vits"]
},
]
}

17
Dockerfile Normal file
View File

@@ -0,0 +1,17 @@
FROM pytorch/pytorch:latest
RUN apt-get update && apt-get install -y build-essential ffmpeg parallel aria2 && apt-get clean
COPY ./requirements.txt /workspace/requirements.txt
RUN pip install -r requirements.txt && pip install webrtcvad-wheels
COPY . /workspace
VOLUME [ "/datasets", "/workspace/synthesizer/saved_models/" ]
ENV DATASET_MIRROR=default FORCE_RETRAIN=false TRAIN_DATASETS=aidatatang_200zh\ magicdata\ aishell3\ data_aishell TRAIN_SKIP_EXISTING=true
EXPOSE 8080
ENTRYPOINT [ "/workspace/docker-entrypoint.sh" ]

View File

@@ -18,70 +18,141 @@
🌍 **Webserver Ready** 可伺服你的训练结果,供远程调用
## 开始
### 1. 安装要求
#### 1.1 通用配置
> 按照原始存储库测试您是否已准备好所有环境。
**Python 3.7 或更高版本** 需要运行工具箱
运行工具箱(demo_toolbox.py)需要 **Python 3.7 或更高版本**
* 安装 [PyTorch](https://pytorch.org/get-started/locally/)。
> 如果在用 pip 方式安装的时候出现 `ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu102 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)` 这个错误可能是 python 版本过低3.9 可以安装成功
* 安装 [ffmpeg](https://ffmpeg.org/download.html#get-packages)。
* 运行`pip install -r requirements.txt` 来安装剩余的必要包。
> 这里的环境建议使用 `Repo Tag 0.0.1` `Pytorch1.9.0 with Torchvision0.10.0 and cudatoolkit10.2` `requirements.txt` `webrtcvad-wheels` 因为 `requiremants.txt` 是在几个月前导出的,所以不适配新版本
* 安装 webrtcvad `pip install webrtcvad-wheels`
或者
-`conda` 或者 `mamba` 安装依赖
```conda env create -n env_name -f env.yml```
```mamba env create -n env_name -f env.yml```
会创建新环境安装必须的依赖. 之后用 `conda activate env_name` 切换环境就完成了.
> env.yml只包含了运行时必要的依赖暂时不包括monotonic-align如果想要装GPU版本的pytorch可以查看官网教程。
#### 1.2 M1芯片Mac环境配置Inference Time)
> 以下环境按x86-64搭建使用原生的`demo_toolbox.py`可作为在不改代码情况下快速使用的workaround。
>
> 如需使用M1芯片训练因`demo_toolbox.py`依赖的`PyQt5`不支持M1则应按需修改代码或者尝试使用`web.py`。
* 安装`PyQt5`,参考[这个链接](https://stackoverflow.com/a/68038451/20455983)
* 用Rosetta打开Terminal参考[这个链接](https://dev.to/courier/tips-and-tricks-to-setup-your-apple-m1-for-development-547g)
* 用系统Python创建项目虚拟环境
```
/usr/bin/python3 -m venv /PathToMockingBird/venv
source /PathToMockingBird/venv/bin/activate
```
* 升级pip并安装`PyQt5`
```
pip install --upgrade pip
pip install pyqt5
```
* 安装`pyworld`和`ctc-segmentation`
> 这里两个文件直接`pip install`的时候找不到wheel尝试从c里build时找不到`Python.h`报错
* 安装`pyworld`
* `brew install python` 通过brew安装python时会自动安装`Python.h`
* `export CPLUS_INCLUDE_PATH=/opt/homebrew/Frameworks/Python.framework/Headers` 对于M1brew安装`Python.h`到上述路径。把路径添加到环境变量里
* `pip install pyworld`
* 安装`ctc-segmentation`
> 因上述方法没有成功,选择从[github](https://github.com/lumaku/ctc-segmentation) clone源码手动编译
* `git clone https://github.com/lumaku/ctc-segmentation.git` 克隆到任意位置
* `cd ctc-segmentation`
* `source /PathToMockingBird/venv/bin/activate` 假设一开始未开启打开MockingBird项目的虚拟环境
* `cythonize -3 ctc_segmentation/ctc_segmentation_dyn.pyx`
* `/usr/bin/arch -x86_64 python setup.py build` 要注意明确用x86-64架构编译
* `/usr/bin/arch -x86_64 python setup.py install --optimize=1 --skip-build`用x86-64架构安装
* 安装其他依赖
* `/usr/bin/arch -x86_64 pip install torch torchvision torchaudio` 这里用pip安装`PyTorch`明确架构是x86
* `pip install ffmpeg` 安装ffmpeg
* `pip install -r requirements.txt`
* 运行
> 参考[这个链接](https://youtrack.jetbrains.com/issue/PY-46290/Allow-running-Python-under-Rosetta-2-in-PyCharm-for-Apple-Silicon)
让项目跑在x86架构环境上
* `vim /PathToMockingBird/venv/bin/pythonM1`
* 写入以下代码
```
#!/usr/bin/env zsh
mydir=${0:a:h}
/usr/bin/arch -x86_64 $mydir/python "$@"
```
* `chmod +x pythonM1` 设为可执行文件
* 如果使用PyCharm则把Interpreter指向`pythonM1`,否则也可命令行运行`/PathToMockingBird/venv/bin/pythonM1 demo_toolbox.py`
### 2. 准备预训练模型
考虑训练您自己专属的模型或者下载社区他人训练好的模型:
> 近期创建了[知乎专题](https://www.zhihu.com/column/c_1425605280340504576) 将不定期更新炼丹小技巧or心得也欢迎提问
#### 2.1 使用数据集自己训练合成器模型与2.2二选一)
#### 2.1 使用数据集自己训练encoder模型 (可选)
* 进行音频和梅尔频谱图预处理:
`python encoder_preprocess.py <datasets_root>`
使用`-d {dataset}` 指定数据集,支持 librispeech_othervoxceleb1aidatatang_200zh使用逗号分割处理多数据集。
* 训练encoder: `python encoder_train.py my_run <datasets_root>/SV2TTS/encoder`
> 训练encoder使用了visdom。你可以加上`-no_visdom`禁用visdom但是有可视化会更好。在单独的命令行/进程中运行"visdom"来启动visdom服务器。
#### 2.2 使用数据集自己训练合成器模型与2.3二选一)
* 下载 数据集并解压:确保您可以访问 *train* 文件夹中的所有音频文件(如.wav
* 进行音频和梅尔频谱图预处理:
`python pre.py <datasets_root> -d {dataset} -n {number}`
可传入参数:
* -d`{dataset}` 指定数据集,支持 aidatatang_200zh, magicdata, aishell3, data_aishell, 不传默认为aidatatang_200zh
* -n `{number}` 指定并行数CPU 11770k + 32GB实测10没有问题
* `-d {dataset}` 指定数据集,支持 aidatatang_200zh, magicdata, aishell3, data_aishell, 不传默认为aidatatang_200zh
* `-n {number}` 指定并行数CPU 11770k + 32GB实测10没有问题
> 假如你下载的 `aidatatang_200zh`文件放在D盘`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`
* 训练合成器:
`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
`python ./control/cli/synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
* 当您在训练文件夹 *synthesizer/saved_models/* 中看到注意线显示和损失满足您的需要时,请转到`启动程序`一步。
#### 2.2使用社区预先训练好的合成器与2.1二选一)
#### 2.3使用社区预先训练好的合成器与2.2二选一)
> 当实在没有设备或者不想慢慢调试,可以使用社区贡献的模型(欢迎持续分享):
| 作者 | 下载链接 | 效果预览 | 信息 |
| --- | ----------- | ----- | ----- |
| 作者 | https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ [百度盘链接](https://pan.baidu.com/s/11FrUYBmLrSs_cQ7s3JTlPQ) 提取码gdn5 | | 25k steps 用3个开源数据集混合训练
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [百度盘链接](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) 提取码1024 | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 旧版需根据[issue](https://github.com/babysor/MockingBird/issues/37)修复
| 作者 | https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g [百度盘链接](https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g) 4j5d | | 75k steps 用3个开源数据集混合训练
| 作者 | https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw [百度盘链接](https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw) 提取码om7f | | 25k steps 用3个开源数据集混合训练, 切换到tag v0.0.1使用
|@FawenYo | https://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_fawenyo_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=n0gGgC | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音需切换到tag v0.0.1使用
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 注意:根据[issue](https://github.com/babysor/MockingBird/issues/37)修复 并切换到tag v0.0.1使用
#### 2.3训练声码器 (可选)
#### 2.4训练声码器 (可选)
对效果影响不大已经预置3款如果希望自己训练可以参考以下命令。
* 预处理数据:
`python vocoder_preprocess.py <datasets_root> -m <synthesizer_model_path>`
> `<datasets_root>`替换为你的数据集目录,`<synthesizer_model_path>`替换为一个你最好的synthesizer模型目录例如 *sythensizer\saved_mode\xxx*
> `<datasets_root>`替换为你的数据集目录,`<synthesizer_model_path>`替换为一个你最好的synthesizer模型目录例如 *sythensizer\saved_models\xxx*
* 训练wavernn声码器:
`python vocoder_train.py <trainid> <datasets_root>`
`python ./control/cli/vocoder_train.py <trainid> <datasets_root>`
> `<trainid>`替换为你想要的标识,同一标识再次训练时会延续原模型
* 训练hifigan声码器:
`python vocoder_train.py <trainid> <datasets_root> hifigan`
`python ./control/cli/vocoder_train.py <trainid> <datasets_root> hifigan`
> `<trainid>`替换为你想要的标识,同一标识再次训练时会延续原模型
* 训练fregan声码器:
`python ./control/cli/vocoder_train.py <trainid> <datasets_root> --config config.json fregan`
> `<trainid>`替换为你想要的标识,同一标识再次训练时会延续原模型
* 将GAN声码器的训练切换为多GPU模式修改GAN文件夹下.json文件中的"num_gpus"参数
### 3. 启动程序或工具箱
您可以尝试使用以下命令:
### 3.1 启动Web程序
### 3.1 启动Web程序v2
`python web.py`
运行成功后在浏览器打开地址, 默认为 `http://localhost:8080`
![123](https://user-images.githubusercontent.com/12797292/135494044-ae59181c-fe3a-406f-9c7d-d21d12fdb4cb.png)
> 目前界面比较buggy,
> * 第一次点击`录制`要等待几秒浏览器正常启动录音,否则会有重音
> * 录制结束不要再点`录制`而是`停止`
> * 仅支持手动新录音16khz, 不支持超过4MB的录音最佳长度在5~15秒
> * 默认使用第一个找到的模型,有动手能力的可以看代码修改 `web\__init__.py`。
### 3.2 启动工具箱:
`python demo_toolbox.py -d <datasets_root>`
@@ -89,33 +160,35 @@
<img width="1042" alt="d48ea37adf3660e657cfb047c10edbc" src="https://user-images.githubusercontent.com/7423248/134275227-c1ddf154-f118-4b77-8949-8c4c7daf25f0.png">
## 文件结构(目标读者:开发者)
```
├─archived_untest_files 废弃文件
├─encoder encoder模型
│ ├─data_objects
│ └─saved_models 预训练好的模型
├─samples 样例语音
├─synthesizer synthesizer模型
├─models
│ ├─saved_models 预训练好的模型
│ └─utils 工具类库
├─toolbox 图形化工具箱
├─utils 工具类库
├─vocoder vocoder模型目前包含hifi-gan、wavrnn
│ ├─hifigan
│ ├─saved_models 预训练好的模型
│ └─wavernn
└─web
├─api
│ └─Web端接口
├─config
│ └─ Web端配置文件
├─static 前端静态脚本
│ └─js
├─templates 前端模板
└─__init__.py Web端入口文件
```
### 4. 番外语音转换Voice Conversion(PPG based)
想像柯南拿着变声器然后发出毛利小五郎的声音吗本项目现基于PPG-VC引入额外两个模块PPG extractor + PPG2Mel, 可以实现变声功能。(文档不全,尤其是训练部分,正在努力补充中)
#### 4.0 准备环境
* 确保项目以上环境已经安装ok运行`pip install espnet` 来安装剩余的必要包。
* 下载以下模型 链接https://pan.baidu.com/s/1bl_x_DHJSAUyN2fma-Q_Wg
提取码gh41
* 24K采样率专用的vocoderhifigan*vocoder\saved_models\xxx*
* 预训练的ppg特征encoder(ppg_extractor)到 *ppg_extractor\saved_models\xxx*
* 预训练的PPG2Mel到 *ppg2mel\saved_models\xxx*
#### 4.1 使用数据集自己训练PPG2Mel模型 (可选)
* 下载aidatatang_200zh数据集并解压确保您可以访问 *train* 文件夹中的所有音频文件(如.wav
* 进行音频和梅尔频谱图预处理:
`python ./control/cli/pre4ppg.py <datasets_root> -d {dataset} -n {number}`
可传入参数:
* `-d {dataset}` 指定数据集,支持 aidatatang_200zh, 不传默认为aidatatang_200zh
* `-n {number}` 指定并行数CPU 11700k在8的情况下需要运行12到18小时待优化
> 假如你下载的 `aidatatang_200zh`文件放在D盘`train`文件路径为 `D:\data\aidatatang_200zh\corpus\train` , 你的`datasets_root`就是 `D:\data\`
* 训练合成器, 注意在上一步先下载好`ppg2mel.yaml`, 修改里面的地址指向预训练好的文件夹:
`python ./control/cli/ppg2mel_train.py --config .\ppg2mel\saved_models\ppg2mel.yaml --oneshotvc `
* 如果想要继续上一次的训练,可以通过`--load .\ppg2mel\saved_models\<old_pt_file>` 参数指定一个预训练模型文件。
#### 4.2 启动工具箱VC模式
您可以尝试使用以下命令:
`python demo_toolbox.py -vc -d <datasets_root>`
> 请指定一个可用的数据集文件路径,如果有支持的数据集则会自动加载供调试,也同时会作为手动录制音频的存储目录。
<img width="971" alt="微信图片_20220305005351" src="https://user-images.githubusercontent.com/7423248/156805733-2b093dbc-d989-4e68-8609-db11f365886a.png">
## 引用及论文
> 该库一开始从仅支持英语的[Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) 分叉出来的,鸣谢作者。
@@ -124,35 +197,36 @@
| --- | ----------- | ----- | --------------------- |
| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
| [2106.02297](https://arxiv.org/abs/2106.02297) | Fre-GAN (vocoder)| Fre-GAN: Adversarial Frequency-consistent Audio Synthesis | 本代码库 |
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | 本代码库 |
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |
## 常見問題(FQ&A)
#### 1.數據集哪裡下載?
## 常见问题(FQ&A)
#### 1.数据集在哪里下载?
| 数据集 | OpenSLR地址 | 其他源 (Google Drive, Baidu网盘等) |
| --- | ----------- | ---------------|
| aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
| magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
| aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
| data_aishell | [OpenSLR](https://www.openslr.org/33/) | |
> aidatatang_200zh 後,還需將 `aidatatang_200zh\corpus\train`下的檔案全選解壓縮
> aidatatang_200zh 后,还需将 `aidatatang_200zh\corpus\train`下的文件全选解压缩
#### 2.`<datasets_root>`是什麼意思?
假如數據集路徑為 `D:\data\aidatatang_200zh`,那 `<datasets_root>`就是 `D:\data`
假如数据集路径为 `D:\data\aidatatang_200zh`,那 `<datasets_root>`就是 `D:\data`
#### 3.訓練模型存不足
訓練合成器時:將 `synthesizer/hparams.py`中的batch_size參數調
#### 3.训练模型存不足
训练合成器时:将 `synthesizer/hparams.py`中的batch_size参数调
```
//調整前
//整前
tts_schedule = [(2, 1e-3, 20_000, 12), # Progressive training schedule
(2, 5e-4, 40_000, 12), # (r, lr, step, batch_size)
(2, 2e-4, 80_000, 12), #
(2, 1e-4, 160_000, 12), # r = reduction factor (# of mel frames
(2, 3e-5, 320_000, 12), # synthesized for each decoder iteration)
(2, 1e-5, 640_000, 12)], # lr = learning rate
//調整後
//调整后
tts_schedule = [(2, 1e-3, 20_000, 8), # Progressive training schedule
(2, 5e-4, 40_000, 8), # (r, lr, step, batch_size)
(2, 2e-4, 80_000, 8), #
@@ -161,15 +235,15 @@ tts_schedule = [(2, 1e-3, 20_000, 8), # Progressive training schedule
(2, 1e-5, 640_000, 8)], # lr = learning rate
```
聲碼器-預處理數據集時:將 `synthesizer/hparams.py`中的batch_size參數調
声码器-预处理数据集时:将 `synthesizer/hparams.py`中的batch_size参数调
```
//調整前
//整前
### Data Preprocessing
max_mel_frames = 900,
rescale = True,
rescaling_max = 0.9,
synthesis_batch_size = 16, # For vocoder preprocessing and inference.
//調整後
//调整后
### Data Preprocessing
max_mel_frames = 900,
rescale = True,
@@ -177,16 +251,16 @@ tts_schedule = [(2, 1e-3, 20_000, 8), # Progressive training schedule
synthesis_batch_size = 8, # For vocoder preprocessing and inference.
```
聲碼器-訓練聲碼器時:將 `vocoder/wavernn/hparams.py`中的batch_size參數調
声码器-训练声码器时:将 `vocoder/wavernn/hparams.py`中的batch_size参数调
```
//調整前
//整前
# Training
voc_batch_size = 100
voc_lr = 1e-4
voc_gen_at_checkpoint = 5
voc_pad = 2
//調整後
//调整后
# Training
voc_batch_size = 6
voc_lr = 1e-4
@@ -195,17 +269,16 @@ voc_pad =2
```
#### 4.碰到`RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70, 512]) from checkpoint, the shape in current model is torch.Size([75, 512]).`
請參照 issue [#37](https://github.com/babysor/MockingBird/issues/37)
请参照 issue [#37](https://github.com/babysor/MockingBird/issues/37)
#### 5.如何改善CPU、GPU用率?
適情況調整batch_size參數來改善
#### 5.如何改善CPU、GPU用率?
视情况调整batch_size参数来改善
#### 6.生 `面文件太小,法完成操作`
請參考這篇[文章](https://blog.csdn.net/qq_17755303/article/details/112564030)將虛擬內存更改100G(102400),例如:档案放置D就更改D的虚拟内存
#### 6.生 `面文件太小,法完成操作`
请参考这篇[文章](https://blog.csdn.net/qq_17755303/article/details/112564030)将虚拟内存更改100G(102400),例如:文件放置D就更改D的虚拟内存
#### 7.什么时候算训练完成?
首先一定要出现注意力模型其次是loss足够低取决于硬件设备和数据集。拿本人的供参考我的注意力是在 18k 步之后出现的,并且在 50k 步之后损失变得低于 0.4
![attention_step_20500_sample_1](https://user-images.githubusercontent.com/7423248/128587252-f669f05a-f411-4811-8784-222156ea5e9d.png)
![step-135500-mel-spectrogram_sample_1](https://user-images.githubusercontent.com/7423248/128587255-4945faa0-5517-46ea-b173-928eff999330.png)

223
README-LINUX-CN.md Normal file
View File

@@ -0,0 +1,223 @@
## 实时语音克隆 - 中文/普通话
![mockingbird](https://user-images.githubusercontent.com/12797292/131216767-6eb251d6-14fc-4951-8324-2722f0cd4c63.jpg)
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)
### [English](README.md) | 中文
### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/) | [Wiki教程](https://github.com/babysor/MockingBird/wiki/Quick-Start-(Newbie)) [训练教程](https://vaj2fgg8yn.feishu.cn/docs/doccn7kAbr3SJz0KM0SIDJ0Xnhd)
## 特性
🌍 **中文** 支持普通话并使用多种中文数据集进行测试aidatatang_200zh, magicdata, aishell3, biaobei, MozillaCommonVoice, data_aishell 等
🤩 **Easy & Awesome** 仅需下载或新训练合成器synthesizer就有良好效果复用预训练的编码器/声码器或实时的HiFi-GAN作为vocoder
🌍 **Webserver Ready** 可伺服你的训练结果,供远程调用。
🤩 **感谢各位小伙伴的支持,本项目将开启新一轮的更新**
## 1.快速开始
### 1.1 建议环境
- Ubuntu 18.04
- Cuda 11.7 && CuDNN 8.5.0
- Python 3.8 或 3.9
- Pytorch 2.0.1 <post cuda-11.7>
### 1.2 环境配置
```shell
# 下载前建议更换国内镜像源
conda create -n sound python=3.9
conda activate sound
git clone https://github.com/babysor/MockingBird.git
cd MockingBird
pip install -r requirements.txt
pip install webrtcvad-wheels
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
```
### 1.3 模型准备
> 当实在没有设备或者不想慢慢调试,可以使用社区贡献的模型(欢迎持续分享):
| 作者 | 下载链接 | 效果预览 | 信息 |
| --- | ----------- | ----- | ----- |
| 作者 | https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g [百度盘链接](https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g) 4j5d | | 75k steps 用3个开源数据集混合训练
| 作者 | https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw [百度盘链接](https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw) 提取码om7f | | 25k steps 用3个开源数据集混合训练, 切换到tag v0.0.1使用
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing [百度盘链接](https://pan.baidu.com/s/1vSYXO4wsLyjnF3Unl-Xoxg) 提取码1024 | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps 台湾口音需切换到tag v0.0.1使用
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ 提取码2021 | https://www.bilibili.com/video/BV1uh411B7AD/ | 150k steps 注意:根据[issue](https://github.com/babysor/MockingBird/issues/37)修复 并切换到tag v0.0.1使用
### 1.4 文件结构准备
文件结构准备如下所示算法将自动遍历synthesizer下的.pt模型文件。
```
# 以第一个 pretrained-11-7-21_75k.pt 为例
└── data
└── ckpt
└── synthesizer
└── pretrained-11-7-21_75k.pt
```
### 1.5 运行
```
python web.py
```
## 2.模型训练
### 2.1 数据准备
#### 2.1.1 数据下载
``` shell
# aidatatang_200zh
wget https://openslr.elda.org/resources/62/aidatatang_200zh.tgz
```
``` shell
# MAGICDATA
wget https://openslr.magicdatatech.com/resources/68/train_set.tar.gz
wget https://openslr.magicdatatech.com/resources/68/dev_set.tar.gz
wget https://openslr.magicdatatech.com/resources/68/test_set.tar.gz
```
``` shell
# AISHELL-3
wget https://openslr.elda.org/resources/93/data_aishell3.tgz
```
```shell
# Aishell
wget https://openslr.elda.org/resources/33/data_aishell.tgz
```
#### 2.1.2 数据批量解压
```shell
# 该指令为解压当前目录下的所有压缩文件
for gz in *.gz; do tar -zxvf $gz; done
```
### 2.2 encoder模型训练
#### 2.2.1 数据预处理:
需要先在`pre.py `头部加入:
```python
import torch
torch.multiprocessing.set_start_method('spawn', force=True)
```
使用以下指令对数据预处理:
```shell
python pre.py <datasets_root> \
-d <datasets_name>
```
其中`<datasets_root>`为原数据集路径,`<datasets_name>` 为数据集名称。
支持 `librispeech_other``voxceleb1``aidatatang_200zh`,使用逗号分割处理多数据集。
### 2.2.2 encoder模型训练
超参数文件路径:`models/encoder/hparams.py`
```shell
python encoder_train.py <name> \
<datasets_root>/SV2TTS/encoder
```
其中 `<name>` 是训练产生文件的名称,可自行修改。
其中 `<datasets_root>` 是经过 `Step 2.1.1` 处理过后的数据集路径。
#### 2.2.3 开启encoder模型训练数据可视化可选
```shell
visdom
```
### 2.3 synthesizer模型训练
#### 2.3.1 数据预处理:
```shell
python pre.py <datasets_root> \
-d <datasets_name> \
-o <datasets_path> \
-n <number>
```
`<datasets_root>` 为原数据集路径,当你的`aidatatang_200zh`路径为`/data/aidatatang_200zh/corpus/train`时,`<datasets_root>` 为 `/data/`。
`<datasets_name>` 为数据集名称。
`<datasets_path>` 为数据集处理后的保存路径。
`<number>` 为数据集处理时进程数根据CPU情况调整大小。
#### 2.3.2 新增数据预处理:
```shell
python pre.py <datasets_root> \
-d <datasets_name> \
-o <datasets_path> \
-n <number> \
-s
```
当新增数据集时,应加 `-s` 选择数据拼接,不加则为覆盖。
#### 2.3.3 synthesizer模型训练
超参数文件路径:`models/synthesizer/hparams.py`,需将`MockingBird/control/cli/synthesizer_train.py`移成`MockingBird/synthesizer_train.py`结构。
```shell
python synthesizer_train.py <name> <datasets_path> \
-m <out_dir>
```
其中 `<name>` 是训练产生文件的名称,可自行修改。
其中 `<datasets_path>` 是经过 `Step 2.2.1` 处理过后的数据集路径。
其中 `<out_dir> `为训练时所有数据的保存路径。
### 2.4 vocoder模型训练
vocoder模型对生成效果影响不大已预置3款。
#### 2.4.1 数据预处理
```shell
python vocoder_preprocess.py <datasets_root> \
-m <synthesizer_model_path>
```
其中`<datasets_root>`为你数据集路径。
其中 `<synthesizer_model_path>`为synthesizer模型地址。
#### 2.4.2 wavernn声码器训练:
```
python vocoder_train.py <name> <datasets_root>
```
#### 2.4.3 hifigan声码器训练:
```
python vocoder_train.py <name> <datasets_root> hifigan
```
#### 2.4.4 fregan声码器训练:
```
python vocoder_train.py <name> <datasets_root> \
--config config.json fregan
```
将GAN声码器的训练切换为多GPU模式修改`GAN`文件夹下`.json`文件中的`num_gpus`参数。
## 3.致谢
### 3.1 项目致谢
该库一开始从仅支持英语的[Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) 分叉出来的,鸣谢作者。
### 3.2 论文致谢
| URL | Designation | 标题 | 实现源码 |
| --- | ----------- | ----- | --------------------- |
| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
| [2106.02297](https://arxiv.org/abs/2106.02297) | Fre-GAN (vocoder)| Fre-GAN: Adversarial Frequency-consistent Audio Synthesis | 本代码库 |
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | 本代码库 |
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |
### 3.3 开发者致谢
作为AI领域的从业者我们不仅乐于开发一些具有里程碑意义的算法项目同时也乐于分享项目以及开发过程中收获的喜悦。
因此你们的使用是对我们项目的最大认可。同时当你们在项目使用中遇到一些问题时欢迎你们随时在issue上留言。你们的指正这对于项目的后续优化具有十分重大的的意义。
为了表示感谢,我们将在本项目中留下各位开发者信息以及相对应的贡献。
- ------------------------------------------------ 开 发 者 贡 献 内 容 ---------------------------------------------------------------------------------

104
README.md
View File

@@ -1,9 +1,11 @@
> 🚧 While I no longer actively update this repo, you can find me continuously pushing this tech forward to good side and open-source. Join me at [MaskGCT](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct). I'm also building an optimized and cloud hosted version: https://noiz.ai/
>
![mockingbird](https://user-images.githubusercontent.com/12797292/131216767-6eb251d6-14fc-4951-8324-2722f0cd4c63.jpg)
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)
> English | [中文](README-CN.md)
> English | [中文](README-CN.md)| [中文Linux](README-LINUX-CN.md)
## Features
🌍 **Chinese** supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.
@@ -21,6 +23,7 @@
## Quick Start
### 1. Install Requirements
#### 1.1 General Setup
> Follow the original repo to test if you got all environment ready.
**Python 3.7 or higher ** is needed to run the toolbox.
@@ -28,31 +31,108 @@
> If you get an `ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu102 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2 )` This error is probably due to a low version of python, try using 3.9 and it will install successfully
* Install [ffmpeg](https://ffmpeg.org/download.html#get-packages).
* Run `pip install -r requirements.txt` to install the remaining necessary packages.
> The recommended environment here is `Repo Tag 0.0.1` `Pytorch1.9.0 with Torchvision0.10.0 and cudatoolkit10.2` `requirements.txt` `webrtcvad-wheels` because `requirements. txt` was exported a few months ago, so it doesn't work with newer versions
* Install webrtcvad `pip install webrtcvad-wheels`(If you need)
> Note that we are using the pretrained encoder/vocoder but synthesizer, since the original model is incompatible with the Chinese sympols. It means the demo_cli is not working at this moment.
or
- install dependencies with `conda` or `mamba`
```conda env create -n env_name -f env.yml```
```mamba env create -n env_name -f env.yml```
will create a virtual environment where necessary dependencies are installed. Switch to the new environment by `conda activate env_name` and enjoy it.
> env.yml only includes the necessary dependencies to run the projecttemporarily without monotonic-align. You can check the official website to install the GPU version of pytorch.
#### 1.2 Setup with a M1 Mac
> The following steps are a workaround to directly use the original `demo_toolbox.py`without the changing of codes.
>
> Since the major issue comes with the PyQt5 packages used in `demo_toolbox.py` not compatible with M1 chips, were one to attempt on training models with the M1 chip, either that person can forgo `demo_toolbox.py`, or one can try the `web.py` in the project.
##### 1.2.1 Install `PyQt5`, with [ref](https://stackoverflow.com/a/68038451/20455983) here.
* Create and open a Rosetta Terminal, with [ref](https://dev.to/courier/tips-and-tricks-to-setup-your-apple-m1-for-development-547g) here.
* Use system Python to create a virtual environment for the project
```
/usr/bin/python3 -m venv /PathToMockingBird/venv
source /PathToMockingBird/venv/bin/activate
```
* Upgrade pip and install `PyQt5`
```
pip install --upgrade pip
pip install pyqt5
```
##### 1.2.2 Install `pyworld` and `ctc-segmentation`
> Both packages seem to be unique to this project and are not seen in the original [Real-Time Voice Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) project. When installing with `pip install`, both packages lack wheels so the program tries to directly compile from c code and could not find `Python.h`.
* Install `pyworld`
* `brew install python` `Python.h` can come with Python installed by brew
* `export CPLUS_INCLUDE_PATH=/opt/homebrew/Frameworks/Python.framework/Headers` The filepath of brew-installed `Python.h` is unique to M1 MacOS and listed above. One needs to manually add the path to the environment variables.
* `pip install pyworld` that should do.
* Install`ctc-segmentation`
> Same method does not apply to `ctc-segmentation`, and one needs to compile it from the source code on [github](https://github.com/lumaku/ctc-segmentation).
* `git clone https://github.com/lumaku/ctc-segmentation.git`
* `cd ctc-segmentation`
* `source /PathToMockingBird/venv/bin/activate` If the virtual environment hasn't been deployed, activate it.
* `cythonize -3 ctc_segmentation/ctc_segmentation_dyn.pyx`
* `/usr/bin/arch -x86_64 python setup.py build` Build with x86 architecture.
* `/usr/bin/arch -x86_64 python setup.py install --optimize=1 --skip-build`Install with x86 architecture.
##### 1.2.3 Other dependencies
* `/usr/bin/arch -x86_64 pip install torch torchvision torchaudio` Pip installing `PyTorch` as an example, articulate that it's installed with x86 architecture
* `pip install ffmpeg` Install ffmpeg
* `pip install -r requirements.txt` Install other requirements.
##### 1.2.4 Run the Inference Time (with Toolbox)
> To run the project on x86 architecture. [ref](https://youtrack.jetbrains.com/issue/PY-46290/Allow-running-Python-under-Rosetta-2-in-PyCharm-for-Apple-Silicon).
* `vim /PathToMockingBird/venv/bin/pythonM1` Create an executable file `pythonM1` to condition python interpreter at `/PathToMockingBird/venv/bin`.
* Write in the following content:
```
#!/usr/bin/env zsh
mydir=${0:a:h}
/usr/bin/arch -x86_64 $mydir/python "$@"
```
* `chmod +x pythonM1` Set the file as executable.
* If using PyCharm IDE, configure project interpreter to `pythonM1`([steps here](https://www.jetbrains.com/help/pycharm/configuring-python-interpreter.html#add-existing-interpreter)), if using command line python, run `/PathToMockingBird/venv/bin/pythonM1 demo_toolbox.py`
### 2. Prepare your models
> Note that we are using the pretrained encoder/vocoder but not synthesizer, since the original model is incompatible with the Chinese symbols. It means the demo_cli is not working at this moment, so additional synthesizer models are required.
You can either train your models or use existing ones:
#### 2.1. Train synthesizer with your dataset
#### 2.1 Train encoder with your dataset (Optional)
* Preprocess with the audios and the mel spectrograms:
`python encoder_preprocess.py <datasets_root>` Allowing parameter `--dataset {dataset}` to support the datasets you want to preprocess. Only the train set of these datasets will be used. Possible names: librispeech_other, voxceleb1, voxceleb2. Use comma to sperate multiple datasets.
* Train the encoder: `python encoder_train.py my_run <datasets_root>/SV2TTS/encoder`
> For training, the encoder uses visdom. You can disable it with `--no_visdom`, but it's nice to have. Run "visdom" in a separate CLI/process to start your visdom server.
#### 2.2 Train synthesizer with your dataset
* Download dataset and unzip: make sure you can access all .wav in folder
* Preprocess with the audios and the mel spectrograms:
`python pre.py <datasets_root>`
Allowing parameter `--dataset {dataset}` to support aidatatang_200zh, magicdata, aishell3, data_aishell, etc.If this parameter is not passed, the default dataset will be aidatatang_200zh.
* Train the synthesizer:
`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
`python train.py --type=synth mandarin <datasets_root>/SV2TTS/synthesizer`
* Go to next step when you see attention line show and loss meet your need in training folder *synthesizer/saved_models/*.
#### 2.2 Use pretrained model of synthesizer
#### 2.3 Use pretrained model of synthesizer
> Thanks to the community, some models will be shared:
| author | Download link | Preview Video | Info |
| --- | ----------- | ----- |----- |
| @myself | https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA [Baidu](https://pan.baidu.com/s/1VHSKIbxXQejtxi2at9IrpA ) codei183 | | 200k steps only trained by aidatatang_200zh
|@FawenYo | https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing https://u.teknik.io/AYxWf.pt | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code2021 | https://www.bilibili.com/video/BV1uh411B7AD/
| @author | https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g [Baidu](https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g) 4j5d | | 75k steps trained by multiple datasets
| @author | https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw [Baidu](https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw) codeom7f | | 25k steps trained by multiple datasets, only works under version 0.0.1
|@FawenYo | https://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_fawenyo_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=n0gGgC | [input](https://github.com/babysor/MockingBird/wiki/audio/self_test.mp3) [output](https://github.com/babysor/MockingBird/wiki/audio/export.wav) | 200k steps with local accent of Taiwan, only works under version 0.0.1
|@miven| https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code: 2021 https://www.aliyundrive.com/s/AwPsbo8mcSP code: z2m0 | https://www.bilibili.com/video/BV1uh411B7AD/ | only works under version 0.0.1
#### 2.3 Train vocoder (Optional)
#### 2.4 Train vocoder (Optional)
> note: vocoder has little difference in effect, so you may not need to train a new one.
* Preprocess the data:
`python vocoder_preprocess.py <datasets_root> -m <synthesizer_model_path>`
@@ -72,6 +152,11 @@ You can then try to run:`python web.py` and open it in browser, default as `http
You can then try the toolbox:
`python demo_toolbox.py -d <datasets_root>`
#### 3.3 Using the command line
You can then try the command:
`python gen_voice.py <text_file.txt> your_wav_file.wav`
you may need to install cn2an by "pip install cn2an" for better digital number result.
## Reference
> This repository is forked from [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) which only support English.
@@ -79,6 +164,7 @@ You can then try the toolbox:
| --- | ----------- | ----- | --------------------- |
| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | This repo |
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | This repo |
| [2106.02297](https://arxiv.org/abs/2106.02297) | Fre-GAN (vocoder)| Fre-GAN: Adversarial Frequency-consistent Audio Synthesis | This repo |
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)

View File

@@ -1,9 +1,9 @@
from encoder.params_model import model_embedding_size as speaker_embedding_size
from models.encoder.params_model import model_embedding_size as speaker_embedding_size
from utils.argutils import print_args
from utils.modelutils import check_model_paths
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
from models.synthesizer.inference import Synthesizer
from models.encoder import inference as encoder
from models.vocoder import inference as vocoder
from pathlib import Path
import numpy as np
import soundfile as sf

View File

@@ -1,7 +1,10 @@
from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2, preprocess_aidatatang_200zh
from utils.argutils import print_args
from pathlib import Path
import argparse
from pathlib import Path
from models.encoder.preprocess import (preprocess_aidatatang_200zh,
preprocess_librispeech, preprocess_voxceleb1,
preprocess_voxceleb2)
from utils.argutils import print_args
if __name__ == "__main__":
class MyFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter):

View File

@@ -1,5 +1,5 @@
from utils.argutils import print_args
from encoder.train import train
from models.encoder.train import train
from pathlib import Path
import argparse

View File

@@ -0,0 +1,67 @@
import sys
import torch
import argparse
import numpy as np
from utils.hparams import HpsYaml
from models.ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
# For reproducibility, comment these may speed up training
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
def main():
# Arguments
parser = argparse.ArgumentParser(description=
'Training PPG2Mel VC model.')
parser.add_argument('--config', type=str,
help='Path to experiment config, e.g., config/vc.yaml')
parser.add_argument('--name', default=None, type=str, help='Name for logging.')
parser.add_argument('--logdir', default='log/', type=str,
help='Logging path.', required=False)
parser.add_argument('--ckpdir', default='ppg2mel/saved_models/', type=str,
help='Checkpoint path.', required=False)
parser.add_argument('--outdir', default='result/', type=str,
help='Decode output path.', required=False)
parser.add_argument('--load', default=None, type=str,
help='Load pre-trained model (for training only)', required=False)
parser.add_argument('--warm_start', action='store_true',
help='Load model weights only, ignore specified layers.')
parser.add_argument('--seed', default=0, type=int,
help='Random seed for reproducable results.', required=False)
parser.add_argument('--njobs', default=8, type=int,
help='Number of threads for dataloader/decoding.', required=False)
parser.add_argument('--cpu', action='store_true', help='Disable GPU training.')
parser.add_argument('--no-pin', action='store_true',
help='Disable pin-memory for dataloader')
parser.add_argument('--test', action='store_true', help='Test the model.')
parser.add_argument('--no-msg', action='store_true', help='Hide all messages.')
parser.add_argument('--finetune', action='store_true', help='Finetune model')
parser.add_argument('--oneshotvc', action='store_true', help='Oneshot VC model')
parser.add_argument('--bilstm', action='store_true', help='BiLSTM VC model')
parser.add_argument('--lsa', action='store_true', help='Use location-sensitive attention (LSA)')
###
paras = parser.parse_args()
setattr(paras, 'gpu', not paras.cpu)
setattr(paras, 'pin_memory', not paras.no_pin)
setattr(paras, 'verbose', not paras.no_msg)
# Make the config dict dot visitable
config = HpsYaml(paras.config)
np.random.seed(paras.seed)
torch.manual_seed(paras.seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(paras.seed)
print(">>> OneShot VC training ...")
mode = "train"
solver = Solver(config, paras, mode)
solver.load_data()
solver.set_model()
solver.exec()
print(">>> Oneshot VC train finished!")
sys.exit(0)
if __name__ == "__main__":
main()

49
control/cli/pre4ppg.py Normal file
View File

@@ -0,0 +1,49 @@
from pathlib import Path
import argparse
from models.ppg2mel.preprocess import preprocess_dataset
from pathlib import Path
import argparse
recognized_datasets = [
"aidatatang_200zh",
"aidatatang_200zh_s", # sample
]
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Preprocesses audio files from datasets, to be used by the "
"ppg2mel model for training.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("datasets_root", type=Path, help=\
"Path to the directory containing your datasets.")
parser.add_argument("-d", "--dataset", type=str, default="aidatatang_200zh", help=\
"Name of the dataset to process, allowing values: aidatatang_200zh.")
parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
"Path to the output directory that will contain the mel spectrograms, the audios and the "
"embeds. Defaults to <datasets_root>/PPGVC/ppg2mel/")
parser.add_argument("-n", "--n_processes", type=int, default=8, help=\
"Number of processes in parallel.")
# parser.add_argument("-s", "--skip_existing", action="store_true", help=\
# "Whether to overwrite existing files with the same name. Useful if the preprocessing was "
# "interrupted. ")
# parser.add_argument("--hparams", type=str, default="", help=\
# "Hyperparameter overrides as a comma-separated list of name-value pairs")
# parser.add_argument("--no_trim", action="store_true", help=\
# "Preprocess audio without trimming silences (not recommended).")
parser.add_argument("-pf", "--ppg_encoder_model_fpath", type=Path, default="ppg_extractor/saved_models/24epoch.pt", help=\
"Path your trained ppg encoder model.")
parser.add_argument("-sf", "--speaker_encoder_model", type=Path, default="encoder/saved_models/pretrained_bak_5805000.pt", help=\
"Path your trained speaker encoder model.")
args = parser.parse_args()
assert args.dataset in recognized_datasets, 'is not supported, file a issue to propose a new one'
# Create directories
assert args.datasets_root.exists()
if not hasattr(args, "out_dir"):
args.out_dir = args.datasets_root.joinpath("PPGVC", "ppg2mel")
args.out_dir.mkdir(exist_ok=True, parents=True)
preprocess_dataset(**vars(args))

View File

@@ -1,10 +1,9 @@
from synthesizer.hparams import hparams
from synthesizer.train import train
from models.synthesizer.hparams import hparams
from models.synthesizer.train import train
from utils.argutils import print_args
import argparse
if __name__ == "__main__":
def new_train():
parser = argparse.ArgumentParser()
parser.add_argument("run_id", type=str, help= \
"Name for this model instance. If a model state from the same run ID was previously "
@@ -13,7 +12,7 @@ if __name__ == "__main__":
parser.add_argument("syn_dir", type=str, default=argparse.SUPPRESS, help= \
"Path to the synthesizer directory that contains the ground truth mel spectrograms, "
"the wavs and the embeds.")
parser.add_argument("-m", "--models_dir", type=str, default="synthesizer/saved_models/", help=\
parser.add_argument("-m", "--models_dir", type=str, default=f"data/ckpt/synthesizer/", help=\
"Path to the output directory that will contain the saved model weights and the logs.")
parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
"Number of steps between updates of the model on the disk. Set to 0 to never save the "
@@ -28,10 +27,14 @@ if __name__ == "__main__":
parser.add_argument("--hparams", default="",
help="Hyperparameter overrides as a comma-separated list of name=value "
"pairs")
args = parser.parse_args()
args, _ = parser.parse_known_args()
print_args(args, parser)
args.hparams = hparams.parse(args.hparams)
# Run the training
train(**vars(args))
if __name__ == "__main__":
new_train()

View File

@@ -0,0 +1,66 @@
import sys
import torch
import argparse
import numpy as np
from utils.hparams import HpsYaml
from models.ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
# For reproducibility, comment these may speed up training
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
def main():
# Arguments
parser = argparse.ArgumentParser(description=
'Training PPG2Mel VC model.')
parser.add_argument('--config', type=str,
help='Path to experiment config, e.g., config/vc.yaml')
parser.add_argument('--name', default=None, type=str, help='Name for logging.')
parser.add_argument('--logdir', default='log/', type=str,
help='Logging path.', required=False)
parser.add_argument('--ckpdir', default='ppg2mel/saved_models/', type=str,
help='Checkpoint path.', required=False)
parser.add_argument('--outdir', default='result/', type=str,
help='Decode output path.', required=False)
parser.add_argument('--load', default=None, type=str,
help='Load pre-trained model (for training only)', required=False)
parser.add_argument('--warm_start', action='store_true',
help='Load model weights only, ignore specified layers.')
parser.add_argument('--seed', default=0, type=int,
help='Random seed for reproducable results.', required=False)
parser.add_argument('--njobs', default=8, type=int,
help='Number of threads for dataloader/decoding.', required=False)
parser.add_argument('--cpu', action='store_true', help='Disable GPU training.')
parser.add_argument('--no-pin', action='store_true',
help='Disable pin-memory for dataloader')
parser.add_argument('--test', action='store_true', help='Test the model.')
parser.add_argument('--no-msg', action='store_true', help='Hide all messages.')
parser.add_argument('--finetune', action='store_true', help='Finetune model')
parser.add_argument('--oneshotvc', action='store_true', help='Oneshot VC model')
parser.add_argument('--bilstm', action='store_true', help='BiLSTM VC model')
parser.add_argument('--lsa', action='store_true', help='Use location-sensitive attention (LSA)')
###
paras = parser.parse_args()
setattr(paras, 'gpu', not paras.cpu)
setattr(paras, 'pin_memory', not paras.no_pin)
setattr(paras, 'verbose', not paras.no_msg)
# Make the config dict dot visitable
config = HpsYaml(paras.config)
np.random.seed(paras.seed)
torch.manual_seed(paras.seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(paras.seed)
print(">>> OneShot VC training ...")
mode = "train"
solver = Solver(config, paras, mode)
solver.load_data()
solver.set_model()
solver.exec()
print(">>> Oneshot VC train finished!")
sys.exit(0)
if __name__ == "__main__":
main()

View File

@@ -1,5 +1,5 @@
from synthesizer.synthesize import run_synthesis
from synthesizer.hparams import hparams
from models.synthesizer.synthesize import run_synthesis
from models.synthesizer.hparams import hparams
from utils.argutils import print_args
import argparse
import os

View File

@@ -1,11 +1,13 @@
from utils.argutils import print_args
from vocoder.wavernn.train import train
from vocoder.hifigan.train import train as train_hifigan
from vocoder.hifigan.env import AttrDict
from models.vocoder.wavernn.train import train
from models.vocoder.hifigan.train import train as train_hifigan
from models.vocoder.fregan.train import train as train_fregan
from utils.util import AttrDict
from pathlib import Path
import argparse
import json
import torch
import torch.multiprocessing as mp
if __name__ == "__main__":
parser = argparse.ArgumentParser(
@@ -61,11 +63,30 @@ if __name__ == "__main__":
# Process the arguments
if args.vocoder_type == "wavernn":
# Run the training wavernn
delattr(args, 'vocoder_type')
delattr(args, 'config')
train(**vars(args))
elif args.vocoder_type == "hifigan":
with open(args.config) as f:
json_config = json.load(f)
h = AttrDict(json_config)
train_hifigan(0, args, h)
if h.num_gpus > 1:
h.num_gpus = torch.cuda.device_count()
h.batch_size = int(h.batch_size / h.num_gpus)
print('Batch size per GPU :', h.batch_size)
mp.spawn(train_hifigan, nprocs=h.num_gpus, args=(args, h,))
else:
train_hifigan(0, args, h)
elif args.vocoder_type == "fregan":
with Path('vocoder/fregan/config.json').open() as f:
json_config = json.load(f)
h = AttrDict(json_config)
if h.num_gpus > 1:
h.num_gpus = torch.cuda.device_count()
h.batch_size = int(h.batch_size / h.num_gpus)
print('Batch size per GPU :', h.batch_size)
mp.spawn(train_fregan, nprocs=h.num_gpus, args=(args, h,))
else:
train_fregan(0, args, h)

View File

151
control/mkgui/app.py Normal file
View File

@@ -0,0 +1,151 @@
from pydantic import BaseModel, Field
import os
from pathlib import Path
from enum import Enum
from models.encoder import inference as encoder
import librosa
from scipy.io.wavfile import write
import re
import numpy as np
from control.mkgui.base.components.types import FileContent
from models.vocoder.hifigan import inference as gan_vocoder
from models.synthesizer.inference import Synthesizer
from typing import Any, Tuple
import matplotlib.pyplot as plt
# Constants
AUDIO_SAMPLES_DIR = f"data{os.sep}samples{os.sep}"
SYN_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}synthesizer"
ENC_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}encoder"
VOC_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}vocoder"
TEMP_SOURCE_AUDIO = f"wavs{os.sep}temp_source.wav"
TEMP_RESULT_AUDIO = f"wavs{os.sep}temp_result.wav"
if not os.path.isdir("wavs"):
os.makedirs("wavs")
# Load local sample audio as options TODO: load dataset
if os.path.isdir(AUDIO_SAMPLES_DIR):
audio_input_selection = Enum('samples', list((file.name, file) for file in Path(AUDIO_SAMPLES_DIR).glob("*.wav")))
# Pre-Load models
if os.path.isdir(SYN_MODELS_DIRT):
synthesizers = Enum('synthesizers', list((file.name, file) for file in Path(SYN_MODELS_DIRT).glob("**/*.pt")))
print("Loaded synthesizer models: " + str(len(synthesizers)))
else:
raise Exception(f"Model folder {SYN_MODELS_DIRT} doesn't exist. 请将模型文件位置移动到上述位置中进行重试!")
if os.path.isdir(ENC_MODELS_DIRT):
encoders = Enum('encoders', list((file.name, file) for file in Path(ENC_MODELS_DIRT).glob("**/*.pt")))
print("Loaded encoders models: " + str(len(encoders)))
else:
raise Exception(f"Model folder {ENC_MODELS_DIRT} doesn't exist.")
if os.path.isdir(VOC_MODELS_DIRT):
vocoders = Enum('vocoders', list((file.name, file) for file in Path(VOC_MODELS_DIRT).glob("**/*gan*.pt")))
print("Loaded vocoders models: " + str(len(synthesizers)))
else:
raise Exception(f"Model folder {VOC_MODELS_DIRT} doesn't exist.")
class Input(BaseModel):
message: str = Field(
..., example="欢迎使用工具箱, 现已支持中文输入!", alias="文本内容"
)
local_audio_file: audio_input_selection = Field(
..., alias="选择语音本地wav",
description="选择本地语音文件."
)
record_audio_file: FileContent = Field(default=None, alias="录制语音",
description="录音.", is_recorder=True, mime_type="audio/wav")
upload_audio_file: FileContent = Field(default=None, alias="或上传语音",
description="拖拽或点击上传.", mime_type="audio/wav")
encoder: encoders = Field(
..., alias="编码模型",
description="选择语音编码模型文件."
)
synthesizer: synthesizers = Field(
..., alias="合成模型",
description="选择语音合成模型文件."
)
vocoder: vocoders = Field(
..., alias="语音解码模型",
description="选择语音解码模型文件(目前只支持HifiGan类型)."
)
class AudioEntity(BaseModel):
content: bytes
mel: Any
class Output(BaseModel):
__root__: Tuple[AudioEntity, AudioEntity]
def render_output_ui(self, streamlit_app, input) -> None: # type: ignore
"""Custom output UI.
If this method is implmeneted, it will be used instead of the default Output UI renderer.
"""
src, result = self.__root__
streamlit_app.subheader("Synthesized Audio")
streamlit_app.audio(result.content, format="audio/wav")
fig, ax = plt.subplots()
ax.imshow(src.mel, aspect="equal", interpolation="none")
ax.set_title("mel spectrogram(Source Audio)")
streamlit_app.pyplot(fig)
fig, ax = plt.subplots()
ax.imshow(result.mel, aspect="equal", interpolation="none")
ax.set_title("mel spectrogram(Result Audio)")
streamlit_app.pyplot(fig)
def synthesize(input: Input) -> Output:
"""synthesize(合成)"""
# load models
encoder.load_model(Path(input.encoder.value))
current_synt = Synthesizer(Path(input.synthesizer.value))
gan_vocoder.load_model(Path(input.vocoder.value))
# load file
if input.record_audio_file != None:
with open(TEMP_SOURCE_AUDIO, "w+b") as f:
f.write(input.record_audio_file.as_bytes())
f.seek(0)
wav, sample_rate = librosa.load(TEMP_SOURCE_AUDIO)
elif input.upload_audio_file != None:
with open(TEMP_SOURCE_AUDIO, "w+b") as f:
f.write(input.upload_audio_file.as_bytes())
f.seek(0)
wav, sample_rate = librosa.load(TEMP_SOURCE_AUDIO)
else:
wav, sample_rate = librosa.load(input.local_audio_file.value)
write(TEMP_SOURCE_AUDIO, sample_rate, wav) #Make sure we get the correct wav
source_spec = Synthesizer.make_spectrogram(wav)
# preprocess
encoder_wav = encoder.preprocess_wav(wav, sample_rate)
embed, _, _ = encoder.embed_utterance(encoder_wav, return_partials=True)
# Load input text
texts = filter(None, input.message.split("\n"))
punctuation = '!,。、,' # punctuate and split/clean text
processed_texts = []
for text in texts:
for processed_text in re.sub(r'[{}]+'.format(punctuation), '\n', text).split('\n'):
if processed_text:
processed_texts.append(processed_text.strip())
texts = processed_texts
# synthesize and vocode
embeds = [embed] * len(texts)
specs = current_synt.synthesize_spectrograms(texts, embeds)
spec = np.concatenate(specs, axis=1)
sample_rate = Synthesizer.sample_rate
wav, sample_rate = gan_vocoder.infer_waveform(spec)
# write and output
write(TEMP_RESULT_AUDIO, sample_rate, wav) #Make sure we get the correct wav
with open(TEMP_SOURCE_AUDIO, "rb") as f:
source_file = f.read()
with open(TEMP_RESULT_AUDIO, "rb") as f:
result_file = f.read()
return Output(__root__=(AudioEntity(content=source_file, mel=source_spec), AudioEntity(content=result_file, mel=spec)))

166
control/mkgui/app_vc.py Normal file
View File

@@ -0,0 +1,166 @@
import os
from enum import Enum
from pathlib import Path
from typing import Any, Tuple
import librosa
import matplotlib.pyplot as plt
import torch
from pydantic import BaseModel, Field
from scipy.io.wavfile import write
import models.ppg2mel as Convertor
import models.ppg_extractor as Extractor
from control.mkgui.base.components.types import FileContent
from models.encoder import inference as speacker_encoder
from models.synthesizer.inference import Synthesizer
from models.vocoder.hifigan import inference as gan_vocoder
# Constants
AUDIO_SAMPLES_DIR = f'data{os.sep}samples{os.sep}'
EXT_MODELS_DIRT = f'data{os.sep}ckpt{os.sep}ppg_extractor'
CONV_MODELS_DIRT = f'data{os.sep}ckpt{os.sep}ppg2mel'
VOC_MODELS_DIRT = f'data{os.sep}ckpt{os.sep}vocoder'
TEMP_SOURCE_AUDIO = f'wavs{os.sep}temp_source.wav'
TEMP_TARGET_AUDIO = f'wavs{os.sep}temp_target.wav'
TEMP_RESULT_AUDIO = f'wavs{os.sep}temp_result.wav'
# Load local sample audio as options TODO: load dataset
if os.path.isdir(AUDIO_SAMPLES_DIR):
audio_input_selection = Enum('samples', list((file.name, file) for file in Path(AUDIO_SAMPLES_DIR).glob("*.wav")))
# Pre-Load models
if os.path.isdir(EXT_MODELS_DIRT):
extractors = Enum('extractors', list((file.name, file) for file in Path(EXT_MODELS_DIRT).glob("**/*.pt")))
print("Loaded extractor models: " + str(len(extractors)))
else:
raise Exception(f"Model folder {EXT_MODELS_DIRT} doesn't exist.")
if os.path.isdir(CONV_MODELS_DIRT):
convertors = Enum('convertors', list((file.name, file) for file in Path(CONV_MODELS_DIRT).glob("**/*.pth")))
print("Loaded convertor models: " + str(len(convertors)))
else:
raise Exception(f"Model folder {CONV_MODELS_DIRT} doesn't exist.")
if os.path.isdir(VOC_MODELS_DIRT):
vocoders = Enum('vocoders', list((file.name, file) for file in Path(VOC_MODELS_DIRT).glob("**/*gan*.pt")))
print("Loaded vocoders models: " + str(len(vocoders)))
else:
raise Exception(f"Model folder {VOC_MODELS_DIRT} doesn't exist.")
class Input(BaseModel):
local_audio_file: audio_input_selection = Field(
..., alias="输入语音本地wav",
description="选择本地语音文件."
)
upload_audio_file: FileContent = Field(default=None, alias="或上传语音",
description="拖拽或点击上传.", mime_type="audio/wav")
local_audio_file_target: audio_input_selection = Field(
..., alias="目标语音本地wav",
description="选择本地语音文件."
)
upload_audio_file_target: FileContent = Field(default=None, alias="或上传目标语音",
description="拖拽或点击上传.", mime_type="audio/wav")
extractor: extractors = Field(
..., alias="编码模型",
description="选择语音编码模型文件."
)
convertor: convertors = Field(
..., alias="转换模型",
description="选择语音转换模型文件."
)
vocoder: vocoders = Field(
..., alias="语音解码模型",
description="选择语音解码模型文件(目前只支持HifiGan类型)."
)
class AudioEntity(BaseModel):
content: bytes
mel: Any
class Output(BaseModel):
__root__: Tuple[AudioEntity, AudioEntity, AudioEntity]
def render_output_ui(self, streamlit_app, input) -> None: # type: ignore
"""Custom output UI.
If this method is implmeneted, it will be used instead of the default Output UI renderer.
"""
src, target, result = self.__root__
streamlit_app.subheader("Synthesized Audio")
streamlit_app.audio(result.content, format="audio/wav")
fig, ax = plt.subplots()
ax.imshow(src.mel, aspect="equal", interpolation="none")
ax.set_title("mel spectrogram(Source Audio)")
streamlit_app.pyplot(fig)
fig, ax = plt.subplots()
ax.imshow(target.mel, aspect="equal", interpolation="none")
ax.set_title("mel spectrogram(Target Audio)")
streamlit_app.pyplot(fig)
fig, ax = plt.subplots()
ax.imshow(result.mel, aspect="equal", interpolation="none")
ax.set_title("mel spectrogram(Result Audio)")
streamlit_app.pyplot(fig)
def convert(input: Input) -> Output:
"""convert(转换)"""
# load models
extractor = Extractor.load_model(Path(input.extractor.value))
convertor = Convertor.load_model(Path(input.convertor.value))
# current_synt = Synthesizer(Path(input.synthesizer.value))
gan_vocoder.load_model(Path(input.vocoder.value))
# load file
if input.upload_audio_file != None:
with open(TEMP_SOURCE_AUDIO, "w+b") as f:
f.write(input.upload_audio_file.as_bytes())
f.seek(0)
src_wav, sample_rate = librosa.load(TEMP_SOURCE_AUDIO)
else:
src_wav, sample_rate = librosa.load(input.local_audio_file.value)
write(TEMP_SOURCE_AUDIO, sample_rate, src_wav) #Make sure we get the correct wav
if input.upload_audio_file_target != None:
with open(TEMP_TARGET_AUDIO, "w+b") as f:
f.write(input.upload_audio_file_target.as_bytes())
f.seek(0)
ref_wav, _ = librosa.load(TEMP_TARGET_AUDIO)
else:
ref_wav, _ = librosa.load(input.local_audio_file_target.value)
write(TEMP_TARGET_AUDIO, sample_rate, ref_wav) #Make sure we get the correct wav
ppg = extractor.extract_from_wav(src_wav)
# Import necessary dependency of Voice Conversion
from utils.f0_utils import (compute_f0, compute_mean_std, f02lf0,
get_converted_lf0uv)
ref_lf0_mean, ref_lf0_std = compute_mean_std(f02lf0(compute_f0(ref_wav)))
speacker_encoder.load_model(Path(f"data{os.sep}ckpt{os.sep}encoder{os.sep}pretrained_bak_5805000.pt"))
embed = speacker_encoder.embed_utterance(ref_wav)
lf0_uv = get_converted_lf0uv(src_wav, ref_lf0_mean, ref_lf0_std, convert=True)
min_len = min(ppg.shape[1], len(lf0_uv))
ppg = ppg[:, :min_len]
lf0_uv = lf0_uv[:min_len]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
_, mel_pred, att_ws = convertor.inference(
ppg,
logf0_uv=torch.from_numpy(lf0_uv).unsqueeze(0).float().to(device),
spembs=torch.from_numpy(embed).unsqueeze(0).to(device),
)
mel_pred= mel_pred.transpose(0, 1)
breaks = [mel_pred.shape[1]]
mel_pred= mel_pred.detach().cpu().numpy()
# synthesize and vocode
wav, sample_rate = gan_vocoder.infer_waveform(mel_pred)
# write and output
write(TEMP_RESULT_AUDIO, sample_rate, wav) #Make sure we get the correct wav
with open(TEMP_SOURCE_AUDIO, "rb") as f:
source_file = f.read()
with open(TEMP_TARGET_AUDIO, "rb") as f:
target_file = f.read()
with open(TEMP_RESULT_AUDIO, "rb") as f:
result_file = f.read()
return Output(__root__=(AudioEntity(content=source_file, mel=Synthesizer.make_spectrogram(src_wav)), AudioEntity(content=target_file, mel=Synthesizer.make_spectrogram(ref_wav)), AudioEntity(content=result_file, mel=Synthesizer.make_spectrogram(wav))))

View File

@@ -0,0 +1,2 @@
from .core import Opyrator

View File

@@ -0,0 +1 @@
from .fastapi_app import create_api

View File

@@ -0,0 +1,102 @@
"""Collection of utilities for FastAPI apps."""
import inspect
from typing import Any, Type
from fastapi import FastAPI, Form
from pydantic import BaseModel
def as_form(cls: Type[BaseModel]) -> Any:
"""Adds an as_form class method to decorated models.
The as_form class method can be used with FastAPI endpoints
"""
new_params = [
inspect.Parameter(
field.alias,
inspect.Parameter.POSITIONAL_ONLY,
default=(Form(field.default) if not field.required else Form(...)),
)
for field in cls.__fields__.values()
]
async def _as_form(**data): # type: ignore
return cls(**data)
sig = inspect.signature(_as_form)
sig = sig.replace(parameters=new_params)
_as_form.__signature__ = sig # type: ignore
setattr(cls, "as_form", _as_form)
return cls
def patch_fastapi(app: FastAPI) -> None:
"""Patch function to allow relative url resolution.
This patch is required to make fastapi fully functional with a relative url path.
This code snippet can be copy-pasted to any Fastapi application.
"""
from fastapi.openapi.docs import get_redoc_html, get_swagger_ui_html
from starlette.requests import Request
from starlette.responses import HTMLResponse
async def redoc_ui_html(req: Request) -> HTMLResponse:
assert app.openapi_url is not None
redoc_ui = get_redoc_html(
openapi_url="./" + app.openapi_url.lstrip("/"),
title=app.title + " - Redoc UI",
)
return HTMLResponse(redoc_ui.body.decode("utf-8"))
async def swagger_ui_html(req: Request) -> HTMLResponse:
assert app.openapi_url is not None
swagger_ui = get_swagger_ui_html(
openapi_url="./" + app.openapi_url.lstrip("/"),
title=app.title + " - Swagger UI",
oauth2_redirect_url=app.swagger_ui_oauth2_redirect_url,
)
# insert request interceptor to have all request run on relativ path
request_interceptor = (
"requestInterceptor: (e) => {"
"\n\t\t\tvar url = window.location.origin + window.location.pathname"
'\n\t\t\turl = url.substring( 0, url.lastIndexOf( "/" ) + 1);'
"\n\t\t\turl = e.url.replace(/http(s)?:\/\/[^/]*\//i, url);" # noqa: W605
"\n\t\t\te.contextUrl = url"
"\n\t\t\te.url = url"
"\n\t\t\treturn e;}"
)
return HTMLResponse(
swagger_ui.body.decode("utf-8").replace(
"dom_id: '#swagger-ui',",
"dom_id: '#swagger-ui',\n\t\t" + request_interceptor + ",",
)
)
# remove old docs route and add our patched route
routes_new = []
for app_route in app.routes:
if app_route.path == "/docs": # type: ignore
continue
if app_route.path == "/redoc": # type: ignore
continue
routes_new.append(app_route)
app.router.routes = routes_new
assert app.docs_url is not None
app.add_route(app.docs_url, swagger_ui_html, include_in_schema=False)
assert app.redoc_url is not None
app.add_route(app.redoc_url, redoc_ui_html, include_in_schema=False)
# Make graphql realtive
from starlette import graphql
graphql.GRAPHIQL = graphql.GRAPHIQL.replace(
"({{REQUEST_PATH}}", '("." + {{REQUEST_PATH}}'
)

View File

@@ -0,0 +1,43 @@
from typing import List
from pydantic import BaseModel
class ScoredLabel(BaseModel):
label: str
score: float
class ClassificationOutput(BaseModel):
__root__: List[ScoredLabel]
def __iter__(self): # type: ignore
return iter(self.__root__)
def __getitem__(self, item): # type: ignore
return self.__root__[item]
def render_output_ui(self, streamlit) -> None: # type: ignore
import plotly.express as px
sorted_predictions = sorted(
[prediction.dict() for prediction in self.__root__],
key=lambda k: k["score"],
)
num_labels = len(sorted_predictions)
if len(sorted_predictions) > 10:
num_labels = streamlit.slider(
"Maximum labels to show: ",
min_value=1,
max_value=len(sorted_predictions),
value=len(sorted_predictions),
)
fig = px.bar(
sorted_predictions[len(sorted_predictions) - num_labels :],
x="score",
y="label",
orientation="h",
)
streamlit.plotly_chart(fig, use_container_width=True)
# fig.show()

View File

@@ -0,0 +1,46 @@
import base64
from typing import Any, Dict, overload
class FileContent(str):
def as_bytes(self) -> bytes:
return base64.b64decode(self, validate=True)
def as_str(self) -> str:
return self.as_bytes().decode()
@classmethod
def __modify_schema__(cls, field_schema: Dict[str, Any]) -> None:
field_schema.update(format="byte")
@classmethod
def __get_validators__(cls) -> Any: # type: ignore
yield cls.validate
@classmethod
def validate(cls, value: Any) -> "FileContent":
if isinstance(value, FileContent):
return value
elif isinstance(value, str):
return FileContent(value)
elif isinstance(value, (bytes, bytearray, memoryview)):
return FileContent(base64.b64encode(value).decode())
else:
raise Exception("Wrong type")
# # 暂时无法使用,因为浏览器中没有考虑选择文件夹
# class DirectoryContent(FileContent):
# @classmethod
# def __modify_schema__(cls, field_schema: Dict[str, Any]) -> None:
# field_schema.update(format="path")
# @classmethod
# def validate(cls, value: Any) -> "DirectoryContent":
# if isinstance(value, DirectoryContent):
# return value
# elif isinstance(value, str):
# return DirectoryContent(value)
# elif isinstance(value, (bytes, bytearray, memoryview)):
# return DirectoryContent(base64.b64encode(value).decode())
# else:
# raise Exception("Wrong type")

203
control/mkgui/base/core.py Normal file
View File

@@ -0,0 +1,203 @@
import importlib
import inspect
import re
from typing import Any, Callable, Type, Union, get_type_hints
from pydantic import BaseModel, parse_raw_as
from pydantic.tools import parse_obj_as
def name_to_title(name: str) -> str:
"""Converts a camelCase or snake_case name to title case."""
# If camelCase -> convert to snake case
name = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name)
name = re.sub("([a-z0-9])([A-Z])", r"\1_\2", name).lower()
# Convert to title case
return name.replace("_", " ").strip().title()
def is_compatible_type(type: Type) -> bool:
"""Returns `True` if the type is opyrator-compatible."""
try:
if issubclass(type, BaseModel):
return True
except Exception:
pass
try:
# valid list type
if type.__origin__ is list and issubclass(type.__args__[0], BaseModel):
return True
except Exception:
pass
return False
def get_input_type(func: Callable) -> Type:
"""Returns the input type of a given function (callable).
Args:
func: The function for which to get the input type.
Raises:
ValueError: If the function does not have a valid input type annotation.
"""
type_hints = get_type_hints(func)
if "input" not in type_hints:
raise ValueError(
"The callable MUST have a parameter with the name `input` with typing annotation. "
"For example: `def my_opyrator(input: InputModel) -> OutputModel:`."
)
input_type = type_hints["input"]
if not is_compatible_type(input_type):
raise ValueError(
"The `input` parameter MUST be a subclass of the Pydantic BaseModel or a list of Pydantic models."
)
# TODO: return warning if more than one input parameters
return input_type
def get_output_type(func: Callable) -> Type:
"""Returns the output type of a given function (callable).
Args:
func: The function for which to get the output type.
Raises:
ValueError: If the function does not have a valid output type annotation.
"""
type_hints = get_type_hints(func)
if "return" not in type_hints:
raise ValueError(
"The return type of the callable MUST be annotated with type hints."
"For example: `def my_opyrator(input: InputModel) -> OutputModel:`."
)
output_type = type_hints["return"]
if not is_compatible_type(output_type):
raise ValueError(
"The return value MUST be a subclass of the Pydantic BaseModel or a list of Pydantic models."
)
return output_type
def get_callable(import_string: str) -> Callable:
"""Import a callable from an string."""
callable_seperator = ":"
if callable_seperator not in import_string:
# Use dot as seperator
callable_seperator = "."
if callable_seperator not in import_string:
raise ValueError("The callable path MUST specify the function. ")
mod_name, callable_name = import_string.rsplit(callable_seperator, 1)
mod = importlib.import_module(mod_name)
return getattr(mod, callable_name)
class Opyrator:
def __init__(self, func: Union[Callable, str]) -> None:
if isinstance(func, str):
# Try to load the function from a string notion
self.function = get_callable(func)
else:
self.function = func
self._action = "Execute"
self._input_type = None
self._output_type = None
if not callable(self.function):
raise ValueError("The provided function parameters is not a callable.")
if inspect.isclass(self.function):
raise ValueError(
"The provided callable is an uninitialized Class. This is not allowed."
)
if inspect.isfunction(self.function):
# The provided callable is a function
self._input_type = get_input_type(self.function)
self._output_type = get_output_type(self.function)
try:
# Get name
self._name = name_to_title(self.function.__name__)
except Exception:
pass
try:
# Get description from function
doc_string = inspect.getdoc(self.function)
if doc_string:
self._action = doc_string
except Exception:
pass
elif hasattr(self.function, "__call__"):
# The provided callable is a function
self._input_type = get_input_type(self.function.__call__) # type: ignore
self._output_type = get_output_type(self.function.__call__) # type: ignore
try:
# Get name
self._name = name_to_title(type(self.function).__name__)
except Exception:
pass
try:
# Get action from
doc_string = inspect.getdoc(self.function.__call__) # type: ignore
if doc_string:
self._action = doc_string
if (
not self._action
or self._action == "Call"
):
# Get docstring from class instead of __call__ function
doc_string = inspect.getdoc(self.function)
if doc_string:
self._action = doc_string
except Exception:
pass
else:
raise ValueError("Unknown callable type.")
@property
def name(self) -> str:
return self._name
@property
def action(self) -> str:
return self._action
@property
def input_type(self) -> Any:
return self._input_type
@property
def output_type(self) -> Any:
return self._output_type
def __call__(self, input: Any, **kwargs: Any) -> Any:
input_obj = input
if isinstance(input, str):
# Allow json input
input_obj = parse_raw_as(self.input_type, input)
if isinstance(input, dict):
# Allow dict input
input_obj = parse_obj_as(self.input_type, input)
return self.function(input_obj, **kwargs)

View File

@@ -0,0 +1 @@
from .streamlit_ui import render_streamlit_ui

View File

@@ -0,0 +1,135 @@
from typing import Dict
def resolve_reference(reference: str, references: Dict) -> Dict:
return references[reference.split("/")[-1]]
def get_single_reference_item(property: Dict, references: Dict) -> Dict:
# Ref can either be directly in the properties or the first element of allOf
reference = property.get("$ref")
if reference is None:
reference = property["allOf"][0]["$ref"]
return resolve_reference(reference, references)
def is_single_string_property(property: Dict) -> bool:
return property.get("type") == "string"
def is_single_datetime_property(property: Dict) -> bool:
if property.get("type") != "string":
return False
return property.get("format") in ["date-time", "time", "date"]
def is_single_boolean_property(property: Dict) -> bool:
return property.get("type") == "boolean"
def is_single_number_property(property: Dict) -> bool:
return property.get("type") in ["integer", "number"]
def is_single_file_property(property: Dict) -> bool:
if property.get("type") != "string":
return False
# TODO: binary?
return property.get("format") == "byte"
def is_single_autio_property(property: Dict) -> bool:
if property.get("type") != "string":
return False
# TODO: binary?
return property.get("format") == "bytes"
def is_single_directory_property(property: Dict) -> bool:
if property.get("type") != "string":
return False
return property.get("format") == "path"
def is_multi_enum_property(property: Dict, references: Dict) -> bool:
if property.get("type") != "array":
return False
if property.get("uniqueItems") is not True:
# Only relevant if it is a set or other datastructures with unique items
return False
try:
_ = resolve_reference(property["items"]["$ref"], references)["enum"]
return True
except Exception:
return False
def is_single_enum_property(property: Dict, references: Dict) -> bool:
try:
_ = get_single_reference_item(property, references)["enum"]
return True
except Exception:
return False
def is_single_dict_property(property: Dict) -> bool:
if property.get("type") != "object":
return False
return "additionalProperties" in property
def is_single_reference(property: Dict) -> bool:
if property.get("type") is not None:
return False
return bool(property.get("$ref"))
def is_multi_file_property(property: Dict) -> bool:
if property.get("type") != "array":
return False
if property.get("items") is None:
return False
try:
# TODO: binary
return property["items"]["format"] == "byte"
except Exception:
return False
def is_single_object(property: Dict, references: Dict) -> bool:
try:
object_reference = get_single_reference_item(property, references)
if object_reference["type"] != "object":
return False
return "properties" in object_reference
except Exception:
return False
def is_property_list(property: Dict) -> bool:
if property.get("type") != "array":
return False
if property.get("items") is None:
return False
try:
return property["items"]["type"] in ["string", "number", "integer"]
except Exception:
return False
def is_object_list_property(property: Dict, references: Dict) -> bool:
if property.get("type") != "array":
return False
try:
object_reference = resolve_reference(property["items"]["$ref"], references)
if object_reference["type"] != "object":
return False
return "properties" in object_reference
except Exception:
return False

View File

@@ -0,0 +1,933 @@
import datetime
import inspect
import mimetypes
import sys
from os import getcwd, unlink, path
from platform import system
from tempfile import NamedTemporaryFile
from typing import Any, Callable, Dict, List, Type
from PIL import Image
import pandas as pd
import streamlit as st
from fastapi.encoders import jsonable_encoder
from loguru import logger
from pydantic import BaseModel, ValidationError, parse_obj_as
from control.mkgui.base import Opyrator
from control.mkgui.base.core import name_to_title
from . import schema_utils
from .streamlit_utils import CUSTOM_STREAMLIT_CSS
STREAMLIT_RUNNER_SNIPPET = """
from control.mkgui.base.ui import render_streamlit_ui
import streamlit as st
# TODO: Make it configurable
# Page config can only be setup once
st.set_page_config(
page_title="MockingBird",
page_icon="🧊",
layout="wide")
render_streamlit_ui()
"""
# with st.spinner("Loading MockingBird GUI. Please wait..."):
# opyrator = Opyrator("{opyrator_path}")
def launch_ui(port: int = 8501) -> None:
with NamedTemporaryFile(
suffix=".py", mode="w", encoding="utf-8", delete=False
) as f:
f.write(STREAMLIT_RUNNER_SNIPPET)
f.seek(0)
import subprocess
python_path = f'PYTHONPATH="$PYTHONPATH:{getcwd()}"'
if system() == "Windows":
python_path = f"set PYTHONPATH=%PYTHONPATH%;{getcwd()} &&"
subprocess.run(
f"""set STREAMLIT_GLOBAL_SHOW_WARNING_ON_DIRECT_EXECUTION=false""",
shell=True,
)
subprocess.run(
f"""{python_path} "{sys.executable}" -m streamlit run --server.port={port} --server.headless=True --runner.magicEnabled=False --server.maxUploadSize=50 --browser.gatherUsageStats=False {f.name}""",
shell=True,
)
f.close()
unlink(f.name)
def function_has_named_arg(func: Callable, parameter: str) -> bool:
try:
sig = inspect.signature(func)
for param in sig.parameters.values():
if param.name == "input":
return True
except Exception:
return False
return False
def has_output_ui_renderer(data_item: BaseModel) -> bool:
return hasattr(data_item, "render_output_ui")
def has_input_ui_renderer(input_class: Type[BaseModel]) -> bool:
return hasattr(input_class, "render_input_ui")
def is_compatible_audio(mime_type: str) -> bool:
return mime_type in ["audio/mpeg", "audio/ogg", "audio/wav"]
def is_compatible_image(mime_type: str) -> bool:
return mime_type in ["image/png", "image/jpeg"]
def is_compatible_video(mime_type: str) -> bool:
return mime_type in ["video/mp4"]
class InputUI:
def __init__(self, session_state, input_class: Type[BaseModel]):
self._session_state = session_state
self._input_class = input_class
self._schema_properties = input_class.schema(by_alias=True).get(
"properties", {}
)
self._schema_references = input_class.schema(by_alias=True).get(
"definitions", {}
)
def render_ui(self, streamlit_app_root) -> None:
if has_input_ui_renderer(self._input_class):
# The input model has a rendering function
# The rendering also returns the current state of input data
self._session_state.input_data = self._input_class.render_input_ui( # type: ignore
st, self._session_state.input_data
)
return
# print(self._schema_properties)
for property_key in self._schema_properties.keys():
property = self._schema_properties[property_key]
if not property.get("title"):
# Set property key as fallback title
property["title"] = name_to_title(property_key)
try:
if "input_data" in self._session_state:
self._store_value(
property_key,
self._render_property(streamlit_app_root, property_key, property),
)
except Exception as e:
print("Exception!", e)
pass
def _get_default_streamlit_input_kwargs(self, key: str, property: Dict) -> Dict:
streamlit_kwargs = {
"label": property.get("title"),
"key": key,
}
if property.get("description"):
streamlit_kwargs["help"] = property.get("description")
return streamlit_kwargs
def _store_value(self, key: str, value: Any) -> None:
data_element = self._session_state.input_data
key_elements = key.split(".")
for i, key_element in enumerate(key_elements):
if i == len(key_elements) - 1:
# add value to this element
data_element[key_element] = value
return
if key_element not in data_element:
data_element[key_element] = {}
data_element = data_element[key_element]
def _get_value(self, key: str) -> Any:
data_element = self._session_state.input_data
key_elements = key.split(".")
for i, key_element in enumerate(key_elements):
if i == len(key_elements) - 1:
# add value to this element
if key_element not in data_element:
return None
return data_element[key_element]
if key_element not in data_element:
data_element[key_element] = {}
data_element = data_element[key_element]
return None
def _render_single_datetime_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
if property.get("format") == "time":
if property.get("default"):
try:
streamlit_kwargs["value"] = datetime.time.fromisoformat( # type: ignore
property.get("default")
)
except Exception:
pass
return streamlit_app.time_input(**streamlit_kwargs)
elif property.get("format") == "date":
if property.get("default"):
try:
streamlit_kwargs["value"] = datetime.date.fromisoformat( # type: ignore
property.get("default")
)
except Exception:
pass
return streamlit_app.date_input(**streamlit_kwargs)
elif property.get("format") == "date-time":
if property.get("default"):
try:
streamlit_kwargs["value"] = datetime.datetime.fromisoformat( # type: ignore
property.get("default")
)
except Exception:
pass
with streamlit_app.container():
streamlit_app.subheader(streamlit_kwargs.get("label"))
if streamlit_kwargs.get("description"):
streamlit_app.text(streamlit_kwargs.get("description"))
selected_date = None
selected_time = None
date_col, time_col = streamlit_app.columns(2)
with date_col:
date_kwargs = {"label": "Date", "key": key + "-date-input"}
if streamlit_kwargs.get("value"):
try:
date_kwargs["value"] = streamlit_kwargs.get( # type: ignore
"value"
).date()
except Exception:
pass
selected_date = streamlit_app.date_input(**date_kwargs)
with time_col:
time_kwargs = {"label": "Time", "key": key + "-time-input"}
if streamlit_kwargs.get("value"):
try:
time_kwargs["value"] = streamlit_kwargs.get( # type: ignore
"value"
).time()
except Exception:
pass
selected_time = streamlit_app.time_input(**time_kwargs)
return datetime.datetime.combine(selected_date, selected_time)
else:
streamlit_app.warning(
"Date format is not supported: " + str(property.get("format"))
)
def _render_single_file_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
file_extension = None
if "mime_type" in property:
file_extension = mimetypes.guess_extension(property["mime_type"])
if "is_recorder" in property:
from audio_recorder_streamlit import audio_recorder
audio_bytes = audio_recorder()
if audio_bytes:
streamlit_app.audio(audio_bytes, format="audio/wav")
return audio_bytes
uploaded_file = streamlit_app.file_uploader(
**streamlit_kwargs, accept_multiple_files=False, type=file_extension
)
if uploaded_file is None:
return None
bytes = uploaded_file.getvalue()
if property.get("mime_type"):
if is_compatible_audio(property["mime_type"]):
# Show audio
streamlit_app.audio(bytes, format=property.get("mime_type"))
if is_compatible_image(property["mime_type"]):
# Show image
streamlit_app.image(bytes)
if is_compatible_video(property["mime_type"]):
# Show video
streamlit_app.video(bytes, format=property.get("mime_type"))
return bytes
def _render_single_audio_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
# streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
from audio_recorder_streamlit import audio_recorder
audio_bytes = audio_recorder()
if audio_bytes:
streamlit_app.audio(audio_bytes, format="audio/wav")
return audio_bytes
# file_extension = None
# if "mime_type" in property:
# file_extension = mimetypes.guess_extension(property["mime_type"])
# uploaded_file = streamlit_app.file_uploader(
# **streamlit_kwargs, accept_multiple_files=False, type=file_extension
# )
# if uploaded_file is None:
# return None
# bytes = uploaded_file.getvalue()
# if property.get("mime_type"):
# if is_compatible_audio(property["mime_type"]):
# # Show audio
# streamlit_app.audio(bytes, format=property.get("mime_type"))
# if is_compatible_image(property["mime_type"]):
# # Show image
# streamlit_app.image(bytes)
# if is_compatible_video(property["mime_type"]):
# # Show video
# streamlit_app.video(bytes, format=property.get("mime_type"))
# return bytes
def _render_single_string_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
if property.get("default"):
streamlit_kwargs["value"] = property.get("default")
elif property.get("example"):
# TODO: also use example for other property types
# Use example as value if it is provided
streamlit_kwargs["value"] = property.get("example")
if property.get("maxLength") is not None:
streamlit_kwargs["max_chars"] = property.get("maxLength")
if (
property.get("format")
or (
property.get("maxLength") is not None
and int(property.get("maxLength")) < 140 # type: ignore
)
or property.get("writeOnly")
):
# If any format is set, use single text input
# If max chars is set to less than 140, use single text input
# If write only -> password field
if property.get("writeOnly"):
streamlit_kwargs["type"] = "password"
return streamlit_app.text_input(**streamlit_kwargs)
else:
# Otherwise use multiline text area
return streamlit_app.text_area(**streamlit_kwargs)
def _render_multi_enum_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
reference_item = schema_utils.resolve_reference(
property["items"]["$ref"], self._schema_references
)
# TODO: how to select defaults
return streamlit_app.multiselect(
**streamlit_kwargs, options=reference_item["enum"]
)
def _render_single_enum_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
reference_item = schema_utils.get_single_reference_item(
property, self._schema_references
)
if property.get("default") is not None:
try:
streamlit_kwargs["index"] = reference_item["enum"].index(
property.get("default")
)
except Exception:
# Use default selection
pass
return streamlit_app.selectbox(
**streamlit_kwargs, options=reference_item["enum"]
)
def _render_single_dict_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
# Add title and subheader
streamlit_app.subheader(property.get("title"))
if property.get("description"):
streamlit_app.markdown(property.get("description"))
streamlit_app.markdown("---")
current_dict = self._get_value(key)
if not current_dict:
current_dict = {}
key_col, value_col = streamlit_app.columns(2)
with key_col:
updated_key = streamlit_app.text_input(
"Key", value="", key=key + "-new-key"
)
with value_col:
# TODO: also add boolean?
value_kwargs = {"label": "Value", "key": key + "-new-value"}
if property["additionalProperties"].get("type") == "integer":
value_kwargs["value"] = 0 # type: ignore
updated_value = streamlit_app.number_input(**value_kwargs)
elif property["additionalProperties"].get("type") == "number":
value_kwargs["value"] = 0.0 # type: ignore
value_kwargs["format"] = "%f"
updated_value = streamlit_app.number_input(**value_kwargs)
else:
value_kwargs["value"] = ""
updated_value = streamlit_app.text_input(**value_kwargs)
streamlit_app.markdown("---")
with streamlit_app.container():
clear_col, add_col = streamlit_app.columns([1, 2])
with clear_col:
if streamlit_app.button("Clear Items", key=key + "-clear-items"):
current_dict = {}
with add_col:
if (
streamlit_app.button("Add Item", key=key + "-add-item")
and updated_key
):
current_dict[updated_key] = updated_value
streamlit_app.write(current_dict)
return current_dict
def _render_single_reference(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
reference_item = schema_utils.get_single_reference_item(
property, self._schema_references
)
return self._render_property(streamlit_app, key, reference_item)
def _render_multi_file_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
file_extension = None
if "mime_type" in property:
file_extension = mimetypes.guess_extension(property["mime_type"])
uploaded_files = streamlit_app.file_uploader(
**streamlit_kwargs, accept_multiple_files=True, type=file_extension
)
uploaded_files_bytes = []
if uploaded_files:
for uploaded_file in uploaded_files:
uploaded_files_bytes.append(uploaded_file.read())
return uploaded_files_bytes
def _render_single_boolean_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
if property.get("default"):
streamlit_kwargs["value"] = property.get("default")
return streamlit_app.checkbox(**streamlit_kwargs)
def _render_single_number_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
streamlit_kwargs = self._get_default_streamlit_input_kwargs(key, property)
number_transform = int
if property.get("type") == "number":
number_transform = float # type: ignore
streamlit_kwargs["format"] = "%f"
if "multipleOf" in property:
# Set stepcount based on multiple of parameter
streamlit_kwargs["step"] = number_transform(property["multipleOf"])
elif number_transform == int:
# Set step size to 1 as default
streamlit_kwargs["step"] = 1
elif number_transform == float:
# Set step size to 0.01 as default
# TODO: adapt to default value
streamlit_kwargs["step"] = 0.01
if "minimum" in property:
streamlit_kwargs["min_value"] = number_transform(property["minimum"])
if "exclusiveMinimum" in property:
streamlit_kwargs["min_value"] = number_transform(
property["exclusiveMinimum"] + streamlit_kwargs["step"]
)
if "maximum" in property:
streamlit_kwargs["max_value"] = number_transform(property["maximum"])
if "exclusiveMaximum" in property:
streamlit_kwargs["max_value"] = number_transform(
property["exclusiveMaximum"] - streamlit_kwargs["step"]
)
if property.get("default") is not None:
streamlit_kwargs["value"] = number_transform(property.get("default")) # type: ignore
else:
if "min_value" in streamlit_kwargs:
streamlit_kwargs["value"] = streamlit_kwargs["min_value"]
elif number_transform == int:
streamlit_kwargs["value"] = 0
else:
# Set default value to step
streamlit_kwargs["value"] = number_transform(streamlit_kwargs["step"])
if "min_value" in streamlit_kwargs and "max_value" in streamlit_kwargs:
# TODO: Only if less than X steps
return streamlit_app.slider(**streamlit_kwargs)
else:
return streamlit_app.number_input(**streamlit_kwargs)
def _render_object_input(self, streamlit_app: st, key: str, property: Dict) -> Any:
properties = property["properties"]
object_inputs = {}
for property_key in properties:
property = properties[property_key]
if not property.get("title"):
# Set property key as fallback title
property["title"] = name_to_title(property_key)
# construct full key based on key parts -> required later to get the value
full_key = key + "." + property_key
object_inputs[property_key] = self._render_property(
streamlit_app, full_key, property
)
return object_inputs
def _render_single_object_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
# Add title and subheader
title = property.get("title")
streamlit_app.subheader(title)
if property.get("description"):
streamlit_app.markdown(property.get("description"))
object_reference = schema_utils.get_single_reference_item(
property, self._schema_references
)
return self._render_object_input(streamlit_app, key, object_reference)
def _render_property_list_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
# Add title and subheader
streamlit_app.subheader(property.get("title"))
if property.get("description"):
streamlit_app.markdown(property.get("description"))
streamlit_app.markdown("---")
current_list = self._get_value(key)
if not current_list:
current_list = []
value_kwargs = {"label": "Value", "key": key + "-new-value"}
if property["items"]["type"] == "integer":
value_kwargs["value"] = 0 # type: ignore
new_value = streamlit_app.number_input(**value_kwargs)
elif property["items"]["type"] == "number":
value_kwargs["value"] = 0.0 # type: ignore
value_kwargs["format"] = "%f"
new_value = streamlit_app.number_input(**value_kwargs)
else:
value_kwargs["value"] = ""
new_value = streamlit_app.text_input(**value_kwargs)
streamlit_app.markdown("---")
with streamlit_app.container():
clear_col, add_col = streamlit_app.columns([1, 2])
with clear_col:
if streamlit_app.button("Clear Items", key=key + "-clear-items"):
current_list = []
with add_col:
if (
streamlit_app.button("Add Item", key=key + "-add-item")
and new_value is not None
):
current_list.append(new_value)
streamlit_app.write(current_list)
return current_list
def _render_object_list_input(
self, streamlit_app: st, key: str, property: Dict
) -> Any:
# TODO: support max_items, and min_items properties
# Add title and subheader
streamlit_app.subheader(property.get("title"))
if property.get("description"):
streamlit_app.markdown(property.get("description"))
streamlit_app.markdown("---")
current_list = self._get_value(key)
if not current_list:
current_list = []
object_reference = schema_utils.resolve_reference(
property["items"]["$ref"], self._schema_references
)
input_data = self._render_object_input(streamlit_app, key, object_reference)
streamlit_app.markdown("---")
with streamlit_app.container():
clear_col, add_col = streamlit_app.columns([1, 2])
with clear_col:
if streamlit_app.button("Clear Items", key=key + "-clear-items"):
current_list = []
with add_col:
if (
streamlit_app.button("Add Item", key=key + "-add-item")
and input_data
):
current_list.append(input_data)
streamlit_app.write(current_list)
return current_list
def _render_property(self, streamlit_app: st, key: str, property: Dict) -> Any:
if schema_utils.is_single_enum_property(property, self._schema_references):
return self._render_single_enum_input(streamlit_app, key, property)
if schema_utils.is_multi_enum_property(property, self._schema_references):
return self._render_multi_enum_input(streamlit_app, key, property)
if schema_utils.is_single_file_property(property):
return self._render_single_file_input(streamlit_app, key, property)
if schema_utils.is_multi_file_property(property):
return self._render_multi_file_input(streamlit_app, key, property)
if schema_utils.is_single_datetime_property(property):
return self._render_single_datetime_input(streamlit_app, key, property)
if schema_utils.is_single_boolean_property(property):
return self._render_single_boolean_input(streamlit_app, key, property)
if schema_utils.is_single_dict_property(property):
return self._render_single_dict_input(streamlit_app, key, property)
if schema_utils.is_single_number_property(property):
return self._render_single_number_input(streamlit_app, key, property)
if schema_utils.is_single_string_property(property):
return self._render_single_string_input(streamlit_app, key, property)
if schema_utils.is_single_object(property, self._schema_references):
return self._render_single_object_input(streamlit_app, key, property)
if schema_utils.is_object_list_property(property, self._schema_references):
return self._render_object_list_input(streamlit_app, key, property)
if schema_utils.is_property_list(property):
return self._render_property_list_input(streamlit_app, key, property)
if schema_utils.is_single_reference(property):
return self._render_single_reference(streamlit_app, key, property)
streamlit_app.warning(
"The type of the following property is currently not supported: "
+ str(property.get("title"))
)
raise Exception("Unsupported property")
class OutputUI:
def __init__(self, output_data: Any, input_data: Any):
self._output_data = output_data
self._input_data = input_data
def render_ui(self, streamlit_app) -> None:
try:
if isinstance(self._output_data, BaseModel):
self._render_single_output(streamlit_app, self._output_data)
return
if type(self._output_data) == list:
self._render_list_output(streamlit_app, self._output_data)
return
except Exception as ex:
streamlit_app.exception(ex)
# Fallback to
streamlit_app.json(jsonable_encoder(self._output_data))
def _render_single_text_property(
self, streamlit: st, property_schema: Dict, value: Any
) -> None:
# Add title and subheader
streamlit.subheader(property_schema.get("title"))
if property_schema.get("description"):
streamlit.markdown(property_schema.get("description"))
if value is None or value == "":
streamlit.info("No value returned!")
else:
streamlit.code(str(value), language="plain")
def _render_single_file_property(
self, streamlit: st, property_schema: Dict, value: Any
) -> None:
# Add title and subheader
streamlit.subheader(property_schema.get("title"))
if property_schema.get("description"):
streamlit.markdown(property_schema.get("description"))
if value is None or value == "":
streamlit.info("No value returned!")
else:
# TODO: Detect if it is a FileContent instance
# TODO: detect if it is base64
file_extension = ""
if "mime_type" in property_schema:
mime_type = property_schema["mime_type"]
file_extension = mimetypes.guess_extension(mime_type) or ""
if is_compatible_audio(mime_type):
streamlit.audio(value.as_bytes(), format=mime_type)
return
if is_compatible_image(mime_type):
streamlit.image(value.as_bytes())
return
if is_compatible_video(mime_type):
streamlit.video(value.as_bytes(), format=mime_type)
return
filename = (
(property_schema["title"] + file_extension)
.lower()
.strip()
.replace(" ", "-")
)
streamlit.markdown(
f'<a href="data:application/octet-stream;base64,{value}" download="{filename}"><input type="button" value="Download File"></a>',
unsafe_allow_html=True,
)
def _render_single_complex_property(
self, streamlit: st, property_schema: Dict, value: Any
) -> None:
# Add title and subheader
streamlit.subheader(property_schema.get("title"))
if property_schema.get("description"):
streamlit.markdown(property_schema.get("description"))
streamlit.json(jsonable_encoder(value))
def _render_single_output(self, streamlit: st, output_data: BaseModel) -> None:
try:
if has_output_ui_renderer(output_data):
if function_has_named_arg(output_data.render_output_ui, "input"): # type: ignore
# render method also requests the input data
output_data.render_output_ui(streamlit, input=self._input_data) # type: ignore
else:
output_data.render_output_ui(streamlit) # type: ignore
return
except Exception:
# Use default auto-generation methods if the custom rendering throws an exception
logger.exception(
"Failed to execute custom render_output_ui function. Using auto-generation instead"
)
model_schema = output_data.schema(by_alias=False)
model_properties = model_schema.get("properties")
definitions = model_schema.get("definitions")
if model_properties:
for property_key in output_data.__dict__:
property_schema = model_properties.get(property_key)
if not property_schema.get("title"):
# Set property key as fallback title
property_schema["title"] = property_key
output_property_value = output_data.__dict__[property_key]
if has_output_ui_renderer(output_property_value):
output_property_value.render_output_ui(streamlit) # type: ignore
continue
if isinstance(output_property_value, BaseModel):
# Render output recursivly
streamlit.subheader(property_schema.get("title"))
if property_schema.get("description"):
streamlit.markdown(property_schema.get("description"))
self._render_single_output(streamlit, output_property_value)
continue
if property_schema:
if schema_utils.is_single_file_property(property_schema):
self._render_single_file_property(
streamlit, property_schema, output_property_value
)
continue
if (
schema_utils.is_single_string_property(property_schema)
or schema_utils.is_single_number_property(property_schema)
or schema_utils.is_single_datetime_property(property_schema)
or schema_utils.is_single_boolean_property(property_schema)
):
self._render_single_text_property(
streamlit, property_schema, output_property_value
)
continue
if definitions and schema_utils.is_single_enum_property(
property_schema, definitions
):
self._render_single_text_property(
streamlit, property_schema, output_property_value.value
)
continue
# TODO: render dict as table
self._render_single_complex_property(
streamlit, property_schema, output_property_value
)
return
def _render_list_output(self, streamlit: st, output_data: List) -> None:
try:
data_items: List = []
for data_item in output_data:
if has_output_ui_renderer(data_item):
# Render using the render function
data_item.render_output_ui(streamlit) # type: ignore
continue
data_items.append(data_item.dict())
# Try to show as dataframe
streamlit.table(pd.DataFrame(data_items))
except Exception:
# Fallback to
streamlit.json(jsonable_encoder(output_data))
def getOpyrator(mode: str) -> Opyrator:
if mode == None or mode.startswith('VC'):
from control.mkgui.app_vc import convert
return Opyrator(convert)
if mode == None or mode.startswith('预处理'):
from control.mkgui.preprocess import preprocess
return Opyrator(preprocess)
if mode == None or mode.startswith('模型训练'):
from control.mkgui.train import train
return Opyrator(train)
if mode == None or mode.startswith('模型训练(VC)'):
from control.mkgui.train_vc import train_vc
return Opyrator(train_vc)
from control.mkgui.app import synthesize
return Opyrator(synthesize)
def render_streamlit_ui() -> None:
# init
session_state = st.session_state
session_state.input_data = {}
# Add custom css settings
st.markdown(f"<style>{CUSTOM_STREAMLIT_CSS}</style>", unsafe_allow_html=True)
with st.spinner("Loading MockingBird GUI. Please wait..."):
session_state.mode = st.sidebar.selectbox(
'模式选择',
( "AI拟音", "VC拟音", "预处理", "模型训练", "模型训练(VC)")
)
if "mode" in session_state:
mode = session_state.mode
else:
mode = ""
opyrator = getOpyrator(mode)
title = opyrator.name + mode
col1, col2, _ = st.columns(3)
col2.title(title)
col2.markdown("欢迎使用MockingBird Web 2")
image = Image.open(path.join('control','mkgui', 'static', 'mb.png'))
col1.image(image)
st.markdown("---")
left, right = st.columns([0.4, 0.6])
with left:
st.header("Control 控制")
# if session_state.mode in ["AI拟音", "VC拟音"] :
# from audiorecorder import audiorecorder
# audio = audiorecorder("Click to record", "Recording...")
# if len(audio) > 0:
# # To play audio in frontend:
# st.audio(audio.tobytes())
InputUI(session_state=session_state, input_class=opyrator.input_type).render_ui(st)
execute_selected = st.button(opyrator.action)
if execute_selected:
with st.spinner("Executing operation. Please wait..."):
try:
input_data_obj = parse_obj_as(
opyrator.input_type, session_state.input_data
)
session_state.output_data = opyrator(input=input_data_obj)
session_state.latest_operation_input = input_data_obj # should this really be saved as additional session object?
except ValidationError as ex:
st.error(ex)
else:
# st.success("Operation executed successfully.")
pass
with right:
st.header("Result 结果")
if 'output_data' in session_state:
OutputUI(
session_state.output_data, session_state.latest_operation_input
).render_ui(st)
if st.button("Clear"):
# Clear all state
for key in st.session_state.keys():
del st.session_state[key]
session_state.input_data = {}
st.experimental_rerun()
else:
# placeholder
st.caption("请使用左侧控制板进行输入并运行获得结果")

View File

@@ -0,0 +1,13 @@
CUSTOM_STREAMLIT_CSS = """
div[data-testid="stBlock"] button {
width: 100% !important;
margin-bottom: 20px !important;
border-color: #bfbfbf !important;
}
section[data-testid="stSidebar"] div {
max-width: 10rem;
}
pre code {
white-space: pre-wrap;
}
"""

View File

@@ -0,0 +1,96 @@
from pydantic import BaseModel, Field
import os
from pathlib import Path
from enum import Enum
from typing import Any, Tuple
# Constants
EXT_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}ppg_extractor"
ENC_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}encoder"
if os.path.isdir(EXT_MODELS_DIRT):
extractors = Enum('extractors', list((file.name, file) for file in Path(EXT_MODELS_DIRT).glob("**/*.pt")))
print("Loaded extractor models: " + str(len(extractors)))
else:
raise Exception(f"Model folder {EXT_MODELS_DIRT} doesn't exist.")
if os.path.isdir(ENC_MODELS_DIRT):
encoders = Enum('encoders', list((file.name, file) for file in Path(ENC_MODELS_DIRT).glob("**/*.pt")))
print("Loaded encoders models: " + str(len(encoders)))
else:
raise Exception(f"Model folder {ENC_MODELS_DIRT} doesn't exist.")
class Model(str, Enum):
VC_PPG2MEL = "ppg2mel"
class Dataset(str, Enum):
AIDATATANG_200ZH = "aidatatang_200zh"
AIDATATANG_200ZH_S = "aidatatang_200zh_s"
class Input(BaseModel):
# def render_input_ui(st, input) -> Dict:
# input["selected_dataset"] = st.selectbox(
# '选择数据集',
# ("aidatatang_200zh", "aidatatang_200zh_s")
# )
# return input
model: Model = Field(
Model.VC_PPG2MEL, title="目标模型",
)
dataset: Dataset = Field(
Dataset.AIDATATANG_200ZH, title="数据集选择",
)
datasets_root: str = Field(
..., alias="数据集根目录", description="输入数据集根目录(相对/绝对)",
format=True,
example="..\\trainning_data\\"
)
output_root: str = Field(
..., alias="输出根目录", description="输出结果根目录(相对/绝对)",
format=True,
example="..\\trainning_data\\"
)
n_processes: int = Field(
2, alias="处理线程数", description="根据CPU线程数来设置",
le=32, ge=1
)
extractor: extractors = Field(
..., alias="特征提取模型",
description="选择PPG特征提取模型文件."
)
encoder: encoders = Field(
..., alias="语音编码模型",
description="选择语音编码模型文件."
)
class AudioEntity(BaseModel):
content: bytes
mel: Any
class Output(BaseModel):
__root__: Tuple[str, int]
def render_output_ui(self, streamlit_app, input) -> None: # type: ignore
"""Custom output UI.
If this method is implmeneted, it will be used instead of the default Output UI renderer.
"""
sr, count = self.__root__
streamlit_app.subheader(f"Dataset {sr} done processed total of {count}")
def preprocess(input: Input) -> Output:
"""Preprocess(预处理)"""
finished = 0
if input.model == Model.VC_PPG2MEL:
from models.ppg2mel.preprocess import preprocess_dataset
finished = preprocess_dataset(
datasets_root=Path(input.datasets_root),
dataset=input.dataset,
out_dir=Path(input.output_root),
n_processes=input.n_processes,
ppg_encoder_model_fpath=Path(input.extractor.value),
speaker_encoder_model=Path(input.encoder.value)
)
# TODO: pass useful return code
return Output(__root__=(input.dataset, finished))

View File

Before

Width:  |  Height:  |  Size: 5.6 KiB

After

Width:  |  Height:  |  Size: 5.6 KiB

106
control/mkgui/train.py Normal file
View File

@@ -0,0 +1,106 @@
from pydantic import BaseModel, Field
import os
from pathlib import Path
from enum import Enum
from typing import Any
from models.synthesizer.hparams import hparams
from models.synthesizer.train import train as synt_train
# Constants
SYN_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}synthesizer"
ENC_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}encoder"
# EXT_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}ppg_extractor"
# CONV_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}ppg2mel"
# ENC_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}encoder"
# Pre-Load models
if os.path.isdir(SYN_MODELS_DIRT):
synthesizers = Enum('synthesizers', list((file.name, file) for file in Path(SYN_MODELS_DIRT).glob("**/*.pt")))
print("Loaded synthesizer models: " + str(len(synthesizers)))
else:
raise Exception(f"Model folder {SYN_MODELS_DIRT} doesn't exist.")
if os.path.isdir(ENC_MODELS_DIRT):
encoders = Enum('encoders', list((file.name, file) for file in Path(ENC_MODELS_DIRT).glob("**/*.pt")))
print("Loaded encoders models: " + str(len(encoders)))
else:
raise Exception(f"Model folder {ENC_MODELS_DIRT} doesn't exist.")
class Model(str, Enum):
DEFAULT = "default"
class Input(BaseModel):
model: Model = Field(
Model.DEFAULT, title="模型类型",
)
# datasets_root: str = Field(
# ..., alias="预处理数据根目录", description="输入目录(相对/绝对),不适用于ppg2mel模型",
# format=True,
# example="..\\trainning_data\\"
# )
input_root: str = Field(
..., alias="输入目录", description="预处理数据根目录",
format=True,
example=f"..{os.sep}audiodata{os.sep}SV2TTS{os.sep}synthesizer"
)
run_id: str = Field(
"", alias="新模型名/运行ID", description="使用新ID进行重新训练否则选择下面的模型进行继续训练",
)
synthesizer: synthesizers = Field(
..., alias="已有合成模型",
description="选择语音合成模型文件."
)
gpu: bool = Field(
True, alias="GPU训练", description="选择“是”则使用GPU训练",
)
verbose: bool = Field(
True, alias="打印详情", description="选择“是”,输出更多详情",
)
encoder: encoders = Field(
..., alias="语音编码模型",
description="选择语音编码模型文件."
)
save_every: int = Field(
1000, alias="更新间隔", description="每隔n步则更新一次模型",
)
backup_every: int = Field(
10000, alias="保存间隔", description="每隔n步则保存一次模型",
)
log_every: int = Field(
500, alias="打印间隔", description="每隔n步则打印一次训练统计",
)
class AudioEntity(BaseModel):
content: bytes
mel: Any
class Output(BaseModel):
__root__: int
def render_output_ui(self, streamlit_app) -> None: # type: ignore
"""Custom output UI.
If this method is implmeneted, it will be used instead of the default Output UI renderer.
"""
streamlit_app.subheader(f"Training started with code: {self.__root__}")
def train(input: Input) -> Output:
"""Train(训练)"""
print(">>> Start training ...")
force_restart = len(input.run_id) > 0
if not force_restart:
input.run_id = Path(input.synthesizer.value).name.split('.')[0]
synt_train(
input.run_id,
input.input_root,
f"data{os.sep}ckpt{os.sep}synthesizer",
input.save_every,
input.backup_every,
input.log_every,
force_restart,
hparams
)
return Output(__root__=0)

155
control/mkgui/train_vc.py Normal file
View File

@@ -0,0 +1,155 @@
from pydantic import BaseModel, Field
import os
from pathlib import Path
from enum import Enum
from typing import Any, Tuple
import numpy as np
from utils.hparams import HpsYaml
from utils.util import AttrDict
import torch
# Constants
EXT_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}ppg_extractor"
CONV_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}ppg2mel"
ENC_MODELS_DIRT = f"data{os.sep}ckpt{os.sep}encoder"
if os.path.isdir(EXT_MODELS_DIRT):
extractors = Enum('extractors', list((file.name, file) for file in Path(EXT_MODELS_DIRT).glob("**/*.pt")))
print("Loaded extractor models: " + str(len(extractors)))
else:
raise Exception(f"Model folder {EXT_MODELS_DIRT} doesn't exist.")
if os.path.isdir(CONV_MODELS_DIRT):
convertors = Enum('convertors', list((file.name, file) for file in Path(CONV_MODELS_DIRT).glob("**/*.pth")))
print("Loaded convertor models: " + str(len(convertors)))
else:
raise Exception(f"Model folder {CONV_MODELS_DIRT} doesn't exist.")
if os.path.isdir(ENC_MODELS_DIRT):
encoders = Enum('encoders', list((file.name, file) for file in Path(ENC_MODELS_DIRT).glob("**/*.pt")))
print("Loaded encoders models: " + str(len(encoders)))
else:
raise Exception(f"Model folder {ENC_MODELS_DIRT} doesn't exist.")
class Model(str, Enum):
VC_PPG2MEL = "ppg2mel"
class Dataset(str, Enum):
AIDATATANG_200ZH = "aidatatang_200zh"
AIDATATANG_200ZH_S = "aidatatang_200zh_s"
class Input(BaseModel):
# def render_input_ui(st, input) -> Dict:
# input["selected_dataset"] = st.selectbox(
# '选择数据集',
# ("aidatatang_200zh", "aidatatang_200zh_s")
# )
# return input
model: Model = Field(
Model.VC_PPG2MEL, title="模型类型",
)
# datasets_root: str = Field(
# ..., alias="预处理数据根目录", description="输入目录(相对/绝对),不适用于ppg2mel模型",
# format=True,
# example="..\\trainning_data\\"
# )
output_root: str = Field(
..., alias="输出目录(可选)", description="建议不填,保持默认",
format=True,
example=""
)
continue_mode: bool = Field(
True, alias="继续训练模式", description="选择“是”,则从下面选择的模型中继续训练",
)
gpu: bool = Field(
True, alias="GPU训练", description="选择“是”则使用GPU训练",
)
verbose: bool = Field(
True, alias="打印详情", description="选择“是”,输出更多详情",
)
# TODO: Move to hiden fields by default
convertor: convertors = Field(
..., alias="转换模型",
description="选择语音转换模型文件."
)
extractor: extractors = Field(
..., alias="特征提取模型",
description="选择PPG特征提取模型文件."
)
encoder: encoders = Field(
..., alias="语音编码模型",
description="选择语音编码模型文件."
)
njobs: int = Field(
8, alias="进程数", description="适用于ppg2mel",
)
seed: int = Field(
default=0, alias="初始随机数", description="适用于ppg2mel",
)
model_name: str = Field(
..., alias="新模型名", description="仅在重新训练时生效,选中继续训练时无效",
example="test"
)
model_config: str = Field(
..., alias="新模型配置", description="仅在重新训练时生效,选中继续训练时无效",
example=".\\ppg2mel\\saved_models\\seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2"
)
class AudioEntity(BaseModel):
content: bytes
mel: Any
class Output(BaseModel):
__root__: Tuple[str, int]
def render_output_ui(self, streamlit_app, input) -> None: # type: ignore
"""Custom output UI.
If this method is implmeneted, it will be used instead of the default Output UI renderer.
"""
sr, count = self.__root__
streamlit_app.subheader(f"Dataset {sr} done processed total of {count}")
def train_vc(input: Input) -> Output:
"""Train VC(训练 VC)"""
print(">>> OneShot VC training ...")
params = AttrDict()
params.update({
"gpu": input.gpu,
"cpu": not input.gpu,
"njobs": input.njobs,
"seed": input.seed,
"verbose": input.verbose,
"load": input.convertor.value,
"warm_start": False,
})
if input.continue_mode:
# trace old model and config
p = Path(input.convertor.value)
params.name = p.parent.name
# search a config file
model_config_fpaths = list(p.parent.rglob("*.yaml"))
if len(model_config_fpaths) == 0:
raise "No model yaml config found for convertor"
config = HpsYaml(model_config_fpaths[0])
params.ckpdir = p.parent.parent
params.config = model_config_fpaths[0]
params.logdir = os.path.join(p.parent, "log")
else:
# Make the config dict dot visitable
config = HpsYaml(input.config)
np.random.seed(input.seed)
torch.manual_seed(input.seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(input.seed)
mode = "train"
from models.ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
solver = Solver(config, params, mode)
solver.load_data()
solver.set_model()
solver.exec()
print(">>> Oneshot VC train finished!")
# TODO: pass useful return code
return Output(__root__=(input.dataset, 0))

View File

@@ -1,18 +1,17 @@
from toolbox.ui import UI
from encoder import inference as encoder
from synthesizer.inference import Synthesizer
from vocoder.wavernn import inference as rnn_vocoder
from vocoder.hifigan import inference as gan_vocoder
from control.toolbox.ui import UI
from models.encoder import inference as encoder
from models.synthesizer.inference import Synthesizer
from models.vocoder.wavernn import inference as rnn_vocoder
from models.vocoder.hifigan import inference as gan_vocoder
from models.vocoder.fregan import inference as fgan_vocoder
from pathlib import Path
from time import perf_counter as timer
from toolbox.utterance import Utterance
from control.toolbox.utterance import Utterance
import numpy as np
import traceback
import sys
import torch
import librosa
import re
from audioread.exceptions import NoBackendError
# 默认使用wavernn
vocoder = rnn_vocoder
@@ -39,8 +38,8 @@ recognized_datasets = [
"VoxCeleb2/dev/aac",
"VoxCeleb2/test/aac",
"VCTK-Corpus/wav48",
"aidatatang_200zh/corpus/dev",
"aidatatang_200zh/corpus/test",
"aidatatang_200zh/corpus/train",
"aishell3/test/wav",
"magicdata/train",
]
@@ -49,14 +48,20 @@ recognized_datasets = [
MAX_WAVES = 15
class Toolbox:
def __init__(self, datasets_root, enc_models_dir, syn_models_dir, voc_models_dir, seed, no_mp3_support):
def __init__(self, datasets_root, enc_models_dir, syn_models_dir, voc_models_dir, extractor_models_dir, convertor_models_dir, seed, no_mp3_support, vc_mode):
self.no_mp3_support = no_mp3_support
self.vc_mode = vc_mode
sys.excepthook = self.excepthook
self.datasets_root = datasets_root
self.utterances = set()
self.current_generated = (None, None, None, None) # speaker_name, spec, breaks, wav
self.synthesizer = None # type: Synthesizer
# for ppg-based voice conversion
self.extractor = None
self.convertor = None # ppg2mel
self.current_wav = None
self.waves_list = []
self.waves_count = 0
@@ -70,9 +75,9 @@ class Toolbox:
self.trim_silences = False
# Initialize the events and the interface
self.ui = UI()
self.ui = UI(vc_mode)
self.style_idx = 0
self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, seed)
self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, extractor_models_dir, convertor_models_dir, seed)
self.setup_events()
self.ui.start()
@@ -96,7 +101,11 @@ class Toolbox:
self.ui.encoder_box.currentIndexChanged.connect(self.init_encoder)
def func():
self.synthesizer = None
self.ui.synthesizer_box.currentIndexChanged.connect(func)
if self.vc_mode:
self.ui.extractor_box.currentIndexChanged.connect(self.init_extractor)
else:
self.ui.synthesizer_box.currentIndexChanged.connect(func)
self.ui.vocoder_box.currentIndexChanged.connect(self.init_vocoder)
# Utterance selection
@@ -109,6 +118,11 @@ class Toolbox:
self.ui.stop_button.clicked.connect(self.ui.stop)
self.ui.record_button.clicked.connect(self.record)
# Source Utterance selection
if self.vc_mode:
func = lambda: self.load_soruce_button(self.ui.selected_utterance)
self.ui.load_soruce_button.clicked.connect(func)
#Audio
self.ui.setup_audio_devices(Synthesizer.sample_rate)
@@ -120,12 +134,17 @@ class Toolbox:
self.ui.waves_cb.currentIndexChanged.connect(self.set_current_wav)
# Generation
func = lambda: self.synthesize() or self.vocode()
self.ui.generate_button.clicked.connect(func)
self.ui.synthesize_button.clicked.connect(self.synthesize)
self.ui.vocode_button.clicked.connect(self.vocode)
self.ui.random_seed_checkbox.clicked.connect(self.update_seed_textbox)
if self.vc_mode:
func = lambda: self.convert() or self.vocode()
self.ui.convert_button.clicked.connect(func)
else:
func = lambda: self.synthesize() or self.vocode()
self.ui.generate_button.clicked.connect(func)
self.ui.synthesize_button.clicked.connect(self.synthesize)
# UMAP legend
self.ui.clear_button.clicked.connect(self.clear_utterances)
@@ -138,9 +157,9 @@ class Toolbox:
def replay_last_wav(self):
self.ui.play(self.current_wav, Synthesizer.sample_rate)
def reset_ui(self, encoder_models_dir, synthesizer_models_dir, vocoder_models_dir, seed):
def reset_ui(self, encoder_models_dir, synthesizer_models_dir, vocoder_models_dir, extractor_models_dir, convertor_models_dir, seed):
self.ui.populate_browser(self.datasets_root, recognized_datasets, 0, True)
self.ui.populate_models(encoder_models_dir, synthesizer_models_dir, vocoder_models_dir)
self.ui.populate_models(encoder_models_dir, synthesizer_models_dir, vocoder_models_dir, extractor_models_dir, convertor_models_dir, self.vc_mode)
self.ui.populate_gen_options(seed, self.trim_silences)
def load_from_browser(self, fpath=None):
@@ -171,7 +190,10 @@ class Toolbox:
self.ui.log("Loaded %s" % name)
self.add_real_utterance(wav, name, speaker_name)
def load_soruce_button(self, utterance: Utterance):
self.selected_source_utterance = utterance
def record(self):
wav = self.ui.record_one(encoder.sampling_rate, 5)
if wav is None:
@@ -196,7 +218,7 @@ class Toolbox:
# Add the utterance
utterance = Utterance(name, speaker_name, wav, spec, embed, partial_embeds, False)
self.utterances.add(utterance)
self.ui.register_utterance(utterance)
self.ui.register_utterance(utterance, self.vc_mode)
# Plot it
self.ui.draw_embed(embed, name, "current")
@@ -235,7 +257,7 @@ class Toolbox:
embed = self.ui.selected_utterance.embed
embeds = [embed] * len(texts)
min_token = int(self.ui.token_slider.value())
specs = self.synthesizer.synthesize_spectrograms(texts, embeds, style_idx=int(self.ui.style_slider.value()), min_stop_token=min_token)
specs = self.synthesizer.synthesize_spectrograms(texts, embeds, style_idx=int(self.ui.style_slider.value()), min_stop_token=min_token, steps=int(self.ui.length_slider.value())*200)
breaks = [spec.shape[1] for spec in specs]
spec = np.concatenate(specs, axis=1)
@@ -269,7 +291,7 @@ class Toolbox:
self.ui.set_loading(i, seq_len)
if self.ui.current_vocoder_fpath is not None:
self.ui.log("")
wav = vocoder.infer_waveform(spec, progress_callback=vocoder_progress)
wav, sample_rate = vocoder.infer_waveform(spec, progress_callback=vocoder_progress)
else:
self.ui.log("Waveform generation with Griffin-Lim... ")
wav = Synthesizer.griffin_lim(spec)
@@ -280,7 +302,7 @@ class Toolbox:
b_ends = np.cumsum(np.array(breaks) * Synthesizer.hparams.hop_size)
b_starts = np.concatenate(([0], b_ends[:-1]))
wavs = [wav[start:end] for start, end, in zip(b_starts, b_ends)]
breaks = [np.zeros(int(0.15 * Synthesizer.sample_rate))] * len(breaks)
breaks = [np.zeros(int(0.15 * sample_rate))] * len(breaks)
wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
# Trim excessive silences
@@ -289,7 +311,7 @@ class Toolbox:
# Play it
wav = wav / np.abs(wav).max() * 0.97
self.ui.play(wav, Synthesizer.sample_rate)
self.ui.play(wav, sample_rate)
# Name it (history displayed in combobox)
# TODO better naming for the combobox items?
@@ -331,6 +353,67 @@ class Toolbox:
self.ui.draw_embed(embed, name, "generated")
self.ui.draw_umap_projections(self.utterances)
def convert(self):
self.ui.log("Extract PPG and Converting...")
self.ui.set_loading(1)
# Init
if self.convertor is None:
self.init_convertor()
if self.extractor is None:
self.init_extractor()
src_wav = self.selected_source_utterance.wav
# Compute the ppg
if not self.extractor is None:
ppg = self.extractor.extract_from_wav(src_wav)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ref_wav = self.ui.selected_utterance.wav
# Import necessary dependency of Voice Conversion
from utils.f0_utils import compute_f0, f02lf0, compute_mean_std, get_converted_lf0uv
ref_lf0_mean, ref_lf0_std = compute_mean_std(f02lf0(compute_f0(ref_wav)))
lf0_uv = get_converted_lf0uv(src_wav, ref_lf0_mean, ref_lf0_std, convert=True)
min_len = min(ppg.shape[1], len(lf0_uv))
ppg = ppg[:, :min_len]
lf0_uv = lf0_uv[:min_len]
_, mel_pred, att_ws = self.convertor.inference(
ppg,
logf0_uv=torch.from_numpy(lf0_uv).unsqueeze(0).float().to(device),
spembs=torch.from_numpy(self.ui.selected_utterance.embed).unsqueeze(0).to(device),
)
mel_pred= mel_pred.transpose(0, 1)
breaks = [mel_pred.shape[1]]
mel_pred= mel_pred.detach().cpu().numpy()
self.ui.draw_spec(mel_pred, "generated")
self.current_generated = (self.ui.selected_utterance.speaker_name, mel_pred, breaks, None)
self.ui.set_loading(0)
def init_extractor(self):
if self.ui.current_extractor_fpath is None:
return
model_fpath = self.ui.current_extractor_fpath
self.ui.log("Loading the extractor %s... " % model_fpath)
self.ui.set_loading(1)
start = timer()
import models.ppg_extractor as extractor
self.extractor = extractor.load_model(model_fpath)
self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
self.ui.set_loading(0)
def init_convertor(self):
if self.ui.current_convertor_fpath is None:
return
model_fpath = self.ui.current_convertor_fpath
self.ui.log("Loading the convertor %s... " % model_fpath)
self.ui.set_loading(1)
start = timer()
import models.ppg2mel as convertor
self.convertor = convertor.load_model( model_fpath)
self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
self.ui.set_loading(0)
def init_encoder(self):
model_fpath = self.ui.current_encoder_fpath
@@ -358,12 +441,26 @@ class Toolbox:
# Case of Griffin-lim
if model_fpath is None:
return
# Sekect vocoder based on model name
if model_fpath.name[0] == "g":
model_config_fpath = None
if model_fpath.name is not None and model_fpath.name.find("hifigan") > -1:
vocoder = gan_vocoder
self.ui.log("set hifigan as vocoder")
# search a config file
model_config_fpaths = list(model_fpath.parent.rglob("*.json"))
if self.vc_mode and self.ui.current_extractor_fpath is None:
return
if len(model_config_fpaths) > 0:
model_config_fpath = model_config_fpaths[0]
elif model_fpath.name is not None and model_fpath.name.find("fregan") > -1:
vocoder = fgan_vocoder
self.ui.log("set fregan as vocoder")
# search a config file
model_config_fpaths = list(model_fpath.parent.rglob("*.json"))
if self.vc_mode and self.ui.current_extractor_fpath is None:
return
if len(model_config_fpaths) > 0:
model_config_fpath = model_config_fpaths[0]
else:
vocoder = rnn_vocoder
self.ui.log("set wavernn as vocoder")
@@ -371,7 +468,7 @@ class Toolbox:
self.ui.log("Loading the vocoder %s... " % model_fpath)
self.ui.set_loading(1)
start = timer()
vocoder.load_model(model_fpath)
vocoder.load_model(model_fpath, model_config_fpath)
self.ui.log("Done (%dms)." % int(1000 * (timer() - start)), "append")
self.ui.set_loading(0)

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.6 KiB

View File

@@ -1,11 +1,10 @@
import matplotlib.pyplot as plt
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from matplotlib.figure import Figure
from PyQt5.QtCore import Qt, QStringListModel
from PyQt5 import QtGui
from PyQt5.QtWidgets import *
from encoder.inference import plot_embedding_as_heatmap
from toolbox.utterance import Utterance
import matplotlib.pyplot as plt
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from models.encoder.inference import plot_embedding_as_heatmap
from control.toolbox.utterance import Utterance
from pathlib import Path
from typing import List, Set
import sounddevice as sd
@@ -34,7 +33,7 @@ colormap = np.array([
[0, 0, 0],
[183, 183, 183],
[76, 255, 0],
], dtype=np.float) / 255
], dtype=float) / 255
default_text = \
"欢迎使用工具箱, 现已支持中文输入!"
@@ -274,7 +273,9 @@ class UI(QDialog):
if datasets_root is None or len(datasets) == 0:
msg = "Warning: you d" + ("id not pass a root directory for datasets as argument" \
if datasets_root is None else "o not have any of the recognized datasets" \
" in %s" % datasets_root)
" in %s \n" \
"Please note use 'E:\datasets' as root path " \
"instead of 'E:\datasets\aidatatang_200zh\corpus\test' as an example " % datasets_root)
self.log(msg)
msg += ".\nThe recognized datasets are:\n\t%s\nFeel free to add your own. You " \
"can still use the toolbox by recording samples yourself." % \
@@ -326,30 +327,51 @@ class UI(QDialog):
def current_vocoder_fpath(self):
return self.vocoder_box.itemData(self.vocoder_box.currentIndex())
@property
def current_extractor_fpath(self):
return self.extractor_box.itemData(self.extractor_box.currentIndex())
@property
def current_convertor_fpath(self):
return self.convertor_box.itemData(self.convertor_box.currentIndex())
def populate_models(self, encoder_models_dir: Path, synthesizer_models_dir: Path,
vocoder_models_dir: Path):
vocoder_models_dir: Path, extractor_models_dir: Path, convertor_models_dir: Path, vc_mode: bool):
# Encoder
encoder_fpaths = list(encoder_models_dir.glob("*.pt"))
if len(encoder_fpaths) == 0:
raise Exception("No encoder models found in %s" % encoder_models_dir)
self.repopulate_box(self.encoder_box, [(f.stem, f) for f in encoder_fpaths])
# Synthesizer
synthesizer_fpaths = list(synthesizer_models_dir.glob("**/*.pt"))
if len(synthesizer_fpaths) == 0:
raise Exception("No synthesizer models found in %s" % synthesizer_models_dir)
self.repopulate_box(self.synthesizer_box, [(f.stem, f) for f in synthesizer_fpaths])
if vc_mode:
# Extractor
extractor_fpaths = list(extractor_models_dir.glob("*.pt"))
if len(extractor_fpaths) == 0:
self.log("No extractor models found in %s" % extractor_fpaths)
self.repopulate_box(self.extractor_box, [(f.stem, f) for f in extractor_fpaths])
# Convertor
convertor_fpaths = list(convertor_models_dir.glob("*.pth"))
if len(convertor_fpaths) == 0:
self.log("No convertor models found in %s" % convertor_fpaths)
self.repopulate_box(self.convertor_box, [(f.stem, f) for f in convertor_fpaths])
else:
# Synthesizer
synthesizer_fpaths = list(synthesizer_models_dir.glob("**/*.pt"))
if len(synthesizer_fpaths) == 0:
raise Exception("No synthesizer models found in %s" % synthesizer_models_dir)
self.repopulate_box(self.synthesizer_box, [(f.stem, f) for f in synthesizer_fpaths])
# Vocoder
vocoder_fpaths = list(vocoder_models_dir.glob("**/*.pt"))
vocoder_items = [(f.stem, f) for f in vocoder_fpaths] + [("Griffin-Lim", None)]
self.repopulate_box(self.vocoder_box, vocoder_items)
@property
def selected_utterance(self):
return self.utterance_history.itemData(self.utterance_history.currentIndex())
def register_utterance(self, utterance: Utterance):
def register_utterance(self, utterance: Utterance, vc_mode):
self.utterance_history.blockSignals(True)
self.utterance_history.insertItem(0, utterance.name, utterance)
self.utterance_history.setCurrentIndex(0)
@@ -359,8 +381,11 @@ class UI(QDialog):
self.utterance_history.removeItem(self.max_saved_utterances)
self.play_button.setDisabled(False)
self.generate_button.setDisabled(False)
self.synthesize_button.setDisabled(False)
if vc_mode:
self.convert_button.setDisabled(False)
else:
self.generate_button.setDisabled(False)
self.synthesize_button.setDisabled(False)
def log(self, line, mode="newline"):
if mode == "newline":
@@ -377,8 +402,8 @@ class UI(QDialog):
self.app.processEvents()
def set_loading(self, value, maximum=1):
self.loading_bar.setValue(value * 100)
self.loading_bar.setMaximum(maximum * 100)
self.loading_bar.setValue(int(value * 100))
self.loading_bar.setMaximum(int(maximum * 100))
self.loading_bar.setTextVisible(value != 0)
self.app.processEvents()
@@ -402,7 +427,7 @@ class UI(QDialog):
else:
self.seed_textbox.setEnabled(False)
def reset_interface(self):
def reset_interface(self, vc_mode):
self.draw_embed(None, None, "current")
self.draw_embed(None, None, "generated")
self.draw_spec(None, "current")
@@ -410,14 +435,17 @@ class UI(QDialog):
self.draw_umap_projections(set())
self.set_loading(0)
self.play_button.setDisabled(True)
self.generate_button.setDisabled(True)
self.synthesize_button.setDisabled(True)
if vc_mode:
self.convert_button.setDisabled(True)
else:
self.generate_button.setDisabled(True)
self.synthesize_button.setDisabled(True)
self.vocode_button.setDisabled(True)
self.replay_wav_button.setDisabled(True)
self.export_wav_button.setDisabled(True)
[self.log("") for _ in range(self.max_log_lines)]
def __init__(self):
def __init__(self, vc_mode):
## Initialize the application
self.app = QApplication(sys.argv)
super().__init__(None)
@@ -469,7 +497,7 @@ class UI(QDialog):
source_groupbox = QGroupBox('Source(源音频)')
source_layout = QGridLayout()
source_groupbox.setLayout(source_layout)
browser_layout.addWidget(source_groupbox, i, 0, 1, 4)
browser_layout.addWidget(source_groupbox, i, 0, 1, 5)
self.dataset_box = QComboBox()
source_layout.addWidget(QLabel("Dataset(数据集):"), i, 0)
@@ -510,25 +538,35 @@ class UI(QDialog):
browser_layout.addWidget(self.play_button, i, 2)
self.stop_button = QPushButton("Stop(暂停)")
browser_layout.addWidget(self.stop_button, i, 3)
if vc_mode:
self.load_soruce_button = QPushButton("Select(选择为被转换的语音输入)")
browser_layout.addWidget(self.load_soruce_button, i, 4)
i += 1
model_groupbox = QGroupBox('Models(模型选择)')
model_layout = QHBoxLayout()
model_groupbox.setLayout(model_layout)
browser_layout.addWidget(model_groupbox, i, 0, 1, 4)
browser_layout.addWidget(model_groupbox, i, 0, 2, 5)
# Model and audio output selection
self.encoder_box = QComboBox()
model_layout.addWidget(QLabel("Encoder:"))
model_layout.addWidget(self.encoder_box)
self.synthesizer_box = QComboBox()
model_layout.addWidget(QLabel("Synthesizer:"))
model_layout.addWidget(self.synthesizer_box)
if vc_mode:
self.extractor_box = QComboBox()
model_layout.addWidget(QLabel("Extractor:"))
model_layout.addWidget(self.extractor_box)
self.convertor_box = QComboBox()
model_layout.addWidget(QLabel("Convertor:"))
model_layout.addWidget(self.convertor_box)
else:
model_layout.addWidget(QLabel("Synthesizer:"))
model_layout.addWidget(self.synthesizer_box)
self.vocoder_box = QComboBox()
model_layout.addWidget(QLabel("Vocoder:"))
model_layout.addWidget(self.vocoder_box)
#Replay & Save Audio
i = 0
output_layout.addWidget(QLabel("<b>Toolbox Output:</b>"), i, 0)
@@ -550,7 +588,7 @@ class UI(QDialog):
## Embed & spectrograms
vis_layout.addStretch()
# TODO: add spectrograms for source
gridspec_kw = {"width_ratios": [1, 4]}
fig, self.current_ax = plt.subplots(1, 2, figsize=(10, 2.25), facecolor="#F0F0F0",
gridspec_kw=gridspec_kw)
@@ -571,16 +609,23 @@ class UI(QDialog):
self.text_prompt = QPlainTextEdit(default_text)
gen_layout.addWidget(self.text_prompt, stretch=1)
self.generate_button = QPushButton("Synthesize and vocode")
gen_layout.addWidget(self.generate_button)
layout = QHBoxLayout()
self.synthesize_button = QPushButton("Synthesize only")
layout.addWidget(self.synthesize_button)
if vc_mode:
layout = QHBoxLayout()
self.convert_button = QPushButton("Extract and Convert")
layout.addWidget(self.convert_button)
gen_layout.addLayout(layout)
else:
self.generate_button = QPushButton("Synthesize and vocode")
gen_layout.addWidget(self.generate_button)
layout = QHBoxLayout()
self.synthesize_button = QPushButton("Synthesize only")
layout.addWidget(self.synthesize_button)
self.vocode_button = QPushButton("Vocode only")
layout.addWidget(self.vocode_button)
gen_layout.addLayout(layout)
layout_seed = QGridLayout()
self.random_seed_checkbox = QCheckBox("Random seed:")
self.random_seed_checkbox.setToolTip("When checked, makes the synthesizer and vocoder deterministic.")
@@ -618,6 +663,19 @@ class UI(QDialog):
layout_seed.addWidget(self.token_value_label, 2, 1)
layout_seed.addWidget(self.token_slider, 2, 3)
self.length_slider = QSlider(Qt.Horizontal)
self.length_slider.setTickInterval(1)
self.length_slider.setFocusPolicy(Qt.NoFocus)
self.length_slider.setSingleStep(1)
self.length_slider.setRange(1, 10)
self.length_value_label = QLabel("2")
self.length_slider.setValue(2)
layout_seed.addWidget(QLabel("MaxLength(最大句长):"), 3, 0)
self.length_slider.valueChanged.connect(lambda s: self.length_value_label.setNum(s))
layout_seed.addWidget(self.length_value_label, 3, 1)
layout_seed.addWidget(self.length_slider, 3, 3)
gen_layout.addLayout(layout_seed)
self.loading_bar = QProgressBar()
@@ -635,7 +693,7 @@ class UI(QDialog):
self.resize(max_size)
## Finalize the display
self.reset_interface()
self.reset_interface(vc_mode)
self.show()
def start(self):

Binary file not shown.

8
datasets_download/CN.txt Normal file
View File

@@ -0,0 +1,8 @@
https://openslr.magicdatatech.com/resources/62/aidatatang_200zh.tgz
out=download/aidatatang_200zh.tgz
https://openslr.magicdatatech.com/resources/68/train_set.tar.gz
out=download/magicdata.tgz
https://openslr.magicdatatech.com/resources/93/data_aishell3.tgz
out=download/aishell3.tgz
https://openslr.magicdatatech.com/resources/33/data_aishell.tgz
out=download/data_aishell.tgz

8
datasets_download/EU.txt Normal file
View File

@@ -0,0 +1,8 @@
https://openslr.elda.org/resources/62/aidatatang_200zh.tgz
out=download/aidatatang_200zh.tgz
https://openslr.elda.org/resources/68/train_set.tar.gz
out=download/magicdata.tgz
https://openslr.elda.org/resources/93/data_aishell3.tgz
out=download/aishell3.tgz
https://openslr.elda.org/resources/33/data_aishell.tgz
out=download/data_aishell.tgz

8
datasets_download/US.txt Normal file
View File

@@ -0,0 +1,8 @@
https://us.openslr.org/resources/62/aidatatang_200zh.tgz
out=download/aidatatang_200zh.tgz
https://us.openslr.org/resources/68/train_set.tar.gz
out=download/magicdata.tgz
https://us.openslr.org/resources/93/data_aishell3.tgz
out=download/aishell3.tgz
https://us.openslr.org/resources/33/data_aishell.tgz
out=download/data_aishell.tgz

View File

@@ -0,0 +1,4 @@
0c0ace77fe8ee77db8d7542d6eb0b7ddf09b1bfb880eb93a7fbdbf4611e9984b /datasets/download/aidatatang_200zh.tgz
be2507d431ad59419ec871e60674caedb2b585f84ffa01fe359784686db0e0cc /datasets/download/aishell3.tgz
a4a0313cde0a933e0e01a451f77de0a23d6c942f4694af5bb7f40b9dc38143fe /datasets/download/data_aishell.tgz
1d2647c614b74048cfe16492570cc5146d800afdc07483a43b31809772632143 /datasets/download/magicdata.tgz

View File

@@ -0,0 +1,8 @@
https://www.openslr.org/resources/62/aidatatang_200zh.tgz
out=download/aidatatang_200zh.tgz
https://www.openslr.org/resources/68/train_set.tar.gz
out=download/magicdata.tgz
https://www.openslr.org/resources/93/data_aishell3.tgz
out=download/aishell3.tgz
https://www.openslr.org/resources/33/data_aishell.tgz
out=download/data_aishell.tgz

8
datasets_download/download.sh Executable file
View File

@@ -0,0 +1,8 @@
#!/usr/bin/env bash
set -Eeuo pipefail
aria2c -x 10 --disable-ipv6 --input-file /workspace/datasets_download/${DATASET_MIRROR}.txt --dir /datasets --continue
echo "Verifying sha256sum..."
parallel --will-cite -a /workspace/datasets_download/datasets.sha256sum "echo -n {} | sha256sum -c"

29
datasets_download/extract.sh Executable file
View File

@@ -0,0 +1,29 @@
#!/usr/bin/env bash
set -Eeuo pipefail
mkdir -p /datasets/aidatatang_200zh
if [ -z "$(ls -A /datasets/aidatatang_200zh)" ] ; then
tar xvz --directory /datasets/ -f /datasets/download/aidatatang_200zh.tgz --exclude 'aidatatang_200zh/corpus/dev/*' --exclude 'aidatatang_200zh/corpus/test/*'
cd /datasets/aidatatang_200zh/corpus/train/
cat *.tar.gz | tar zxvf - -i
rm -f *.tar.gz
fi
mkdir -p /datasets/magicdata
if [ -z "$(ls -A /datasets/magicdata)" ] ; then
tar xvz --directory /datasets/magicdata -f /datasets/download/magicdata.tgz train/
fi
mkdir -p /datasets/aishell3
if [ -z "$(ls -A /datasets/aishell3)" ] ; then
tar xvz --directory /datasets/aishell3 -f /datasets/download/aishell3.tgz train/
fi
mkdir -p /datasets/data_aishell
if [ -z "$(ls -A /datasets/data_aishell)" ] ; then
tar xvz --directory /datasets/ -f /datasets/download/data_aishell.tgz
cd /datasets/data_aishell/wav/
cat *.tar.gz | tar zxvf - -i --exclude 'dev/*' --exclude 'test/*'
rm -f *.tar.gz
fi

View File

@@ -1,5 +1,5 @@
from pathlib import Path
from toolbox import Toolbox
from control.toolbox import Toolbox
from utils.argutils import print_args
from utils.modelutils import check_model_paths
import argparse
@@ -15,12 +15,18 @@ if __name__ == '__main__':
parser.add_argument("-d", "--datasets_root", type=Path, help= \
"Path to the directory containing your datasets. See toolbox/__init__.py for a list of "
"supported datasets.", default=None)
parser.add_argument("-e", "--enc_models_dir", type=Path, default="encoder/saved_models",
parser.add_argument("-vc", "--vc_mode", action="store_true",
help="Voice Conversion Mode(PPG based)")
parser.add_argument("-e", "--enc_models_dir", type=Path, default=f"data{os.sep}ckpt{os.sep}encoder",
help="Directory containing saved encoder models")
parser.add_argument("-s", "--syn_models_dir", type=Path, default="synthesizer/saved_models",
parser.add_argument("-s", "--syn_models_dir", type=Path, default=f"data{os.sep}ckpt{os.sep}synthesizer",
help="Directory containing saved synthesizer models")
parser.add_argument("-v", "--voc_models_dir", type=Path, default="vocoder/saved_models",
parser.add_argument("-v", "--voc_models_dir", type=Path, default=f"data{os.sep}ckpt{os.sep}vocoder",
help="Directory containing saved vocoder models")
parser.add_argument("-ex", "--extractor_models_dir", type=Path, default=f"data{os.sep}ckpt{os.sep}ppg_extractor",
help="Directory containing saved extrator models")
parser.add_argument("-cv", "--convertor_models_dir", type=Path, default=f"data{os.sep}ckpt{os.sep}ppg2mel",
help="Directory containing saved convert models")
parser.add_argument("--cpu", action="store_true", help=\
"If True, processing is done on CPU, even when a GPU is available.")
parser.add_argument("--seed", type=int, default=None, help=\

23
docker-compose.yml Normal file
View File

@@ -0,0 +1,23 @@
version: '3.8'
services:
server:
image: mockingbird:latest
build: .
volumes:
- ./datasets:/datasets
- ./synthesizer/saved_models:/workspace/synthesizer/saved_models
environment:
- DATASET_MIRROR=US
- FORCE_RETRAIN=false
- TRAIN_DATASETS=aidatatang_200zh magicdata aishell3 data_aishell
- TRAIN_SKIP_EXISTING=true
ports:
- 8080:8080
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: [ '0' ]
capabilities: [ gpu ]

17
docker-entrypoint.sh Executable file
View File

@@ -0,0 +1,17 @@
#!/usr/bin/env bash
if [ -z "$(ls -A /workspace/synthesizer/saved_models)" ] || [ "$FORCE_RETRAIN" = true ] ; then
/workspace/datasets_download/download.sh
/workspace/datasets_download/extract.sh
for DATASET in ${TRAIN_DATASETS}
do
if [ "$TRAIN_SKIP_EXISTING" = true ] ; then
python pre.py /datasets -d ${DATASET} -n $(nproc) --skip_existing
else
python pre.py /datasets -d ${DATASET} -n $(nproc)
fi
done
python synthesizer_train.py mandarin /datasets/SV2TTS/synthesizer
fi
python web.py

View File

@@ -1,2 +0,0 @@
from encoder.data_objects.speaker_verification_dataset import SpeakerVerificationDataset
from encoder.data_objects.speaker_verification_dataset import SpeakerVerificationDataLoader

BIN
env.yml Normal file

Binary file not shown.

120
gen_voice.py Normal file
View File

@@ -0,0 +1,120 @@
from models.synthesizer.inference import Synthesizer
from models.encoder import inference as encoder
from models.vocoder.hifigan import inference as gan_vocoder
from pathlib import Path
import numpy as np
import soundfile as sf
import torch
import sys
import os
import re
import cn2an
vocoder = gan_vocoder
def gen_one_wav(synthesizer, in_fpath, embed, texts, file_name, seq):
embeds = [embed] * len(texts)
# If you know what the attention layer alignments are, you can retrieve them here by
# passing return_alignments=True
specs = synthesizer.synthesize_spectrograms(texts, embeds, style_idx=-1, min_stop_token=4, steps=400)
#spec = specs[0]
breaks = [spec.shape[1] for spec in specs]
spec = np.concatenate(specs, axis=1)
# If seed is specified, reset torch seed and reload vocoder
# Synthesizing the waveform is fairly straightforward. Remember that the longer the
# spectrogram, the more time-efficient the vocoder.
generated_wav, output_sample_rate = vocoder.infer_waveform(spec)
# Add breaks
b_ends = np.cumsum(np.array(breaks) * synthesizer.hparams.hop_size)
b_starts = np.concatenate(([0], b_ends[:-1]))
wavs = [generated_wav[start:end] for start, end, in zip(b_starts, b_ends)]
breaks = [np.zeros(int(0.15 * synthesizer.sample_rate))] * len(breaks)
generated_wav = np.concatenate([i for w, b in zip(wavs, breaks) for i in (w, b)])
## Post-generation
# There's a bug with sounddevice that makes the audio cut one second earlier, so we
# pad it.
# Trim excess silences to compensate for gaps in spectrograms (issue #53)
generated_wav = encoder.preprocess_wav(generated_wav)
generated_wav = generated_wav / np.abs(generated_wav).max() * 0.97
# Save it on the disk
model=os.path.basename(in_fpath)
filename = "%s_%d_%s.wav" %(file_name, seq, model)
sf.write(filename, generated_wav, synthesizer.sample_rate)
print("\nSaved output as %s\n\n" % filename)
def generate_wav(enc_model_fpath, syn_model_fpath, voc_model_fpath, in_fpath, input_txt, file_name):
if torch.cuda.is_available():
device_id = torch.cuda.current_device()
gpu_properties = torch.cuda.get_device_properties(device_id)
## Print some environment information (for debugging purposes)
print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with "
"%.1fGb total memory.\n" %
(torch.cuda.device_count(),
device_id,
gpu_properties.name,
gpu_properties.major,
gpu_properties.minor,
gpu_properties.total_memory / 1e9))
else:
print("Using CPU for inference.\n")
print("Preparing the encoder, the synthesizer and the vocoder...")
encoder.load_model(enc_model_fpath)
synthesizer = Synthesizer(syn_model_fpath)
vocoder.load_model(voc_model_fpath)
encoder_wav = synthesizer.load_preprocess_wav(in_fpath)
embed, partial_embeds, _ = encoder.embed_utterance(encoder_wav, return_partials=True)
texts = input_txt.split("\n")
seq=0
each_num=1500
punctuation = '!,。、,' # punctuate and split/clean text
processed_texts = []
cur_num = 0
for text in texts:
for processed_text in re.sub(r'[{}]+'.format(punctuation), '\n', text).split('\n'):
if processed_text:
processed_texts.append(processed_text.strip())
cur_num += len(processed_text.strip())
if cur_num > each_num:
seq = seq +1
gen_one_wav(synthesizer, in_fpath, embed, processed_texts, file_name, seq)
processed_texts = []
cur_num = 0
if len(processed_texts)>0:
seq = seq +1
gen_one_wav(synthesizer, in_fpath, embed, processed_texts, file_name, seq)
if (len(sys.argv)>=3):
my_txt = ""
print("reading from :", sys.argv[1])
with open(sys.argv[1], "r") as f:
for line in f.readlines():
#line = line.strip('\n')
my_txt += line
txt_file_name = sys.argv[1]
wav_file_name = sys.argv[2]
output = cn2an.transform(my_txt, "an2cn")
print(output)
generate_wav(
Path("encoder/saved_models/pretrained.pt"),
Path("synthesizer/saved_models/mandarin.pt"),
Path("vocoder/saved_models/pretrained/g_hifigan.pt"), wav_file_name, output, txt_file_name
)
else:
print("please input the file name")
exit(1)

View File

View File

@@ -1,5 +1,5 @@
from scipy.ndimage.morphology import binary_dilation
from encoder.params_data import *
from models.encoder.params_data import *
from pathlib import Path
from typing import Optional, Union
from warnings import warn
@@ -39,7 +39,7 @@ def preprocess_wav(fpath_or_wav: Union[str, Path, np.ndarray],
# Resample the wav if needed
if source_sr is not None and source_sr != sampling_rate:
wav = librosa.resample(wav, source_sr, sampling_rate)
wav = librosa.resample(wav, orig_sr = source_sr, target_sr = sampling_rate)
# Apply the preprocessing: normalize volume and shorten long silences
if normalize:
@@ -56,8 +56,8 @@ def wav_to_mel_spectrogram(wav):
Note: this not a log-mel spectrogram.
"""
frames = librosa.feature.melspectrogram(
wav,
sampling_rate,
y=wav,
sr=sampling_rate,
n_fft=int(sampling_rate * mel_window_length / 1000),
hop_length=int(sampling_rate * mel_window_step / 1000),
n_mels=mel_n_channels
@@ -99,7 +99,7 @@ def trim_long_silences(wav):
return ret[width - 1:] / width
audio_mask = moving_average(voice_flags, vad_moving_average_width)
audio_mask = np.round(audio_mask).astype(np.bool)
audio_mask = np.round(audio_mask).astype(bool)
# Dilate the voiced regions
audio_mask = binary_dilation(audio_mask, np.ones(vad_max_silence_length + 1))

View File

@@ -0,0 +1,2 @@
from models.encoder.data_objects.speaker_verification_dataset import SpeakerVerificationDataset
from models.encoder.data_objects.speaker_verification_dataset import SpeakerVerificationDataLoader

View File

@@ -1,5 +1,5 @@
from encoder.data_objects.random_cycler import RandomCycler
from encoder.data_objects.utterance import Utterance
from models.encoder.data_objects.random_cycler import RandomCycler
from models.encoder.data_objects.utterance import Utterance
from pathlib import Path
# Contains the set of utterances of a single speaker

View File

@@ -1,6 +1,6 @@
import numpy as np
from typing import List
from encoder.data_objects.speaker import Speaker
from models.encoder.data_objects.speaker import Speaker
class SpeakerBatch:
def __init__(self, speakers: List[Speaker], utterances_per_speaker: int, n_frames: int):

View File

@@ -1,7 +1,7 @@
from encoder.data_objects.random_cycler import RandomCycler
from encoder.data_objects.speaker_batch import SpeakerBatch
from encoder.data_objects.speaker import Speaker
from encoder.params_data import partials_n_frames
from models.encoder.data_objects.random_cycler import RandomCycler
from models.encoder.data_objects.speaker_batch import SpeakerBatch
from models.encoder.data_objects.speaker import Speaker
from models.encoder.params_data import partials_n_frames
from torch.utils.data import Dataset, DataLoader
from pathlib import Path

View File

@@ -1,8 +1,8 @@
from encoder.params_data import *
from encoder.model import SpeakerEncoder
from encoder.audio import preprocess_wav # We want to expose this function from here
from models.encoder.params_data import *
from models.encoder.model import SpeakerEncoder
from models.encoder.audio import preprocess_wav # We want to expose this function from here
from matplotlib import cm
from encoder import audio
from models.encoder import audio
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
@@ -34,8 +34,16 @@ def load_model(weights_fpath: Path, device=None):
_model.load_state_dict(checkpoint["model_state"])
_model.eval()
print("Loaded encoder \"%s\" trained to step %d" % (weights_fpath.name, checkpoint["step"]))
return _model
def set_model(model, device=None):
global _model, _device
_model = model
if device is None:
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
_device = device
_model.to(device)
def is_loaded():
return _model is not None
@@ -57,7 +65,7 @@ def embed_frames_batch(frames_batch):
def compute_partial_slices(n_samples, partial_utterance_n_frames=partials_n_frames,
min_pad_coverage=0.75, overlap=0.5):
min_pad_coverage=0.75, overlap=0.5, rate=None):
"""
Computes where to split an utterance waveform and its corresponding mel spectrogram to obtain
partial utterances of <partial_utterance_n_frames> each. Both the waveform and the mel
@@ -85,9 +93,18 @@ def compute_partial_slices(n_samples, partial_utterance_n_frames=partials_n_fram
assert 0 <= overlap < 1
assert 0 < min_pad_coverage <= 1
samples_per_frame = int((sampling_rate * mel_window_step / 1000))
n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
frame_step = max(int(np.round(partial_utterance_n_frames * (1 - overlap))), 1)
if rate != None:
samples_per_frame = int((sampling_rate * mel_window_step / 1000))
n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
frame_step = int(np.round((sampling_rate / rate) / samples_per_frame))
else:
samples_per_frame = int((sampling_rate * mel_window_step / 1000))
n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
frame_step = max(int(np.round(partial_utterance_n_frames * (1 - overlap))), 1)
assert 0 < frame_step, "The rate is too high"
assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % \
(sampling_rate / (samples_per_frame * partials_n_frames))
# Compute the slices
wav_slices, mel_slices = [], []

View File

@@ -1,5 +1,5 @@
from encoder.params_model import *
from encoder.params_data import *
from models.encoder.params_model import *
from models.encoder.params_data import *
from scipy.interpolate import interp1d
from sklearn.metrics import roc_curve
from torch.nn.utils import clip_grad_norm_

View File

@@ -1,8 +1,8 @@
from multiprocess.pool import ThreadPool
from encoder.params_data import *
from encoder.config import librispeech_datasets, anglophone_nationalites
from models.encoder.params_data import *
from models.encoder.config import librispeech_datasets, anglophone_nationalites
from datetime import datetime
from encoder import audio
from models.encoder import audio
from pathlib import Path
from tqdm import tqdm
import numpy as np
@@ -22,7 +22,7 @@ class DatasetLog:
self._log_params()
def _log_params(self):
from encoder import params_data
from models.encoder import params_data
self.write_line("Parameter values:")
for param_name in (p for p in dir(params_data) if not p.startswith("__")):
value = getattr(params_data, param_name)

View File

@@ -1,7 +1,7 @@
from encoder.visualizations import Visualizations
from encoder.data_objects import SpeakerVerificationDataLoader, SpeakerVerificationDataset
from encoder.params_model import *
from encoder.model import SpeakerEncoder
from models.encoder.visualizations import Visualizations
from models.encoder.data_objects import SpeakerVerificationDataLoader, SpeakerVerificationDataset
from models.encoder.params_model import *
from models.encoder.model import SpeakerEncoder
from utils.profiler import Profiler
from pathlib import Path
import torch

View File

@@ -1,4 +1,4 @@
from encoder.data_objects.speaker_verification_dataset import SpeakerVerificationDataset
from models.encoder.data_objects.speaker_verification_dataset import SpeakerVerificationDataset
from datetime import datetime
from time import perf_counter as timer
import matplotlib.pyplot as plt
@@ -21,7 +21,7 @@ colormap = np.array([
[33, 0, 127],
[0, 0, 0],
[183, 183, 183],
], dtype=np.float) / 255
], dtype=float) / 255
class Visualizations:
@@ -65,8 +65,8 @@ class Visualizations:
def log_params(self):
if self.disabled:
return
from encoder import params_data
from encoder import params_model
from models.encoder import params_data
from models.encoder import params_model
param_string = "<b>Model parameters</b>:<br>"
for param_name in (p for p in dir(params_model) if not p.startswith("__")):
value = getattr(params_model, param_name)

209
models/ppg2mel/__init__.py Normal file
View File

@@ -0,0 +1,209 @@
#!/usr/bin/env python3
# Copyright 2020 Songxiang Liu
# Apache 2.0
from typing import List
import torch
import torch.nn.functional as F
import numpy as np
from .utils.abs_model import AbsMelDecoder
from .rnn_decoder_mol import Decoder
from .utils.cnn_postnet import Postnet
from .utils.vc_utils import get_mask_from_lengths
from utils.hparams import HpsYaml
class MelDecoderMOLv2(AbsMelDecoder):
"""Use an encoder to preprocess ppg."""
def __init__(
self,
num_speakers: int,
spk_embed_dim: int,
bottle_neck_feature_dim: int,
encoder_dim: int = 256,
encoder_downsample_rates: List = [2, 2],
attention_rnn_dim: int = 512,
decoder_rnn_dim: int = 512,
num_decoder_rnn_layer: int = 1,
concat_context_to_last: bool = True,
prenet_dims: List = [256, 128],
num_mixtures: int = 5,
frames_per_step: int = 2,
mask_padding: bool = True,
):
super().__init__()
self.mask_padding = mask_padding
self.bottle_neck_feature_dim = bottle_neck_feature_dim
self.num_mels = 80
self.encoder_down_factor=np.cumprod(encoder_downsample_rates)[-1]
self.frames_per_step = frames_per_step
self.use_spk_dvec = True
input_dim = bottle_neck_feature_dim
# Downsampling convolution
self.bnf_prenet = torch.nn.Sequential(
torch.nn.Conv1d(input_dim, encoder_dim, kernel_size=1, bias=False),
torch.nn.LeakyReLU(0.1),
torch.nn.InstanceNorm1d(encoder_dim, affine=False),
torch.nn.Conv1d(
encoder_dim, encoder_dim,
kernel_size=2*encoder_downsample_rates[0],
stride=encoder_downsample_rates[0],
padding=encoder_downsample_rates[0]//2,
),
torch.nn.LeakyReLU(0.1),
torch.nn.InstanceNorm1d(encoder_dim, affine=False),
torch.nn.Conv1d(
encoder_dim, encoder_dim,
kernel_size=2*encoder_downsample_rates[1],
stride=encoder_downsample_rates[1],
padding=encoder_downsample_rates[1]//2,
),
torch.nn.LeakyReLU(0.1),
torch.nn.InstanceNorm1d(encoder_dim, affine=False),
)
decoder_enc_dim = encoder_dim
self.pitch_convs = torch.nn.Sequential(
torch.nn.Conv1d(2, encoder_dim, kernel_size=1, bias=False),
torch.nn.LeakyReLU(0.1),
torch.nn.InstanceNorm1d(encoder_dim, affine=False),
torch.nn.Conv1d(
encoder_dim, encoder_dim,
kernel_size=2*encoder_downsample_rates[0],
stride=encoder_downsample_rates[0],
padding=encoder_downsample_rates[0]//2,
),
torch.nn.LeakyReLU(0.1),
torch.nn.InstanceNorm1d(encoder_dim, affine=False),
torch.nn.Conv1d(
encoder_dim, encoder_dim,
kernel_size=2*encoder_downsample_rates[1],
stride=encoder_downsample_rates[1],
padding=encoder_downsample_rates[1]//2,
),
torch.nn.LeakyReLU(0.1),
torch.nn.InstanceNorm1d(encoder_dim, affine=False),
)
self.reduce_proj = torch.nn.Linear(encoder_dim + spk_embed_dim, encoder_dim)
# Decoder
self.decoder = Decoder(
enc_dim=decoder_enc_dim,
num_mels=self.num_mels,
frames_per_step=frames_per_step,
attention_rnn_dim=attention_rnn_dim,
decoder_rnn_dim=decoder_rnn_dim,
num_decoder_rnn_layer=num_decoder_rnn_layer,
prenet_dims=prenet_dims,
num_mixtures=num_mixtures,
use_stop_tokens=True,
concat_context_to_last=concat_context_to_last,
encoder_down_factor=self.encoder_down_factor,
)
# Mel-Spec Postnet: some residual CNN layers
self.postnet = Postnet()
def parse_output(self, outputs, output_lengths=None):
if self.mask_padding and output_lengths is not None:
mask = ~get_mask_from_lengths(output_lengths, outputs[0].size(1))
mask = mask.unsqueeze(2).expand(mask.size(0), mask.size(1), self.num_mels)
outputs[0].data.masked_fill_(mask, 0.0)
outputs[1].data.masked_fill_(mask, 0.0)
return outputs
def forward(
self,
bottle_neck_features: torch.Tensor,
feature_lengths: torch.Tensor,
speech: torch.Tensor,
speech_lengths: torch.Tensor,
logf0_uv: torch.Tensor = None,
spembs: torch.Tensor = None,
output_att_ws: bool = False,
):
decoder_inputs = self.bnf_prenet(
bottle_neck_features.transpose(1, 2)
).transpose(1, 2)
logf0_uv = self.pitch_convs(logf0_uv.transpose(1, 2)).transpose(1, 2)
decoder_inputs = decoder_inputs + logf0_uv
assert spembs is not None
spk_embeds = F.normalize(
spembs).unsqueeze(1).expand(-1, decoder_inputs.size(1), -1)
decoder_inputs = torch.cat([decoder_inputs, spk_embeds], dim=-1)
decoder_inputs = self.reduce_proj(decoder_inputs)
# (B, num_mels, T_dec)
T_dec = torch.div(feature_lengths, int(self.encoder_down_factor), rounding_mode='floor')
mel_outputs, predicted_stop, alignments = self.decoder(
decoder_inputs, speech, T_dec)
## Post-processing
mel_outputs_postnet = self.postnet(mel_outputs.transpose(1, 2)).transpose(1, 2)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
if output_att_ws:
return self.parse_output(
[mel_outputs, mel_outputs_postnet, predicted_stop, alignments], speech_lengths)
else:
return self.parse_output(
[mel_outputs, mel_outputs_postnet, predicted_stop], speech_lengths)
# return mel_outputs, mel_outputs_postnet
def inference(
self,
bottle_neck_features: torch.Tensor,
logf0_uv: torch.Tensor = None,
spembs: torch.Tensor = None,
):
decoder_inputs = self.bnf_prenet(bottle_neck_features.transpose(1, 2)).transpose(1, 2)
logf0_uv = self.pitch_convs(logf0_uv.transpose(1, 2)).transpose(1, 2)
decoder_inputs = decoder_inputs + logf0_uv
assert spembs is not None
spk_embeds = F.normalize(
spembs).unsqueeze(1).expand(-1, decoder_inputs.size(1), -1)
bottle_neck_features = torch.cat([decoder_inputs, spk_embeds], dim=-1)
bottle_neck_features = self.reduce_proj(bottle_neck_features)
## Decoder
if bottle_neck_features.size(0) > 1:
mel_outputs, alignments = self.decoder.inference_batched(bottle_neck_features)
else:
mel_outputs, alignments = self.decoder.inference(bottle_neck_features,)
## Post-processing
mel_outputs_postnet = self.postnet(mel_outputs.transpose(1, 2)).transpose(1, 2)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
# outputs = mel_outputs_postnet[0]
return mel_outputs[0], mel_outputs_postnet[0], alignments[0]
def load_model(model_file, device=None):
# search a config file
model_config_fpaths = list(model_file.parent.rglob("*.yaml"))
if len(model_config_fpaths) == 0:
raise "No model yaml config found for convertor"
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_config = HpsYaml(model_config_fpaths[0])
ppg2mel_model = MelDecoderMOLv2(
**model_config["model"]
).to(device)
ckpt = torch.load(model_file, map_location=device)
ppg2mel_model.load_state_dict(ckpt["model"])
ppg2mel_model.eval()
return ppg2mel_model

View File

@@ -0,0 +1,113 @@
import os
import torch
import numpy as np
from tqdm import tqdm
from pathlib import Path
import soundfile
import resampy
from models.ppg_extractor import load_model
import encoder.inference as Encoder
from models.encoder.audio import preprocess_wav
from models.encoder import audio
from utils.f0_utils import compute_f0
from torch.multiprocessing import Pool, cpu_count
from functools import partial
SAMPLE_RATE=16000
def _compute_bnf(
wav: any,
output_fpath: str,
device: torch.device,
ppg_model_local: any,
):
"""
Compute CTC-Attention Seq2seq ASR encoder bottle-neck features (BNF).
"""
ppg_model_local.to(device)
wav_tensor = torch.from_numpy(wav).float().to(device).unsqueeze(0)
wav_length = torch.LongTensor([wav.shape[0]]).to(device)
with torch.no_grad():
bnf = ppg_model_local(wav_tensor, wav_length)
bnf_npy = bnf.squeeze(0).cpu().numpy()
np.save(output_fpath, bnf_npy, allow_pickle=False)
return bnf_npy, len(bnf_npy)
def _compute_f0_from_wav(wav, output_fpath):
"""Compute merged f0 values."""
f0 = compute_f0(wav, SAMPLE_RATE)
np.save(output_fpath, f0, allow_pickle=False)
return f0, len(f0)
def _compute_spkEmbed(wav, output_fpath, encoder_model_local, device):
Encoder.set_model(encoder_model_local)
# Compute where to split the utterance into partials and pad if necessary
wave_slices, mel_slices = Encoder.compute_partial_slices(len(wav), rate=1.3, min_pad_coverage=0.75)
max_wave_length = wave_slices[-1].stop
if max_wave_length >= len(wav):
wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")
# Split the utterance into partials
frames = audio.wav_to_mel_spectrogram(wav)
frames_batch = np.array([frames[s] for s in mel_slices])
partial_embeds = Encoder.embed_frames_batch(frames_batch)
# Compute the utterance embedding from the partial embeddings
raw_embed = np.mean(partial_embeds, axis=0)
embed = raw_embed / np.linalg.norm(raw_embed, 2)
np.save(output_fpath, embed, allow_pickle=False)
return embed, len(embed)
def preprocess_one(wav_path, out_dir, device, ppg_model_local, encoder_model_local):
# wav = preprocess_wav(wav_path)
# try:
wav, sr = soundfile.read(wav_path)
if len(wav) < sr:
return None, sr, len(wav)
if sr != SAMPLE_RATE:
wav = resampy.resample(wav, sr, SAMPLE_RATE)
sr = SAMPLE_RATE
utt_id = os.path.basename(wav_path).rstrip(".wav")
_, length_bnf = _compute_bnf(output_fpath=f"{out_dir}/bnf/{utt_id}.ling_feat.npy", wav=wav, device=device, ppg_model_local=ppg_model_local)
_, length_f0 = _compute_f0_from_wav(output_fpath=f"{out_dir}/f0/{utt_id}.f0.npy", wav=wav)
_, length_embed = _compute_spkEmbed(output_fpath=f"{out_dir}/embed/{utt_id}.npy", device=device, encoder_model_local=encoder_model_local, wav=wav)
def preprocess_dataset(datasets_root, dataset, out_dir, n_processes, ppg_encoder_model_fpath, speaker_encoder_model):
# Glob wav files
wav_file_list = sorted(Path(f"{datasets_root}/{dataset}").glob("**/*.wav"))
print(f"Globbed {len(wav_file_list)} wav files.")
out_dir.joinpath("bnf").mkdir(exist_ok=True, parents=True)
out_dir.joinpath("f0").mkdir(exist_ok=True, parents=True)
out_dir.joinpath("embed").mkdir(exist_ok=True, parents=True)
ppg_model_local = load_model(ppg_encoder_model_fpath, "cpu")
encoder_model_local = Encoder.load_model(speaker_encoder_model, "cpu")
if n_processes is None:
n_processes = cpu_count()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
func = partial(preprocess_one, out_dir=out_dir, ppg_model_local=ppg_model_local, encoder_model_local=encoder_model_local, device=device)
job = Pool(n_processes).imap(func, wav_file_list)
list(tqdm(job, "Preprocessing", len(wav_file_list), unit="wav"))
# finish processing and mark
t_fid_file = out_dir.joinpath("train_fidlist.txt").open("w", encoding="utf-8")
d_fid_file = out_dir.joinpath("dev_fidlist.txt").open("w", encoding="utf-8")
e_fid_file = out_dir.joinpath("eval_fidlist.txt").open("w", encoding="utf-8")
for file in sorted(out_dir.joinpath("f0").glob("*.npy")):
id = os.path.basename(file).split(".f0.npy")[0]
if id.endswith("01"):
d_fid_file.write(id + "\n")
elif id.endswith("09"):
e_fid_file.write(id + "\n")
else:
t_fid_file.write(id + "\n")
t_fid_file.close()
d_fid_file.close()
e_fid_file.close()
return len(wav_file_list)

View File

@@ -0,0 +1,374 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from .utils.mol_attention import MOLAttention
from .utils.basic_layers import Linear
from .utils.vc_utils import get_mask_from_lengths
class DecoderPrenet(nn.Module):
def __init__(self, in_dim, sizes):
super().__init__()
in_sizes = [in_dim] + sizes[:-1]
self.layers = nn.ModuleList(
[Linear(in_size, out_size, bias=False)
for (in_size, out_size) in zip(in_sizes, sizes)])
def forward(self, x):
for linear in self.layers:
x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
return x
class Decoder(nn.Module):
"""Mixture of Logistic (MoL) attention-based RNN Decoder."""
def __init__(
self,
enc_dim,
num_mels,
frames_per_step,
attention_rnn_dim,
decoder_rnn_dim,
prenet_dims,
num_mixtures,
encoder_down_factor=1,
num_decoder_rnn_layer=1,
use_stop_tokens=False,
concat_context_to_last=False,
):
super().__init__()
self.enc_dim = enc_dim
self.encoder_down_factor = encoder_down_factor
self.num_mels = num_mels
self.frames_per_step = frames_per_step
self.attention_rnn_dim = attention_rnn_dim
self.decoder_rnn_dim = decoder_rnn_dim
self.prenet_dims = prenet_dims
self.use_stop_tokens = use_stop_tokens
self.num_decoder_rnn_layer = num_decoder_rnn_layer
self.concat_context_to_last = concat_context_to_last
# Mel prenet
self.prenet = DecoderPrenet(num_mels, prenet_dims)
self.prenet_pitch = DecoderPrenet(num_mels, prenet_dims)
# Attention RNN
self.attention_rnn = nn.LSTMCell(
prenet_dims[-1] + enc_dim,
attention_rnn_dim
)
# Attention
self.attention_layer = MOLAttention(
attention_rnn_dim,
r=frames_per_step/encoder_down_factor,
M=num_mixtures,
)
# Decoder RNN
self.decoder_rnn_layers = nn.ModuleList()
for i in range(num_decoder_rnn_layer):
if i == 0:
self.decoder_rnn_layers.append(
nn.LSTMCell(
enc_dim + attention_rnn_dim,
decoder_rnn_dim))
else:
self.decoder_rnn_layers.append(
nn.LSTMCell(
decoder_rnn_dim,
decoder_rnn_dim))
# self.decoder_rnn = nn.LSTMCell(
# 2 * enc_dim + attention_rnn_dim,
# decoder_rnn_dim
# )
if concat_context_to_last:
self.linear_projection = Linear(
enc_dim + decoder_rnn_dim,
num_mels * frames_per_step
)
else:
self.linear_projection = Linear(
decoder_rnn_dim,
num_mels * frames_per_step
)
# Stop-token layer
if self.use_stop_tokens:
if concat_context_to_last:
self.stop_layer = Linear(
enc_dim + decoder_rnn_dim, 1, bias=True, w_init_gain="sigmoid"
)
else:
self.stop_layer = Linear(
decoder_rnn_dim, 1, bias=True, w_init_gain="sigmoid"
)
def get_go_frame(self, memory):
B = memory.size(0)
go_frame = torch.zeros((B, self.num_mels), dtype=torch.float,
device=memory.device)
return go_frame
def initialize_decoder_states(self, memory, mask):
device = next(self.parameters()).device
B = memory.size(0)
# attention rnn states
self.attention_hidden = torch.zeros(
(B, self.attention_rnn_dim), device=device)
self.attention_cell = torch.zeros(
(B, self.attention_rnn_dim), device=device)
# decoder rnn states
self.decoder_hiddens = []
self.decoder_cells = []
for i in range(self.num_decoder_rnn_layer):
self.decoder_hiddens.append(
torch.zeros((B, self.decoder_rnn_dim),
device=device)
)
self.decoder_cells.append(
torch.zeros((B, self.decoder_rnn_dim),
device=device)
)
# self.decoder_hidden = torch.zeros(
# (B, self.decoder_rnn_dim), device=device)
# self.decoder_cell = torch.zeros(
# (B, self.decoder_rnn_dim), device=device)
self.attention_context = torch.zeros(
(B, self.enc_dim), device=device)
self.memory = memory
# self.processed_memory = self.attention_layer.memory_layer(memory)
self.mask = mask
def parse_decoder_inputs(self, decoder_inputs):
"""Prepare decoder inputs, i.e. gt mel
Args:
decoder_inputs:(B, T_out, n_mel_channels) inputs used for teacher-forced training.
"""
decoder_inputs = decoder_inputs.reshape(
decoder_inputs.size(0),
int(decoder_inputs.size(1)/self.frames_per_step), -1)
# (B, T_out//r, r*num_mels) -> (T_out//r, B, r*num_mels)
decoder_inputs = decoder_inputs.transpose(0, 1)
# (T_out//r, B, num_mels)
decoder_inputs = decoder_inputs[:,:,-self.num_mels:]
return decoder_inputs
def parse_decoder_outputs(self, mel_outputs, alignments, stop_outputs):
""" Prepares decoder outputs for output
Args:
mel_outputs:
alignments:
"""
# (T_out//r, B, T_enc) -> (B, T_out//r, T_enc)
alignments = torch.stack(alignments).transpose(0, 1)
# (T_out//r, B) -> (B, T_out//r)
if stop_outputs is not None:
if alignments.size(0) == 1:
stop_outputs = torch.stack(stop_outputs).unsqueeze(0)
else:
stop_outputs = torch.stack(stop_outputs).transpose(0, 1)
stop_outputs = stop_outputs.contiguous()
# (T_out//r, B, num_mels*r) -> (B, T_out//r, num_mels*r)
mel_outputs = torch.stack(mel_outputs).transpose(0, 1).contiguous()
# decouple frames per step
# (B, T_out, num_mels)
mel_outputs = mel_outputs.view(
mel_outputs.size(0), -1, self.num_mels)
return mel_outputs, alignments, stop_outputs
def attend(self, decoder_input):
cell_input = torch.cat((decoder_input, self.attention_context), -1)
self.attention_hidden, self.attention_cell = self.attention_rnn(
cell_input, (self.attention_hidden, self.attention_cell))
self.attention_context, attention_weights = self.attention_layer(
self.attention_hidden, self.memory, None, self.mask)
decoder_rnn_input = torch.cat(
(self.attention_hidden, self.attention_context), -1)
return decoder_rnn_input, self.attention_context, attention_weights
def decode(self, decoder_input):
for i in range(self.num_decoder_rnn_layer):
if i == 0:
self.decoder_hiddens[i], self.decoder_cells[i] = self.decoder_rnn_layers[i](
decoder_input, (self.decoder_hiddens[i], self.decoder_cells[i]))
else:
self.decoder_hiddens[i], self.decoder_cells[i] = self.decoder_rnn_layers[i](
self.decoder_hiddens[i-1], (self.decoder_hiddens[i], self.decoder_cells[i]))
return self.decoder_hiddens[-1]
def forward(self, memory, mel_inputs, memory_lengths):
""" Decoder forward pass for training
Args:
memory: (B, T_enc, enc_dim) Encoder outputs
decoder_inputs: (B, T, num_mels) Decoder inputs for teacher forcing.
memory_lengths: (B, ) Encoder output lengths for attention masking.
Returns:
mel_outputs: (B, T, num_mels) mel outputs from the decoder
alignments: (B, T//r, T_enc) attention weights.
"""
# [1, B, num_mels]
go_frame = self.get_go_frame(memory).unsqueeze(0)
# [T//r, B, num_mels]
mel_inputs = self.parse_decoder_inputs(mel_inputs)
# [T//r + 1, B, num_mels]
mel_inputs = torch.cat((go_frame, mel_inputs), dim=0)
# [T//r + 1, B, prenet_dim]
decoder_inputs = self.prenet(mel_inputs)
# decoder_inputs_pitch = self.prenet_pitch(decoder_inputs__)
self.initialize_decoder_states(
memory, mask=~get_mask_from_lengths(memory_lengths),
)
self.attention_layer.init_states(memory)
# self.attention_layer_pitch.init_states(memory_pitch)
mel_outputs, alignments = [], []
if self.use_stop_tokens:
stop_outputs = []
else:
stop_outputs = None
while len(mel_outputs) < decoder_inputs.size(0) - 1:
decoder_input = decoder_inputs[len(mel_outputs)]
# decoder_input_pitch = decoder_inputs_pitch[len(mel_outputs)]
decoder_rnn_input, context, attention_weights = self.attend(decoder_input)
decoder_rnn_output = self.decode(decoder_rnn_input)
if self.concat_context_to_last:
decoder_rnn_output = torch.cat(
(decoder_rnn_output, context), dim=1)
mel_output = self.linear_projection(decoder_rnn_output)
if self.use_stop_tokens:
stop_output = self.stop_layer(decoder_rnn_output)
stop_outputs += [stop_output.squeeze()]
mel_outputs += [mel_output.squeeze(1)] #? perhaps don't need squeeze
alignments += [attention_weights]
# alignments_pitch += [attention_weights_pitch]
mel_outputs, alignments, stop_outputs = self.parse_decoder_outputs(
mel_outputs, alignments, stop_outputs)
if stop_outputs is None:
return mel_outputs, alignments
else:
return mel_outputs, stop_outputs, alignments
def inference(self, memory, stop_threshold=0.5):
""" Decoder inference
Args:
memory: (1, T_enc, D_enc) Encoder outputs
Returns:
mel_outputs: mel outputs from the decoder
alignments: sequence of attention weights from the decoder
"""
# [1, num_mels]
decoder_input = self.get_go_frame(memory)
self.initialize_decoder_states(memory, mask=None)
self.attention_layer.init_states(memory)
mel_outputs, alignments = [], []
# NOTE(sx): heuristic
max_decoder_step = memory.size(1)*self.encoder_down_factor//self.frames_per_step
min_decoder_step = memory.size(1)*self.encoder_down_factor // self.frames_per_step - 5
while True:
decoder_input = self.prenet(decoder_input)
decoder_input_final, context, alignment = self.attend(decoder_input)
#mel_output, stop_output, alignment = self.decode(decoder_input)
decoder_rnn_output = self.decode(decoder_input_final)
if self.concat_context_to_last:
decoder_rnn_output = torch.cat(
(decoder_rnn_output, context), dim=1)
mel_output = self.linear_projection(decoder_rnn_output)
stop_output = self.stop_layer(decoder_rnn_output)
mel_outputs += [mel_output.squeeze(1)]
alignments += [alignment]
if torch.sigmoid(stop_output.data) > stop_threshold and len(mel_outputs) >= min_decoder_step:
break
if len(mel_outputs) >= max_decoder_step:
# print("Warning! Decoding steps reaches max decoder steps.")
break
decoder_input = mel_output[:,-self.num_mels:]
mel_outputs, alignments, _ = self.parse_decoder_outputs(
mel_outputs, alignments, None)
return mel_outputs, alignments
def inference_batched(self, memory, stop_threshold=0.5):
""" Decoder inference
Args:
memory: (B, T_enc, D_enc) Encoder outputs
Returns:
mel_outputs: mel outputs from the decoder
alignments: sequence of attention weights from the decoder
"""
# [1, num_mels]
decoder_input = self.get_go_frame(memory)
self.initialize_decoder_states(memory, mask=None)
self.attention_layer.init_states(memory)
mel_outputs, alignments = [], []
stop_outputs = []
# NOTE(sx): heuristic
max_decoder_step = memory.size(1)*self.encoder_down_factor//self.frames_per_step
min_decoder_step = memory.size(1)*self.encoder_down_factor // self.frames_per_step - 5
while True:
decoder_input = self.prenet(decoder_input)
decoder_input_final, context, alignment = self.attend(decoder_input)
#mel_output, stop_output, alignment = self.decode(decoder_input)
decoder_rnn_output = self.decode(decoder_input_final)
if self.concat_context_to_last:
decoder_rnn_output = torch.cat(
(decoder_rnn_output, context), dim=1)
mel_output = self.linear_projection(decoder_rnn_output)
# (B, 1)
stop_output = self.stop_layer(decoder_rnn_output)
stop_outputs += [stop_output.squeeze()]
# stop_outputs.append(stop_output)
mel_outputs += [mel_output.squeeze(1)]
alignments += [alignment]
# print(stop_output.shape)
if torch.all(torch.sigmoid(stop_output.squeeze().data) > stop_threshold) \
and len(mel_outputs) >= min_decoder_step:
break
if len(mel_outputs) >= max_decoder_step:
# print("Warning! Decoding steps reaches max decoder steps.")
break
decoder_input = mel_output[:,-self.num_mels:]
mel_outputs, alignments, stop_outputs = self.parse_decoder_outputs(
mel_outputs, alignments, stop_outputs)
mel_outputs_stacked = []
for mel, stop_logit in zip(mel_outputs, stop_outputs):
idx = np.argwhere(torch.sigmoid(stop_logit.cpu()) > stop_threshold)[0][0].item()
mel_outputs_stacked.append(mel[:idx,:])
mel_outputs = torch.cat(mel_outputs_stacked, dim=0).unsqueeze(0)
return mel_outputs, alignments

62
models/ppg2mel/train.py Normal file
View File

@@ -0,0 +1,62 @@
import sys
import torch
import argparse
import numpy as np
from utils.hparams import HpsYaml
from models.ppg2mel.train.train_linglf02mel_seq2seq_oneshotvc import Solver
# For reproducibility, comment these may speed up training
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
def main():
# Arguments
parser = argparse.ArgumentParser(description=
'Training PPG2Mel VC model.')
parser.add_argument('--config', type=str,
help='Path to experiment config, e.g., config/vc.yaml')
parser.add_argument('--name', default=None, type=str, help='Name for logging.')
parser.add_argument('--logdir', default='log/', type=str,
help='Logging path.', required=False)
parser.add_argument('--ckpdir', default='ckpt/', type=str,
help='Checkpoint path.', required=False)
parser.add_argument('--outdir', default='result/', type=str,
help='Decode output path.', required=False)
parser.add_argument('--load', default=None, type=str,
help='Load pre-trained model (for training only)', required=False)
parser.add_argument('--warm_start', action='store_true',
help='Load model weights only, ignore specified layers.')
parser.add_argument('--seed', default=0, type=int,
help='Random seed for reproducable results.', required=False)
parser.add_argument('--njobs', default=8, type=int,
help='Number of threads for dataloader/decoding.', required=False)
parser.add_argument('--cpu', action='store_true', help='Disable GPU training.')
# parser.add_argument('--no-pin', action='store_true',
# help='Disable pin-memory for dataloader')
parser.add_argument('--no-msg', action='store_true', help='Hide all messages.')
###
paras = parser.parse_args()
setattr(paras, 'gpu', not paras.cpu)
setattr(paras, 'pin_memory', not paras.no_pin)
setattr(paras, 'verbose', not paras.no_msg)
# Make the config dict dot visitable
config = HpsYaml(paras.config)
np.random.seed(paras.seed)
torch.manual_seed(paras.seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(paras.seed)
print(">>> OneShot VC training ...")
mode = "train"
solver = Solver(config, paras, mode)
solver.load_data()
solver.set_model()
solver.exec()
print(">>> Oneshot VC train finished!")
sys.exit(0)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,50 @@
from typing import Dict
from typing import Tuple
import torch
import torch.nn as nn
import torch.nn.functional as F
from ..utils.nets_utils import make_pad_mask
class MaskedMSELoss(nn.Module):
def __init__(self, frames_per_step):
super().__init__()
self.frames_per_step = frames_per_step
self.mel_loss_criterion = nn.MSELoss(reduction='none')
# self.loss = nn.MSELoss()
self.stop_loss_criterion = nn.BCEWithLogitsLoss(reduction='none')
def get_mask(self, lengths, max_len=None):
# lengths: [B,]
if max_len is None:
max_len = torch.max(lengths)
batch_size = lengths.size(0)
seq_range = torch.arange(0, max_len).long()
seq_range_expand = seq_range.unsqueeze(0).expand(batch_size, max_len).to(lengths.device)
seq_length_expand = lengths.unsqueeze(1).expand_as(seq_range_expand)
return (seq_range_expand < seq_length_expand).float()
def forward(self, mel_pred, mel_pred_postnet, mel_trg, lengths,
stop_target, stop_pred):
## process stop_target
B = stop_target.size(0)
stop_target = stop_target.reshape(B, -1, self.frames_per_step)[:, :, 0]
stop_lengths = torch.ceil(lengths.float() / self.frames_per_step).long()
stop_mask = self.get_mask(stop_lengths, int(mel_trg.size(1)/self.frames_per_step))
mel_trg.requires_grad = False
# (B, T, 1)
mel_mask = self.get_mask(lengths, mel_trg.size(1)).unsqueeze(-1)
# (B, T, D)
mel_mask = mel_mask.expand_as(mel_trg)
mel_loss_pre = (self.mel_loss_criterion(mel_pred, mel_trg) * mel_mask).sum() / mel_mask.sum()
mel_loss_post = (self.mel_loss_criterion(mel_pred_postnet, mel_trg) * mel_mask).sum() / mel_mask.sum()
mel_loss = mel_loss_pre + mel_loss_post
# stop token loss
stop_loss = torch.sum(self.stop_loss_criterion(stop_pred, stop_target) * stop_mask) / stop_mask.sum()
return mel_loss, stop_loss

View File

@@ -0,0 +1,45 @@
import torch
import numpy as np
class Optimizer():
def __init__(self, parameters, optimizer, lr, eps, lr_scheduler,
**kwargs):
# Setup torch optimizer
self.opt_type = optimizer
self.init_lr = lr
self.sch_type = lr_scheduler
opt = getattr(torch.optim, optimizer)
if lr_scheduler == 'warmup':
warmup_step = 4000.0
init_lr = lr
self.lr_scheduler = lambda step: init_lr * warmup_step ** 0.5 * \
np.minimum((step+1)*warmup_step**-1.5, (step+1)**-0.5)
self.opt = opt(parameters, lr=1.0)
else:
self.lr_scheduler = None
self.opt = opt(parameters, lr=lr, eps=eps) # ToDo: 1e-8 better?
def get_opt_state_dict(self):
return self.opt.state_dict()
def load_opt_state_dict(self, state_dict):
self.opt.load_state_dict(state_dict)
def pre_step(self, step):
if self.lr_scheduler is not None:
cur_lr = self.lr_scheduler(step)
for param_group in self.opt.param_groups:
param_group['lr'] = cur_lr
else:
cur_lr = self.init_lr
self.opt.zero_grad()
return cur_lr
def step(self):
self.opt.step()
def create_msg(self):
return ['Optim.Info.| Algo. = {}\t| Lr = {}\t (schedule = {})'
.format(self.opt_type, self.init_lr, self.sch_type)]

View File

@@ -0,0 +1,10 @@
# Default parameters which will be imported by solver
default_hparas = {
'GRAD_CLIP': 5.0, # Grad. clip threshold
'PROGRESS_STEP': 100, # Std. output refresh freq.
# Decode steps for objective validation (step = ratio*input_txt_len)
'DEV_STEP_RATIO': 1.2,
# Number of examples (alignment/text) to show in tensorboard
'DEV_N_EXAMPLE': 4,
'TB_FLUSH_FREQ': 180 # Update frequency of tensorboard (secs)
}

View File

@@ -0,0 +1,216 @@
import os
import sys
import abc
import math
import yaml
import torch
from torch.utils.tensorboard import SummaryWriter
from .option import default_hparas
from utils.util import human_format, Timer
class BaseSolver():
'''
Prototype Solver for all kinds of tasks
Arguments
config - yaml-styled config
paras - argparse outcome
mode - "train"/"test"
'''
def __init__(self, config, paras, mode="train"):
# General Settings
self.config = config # load from yaml file
self.paras = paras # command line args
self.mode = mode # 'train' or 'test'
for k, v in default_hparas.items():
setattr(self, k, v)
self.device = torch.device('cuda') if self.paras.gpu and torch.cuda.is_available() \
else torch.device('cpu')
# Name experiment
self.exp_name = paras.name
if self.exp_name is None:
if 'exp_name' in self.config:
self.exp_name = self.config.exp_name
else:
# By default, exp is named after config file
self.exp_name = paras.config.split('/')[-1].replace('.yaml', '')
if mode == 'train':
self.exp_name += '_seed{}'.format(paras.seed)
if mode == 'train':
# Filepath setup
os.makedirs(paras.ckpdir, exist_ok=True)
self.ckpdir = os.path.join(paras.ckpdir, self.exp_name)
os.makedirs(self.ckpdir, exist_ok=True)
# Logger settings
self.logdir = os.path.join(paras.logdir, self.exp_name)
self.log = SummaryWriter(
self.logdir, flush_secs=self.TB_FLUSH_FREQ)
self.timer = Timer()
# Hyper-parameters
self.step = 0
self.valid_step = config.hparas.valid_step
self.max_step = config.hparas.max_step
self.verbose('Exp. name : {}'.format(self.exp_name))
self.verbose('Loading data... large corpus may took a while.')
# elif mode == 'test':
# # Output path
# os.makedirs(paras.outdir, exist_ok=True)
# self.ckpdir = os.path.join(paras.outdir, self.exp_name)
# Load training config to get acoustic feat and build model
# self.src_config = HpsYaml(config.src.config)
# self.paras.load = config.src.ckpt
# self.verbose('Evaluating result of tr. config @ {}'.format(
# config.src.config))
def backward(self, loss):
'''
Standard backward step with self.timer and debugger
Arguments
loss - the loss to perform loss.backward()
'''
self.timer.set()
loss.backward()
grad_norm = torch.nn.utils.clip_grad_norm_(
self.model.parameters(), self.GRAD_CLIP)
if math.isnan(grad_norm):
self.verbose('Error : grad norm is NaN @ step '+str(self.step))
else:
self.optimizer.step()
self.timer.cnt('bw')
return grad_norm
def load_ckpt(self):
''' Load ckpt if --load option is specified '''
print(self.paras)
if self.paras.load is not None:
if self.paras.warm_start:
self.verbose(f"Warm starting model from checkpoint {self.paras.load}.")
ckpt = torch.load(
self.paras.load, map_location=self.device if self.mode == 'train'
else 'cpu')
model_dict = ckpt['model']
if "ignore_layers" in self.config.model and len(self.config.model.ignore_layers) > 0:
model_dict = {k:v for k, v in model_dict.items()
if k not in self.config.model.ignore_layers}
dummy_dict = self.model.state_dict()
dummy_dict.update(model_dict)
model_dict = dummy_dict
self.model.load_state_dict(model_dict)
else:
# Load weights
ckpt = torch.load(
self.paras.load, map_location=self.device if self.mode == 'train'
else 'cpu')
self.model.load_state_dict(ckpt['model'])
# Load task-dependent items
if self.mode == 'train':
self.step = ckpt['global_step']
self.optimizer.load_opt_state_dict(ckpt['optimizer'])
self.verbose('Load ckpt from {}, restarting at step {}'.format(
self.paras.load, self.step))
else:
for k, v in ckpt.items():
if type(v) is float:
metric, score = k, v
self.model.eval()
self.verbose('Evaluation target = {} (recorded {} = {:.2f} %)'.format(
self.paras.load, metric, score))
def verbose(self, msg):
''' Verbose function for print information to stdout'''
if self.paras.verbose:
if type(msg) == list:
for m in msg:
print('[INFO]', m.ljust(100))
else:
print('[INFO]', msg.ljust(100))
def progress(self, msg):
''' Verbose function for updating progress on stdout (do not include newline) '''
if self.paras.verbose:
sys.stdout.write("\033[K") # Clear line
print('[{}] {}'.format(human_format(self.step), msg), end='\r')
def write_log(self, log_name, log_dict):
'''
Write log to TensorBoard
log_name - <str> Name of tensorboard variable
log_value - <dict>/<array> Value of variable (e.g. dict of losses), passed if value = None
'''
if type(log_dict) is dict:
log_dict = {key: val for key, val in log_dict.items() if (
val is not None and not math.isnan(val))}
if log_dict is None:
pass
elif len(log_dict) > 0:
if 'align' in log_name or 'spec' in log_name:
img, form = log_dict
self.log.add_image(
log_name, img, global_step=self.step, dataformats=form)
elif 'text' in log_name or 'hyp' in log_name:
self.log.add_text(log_name, log_dict, self.step)
else:
self.log.add_scalars(log_name, log_dict, self.step)
def save_checkpoint(self, f_name, metric, score, show_msg=True):
''''
Ckpt saver
f_name - <str> the name of ckpt file (w/o prefix) to store, overwrite if existed
score - <float> The value of metric used to evaluate model
'''
ckpt_path = os.path.join(self.ckpdir, f_name)
full_dict = {
"model": self.model.state_dict(),
"optimizer": self.optimizer.get_opt_state_dict(),
"global_step": self.step,
metric: score
}
torch.save(full_dict, ckpt_path)
if show_msg:
self.verbose("Saved checkpoint (step = {}, {} = {:.2f}) and status @ {}".
format(human_format(self.step), metric, score, ckpt_path))
# ----------------------------------- Abtract Methods ------------------------------------------ #
@abc.abstractmethod
def load_data(self):
'''
Called by main to load all data
After this call, data related attributes should be setup (e.g. self.tr_set, self.dev_set)
No return value
'''
raise NotImplementedError
@abc.abstractmethod
def set_model(self):
'''
Called by main to set models
After this call, model related attributes should be setup (e.g. self.l2_loss)
The followings MUST be setup
- self.model (torch.nn.Module)
- self.optimizer (src.Optimizer),
init. w/ self.optimizer = src.Optimizer(self.model.parameters(),**self.config['hparas'])
Loading pre-trained model should also be performed here
No return value
'''
raise NotImplementedError
@abc.abstractmethod
def exec(self):
'''
Called by main to execute training/inference
'''
raise NotImplementedError

View File

@@ -0,0 +1,288 @@
import os, sys
# sys.path.append('/home/shaunxliu/projects/nnsp')
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import torch
from torch.utils.data import DataLoader
import numpy as np
from .solver import BaseSolver
from utils.data_load import OneshotVcDataset, MultiSpkVcCollate
# from src.rnn_ppg2mel import BiRnnPpg2MelModel
# from src.mel_decoder_mol_encAddlf0 import MelDecoderMOL
from .loss import MaskedMSELoss
from .optim import Optimizer
from utils.util import human_format
from models.ppg2mel import MelDecoderMOLv2
class Solver(BaseSolver):
"""Customized Solver."""
def __init__(self, config, paras, mode):
super().__init__(config, paras, mode)
self.num_att_plots = 5
self.att_ws_dir = f"{self.logdir}/att_ws"
os.makedirs(self.att_ws_dir, exist_ok=True)
self.best_loss = np.inf
def fetch_data(self, data):
"""Move data to device"""
data = [i.to(self.device) for i in data]
return data
def load_data(self):
""" Load data for training/validation/plotting."""
train_dataset = OneshotVcDataset(
meta_file=self.config.data.train_fid_list,
vctk_ppg_dir=self.config.data.vctk_ppg_dir,
libri_ppg_dir=self.config.data.libri_ppg_dir,
vctk_f0_dir=self.config.data.vctk_f0_dir,
libri_f0_dir=self.config.data.libri_f0_dir,
vctk_wav_dir=self.config.data.vctk_wav_dir,
libri_wav_dir=self.config.data.libri_wav_dir,
vctk_spk_dvec_dir=self.config.data.vctk_spk_dvec_dir,
libri_spk_dvec_dir=self.config.data.libri_spk_dvec_dir,
ppg_file_ext=self.config.data.ppg_file_ext,
min_max_norm_mel=self.config.data.min_max_norm_mel,
mel_min=self.config.data.mel_min,
mel_max=self.config.data.mel_max,
)
dev_dataset = OneshotVcDataset(
meta_file=self.config.data.dev_fid_list,
vctk_ppg_dir=self.config.data.vctk_ppg_dir,
libri_ppg_dir=self.config.data.libri_ppg_dir,
vctk_f0_dir=self.config.data.vctk_f0_dir,
libri_f0_dir=self.config.data.libri_f0_dir,
vctk_wav_dir=self.config.data.vctk_wav_dir,
libri_wav_dir=self.config.data.libri_wav_dir,
vctk_spk_dvec_dir=self.config.data.vctk_spk_dvec_dir,
libri_spk_dvec_dir=self.config.data.libri_spk_dvec_dir,
ppg_file_ext=self.config.data.ppg_file_ext,
min_max_norm_mel=self.config.data.min_max_norm_mel,
mel_min=self.config.data.mel_min,
mel_max=self.config.data.mel_max,
)
self.train_dataloader = DataLoader(
train_dataset,
num_workers=self.paras.njobs,
shuffle=True,
batch_size=self.config.hparas.batch_size,
pin_memory=False,
drop_last=True,
collate_fn=MultiSpkVcCollate(self.config.model.frames_per_step,
use_spk_dvec=True),
)
self.dev_dataloader = DataLoader(
dev_dataset,
num_workers=self.paras.njobs,
shuffle=False,
batch_size=self.config.hparas.batch_size,
pin_memory=False,
drop_last=False,
collate_fn=MultiSpkVcCollate(self.config.model.frames_per_step,
use_spk_dvec=True),
)
self.plot_dataloader = DataLoader(
dev_dataset,
num_workers=self.paras.njobs,
shuffle=False,
batch_size=1,
pin_memory=False,
drop_last=False,
collate_fn=MultiSpkVcCollate(self.config.model.frames_per_step,
use_spk_dvec=True,
give_uttids=True),
)
msg = "Have prepared training set and dev set."
self.verbose(msg)
def load_pretrained_params(self):
print("Load pretrained model from: ", self.config.data.pretrain_model_file)
ignore_layer_prefixes = ["speaker_embedding_table"]
pretrain_model_file = self.config.data.pretrain_model_file
pretrain_ckpt = torch.load(
pretrain_model_file, map_location=self.device
)["model"]
model_dict = self.model.state_dict()
print(self.model)
# 1. filter out unnecessrary keys
for prefix in ignore_layer_prefixes:
pretrain_ckpt = {k : v
for k, v in pretrain_ckpt.items() if not k.startswith(prefix)
}
# 2. overwrite entries in the existing state dict
model_dict.update(pretrain_ckpt)
# 3. load the new state dict
self.model.load_state_dict(model_dict)
def set_model(self):
"""Setup model and optimizer"""
# Model
print("[INFO] Model name: ", self.config["model_name"])
self.model = MelDecoderMOLv2(
**self.config["model"]
).to(self.device)
# self.load_pretrained_params()
# model_params = [{'params': self.model.spk_embedding.weight}]
model_params = [{'params': self.model.parameters()}]
# Loss criterion
self.loss_criterion = MaskedMSELoss(self.config.model.frames_per_step)
# Optimizer
self.optimizer = Optimizer(model_params, **self.config["hparas"])
self.verbose(self.optimizer.create_msg())
# Automatically load pre-trained model if self.paras.load is given
self.load_ckpt()
def exec(self):
self.verbose("Total training steps {}.".format(
human_format(self.max_step)))
mel_loss = None
n_epochs = 0
# Set as current time
self.timer.set()
while self.step < self.max_step:
for data in self.train_dataloader:
# Pre-step: updata lr_rate and do zero_grad
lr_rate = self.optimizer.pre_step(self.step)
total_loss = 0
# data to device
ppgs, lf0_uvs, mels, in_lengths, \
out_lengths, spk_ids, stop_tokens = self.fetch_data(data)
self.timer.cnt("rd")
mel_outputs, mel_outputs_postnet, predicted_stop = self.model(
ppgs,
in_lengths,
mels,
out_lengths,
lf0_uvs,
spk_ids
)
mel_loss, stop_loss = self.loss_criterion(
mel_outputs,
mel_outputs_postnet,
mels,
out_lengths,
stop_tokens,
predicted_stop
)
loss = mel_loss + stop_loss
self.timer.cnt("fw")
# Back-prop
grad_norm = self.backward(loss)
self.step += 1
# Logger
if (self.step == 1) or (self.step % self.PROGRESS_STEP == 0):
self.progress("Tr|loss:{:.4f},mel-loss:{:.4f},stop-loss:{:.4f}|Grad.Norm-{:.2f}|{}"
.format(loss.cpu().item(), mel_loss.cpu().item(),
stop_loss.cpu().item(), grad_norm, self.timer.show()))
self.write_log('loss', {'tr/loss': loss,
'tr/mel-loss': mel_loss,
'tr/stop-loss': stop_loss})
# Validation
if (self.step == 1) or (self.step % self.valid_step == 0):
self.validate()
# End of step
# https://github.com/pytorch/pytorch/issues/13246#issuecomment-529185354
torch.cuda.empty_cache()
self.timer.set()
if self.step > self.max_step:
break
n_epochs += 1
self.log.close()
def validate(self):
self.model.eval()
dev_loss, dev_mel_loss, dev_stop_loss = 0.0, 0.0, 0.0
for i, data in enumerate(self.dev_dataloader):
self.progress('Valid step - {}/{}'.format(i+1, len(self.dev_dataloader)))
# Fetch data
ppgs, lf0_uvs, mels, in_lengths, \
out_lengths, spk_ids, stop_tokens = self.fetch_data(data)
with torch.no_grad():
mel_outputs, mel_outputs_postnet, predicted_stop = self.model(
ppgs,
in_lengths,
mels,
out_lengths,
lf0_uvs,
spk_ids
)
mel_loss, stop_loss = self.loss_criterion(
mel_outputs,
mel_outputs_postnet,
mels,
out_lengths,
stop_tokens,
predicted_stop
)
loss = mel_loss + stop_loss
dev_loss += loss.cpu().item()
dev_mel_loss += mel_loss.cpu().item()
dev_stop_loss += stop_loss.cpu().item()
dev_loss = dev_loss / (i + 1)
dev_mel_loss = dev_mel_loss / (i + 1)
dev_stop_loss = dev_stop_loss / (i + 1)
self.save_checkpoint(f'step_{self.step}.pth', 'loss', dev_loss, show_msg=False)
if dev_loss < self.best_loss:
self.best_loss = dev_loss
self.save_checkpoint(f'best_loss_step_{self.step}.pth', 'loss', dev_loss)
self.write_log('loss', {'dv/loss': dev_loss,
'dv/mel-loss': dev_mel_loss,
'dv/stop-loss': dev_stop_loss})
# plot attention
for i, data in enumerate(self.plot_dataloader):
if i == self.num_att_plots:
break
# Fetch data
ppgs, lf0_uvs, mels, in_lengths, \
out_lengths, spk_ids, stop_tokens = self.fetch_data(data[:-1])
fid = data[-1][0]
with torch.no_grad():
_, _, _, att_ws = self.model(
ppgs,
in_lengths,
mels,
out_lengths,
lf0_uvs,
spk_ids,
output_att_ws=True
)
att_ws = att_ws.squeeze(0).cpu().numpy()
att_ws = att_ws[None]
w, h = plt.figaspect(1.0 / len(att_ws))
fig = plt.Figure(figsize=(w * 1.3, h * 1.3))
axes = fig.subplots(1, len(att_ws))
if len(att_ws) == 1:
axes = [axes]
for ax, aw in zip(axes, att_ws):
ax.imshow(aw.astype(np.float32), aspect="auto")
ax.set_title(f"{fid}")
ax.set_xlabel("Input")
ax.set_ylabel("Output")
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
fig_name = f"{self.att_ws_dir}/{fid}_step{self.step}.png"
fig.savefig(fig_name)
# Resume training
self.model.train()

View File

@@ -0,0 +1,23 @@
from abc import ABC
from abc import abstractmethod
import torch
class AbsMelDecoder(torch.nn.Module, ABC):
"""The abstract PPG-based voice conversion class
This "model" is one of mediator objects for "Task" class.
"""
@abstractmethod
def forward(
self,
bottle_neck_features: torch.Tensor,
feature_lengths: torch.Tensor,
speech: torch.Tensor,
speech_lengths: torch.Tensor,
logf0_uv: torch.Tensor = None,
spembs: torch.Tensor = None,
styleembs: torch.Tensor = None,
) -> torch.Tensor:
raise NotImplementedError

View File

@@ -0,0 +1,79 @@
import torch
from torch import nn
from torch.nn import functional as F
from torch.autograd import Function
def tile(x, count, dim=0):
"""
Tiles x on dimension dim count times.
"""
perm = list(range(len(x.size())))
if dim != 0:
perm[0], perm[dim] = perm[dim], perm[0]
x = x.permute(perm).contiguous()
out_size = list(x.size())
out_size[0] *= count
batch = x.size(0)
x = x.view(batch, -1) \
.transpose(0, 1) \
.repeat(count, 1) \
.transpose(0, 1) \
.contiguous() \
.view(*out_size)
if dim != 0:
x = x.permute(perm).contiguous()
return x
class Linear(torch.nn.Module):
def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
super(Linear, self).__init__()
self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
torch.nn.init.xavier_uniform_(
self.linear_layer.weight,
gain=torch.nn.init.calculate_gain(w_init_gain))
def forward(self, x):
return self.linear_layer(x)
class Conv1d(torch.nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
padding=None, dilation=1, bias=True, w_init_gain='linear', param=None):
super(Conv1d, self).__init__()
if padding is None:
assert(kernel_size % 2 == 1)
padding = int(dilation * (kernel_size - 1)/2)
self.conv = torch.nn.Conv1d(in_channels, out_channels,
kernel_size=kernel_size, stride=stride,
padding=padding, dilation=dilation,
bias=bias)
torch.nn.init.xavier_uniform_(
self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain, param=param))
def forward(self, x):
# x: BxDxT
return self.conv(x)
def tile(x, count, dim=0):
"""
Tiles x on dimension dim count times.
"""
perm = list(range(len(x.size())))
if dim != 0:
perm[0], perm[dim] = perm[dim], perm[0]
x = x.permute(perm).contiguous()
out_size = list(x.size())
out_size[0] *= count
batch = x.size(0)
x = x.view(batch, -1) \
.transpose(0, 1) \
.repeat(count, 1) \
.transpose(0, 1) \
.contiguous() \
.view(*out_size)
if dim != 0:
x = x.permute(perm).contiguous()
return x

View File

@@ -0,0 +1,52 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from .basic_layers import Linear, Conv1d
class Postnet(nn.Module):
"""Postnet
- Five 1-d convolution with 512 channels and kernel size 5
"""
def __init__(self, num_mels=80,
num_layers=5,
hidden_dim=512,
kernel_size=5):
super(Postnet, self).__init__()
self.convolutions = nn.ModuleList()
self.convolutions.append(
nn.Sequential(
Conv1d(
num_mels, hidden_dim,
kernel_size=kernel_size, stride=1,
padding=int((kernel_size - 1) / 2),
dilation=1, w_init_gain='tanh'),
nn.BatchNorm1d(hidden_dim)))
for i in range(1, num_layers - 1):
self.convolutions.append(
nn.Sequential(
Conv1d(
hidden_dim,
hidden_dim,
kernel_size=kernel_size, stride=1,
padding=int((kernel_size - 1) / 2),
dilation=1, w_init_gain='tanh'),
nn.BatchNorm1d(hidden_dim)))
self.convolutions.append(
nn.Sequential(
Conv1d(
hidden_dim, num_mels,
kernel_size=kernel_size, stride=1,
padding=int((kernel_size - 1) / 2),
dilation=1, w_init_gain='linear'),
nn.BatchNorm1d(num_mels)))
def forward(self, x):
# x: (B, num_mels, T_dec)
for i in range(len(self.convolutions) - 1):
x = F.dropout(torch.tanh(self.convolutions[i](x)), 0.5, self.training)
x = F.dropout(self.convolutions[-1](x), 0.5, self.training)
return x

View File

@@ -0,0 +1,123 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
class MOLAttention(nn.Module):
""" Discretized Mixture of Logistic (MOL) attention.
C.f. Section 5 of "MelNet: A Generative Model for Audio in the Frequency Domain" and
GMMv2b model in "Location-relative attention mechanisms for robust long-form speech synthesis".
"""
def __init__(
self,
query_dim,
r=1,
M=5,
):
"""
Args:
query_dim: attention_rnn_dim.
M: number of mixtures.
"""
super().__init__()
if r < 1:
self.r = float(r)
else:
self.r = int(r)
self.M = M
self.score_mask_value = 0.0 # -float("inf")
self.eps = 1e-5
# Position arrary for encoder time steps
self.J = None
# Query layer: [w, sigma,]
self.query_layer = torch.nn.Sequential(
nn.Linear(query_dim, 256, bias=True),
nn.ReLU(),
nn.Linear(256, 3*M, bias=True)
)
self.mu_prev = None
self.initialize_bias()
def initialize_bias(self):
"""Initialize sigma and Delta."""
# sigma
torch.nn.init.constant_(self.query_layer[2].bias[self.M:2*self.M], 1.0)
# Delta: softplus(1.8545) = 2.0; softplus(3.9815) = 4.0; softplus(0.5413) = 1.0
# softplus(-0.432) = 0.5003
if self.r == 2:
torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], 1.8545)
elif self.r == 4:
torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], 3.9815)
elif self.r == 1:
torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], 0.5413)
else:
torch.nn.init.constant_(self.query_layer[2].bias[2*self.M:3*self.M], -0.432)
def init_states(self, memory):
"""Initialize mu_prev and J.
This function should be called by the decoder before decoding one batch.
Args:
memory: (B, T, D_enc) encoder output.
"""
B, T_enc, _ = memory.size()
device = memory.device
self.J = torch.arange(0, T_enc + 2.0).to(device) + 0.5 # NOTE: for discretize usage
# self.J = memory.new_tensor(np.arange(T_enc), dtype=torch.float)
self.mu_prev = torch.zeros(B, self.M).to(device)
def forward(self, att_rnn_h, memory, memory_pitch=None, mask=None):
"""
att_rnn_h: attetion rnn hidden state.
memory: encoder outputs (B, T_enc, D).
mask: binary mask for padded data (B, T_enc).
"""
# [B, 3M]
mixture_params = self.query_layer(att_rnn_h)
# [B, M]
w_hat = mixture_params[:, :self.M]
sigma_hat = mixture_params[:, self.M:2*self.M]
Delta_hat = mixture_params[:, 2*self.M:3*self.M]
# print("w_hat: ", w_hat)
# print("sigma_hat: ", sigma_hat)
# print("Delta_hat: ", Delta_hat)
# Dropout to de-correlate attention heads
w_hat = F.dropout(w_hat, p=0.5, training=self.training) # NOTE(sx): needed?
# Mixture parameters
w = torch.softmax(w_hat, dim=-1) + self.eps
sigma = F.softplus(sigma_hat) + self.eps
Delta = F.softplus(Delta_hat)
mu_cur = self.mu_prev + Delta
# print("w:", w)
j = self.J[:memory.size(1) + 1]
# Attention weights
# CDF of logistic distribution
phi_t = w.unsqueeze(-1) * (1 / (1 + torch.sigmoid(
(mu_cur.unsqueeze(-1) - j) / sigma.unsqueeze(-1))))
# print("phi_t:", phi_t)
# Discretize attention weights
# (B, T_enc + 1)
alpha_t = torch.sum(phi_t, dim=1)
alpha_t = alpha_t[:, 1:] - alpha_t[:, :-1]
alpha_t[alpha_t == 0] = self.eps
# print("alpha_t: ", alpha_t.size())
# Apply masking
if mask is not None:
alpha_t.data.masked_fill_(mask, self.score_mask_value)
context = torch.bmm(alpha_t.unsqueeze(1), memory).squeeze(1)
if memory_pitch is not None:
context_pitch = torch.bmm(alpha_t.unsqueeze(1), memory_pitch).squeeze(1)
self.mu_prev = mu_cur
if memory_pitch is not None:
return context, context_pitch, alpha_t
return context, alpha_t

View File

@@ -0,0 +1,451 @@
# -*- coding: utf-8 -*-
"""Network related utility tools."""
import logging
from typing import Dict
import numpy as np
import torch
def to_device(m, x):
"""Send tensor into the device of the module.
Args:
m (torch.nn.Module): Torch module.
x (Tensor): Torch tensor.
Returns:
Tensor: Torch tensor located in the same place as torch module.
"""
assert isinstance(m, torch.nn.Module)
device = next(m.parameters()).device
return x.to(device)
def pad_list(xs, pad_value):
"""Perform padding for the list of tensors.
Args:
xs (List): List of Tensors [(T_1, `*`), (T_2, `*`), ..., (T_B, `*`)].
pad_value (float): Value for padding.
Returns:
Tensor: Padded tensor (B, Tmax, `*`).
Examples:
>>> x = [torch.ones(4), torch.ones(2), torch.ones(1)]
>>> x
[tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])]
>>> pad_list(x, 0)
tensor([[1., 1., 1., 1.],
[1., 1., 0., 0.],
[1., 0., 0., 0.]])
"""
n_batch = len(xs)
max_len = max(x.size(0) for x in xs)
pad = xs[0].new(n_batch, max_len, *xs[0].size()[1:]).fill_(pad_value)
for i in range(n_batch):
pad[i, :xs[i].size(0)] = xs[i]
return pad
def make_pad_mask(lengths, xs=None, length_dim=-1):
"""Make mask tensor containing indices of padded part.
Args:
lengths (LongTensor or List): Batch of lengths (B,).
xs (Tensor, optional): The reference tensor. If set, masks will be the same shape as this tensor.
length_dim (int, optional): Dimension indicator of the above tensor. See the example.
Returns:
Tensor: Mask tensor containing indices of padded part.
dtype=torch.uint8 in PyTorch 1.2-
dtype=torch.bool in PyTorch 1.2+ (including 1.2)
Examples:
With only lengths.
>>> lengths = [5, 3, 2]
>>> make_non_pad_mask(lengths)
masks = [[0, 0, 0, 0 ,0],
[0, 0, 0, 1, 1],
[0, 0, 1, 1, 1]]
With the reference tensor.
>>> xs = torch.zeros((3, 2, 4))
>>> make_pad_mask(lengths, xs)
tensor([[[0, 0, 0, 0],
[0, 0, 0, 0]],
[[0, 0, 0, 1],
[0, 0, 0, 1]],
[[0, 0, 1, 1],
[0, 0, 1, 1]]], dtype=torch.uint8)
>>> xs = torch.zeros((3, 2, 6))
>>> make_pad_mask(lengths, xs)
tensor([[[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1]],
[[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1]],
[[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
With the reference tensor and dimension indicator.
>>> xs = torch.zeros((3, 6, 6))
>>> make_pad_mask(lengths, xs, 1)
tensor([[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1]],
[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]],
[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]]], dtype=torch.uint8)
>>> make_pad_mask(lengths, xs, 2)
tensor([[[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1]],
[[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1]],
[[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
"""
if length_dim == 0:
raise ValueError('length_dim cannot be 0: {}'.format(length_dim))
if not isinstance(lengths, list):
lengths = lengths.tolist()
bs = int(len(lengths))
if xs is None:
maxlen = int(max(lengths))
else:
maxlen = xs.size(length_dim)
seq_range = torch.arange(0, maxlen, dtype=torch.int64)
seq_range_expand = seq_range.unsqueeze(0).expand(bs, maxlen)
seq_length_expand = seq_range_expand.new(lengths).unsqueeze(-1)
mask = seq_range_expand >= seq_length_expand
if xs is not None:
assert xs.size(0) == bs, (xs.size(0), bs)
if length_dim < 0:
length_dim = xs.dim() + length_dim
# ind = (:, None, ..., None, :, , None, ..., None)
ind = tuple(slice(None) if i in (0, length_dim) else None
for i in range(xs.dim()))
mask = mask[ind].expand_as(xs).to(xs.device)
return mask
def make_non_pad_mask(lengths, xs=None, length_dim=-1):
"""Make mask tensor containing indices of non-padded part.
Args:
lengths (LongTensor or List): Batch of lengths (B,).
xs (Tensor, optional): The reference tensor. If set, masks will be the same shape as this tensor.
length_dim (int, optional): Dimension indicator of the above tensor. See the example.
Returns:
ByteTensor: mask tensor containing indices of padded part.
dtype=torch.uint8 in PyTorch 1.2-
dtype=torch.bool in PyTorch 1.2+ (including 1.2)
Examples:
With only lengths.
>>> lengths = [5, 3, 2]
>>> make_non_pad_mask(lengths)
masks = [[1, 1, 1, 1 ,1],
[1, 1, 1, 0, 0],
[1, 1, 0, 0, 0]]
With the reference tensor.
>>> xs = torch.zeros((3, 2, 4))
>>> make_non_pad_mask(lengths, xs)
tensor([[[1, 1, 1, 1],
[1, 1, 1, 1]],
[[1, 1, 1, 0],
[1, 1, 1, 0]],
[[1, 1, 0, 0],
[1, 1, 0, 0]]], dtype=torch.uint8)
>>> xs = torch.zeros((3, 2, 6))
>>> make_non_pad_mask(lengths, xs)
tensor([[[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0]],
[[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0]],
[[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
With the reference tensor and dimension indicator.
>>> xs = torch.zeros((3, 6, 6))
>>> make_non_pad_mask(lengths, xs, 1)
tensor([[[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0]],
[[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]],
[[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]], dtype=torch.uint8)
>>> make_non_pad_mask(lengths, xs, 2)
tensor([[[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 0]],
[[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0]],
[[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
"""
return ~make_pad_mask(lengths, xs, length_dim)
def mask_by_length(xs, lengths, fill=0):
"""Mask tensor according to length.
Args:
xs (Tensor): Batch of input tensor (B, `*`).
lengths (LongTensor or List): Batch of lengths (B,).
fill (int or float): Value to fill masked part.
Returns:
Tensor: Batch of masked input tensor (B, `*`).
Examples:
>>> x = torch.arange(5).repeat(3, 1) + 1
>>> x
tensor([[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5]])
>>> lengths = [5, 3, 2]
>>> mask_by_length(x, lengths)
tensor([[1, 2, 3, 4, 5],
[1, 2, 3, 0, 0],
[1, 2, 0, 0, 0]])
"""
assert xs.size(0) == len(lengths)
ret = xs.data.new(*xs.size()).fill_(fill)
for i, l in enumerate(lengths):
ret[i, :l] = xs[i, :l]
return ret
def th_accuracy(pad_outputs, pad_targets, ignore_label):
"""Calculate accuracy.
Args:
pad_outputs (Tensor): Prediction tensors (B * Lmax, D).
pad_targets (LongTensor): Target label tensors (B, Lmax, D).
ignore_label (int): Ignore label id.
Returns:
float: Accuracy value (0.0 - 1.0).
"""
pad_pred = pad_outputs.view(
pad_targets.size(0),
pad_targets.size(1),
pad_outputs.size(1)).argmax(2)
mask = pad_targets != ignore_label
numerator = torch.sum(pad_pred.masked_select(mask) == pad_targets.masked_select(mask))
denominator = torch.sum(mask)
return float(numerator) / float(denominator)
def to_torch_tensor(x):
"""Change to torch.Tensor or ComplexTensor from numpy.ndarray.
Args:
x: Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.
Returns:
Tensor or ComplexTensor: Type converted inputs.
Examples:
>>> xs = np.ones(3, dtype=np.float32)
>>> xs = to_torch_tensor(xs)
tensor([1., 1., 1.])
>>> xs = torch.ones(3, 4, 5)
>>> assert to_torch_tensor(xs) is xs
>>> xs = {'real': xs, 'imag': xs}
>>> to_torch_tensor(xs)
ComplexTensor(
Real:
tensor([1., 1., 1.])
Imag;
tensor([1., 1., 1.])
)
"""
# If numpy, change to torch tensor
if isinstance(x, np.ndarray):
if x.dtype.kind == 'c':
# Dynamically importing because torch_complex requires python3
from torch_complex.tensor import ComplexTensor
return ComplexTensor(x)
else:
return torch.from_numpy(x)
# If {'real': ..., 'imag': ...}, convert to ComplexTensor
elif isinstance(x, dict):
# Dynamically importing because torch_complex requires python3
from torch_complex.tensor import ComplexTensor
if 'real' not in x or 'imag' not in x:
raise ValueError("has 'real' and 'imag' keys: {}".format(list(x)))
# Relative importing because of using python3 syntax
return ComplexTensor(x['real'], x['imag'])
# If torch.Tensor, as it is
elif isinstance(x, torch.Tensor):
return x
else:
error = ("x must be numpy.ndarray, torch.Tensor or a dict like "
"{{'real': torch.Tensor, 'imag': torch.Tensor}}, "
"but got {}".format(type(x)))
try:
from torch_complex.tensor import ComplexTensor
except Exception:
# If PY2
raise ValueError(error)
else:
# If PY3
if isinstance(x, ComplexTensor):
return x
else:
raise ValueError(error)
def get_subsample(train_args, mode, arch):
"""Parse the subsampling factors from the training args for the specified `mode` and `arch`.
Args:
train_args: argument Namespace containing options.
mode: one of ('asr', 'mt', 'st')
arch: one of ('rnn', 'rnn-t', 'rnn_mix', 'rnn_mulenc', 'transformer')
Returns:
np.ndarray / List[np.ndarray]: subsampling factors.
"""
if arch == 'transformer':
return np.array([1])
elif mode == 'mt' and arch == 'rnn':
# +1 means input (+1) and layers outputs (train_args.elayer)
subsample = np.ones(train_args.elayers + 1, dtype=np.int)
logging.warning('Subsampling is not performed for machine translation.')
logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
return subsample
elif (mode == 'asr' and arch in ('rnn', 'rnn-t')) or \
(mode == 'mt' and arch == 'rnn') or \
(mode == 'st' and arch == 'rnn'):
subsample = np.ones(train_args.elayers + 1, dtype=np.int)
if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
ss = train_args.subsample.split("_")
for j in range(min(train_args.elayers + 1, len(ss))):
subsample[j] = int(ss[j])
else:
logging.warning(
'Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.')
logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
return subsample
elif mode == 'asr' and arch == 'rnn_mix':
subsample = np.ones(train_args.elayers_sd + train_args.elayers + 1, dtype=np.int)
if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
ss = train_args.subsample.split("_")
for j in range(min(train_args.elayers_sd + train_args.elayers + 1, len(ss))):
subsample[j] = int(ss[j])
else:
logging.warning(
'Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.')
logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
return subsample
elif mode == 'asr' and arch == 'rnn_mulenc':
subsample_list = []
for idx in range(train_args.num_encs):
subsample = np.ones(train_args.elayers[idx] + 1, dtype=np.int)
if train_args.etype[idx].endswith("p") and not train_args.etype[idx].startswith("vgg"):
ss = train_args.subsample[idx].split("_")
for j in range(min(train_args.elayers[idx] + 1, len(ss))):
subsample[j] = int(ss[j])
else:
logging.warning(
'Encoder %d: Subsampling is not performed for vgg*. '
'It is performed in max pooling layers at CNN.', idx + 1)
logging.info('subsample: ' + ' '.join([str(x) for x in subsample]))
subsample_list.append(subsample)
return subsample_list
else:
raise ValueError('Invalid options: mode={}, arch={}'.format(mode, arch))
def rename_state_dict(old_prefix: str, new_prefix: str, state_dict: Dict[str, torch.Tensor]):
"""Replace keys of old prefix with new prefix in state dict."""
# need this list not to break the dict iterator
old_keys = [k for k in state_dict if k.startswith(old_prefix)]
if len(old_keys) > 0:
logging.warning(f'Rename: {old_prefix} -> {new_prefix}')
for k in old_keys:
v = state_dict.pop(k)
new_k = k.replace(old_prefix, new_prefix)
state_dict[new_k] = v

View File

@@ -0,0 +1,22 @@
import torch
def gcd(a, b):
"""Greatest common divisor."""
a, b = (a, b) if a >=b else (b, a)
if a%b == 0:
return b
else :
return gcd(b, a%b)
def lcm(a, b):
"""Least common multiple"""
return a * b // gcd(a, b)
def get_mask_from_lengths(lengths, max_len=None):
if max_len is None:
max_len = torch.max(lengths).item()
ids = torch.arange(0, max_len, out=torch.cuda.LongTensor(max_len))
mask = (ids < lengths.unsqueeze(1)).bool()
return mask

View File

@@ -0,0 +1,102 @@
import argparse
import torch
from pathlib import Path
import yaml
from .frontend import DefaultFrontend
from .utterance_mvn import UtteranceMVN
from .encoder.conformer_encoder import ConformerEncoder
_model = None # type: PPGModel
_device = None
class PPGModel(torch.nn.Module):
def __init__(
self,
frontend,
normalizer,
encoder,
):
super().__init__()
self.frontend = frontend
self.normalize = normalizer
self.encoder = encoder
def forward(self, speech, speech_lengths):
"""
Args:
speech (tensor): (B, L)
speech_lengths (tensor): (B, )
Returns:
bottle_neck_feats (tensor): (B, L//hop_size, 144)
"""
feats, feats_lengths = self._extract_feats(speech, speech_lengths)
feats, feats_lengths = self.normalize(feats, feats_lengths)
encoder_out, encoder_out_lens, _ = self.encoder(feats, feats_lengths)
return encoder_out
def _extract_feats(
self, speech: torch.Tensor, speech_lengths: torch.Tensor
):
assert speech_lengths.dim() == 1, speech_lengths.shape
# for data-parallel
speech = speech[:, : speech_lengths.max()]
if self.frontend is not None:
# Frontend
# e.g. STFT and Feature extract
# data_loader may send time-domain signal in this case
# speech (Batch, NSamples) -> feats: (Batch, NFrames, Dim)
feats, feats_lengths = self.frontend(speech, speech_lengths)
else:
# No frontend and no feature extract
feats, feats_lengths = speech, speech_lengths
return feats, feats_lengths
def extract_from_wav(self, src_wav):
src_wav_tensor = torch.from_numpy(src_wav).unsqueeze(0).float().to(_device)
src_wav_lengths = torch.LongTensor([len(src_wav)]).to(_device)
return self(src_wav_tensor, src_wav_lengths)
def build_model(args):
normalizer = UtteranceMVN(**args.normalize_conf)
frontend = DefaultFrontend(**args.frontend_conf)
encoder = ConformerEncoder(input_size=80, **args.encoder_conf)
model = PPGModel(frontend, normalizer, encoder)
return model
def load_model(model_file, device=None):
global _model, _device
if device is None:
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
else:
_device = device
# search a config file
model_config_fpaths = list(model_file.parent.rglob("*.yaml"))
config_file = model_config_fpaths[0]
with config_file.open("r", encoding="utf-8") as f:
args = yaml.safe_load(f)
args = argparse.Namespace(**args)
model = build_model(args)
model_state_dict = model.state_dict()
ckpt_state_dict = torch.load(model_file, map_location=_device)
ckpt_state_dict = {k:v for k,v in ckpt_state_dict.items() if 'encoder' in k}
model_state_dict.update(ckpt_state_dict)
model.load_state_dict(model_state_dict)
_model = model.eval().to(_device)
return _model

View File

@@ -0,0 +1,398 @@
#!/usr/bin/env python3
# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
"""Common functions for ASR."""
import argparse
import editdistance
import json
import logging
import numpy as np
import six
import sys
from itertools import groupby
def end_detect(ended_hyps, i, M=3, D_end=np.log(1 * np.exp(-10))):
"""End detection.
desribed in Eq. (50) of S. Watanabe et al
"Hybrid CTC/Attention Architecture for End-to-End Speech Recognition"
:param ended_hyps:
:param i:
:param M:
:param D_end:
:return:
"""
if len(ended_hyps) == 0:
return False
count = 0
best_hyp = sorted(ended_hyps, key=lambda x: x['score'], reverse=True)[0]
for m in six.moves.range(M):
# get ended_hyps with their length is i - m
hyp_length = i - m
hyps_same_length = [x for x in ended_hyps if len(x['yseq']) == hyp_length]
if len(hyps_same_length) > 0:
best_hyp_same_length = sorted(hyps_same_length, key=lambda x: x['score'], reverse=True)[0]
if best_hyp_same_length['score'] - best_hyp['score'] < D_end:
count += 1
if count == M:
return True
else:
return False
# TODO(takaaki-hori): add different smoothing methods
def label_smoothing_dist(odim, lsm_type, transcript=None, blank=0):
"""Obtain label distribution for loss smoothing.
:param odim:
:param lsm_type:
:param blank:
:param transcript:
:return:
"""
if transcript is not None:
with open(transcript, 'rb') as f:
trans_json = json.load(f)['utts']
if lsm_type == 'unigram':
assert transcript is not None, 'transcript is required for %s label smoothing' % lsm_type
labelcount = np.zeros(odim)
for k, v in trans_json.items():
ids = np.array([int(n) for n in v['output'][0]['tokenid'].split()])
# to avoid an error when there is no text in an uttrance
if len(ids) > 0:
labelcount[ids] += 1
labelcount[odim - 1] = len(transcript) # count <eos>
labelcount[labelcount == 0] = 1 # flooring
labelcount[blank] = 0 # remove counts for blank
labeldist = labelcount.astype(np.float32) / np.sum(labelcount)
else:
logging.error(
"Error: unexpected label smoothing type: %s" % lsm_type)
sys.exit()
return labeldist
def get_vgg2l_odim(idim, in_channel=3, out_channel=128, downsample=True):
"""Return the output size of the VGG frontend.
:param in_channel: input channel size
:param out_channel: output channel size
:return: output size
:rtype int
"""
idim = idim / in_channel
if downsample:
idim = np.ceil(np.array(idim, dtype=np.float32) / 2) # 1st max pooling
idim = np.ceil(np.array(idim, dtype=np.float32) / 2) # 2nd max pooling
return int(idim) * out_channel # numer of channels
class ErrorCalculator(object):
"""Calculate CER and WER for E2E_ASR and CTC models during training.
:param y_hats: numpy array with predicted text
:param y_pads: numpy array with true (target) text
:param char_list:
:param sym_space:
:param sym_blank:
:return:
"""
def __init__(self, char_list, sym_space, sym_blank, report_cer=False, report_wer=False,
trans_type="char"):
"""Construct an ErrorCalculator object."""
super(ErrorCalculator, self).__init__()
self.report_cer = report_cer
self.report_wer = report_wer
self.trans_type = trans_type
self.char_list = char_list
self.space = sym_space
self.blank = sym_blank
self.idx_blank = self.char_list.index(self.blank)
if self.space in self.char_list:
self.idx_space = self.char_list.index(self.space)
else:
self.idx_space = None
def __call__(self, ys_hat, ys_pad, is_ctc=False):
"""Calculate sentence-level WER/CER score.
:param torch.Tensor ys_hat: prediction (batch, seqlen)
:param torch.Tensor ys_pad: reference (batch, seqlen)
:param bool is_ctc: calculate CER score for CTC
:return: sentence-level WER score
:rtype float
:return: sentence-level CER score
:rtype float
"""
cer, wer = None, None
if is_ctc:
return self.calculate_cer_ctc(ys_hat, ys_pad)
elif not self.report_cer and not self.report_wer:
return cer, wer
seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad)
if self.report_cer:
cer = self.calculate_cer(seqs_hat, seqs_true)
if self.report_wer:
wer = self.calculate_wer(seqs_hat, seqs_true)
return cer, wer
def calculate_cer_ctc(self, ys_hat, ys_pad):
"""Calculate sentence-level CER score for CTC.
:param torch.Tensor ys_hat: prediction (batch, seqlen)
:param torch.Tensor ys_pad: reference (batch, seqlen)
:return: average sentence-level CER score
:rtype float
"""
cers, char_ref_lens = [], []
for i, y in enumerate(ys_hat):
y_hat = [x[0] for x in groupby(y)]
y_true = ys_pad[i]
seq_hat, seq_true = [], []
for idx in y_hat:
idx = int(idx)
if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
seq_hat.append(self.char_list[int(idx)])
for idx in y_true:
idx = int(idx)
if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
seq_true.append(self.char_list[int(idx)])
if self.trans_type == "char":
hyp_chars = "".join(seq_hat)
ref_chars = "".join(seq_true)
else:
hyp_chars = " ".join(seq_hat)
ref_chars = " ".join(seq_true)
if len(ref_chars) > 0:
cers.append(editdistance.eval(hyp_chars, ref_chars))
char_ref_lens.append(len(ref_chars))
cer_ctc = float(sum(cers)) / sum(char_ref_lens) if cers else None
return cer_ctc
def convert_to_char(self, ys_hat, ys_pad):
"""Convert index to character.
:param torch.Tensor seqs_hat: prediction (batch, seqlen)
:param torch.Tensor seqs_true: reference (batch, seqlen)
:return: token list of prediction
:rtype list
:return: token list of reference
:rtype list
"""
seqs_hat, seqs_true = [], []
for i, y_hat in enumerate(ys_hat):
y_true = ys_pad[i]
eos_true = np.where(y_true == -1)[0]
eos_true = eos_true[0] if len(eos_true) > 0 else len(y_true)
# To avoid wrong higher WER than the one obtained from the decoding
# eos from y_true is used to mark the eos in y_hat
# because of that y_hats has not padded outs with -1.
seq_hat = [self.char_list[int(idx)] for idx in y_hat[:eos_true]]
seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]
# seq_hat_text = "".join(seq_hat).replace(self.space, ' ')
seq_hat_text = " ".join(seq_hat).replace(self.space, ' ')
seq_hat_text = seq_hat_text.replace(self.blank, '')
# seq_true_text = "".join(seq_true).replace(self.space, ' ')
seq_true_text = " ".join(seq_true).replace(self.space, ' ')
seqs_hat.append(seq_hat_text)
seqs_true.append(seq_true_text)
return seqs_hat, seqs_true
def calculate_cer(self, seqs_hat, seqs_true):
"""Calculate sentence-level CER score.
:param list seqs_hat: prediction
:param list seqs_true: reference
:return: average sentence-level CER score
:rtype float
"""
char_eds, char_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_chars = seq_hat_text.replace(' ', '')
ref_chars = seq_true_text.replace(' ', '')
char_eds.append(editdistance.eval(hyp_chars, ref_chars))
char_ref_lens.append(len(ref_chars))
return float(sum(char_eds)) / sum(char_ref_lens)
def calculate_wer(self, seqs_hat, seqs_true):
"""Calculate sentence-level WER score.
:param list seqs_hat: prediction
:param list seqs_true: reference
:return: average sentence-level WER score
:rtype float
"""
word_eds, word_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_words = seq_hat_text.split()
ref_words = seq_true_text.split()
word_eds.append(editdistance.eval(hyp_words, ref_words))
word_ref_lens.append(len(ref_words))
return float(sum(word_eds)) / sum(word_ref_lens)
class ErrorCalculatorTrans(object):
"""Calculate CER and WER for transducer models.
Args:
decoder (nn.Module): decoder module
args (Namespace): argument Namespace containing options
report_cer (boolean): compute CER option
report_wer (boolean): compute WER option
"""
def __init__(self, decoder, args, report_cer=False, report_wer=False):
"""Construct an ErrorCalculator object for transducer model."""
super(ErrorCalculatorTrans, self).__init__()
self.dec = decoder
recog_args = {'beam_size': args.beam_size,
'nbest': args.nbest,
'space': args.sym_space,
'score_norm_transducer': args.score_norm_transducer}
self.recog_args = argparse.Namespace(**recog_args)
self.char_list = args.char_list
self.space = args.sym_space
self.blank = args.sym_blank
self.report_cer = args.report_cer
self.report_wer = args.report_wer
def __call__(self, hs_pad, ys_pad):
"""Calculate sentence-level WER/CER score for transducer models.
Args:
hs_pad (torch.Tensor): batch of padded input sequence (batch, T, D)
ys_pad (torch.Tensor): reference (batch, seqlen)
Returns:
(float): sentence-level CER score
(float): sentence-level WER score
"""
cer, wer = None, None
if not self.report_cer and not self.report_wer:
return cer, wer
batchsize = int(hs_pad.size(0))
batch_nbest = []
for b in six.moves.range(batchsize):
if self.recog_args.beam_size == 1:
nbest_hyps = self.dec.recognize(hs_pad[b], self.recog_args)
else:
nbest_hyps = self.dec.recognize_beam(hs_pad[b], self.recog_args)
batch_nbest.append(nbest_hyps)
ys_hat = [nbest_hyp[0]['yseq'][1:] for nbest_hyp in batch_nbest]
seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad.cpu())
if self.report_cer:
cer = self.calculate_cer(seqs_hat, seqs_true)
if self.report_wer:
wer = self.calculate_wer(seqs_hat, seqs_true)
return cer, wer
def convert_to_char(self, ys_hat, ys_pad):
"""Convert index to character.
Args:
ys_hat (torch.Tensor): prediction (batch, seqlen)
ys_pad (torch.Tensor): reference (batch, seqlen)
Returns:
(list): token list of prediction
(list): token list of reference
"""
seqs_hat, seqs_true = [], []
for i, y_hat in enumerate(ys_hat):
y_true = ys_pad[i]
eos_true = np.where(y_true == -1)[0]
eos_true = eos_true[0] if len(eos_true) > 0 else len(y_true)
seq_hat = [self.char_list[int(idx)] for idx in y_hat[:eos_true]]
seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]
seq_hat_text = "".join(seq_hat).replace(self.space, ' ')
seq_hat_text = seq_hat_text.replace(self.blank, '')
seq_true_text = "".join(seq_true).replace(self.space, ' ')
seqs_hat.append(seq_hat_text)
seqs_true.append(seq_true_text)
return seqs_hat, seqs_true
def calculate_cer(self, seqs_hat, seqs_true):
"""Calculate sentence-level CER score for transducer model.
Args:
seqs_hat (torch.Tensor): prediction (batch, seqlen)
seqs_true (torch.Tensor): reference (batch, seqlen)
Returns:
(float): average sentence-level CER score
"""
char_eds, char_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_chars = seq_hat_text.replace(' ', '')
ref_chars = seq_true_text.replace(' ', '')
char_eds.append(editdistance.eval(hyp_chars, ref_chars))
char_ref_lens.append(len(ref_chars))
return float(sum(char_eds)) / sum(char_ref_lens)
def calculate_wer(self, seqs_hat, seqs_true):
"""Calculate sentence-level WER score for transducer model.
Args:
seqs_hat (torch.Tensor): prediction (batch, seqlen)
seqs_true (torch.Tensor): reference (batch, seqlen)
Returns:
(float): average sentence-level WER score
"""
word_eds, word_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_words = seq_hat_text.split()
ref_words = seq_true_text.split()
word_eds.append(editdistance.eval(hyp_words, ref_words))
word_ref_lens.append(len(ref_words))
return float(sum(word_eds)) / sum(word_ref_lens)

View File

View File

@@ -0,0 +1,183 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2019 Shigeki Karita
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
"""Multi-Head Attention layer definition."""
import math
import numpy
import torch
from torch import nn
class MultiHeadedAttention(nn.Module):
"""Multi-Head Attention layer.
:param int n_head: the number of head s
:param int n_feat: the number of features
:param float dropout_rate: dropout rate
"""
def __init__(self, n_head, n_feat, dropout_rate):
"""Construct an MultiHeadedAttention object."""
super(MultiHeadedAttention, self).__init__()
assert n_feat % n_head == 0
# We assume d_v always equals d_k
self.d_k = n_feat // n_head
self.h = n_head
self.linear_q = nn.Linear(n_feat, n_feat)
self.linear_k = nn.Linear(n_feat, n_feat)
self.linear_v = nn.Linear(n_feat, n_feat)
self.linear_out = nn.Linear(n_feat, n_feat)
self.attn = None
self.dropout = nn.Dropout(p=dropout_rate)
def forward_qkv(self, query, key, value):
"""Transform query, key and value.
:param torch.Tensor query: (batch, time1, size)
:param torch.Tensor key: (batch, time2, size)
:param torch.Tensor value: (batch, time2, size)
:return torch.Tensor transformed query, key and value
"""
n_batch = query.size(0)
q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
q = q.transpose(1, 2) # (batch, head, time1, d_k)
k = k.transpose(1, 2) # (batch, head, time2, d_k)
v = v.transpose(1, 2) # (batch, head, time2, d_k)
return q, k, v
def forward_attention(self, value, scores, mask):
"""Compute attention context vector.
:param torch.Tensor value: (batch, head, time2, size)
:param torch.Tensor scores: (batch, head, time1, time2)
:param torch.Tensor mask: (batch, 1, time2) or (batch, time1, time2)
:return torch.Tensor transformed `value` (batch, time1, d_model)
weighted by the attention score (batch, time1, time2)
"""
n_batch = value.size(0)
if mask is not None:
mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2)
min_value = float(
numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min
)
scores = scores.masked_fill(mask, min_value)
self.attn = torch.softmax(scores, dim=-1).masked_fill(
mask, 0.0
) # (batch, head, time1, time2)
else:
self.attn = torch.softmax(scores, dim=-1) # (batch, head, time1, time2)
p_attn = self.dropout(self.attn)
x = torch.matmul(p_attn, value) # (batch, head, time1, d_k)
x = (
x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
) # (batch, time1, d_model)
return self.linear_out(x) # (batch, time1, d_model)
def forward(self, query, key, value, mask):
"""Compute 'Scaled Dot Product Attention'.
:param torch.Tensor query: (batch, time1, size)
:param torch.Tensor key: (batch, time2, size)
:param torch.Tensor value: (batch, time2, size)
:param torch.Tensor mask: (batch, 1, time2) or (batch, time1, time2)
:param torch.nn.Dropout dropout:
:return torch.Tensor: attention output (batch, time1, d_model)
"""
q, k, v = self.forward_qkv(query, key, value)
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
return self.forward_attention(v, scores, mask)
class RelPositionMultiHeadedAttention(MultiHeadedAttention):
"""Multi-Head Attention layer with relative position encoding.
Paper: https://arxiv.org/abs/1901.02860
:param int n_head: the number of head s
:param int n_feat: the number of features
:param float dropout_rate: dropout rate
"""
def __init__(self, n_head, n_feat, dropout_rate):
"""Construct an RelPositionMultiHeadedAttention object."""
super().__init__(n_head, n_feat, dropout_rate)
# linear transformation for positional ecoding
self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
# these two learnable bias are used in matrix c and matrix d
# as described in https://arxiv.org/abs/1901.02860 Section 3.3
self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
torch.nn.init.xavier_uniform_(self.pos_bias_u)
torch.nn.init.xavier_uniform_(self.pos_bias_v)
def rel_shift(self, x, zero_triu=False):
"""Compute relative positinal encoding.
:param torch.Tensor x: (batch, time, size)
:param bool zero_triu: return the lower triangular part of the matrix
"""
zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)
x_padded = torch.cat([zero_pad, x], dim=-1)
x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))
x = x_padded[:, :, 1:].view_as(x)
if zero_triu:
ones = torch.ones((x.size(2), x.size(3)))
x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
return x
def forward(self, query, key, value, pos_emb, mask):
"""Compute 'Scaled Dot Product Attention' with rel. positional encoding.
:param torch.Tensor query: (batch, time1, size)
:param torch.Tensor key: (batch, time2, size)
:param torch.Tensor value: (batch, time2, size)
:param torch.Tensor pos_emb: (batch, time1, size)
:param torch.Tensor mask: (batch, time1, time2)
:param torch.nn.Dropout dropout:
:return torch.Tensor: attention output (batch, time1, d_model)
"""
q, k, v = self.forward_qkv(query, key, value)
q = q.transpose(1, 2) # (batch, time1, head, d_k)
n_batch_pos = pos_emb.size(0)
p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
p = p.transpose(1, 2) # (batch, head, time1, d_k)
# (batch, head, time1, d_k)
q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)
# (batch, head, time1, d_k)
q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)
# compute attention score
# first compute matrix a and matrix c
# as described in https://arxiv.org/abs/1901.02860 Section 3.3
# (batch, head, time1, time2)
matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))
# compute matrix b and matrix d
# (batch, head, time1, time2)
matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))
matrix_bd = self.rel_shift(matrix_bd)
scores = (matrix_ac + matrix_bd) / math.sqrt(
self.d_k
) # (batch, head, time1, time2)
return self.forward_attention(v, scores, mask)

View File

@@ -0,0 +1,262 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2019 Shigeki Karita
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
"""Encoder definition."""
import logging
import torch
from typing import Callable
from typing import Collection
from typing import Dict
from typing import List
from typing import Optional
from typing import Tuple
from .convolution import ConvolutionModule
from .encoder_layer import EncoderLayer
from ..nets_utils import get_activation, make_pad_mask
from .vgg import VGG2L
from .attention import MultiHeadedAttention, RelPositionMultiHeadedAttention
from .embedding import PositionalEncoding, ScaledPositionalEncoding, RelPositionalEncoding
from .layer_norm import LayerNorm
from .multi_layer_conv import Conv1dLinear, MultiLayeredConv1d
from .positionwise_feed_forward import PositionwiseFeedForward
from .repeat import repeat
from .subsampling import Conv2dNoSubsampling, Conv2dSubsampling
class ConformerEncoder(torch.nn.Module):
"""Conformer encoder module.
:param int idim: input dim
:param int attention_dim: dimention of attention
:param int attention_heads: the number of heads of multi head attention
:param int linear_units: the number of units of position-wise feed forward
:param int num_blocks: the number of decoder blocks
:param float dropout_rate: dropout rate
:param float attention_dropout_rate: dropout rate in attention
:param float positional_dropout_rate: dropout rate after adding positional encoding
:param str or torch.nn.Module input_layer: input layer type
:param bool normalize_before: whether to use layer_norm before the first block
:param bool concat_after: whether to concat attention layer's input and output
if True, additional linear will be applied.
i.e. x -> x + linear(concat(x, att(x)))
if False, no additional linear will be applied. i.e. x -> x + att(x)
:param str positionwise_layer_type: linear of conv1d
:param int positionwise_conv_kernel_size: kernel size of positionwise conv1d layer
:param str encoder_pos_enc_layer_type: encoder positional encoding layer type
:param str encoder_attn_layer_type: encoder attention layer type
:param str activation_type: encoder activation function type
:param bool macaron_style: whether to use macaron style for positionwise layer
:param bool use_cnn_module: whether to use convolution module
:param int cnn_module_kernel: kernerl size of convolution module
:param int padding_idx: padding_idx for input_layer=embed
"""
def __init__(
self,
input_size,
attention_dim=256,
attention_heads=4,
linear_units=2048,
num_blocks=6,
dropout_rate=0.1,
positional_dropout_rate=0.1,
attention_dropout_rate=0.0,
input_layer="conv2d",
normalize_before=True,
concat_after=False,
positionwise_layer_type="linear",
positionwise_conv_kernel_size=1,
macaron_style=False,
pos_enc_layer_type="abs_pos",
selfattention_layer_type="selfattn",
activation_type="swish",
use_cnn_module=False,
cnn_module_kernel=31,
padding_idx=-1,
no_subsample=False,
subsample_by_2=False,
):
"""Construct an Encoder object."""
super().__init__()
self._output_size = attention_dim
idim = input_size
activation = get_activation(activation_type)
if pos_enc_layer_type == "abs_pos":
pos_enc_class = PositionalEncoding
elif pos_enc_layer_type == "scaled_abs_pos":
pos_enc_class = ScaledPositionalEncoding
elif pos_enc_layer_type == "rel_pos":
assert selfattention_layer_type == "rel_selfattn"
pos_enc_class = RelPositionalEncoding
else:
raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
if input_layer == "linear":
self.embed = torch.nn.Sequential(
torch.nn.Linear(idim, attention_dim),
torch.nn.LayerNorm(attention_dim),
torch.nn.Dropout(dropout_rate),
pos_enc_class(attention_dim, positional_dropout_rate),
)
elif input_layer == "conv2d":
logging.info("Encoder input layer type: conv2d")
if no_subsample:
self.embed = Conv2dNoSubsampling(
idim,
attention_dim,
dropout_rate,
pos_enc_class(attention_dim, positional_dropout_rate),
)
else:
self.embed = Conv2dSubsampling(
idim,
attention_dim,
dropout_rate,
pos_enc_class(attention_dim, positional_dropout_rate),
subsample_by_2, # NOTE(Sx): added by songxiang
)
elif input_layer == "vgg2l":
self.embed = VGG2L(idim, attention_dim)
elif input_layer == "embed":
self.embed = torch.nn.Sequential(
torch.nn.Embedding(idim, attention_dim, padding_idx=padding_idx),
pos_enc_class(attention_dim, positional_dropout_rate),
)
elif isinstance(input_layer, torch.nn.Module):
self.embed = torch.nn.Sequential(
input_layer,
pos_enc_class(attention_dim, positional_dropout_rate),
)
elif input_layer is None:
self.embed = torch.nn.Sequential(
pos_enc_class(attention_dim, positional_dropout_rate)
)
else:
raise ValueError("unknown input_layer: " + input_layer)
self.normalize_before = normalize_before
if positionwise_layer_type == "linear":
positionwise_layer = PositionwiseFeedForward
positionwise_layer_args = (
attention_dim,
linear_units,
dropout_rate,
activation,
)
elif positionwise_layer_type == "conv1d":
positionwise_layer = MultiLayeredConv1d
positionwise_layer_args = (
attention_dim,
linear_units,
positionwise_conv_kernel_size,
dropout_rate,
)
elif positionwise_layer_type == "conv1d-linear":
positionwise_layer = Conv1dLinear
positionwise_layer_args = (
attention_dim,
linear_units,
positionwise_conv_kernel_size,
dropout_rate,
)
else:
raise NotImplementedError("Support only linear or conv1d.")
if selfattention_layer_type == "selfattn":
logging.info("encoder self-attention layer type = self-attention")
encoder_selfattn_layer = MultiHeadedAttention
encoder_selfattn_layer_args = (
attention_heads,
attention_dim,
attention_dropout_rate,
)
elif selfattention_layer_type == "rel_selfattn":
assert pos_enc_layer_type == "rel_pos"
encoder_selfattn_layer = RelPositionMultiHeadedAttention
encoder_selfattn_layer_args = (
attention_heads,
attention_dim,
attention_dropout_rate,
)
else:
raise ValueError("unknown encoder_attn_layer: " + selfattention_layer_type)
convolution_layer = ConvolutionModule
convolution_layer_args = (attention_dim, cnn_module_kernel, activation)
self.encoders = repeat(
num_blocks,
lambda lnum: EncoderLayer(
attention_dim,
encoder_selfattn_layer(*encoder_selfattn_layer_args),
positionwise_layer(*positionwise_layer_args),
positionwise_layer(*positionwise_layer_args) if macaron_style else None,
convolution_layer(*convolution_layer_args) if use_cnn_module else None,
dropout_rate,
normalize_before,
concat_after,
),
)
if self.normalize_before:
self.after_norm = LayerNorm(attention_dim)
def output_size(self) -> int:
return self._output_size
def forward(
self,
xs_pad: torch.Tensor,
ilens: torch.Tensor,
prev_states: torch.Tensor = None,
) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
"""
Args:
xs_pad: input tensor (B, L, D)
ilens: input lengths (B)
prev_states: Not to be used now.
Returns:
Position embedded tensor and mask
"""
masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device)
if isinstance(self.embed, (Conv2dSubsampling, Conv2dNoSubsampling, VGG2L)):
# print(xs_pad.shape)
xs_pad, masks = self.embed(xs_pad, masks)
# print(xs_pad[0].size())
else:
xs_pad = self.embed(xs_pad)
xs_pad, masks = self.encoders(xs_pad, masks)
if isinstance(xs_pad, tuple):
xs_pad = xs_pad[0]
if self.normalize_before:
xs_pad = self.after_norm(xs_pad)
olens = masks.squeeze(1).sum(1)
return xs_pad, olens, None
# def forward(self, xs, masks):
# """Encode input sequence.
# :param torch.Tensor xs: input tensor
# :param torch.Tensor masks: input mask
# :return: position embedded tensor and mask
# :rtype Tuple[torch.Tensor, torch.Tensor]:
# """
# if isinstance(self.embed, (Conv2dSubsampling, VGG2L)):
# xs, masks = self.embed(xs, masks)
# else:
# xs = self.embed(xs)
# xs, masks = self.encoders(xs, masks)
# if isinstance(xs, tuple):
# xs = xs[0]
# if self.normalize_before:
# xs = self.after_norm(xs)
# return xs, masks

View File

@@ -0,0 +1,74 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2020 Johns Hopkins University (Shinji Watanabe)
# Northwestern Polytechnical University (Pengcheng Guo)
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
"""ConvolutionModule definition."""
from torch import nn
class ConvolutionModule(nn.Module):
"""ConvolutionModule in Conformer model.
:param int channels: channels of cnn
:param int kernel_size: kernerl size of cnn
"""
def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):
"""Construct an ConvolutionModule object."""
super(ConvolutionModule, self).__init__()
# kernerl_size should be a odd number for 'SAME' padding
assert (kernel_size - 1) % 2 == 0
self.pointwise_conv1 = nn.Conv1d(
channels,
2 * channels,
kernel_size=1,
stride=1,
padding=0,
bias=bias,
)
self.depthwise_conv = nn.Conv1d(
channels,
channels,
kernel_size,
stride=1,
padding=(kernel_size - 1) // 2,
groups=channels,
bias=bias,
)
self.norm = nn.BatchNorm1d(channels)
self.pointwise_conv2 = nn.Conv1d(
channels,
channels,
kernel_size=1,
stride=1,
padding=0,
bias=bias,
)
self.activation = activation
def forward(self, x):
"""Compute convolution module.
:param torch.Tensor x: (batch, time, size)
:return torch.Tensor: convoluted `value` (batch, time, d_model)
"""
# exchange the temporal dimension and the feature dimension
x = x.transpose(1, 2)
# GLU mechanism
x = self.pointwise_conv1(x) # (batch, 2*channel, dim)
x = nn.functional.glu(x, dim=1) # (batch, channel, dim)
# 1D Depthwise Conv
x = self.depthwise_conv(x)
x = self.activation(self.norm(x))
x = self.pointwise_conv2(x)
return x.transpose(1, 2)

View File

@@ -0,0 +1,166 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2019 Shigeki Karita
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
"""Positonal Encoding Module."""
import math
import torch
def _pre_hook(
state_dict,
prefix,
local_metadata,
strict,
missing_keys,
unexpected_keys,
error_msgs,
):
"""Perform pre-hook in load_state_dict for backward compatibility.
Note:
We saved self.pe until v.0.5.2 but we have omitted it later.
Therefore, we remove the item "pe" from `state_dict` for backward compatibility.
"""
k = prefix + "pe"
if k in state_dict:
state_dict.pop(k)
class PositionalEncoding(torch.nn.Module):
"""Positional encoding.
:param int d_model: embedding dim
:param float dropout_rate: dropout rate
:param int max_len: maximum input length
:param reverse: whether to reverse the input position
"""
def __init__(self, d_model, dropout_rate, max_len=5000, reverse=False):
"""Construct an PositionalEncoding object."""
super(PositionalEncoding, self).__init__()
self.d_model = d_model
self.reverse = reverse
self.xscale = math.sqrt(self.d_model)
self.dropout = torch.nn.Dropout(p=dropout_rate)
self.pe = None
self.extend_pe(torch.tensor(0.0).expand(1, max_len))
self._register_load_state_dict_pre_hook(_pre_hook)
def extend_pe(self, x):
"""Reset the positional encodings."""
if self.pe is not None:
if self.pe.size(1) >= x.size(1):
if self.pe.dtype != x.dtype or self.pe.device != x.device:
self.pe = self.pe.to(dtype=x.dtype, device=x.device)
return
pe = torch.zeros(x.size(1), self.d_model)
if self.reverse:
position = torch.arange(
x.size(1) - 1, -1, -1.0, dtype=torch.float32
).unsqueeze(1)
else:
position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, self.d_model, 2, dtype=torch.float32)
* -(math.log(10000.0) / self.d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.pe = pe.to(device=x.device, dtype=x.dtype)
def forward(self, x: torch.Tensor):
"""Add positional encoding.
Args:
x (torch.Tensor): Input. Its shape is (batch, time, ...)
Returns:
torch.Tensor: Encoded tensor. Its shape is (batch, time, ...)
"""
self.extend_pe(x)
x = x * self.xscale + self.pe[:, : x.size(1)]
return self.dropout(x)
class ScaledPositionalEncoding(PositionalEncoding):
"""Scaled positional encoding module.
See also: Sec. 3.2 https://arxiv.org/pdf/1809.08895.pdf
"""
def __init__(self, d_model, dropout_rate, max_len=5000):
"""Initialize class.
:param int d_model: embedding dim
:param float dropout_rate: dropout rate
:param int max_len: maximum input length
"""
super().__init__(d_model=d_model, dropout_rate=dropout_rate, max_len=max_len)
self.alpha = torch.nn.Parameter(torch.tensor(1.0))
def reset_parameters(self):
"""Reset parameters."""
self.alpha.data = torch.tensor(1.0)
def forward(self, x):
"""Add positional encoding.
Args:
x (torch.Tensor): Input. Its shape is (batch, time, ...)
Returns:
torch.Tensor: Encoded tensor. Its shape is (batch, time, ...)
"""
self.extend_pe(x)
x = x + self.alpha * self.pe[:, : x.size(1)]
return self.dropout(x)
class RelPositionalEncoding(PositionalEncoding):
"""Relitive positional encoding module.
See : Appendix B in https://arxiv.org/abs/1901.02860
:param int d_model: embedding dim
:param float dropout_rate: dropout rate
:param int max_len: maximum input length
"""
def __init__(self, d_model, dropout_rate, max_len=5000):
"""Initialize class.
:param int d_model: embedding dim
:param float dropout_rate: dropout rate
:param int max_len: maximum input length
"""
super().__init__(d_model, dropout_rate, max_len, reverse=True)
def forward(self, x):
"""Compute positional encoding.
Args:
x (torch.Tensor): Input. Its shape is (batch, time, ...)
Returns:
torch.Tensor: x. Its shape is (batch, time, ...)
torch.Tensor: pos_emb. Its shape is (1, time, ...)
"""
self.extend_pe(x)
x = x * self.xscale
pos_emb = self.pe[:, : x.size(1)]
return self.dropout(x), self.dropout(pos_emb)

View File

@@ -0,0 +1,217 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2019 Shigeki Karita
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
"""Encoder definition."""
import logging
import torch
from espnet.nets.pytorch_backend.conformer.convolution import ConvolutionModule
from espnet.nets.pytorch_backend.conformer.encoder_layer import EncoderLayer
from espnet.nets.pytorch_backend.nets_utils import get_activation
from espnet.nets.pytorch_backend.transducer.vgg import VGG2L
from espnet.nets.pytorch_backend.transformer.attention import (
MultiHeadedAttention, # noqa: H301
RelPositionMultiHeadedAttention, # noqa: H301
)
from espnet.nets.pytorch_backend.transformer.embedding import (
PositionalEncoding, # noqa: H301
ScaledPositionalEncoding, # noqa: H301
RelPositionalEncoding, # noqa: H301
)
from espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm
from espnet.nets.pytorch_backend.transformer.multi_layer_conv import Conv1dLinear
from espnet.nets.pytorch_backend.transformer.multi_layer_conv import MultiLayeredConv1d
from espnet.nets.pytorch_backend.transformer.positionwise_feed_forward import (
PositionwiseFeedForward, # noqa: H301
)
from espnet.nets.pytorch_backend.transformer.repeat import repeat
from espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling
class Encoder(torch.nn.Module):
"""Conformer encoder module.
:param int idim: input dim
:param int attention_dim: dimention of attention
:param int attention_heads: the number of heads of multi head attention
:param int linear_units: the number of units of position-wise feed forward
:param int num_blocks: the number of decoder blocks
:param float dropout_rate: dropout rate
:param float attention_dropout_rate: dropout rate in attention
:param float positional_dropout_rate: dropout rate after adding positional encoding
:param str or torch.nn.Module input_layer: input layer type
:param bool normalize_before: whether to use layer_norm before the first block
:param bool concat_after: whether to concat attention layer's input and output
if True, additional linear will be applied.
i.e. x -> x + linear(concat(x, att(x)))
if False, no additional linear will be applied. i.e. x -> x + att(x)
:param str positionwise_layer_type: linear of conv1d
:param int positionwise_conv_kernel_size: kernel size of positionwise conv1d layer
:param str encoder_pos_enc_layer_type: encoder positional encoding layer type
:param str encoder_attn_layer_type: encoder attention layer type
:param str activation_type: encoder activation function type
:param bool macaron_style: whether to use macaron style for positionwise layer
:param bool use_cnn_module: whether to use convolution module
:param int cnn_module_kernel: kernerl size of convolution module
:param int padding_idx: padding_idx for input_layer=embed
"""
def __init__(
self,
idim,
attention_dim=256,
attention_heads=4,
linear_units=2048,
num_blocks=6,
dropout_rate=0.1,
positional_dropout_rate=0.1,
attention_dropout_rate=0.0,
input_layer="conv2d",
normalize_before=True,
concat_after=False,
positionwise_layer_type="linear",
positionwise_conv_kernel_size=1,
macaron_style=False,
pos_enc_layer_type="abs_pos",
selfattention_layer_type="selfattn",
activation_type="swish",
use_cnn_module=False,
cnn_module_kernel=31,
padding_idx=-1,
):
"""Construct an Encoder object."""
super(Encoder, self).__init__()
activation = get_activation(activation_type)
if pos_enc_layer_type == "abs_pos":
pos_enc_class = PositionalEncoding
elif pos_enc_layer_type == "scaled_abs_pos":
pos_enc_class = ScaledPositionalEncoding
elif pos_enc_layer_type == "rel_pos":
assert selfattention_layer_type == "rel_selfattn"
pos_enc_class = RelPositionalEncoding
else:
raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
if input_layer == "linear":
self.embed = torch.nn.Sequential(
torch.nn.Linear(idim, attention_dim),
torch.nn.LayerNorm(attention_dim),
torch.nn.Dropout(dropout_rate),
pos_enc_class(attention_dim, positional_dropout_rate),
)
elif input_layer == "conv2d":
self.embed = Conv2dSubsampling(
idim,
attention_dim,
dropout_rate,
pos_enc_class(attention_dim, positional_dropout_rate),
)
elif input_layer == "vgg2l":
self.embed = VGG2L(idim, attention_dim)
elif input_layer == "embed":
self.embed = torch.nn.Sequential(
torch.nn.Embedding(idim, attention_dim, padding_idx=padding_idx),
pos_enc_class(attention_dim, positional_dropout_rate),
)
elif isinstance(input_layer, torch.nn.Module):
self.embed = torch.nn.Sequential(
input_layer,
pos_enc_class(attention_dim, positional_dropout_rate),
)
elif input_layer is None:
self.embed = torch.nn.Sequential(
pos_enc_class(attention_dim, positional_dropout_rate)
)
else:
raise ValueError("unknown input_layer: " + input_layer)
self.normalize_before = normalize_before
if positionwise_layer_type == "linear":
positionwise_layer = PositionwiseFeedForward
positionwise_layer_args = (
attention_dim,
linear_units,
dropout_rate,
activation,
)
elif positionwise_layer_type == "conv1d":
positionwise_layer = MultiLayeredConv1d
positionwise_layer_args = (
attention_dim,
linear_units,
positionwise_conv_kernel_size,
dropout_rate,
)
elif positionwise_layer_type == "conv1d-linear":
positionwise_layer = Conv1dLinear
positionwise_layer_args = (
attention_dim,
linear_units,
positionwise_conv_kernel_size,
dropout_rate,
)
else:
raise NotImplementedError("Support only linear or conv1d.")
if selfattention_layer_type == "selfattn":
logging.info("encoder self-attention layer type = self-attention")
encoder_selfattn_layer = MultiHeadedAttention
encoder_selfattn_layer_args = (
attention_heads,
attention_dim,
attention_dropout_rate,
)
elif selfattention_layer_type == "rel_selfattn":
assert pos_enc_layer_type == "rel_pos"
encoder_selfattn_layer = RelPositionMultiHeadedAttention
encoder_selfattn_layer_args = (
attention_heads,
attention_dim,
attention_dropout_rate,
)
else:
raise ValueError("unknown encoder_attn_layer: " + selfattention_layer_type)
convolution_layer = ConvolutionModule
convolution_layer_args = (attention_dim, cnn_module_kernel, activation)
self.encoders = repeat(
num_blocks,
lambda lnum: EncoderLayer(
attention_dim,
encoder_selfattn_layer(*encoder_selfattn_layer_args),
positionwise_layer(*positionwise_layer_args),
positionwise_layer(*positionwise_layer_args) if macaron_style else None,
convolution_layer(*convolution_layer_args) if use_cnn_module else None,
dropout_rate,
normalize_before,
concat_after,
),
)
if self.normalize_before:
self.after_norm = LayerNorm(attention_dim)
def forward(self, xs, masks):
"""Encode input sequence.
:param torch.Tensor xs: input tensor
:param torch.Tensor masks: input mask
:return: position embedded tensor and mask
:rtype Tuple[torch.Tensor, torch.Tensor]:
"""
if isinstance(self.embed, (Conv2dSubsampling, VGG2L)):
xs, masks = self.embed(xs, masks)
else:
xs = self.embed(xs)
xs, masks = self.encoders(xs, masks)
if isinstance(xs, tuple):
xs = xs[0]
if self.normalize_before:
xs = self.after_norm(xs)
return xs, masks

Some files were not shown because too many files have changed in this diff Show More