mirror of
https://github.com/babysor/Realtime-Voice-Clone-Chinese.git
synced 2026-02-10 05:46:06 +08:00
Compare commits
11 Commits
webtoolbox
...
Add-GST
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
178787887b | ||
|
|
43c86eb411 | ||
|
|
37f11ab9ce | ||
|
|
e2017d0314 | ||
|
|
547ac816df | ||
|
|
6b4ab39601 | ||
|
|
b46e7a7866 | ||
|
|
8a384a1191 | ||
|
|
11154783d8 | ||
|
|
d52db0444e | ||
|
|
790d11a58b |
4
.gitignore
vendored
4
.gitignore
vendored
@@ -17,5 +17,7 @@
|
|||||||
*.sh
|
*.sh
|
||||||
synthesizer/saved_models/*
|
synthesizer/saved_models/*
|
||||||
vocoder/saved_models/*
|
vocoder/saved_models/*
|
||||||
|
encoder/saved_models/*
|
||||||
cp_hifigan/*
|
cp_hifigan/*
|
||||||
!vocoder/saved_models/pretrained/*
|
!vocoder/saved_models/pretrained/*
|
||||||
|
!encoder/saved_models/pretrained.pt
|
||||||
18
.vscode/launch.json
vendored
18
.vscode/launch.json
vendored
@@ -17,7 +17,7 @@
|
|||||||
"request": "launch",
|
"request": "launch",
|
||||||
"program": "vocoder_preprocess.py",
|
"program": "vocoder_preprocess.py",
|
||||||
"console": "integratedTerminal",
|
"console": "integratedTerminal",
|
||||||
"args": ["..\\..\\chs1"]
|
"args": ["..\\audiodata"]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "Python: Vocoder Train",
|
"name": "Python: Vocoder Train",
|
||||||
@@ -25,15 +25,23 @@
|
|||||||
"request": "launch",
|
"request": "launch",
|
||||||
"program": "vocoder_train.py",
|
"program": "vocoder_train.py",
|
||||||
"console": "integratedTerminal",
|
"console": "integratedTerminal",
|
||||||
"args": ["dev", "..\\..\\chs1"]
|
"args": ["dev", "..\\audiodata"]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "Python: demo box",
|
"name": "Python: Demo Box",
|
||||||
"type": "python",
|
"type": "python",
|
||||||
"request": "launch",
|
"request": "launch",
|
||||||
"program": "demo_toolbox.py",
|
"program": "demo_toolbox.py",
|
||||||
"console": "integratedTerminal",
|
"console": "integratedTerminal",
|
||||||
"args": ["-d", "..\\..\\chs"]
|
"args": ["-d","..\\audiodata"]
|
||||||
}
|
},
|
||||||
|
{
|
||||||
|
"name": "Python: Synth Train",
|
||||||
|
"type": "python",
|
||||||
|
"request": "launch",
|
||||||
|
"program": "synthesizer_train.py",
|
||||||
|
"console": "integratedTerminal",
|
||||||
|
"args": ["my_run", "..\\"]
|
||||||
|
},
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|||||||
13
README-CN.md
13
README-CN.md
@@ -5,7 +5,7 @@
|
|||||||
|
|
||||||
### [English](README.md) | 中文
|
### [English](README.md) | 中文
|
||||||
|
|
||||||
### [DEMO VIDEO](https://www.bilibili.com/video/BV1sA411P7wM/)
|
### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/)
|
||||||
|
|
||||||
## 特性
|
## 特性
|
||||||
🌍 **中文** 支持普通话并使用多种中文数据集进行测试:aidatatang_200zh, magicdata, aishell3, biaobei,MozillaCommonVoice 等
|
🌍 **中文** 支持普通话并使用多种中文数据集进行测试:aidatatang_200zh, magicdata, aishell3, biaobei,MozillaCommonVoice 等
|
||||||
@@ -73,7 +73,7 @@
|
|||||||
### 3.1 启动Web程序:
|
### 3.1 启动Web程序:
|
||||||
`python web.py`
|
`python web.py`
|
||||||
运行成功后在浏览器打开地址, 默认为 `http://localhost:8080`
|
运行成功后在浏览器打开地址, 默认为 `http://localhost:8080`
|
||||||
<img width="578" alt="bd64cd80385754afa599e3840504f45" src="https://user-images.githubusercontent.com/7423248/134275205-c95e6bd8-4f41-4eb5-9143-0390627baee1.png">
|

|
||||||
> 注:目前界面比较buggy,
|
> 注:目前界面比较buggy,
|
||||||
> * 第一次点击`录制`要等待几秒浏览器正常启动录音,否则会有重音
|
> * 第一次点击`录制`要等待几秒浏览器正常启动录音,否则会有重音
|
||||||
> * 录制结束不要再点`录制`而是`停止`
|
> * 录制结束不要再点`录制`而是`停止`
|
||||||
@@ -119,15 +119,20 @@
|
|||||||
|
|
||||||
| URL | Designation | 标题 | 实现源码 |
|
| URL | Designation | 标题 | 实现源码 |
|
||||||
| --- | ----------- | ----- | --------------------- |
|
| --- | ----------- | ----- | --------------------- |
|
||||||
|
| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | 本代码库 |
|
||||||
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
|
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | 本代码库 |
|
||||||
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
|
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | This repo |
|
||||||
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|
||||||
|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
|
|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
|
||||||
|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |
|
|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | 本代码库 |
|
||||||
|
|
||||||
## 常見問題(FQ&A)
|
## 常見問題(FQ&A)
|
||||||
#### 1.數據集哪裡下載?
|
#### 1.數據集哪裡下載?
|
||||||
[aidatatang_200zh](http://www.openslr.org/62/)、[magicdata](http://www.openslr.org/68/)、[aishell3](http://www.openslr.org/93/)
|
| 数据集 | OpenSLR地址 | 其他源 (Google Drive, Baidu网盘等) |
|
||||||
|
| --- | ----------- | ---------------|
|
||||||
|
| aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
|
||||||
|
| magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
|
||||||
|
| aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
|
||||||
> 解壓 aidatatang_200zh 後,還需將 `aidatatang_200zh\corpus\train`下的檔案全選解壓縮
|
> 解壓 aidatatang_200zh 後,還需將 `aidatatang_200zh\corpus\train`下的檔案全選解壓縮
|
||||||
|
|
||||||
#### 2.`<datasets_root>`是什麼意思?
|
#### 2.`<datasets_root>`是什麼意思?
|
||||||
|
|||||||
@@ -16,7 +16,7 @@
|
|||||||
|
|
||||||
🌍 **Webserver Ready** to serve your result with remote calling
|
🌍 **Webserver Ready** to serve your result with remote calling
|
||||||
|
|
||||||
### [DEMO VIDEO](https://www.bilibili.com/video/BV1sA411P7wM/)
|
### [DEMO VIDEO](https://www.bilibili.com/video/BV17Q4y1B7mY/)
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
@@ -77,6 +77,7 @@ You can then try the toolbox:
|
|||||||
|
|
||||||
| URL | Designation | Title | Implementation source |
|
| URL | Designation | Title | Implementation source |
|
||||||
| --- | ----------- | ----- | --------------------- |
|
| --- | ----------- | ----- | --------------------- |
|
||||||
|
| [1803.09017](https://arxiv.org/abs/1803.09017) | GlobalStyleToken (synthesizer)| Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | This repo |
|
||||||
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | This repo |
|
| [2010.05646](https://arxiv.org/abs/2010.05646) | HiFi-GAN (vocoder)| Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | This repo |
|
||||||
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
|
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
|
||||||
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|
||||||
@@ -85,7 +86,11 @@ You can then try the toolbox:
|
|||||||
|
|
||||||
## F Q&A
|
## F Q&A
|
||||||
#### 1.Where can I download the dataset?
|
#### 1.Where can I download the dataset?
|
||||||
[aidatatang_200zh](http://www.openslr.org/62/)、[magicdata](http://www.openslr.org/68/)、[aishell3](http://www.openslr.org/93/)
|
| Dataset | Original Source | Alternative Sources |
|
||||||
|
| --- | ----------- | ---------------|
|
||||||
|
| aidatatang_200zh | [OpenSLR](http://www.openslr.org/62/) | [Google Drive](https://drive.google.com/file/d/110A11KZoVe7vy6kXlLb6zVPLb_J91I_t/view?usp=sharing) |
|
||||||
|
| magicdata | [OpenSLR](http://www.openslr.org/68/) | [Google Drive (Dev set)](https://drive.google.com/file/d/1g5bWRUSNH68ycC6eNvtwh07nX3QhOOlo/view?usp=sharing) |
|
||||||
|
| aishell3 | [OpenSLR](https://www.openslr.org/93/) | [Google Drive](https://drive.google.com/file/d/1shYp_o4Z0X0cZSKQDtFirct2luFUwKzZ/view?usp=sharing) |
|
||||||
> After unzip aidatatang_200zh, you need to unzip all the files under `aidatatang_200zh\corpus\train`
|
> After unzip aidatatang_200zh, you need to unzip all the files under `aidatatang_200zh\corpus\train`
|
||||||
|
|
||||||
#### 2.What is`<datasets_root>`?
|
#### 2.What is`<datasets_root>`?
|
||||||
|
|||||||
@@ -117,6 +117,15 @@ def _preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir,
|
|||||||
logger.finalize()
|
logger.finalize()
|
||||||
print("Done preprocessing %s.\n" % dataset_name)
|
print("Done preprocessing %s.\n" % dataset_name)
|
||||||
|
|
||||||
|
def preprocess_aidatatang_200zh(datasets_root: Path, out_dir: Path, skip_existing=False):
|
||||||
|
dataset_name = "aidatatang_200zh"
|
||||||
|
dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
|
||||||
|
if not dataset_root:
|
||||||
|
return
|
||||||
|
# Preprocess all speakers
|
||||||
|
speaker_dirs = list(dataset_root.joinpath("corpus", "train").glob("*"))
|
||||||
|
_preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir, "wav",
|
||||||
|
skip_existing, logger)
|
||||||
|
|
||||||
def preprocess_librispeech(datasets_root: Path, out_dir: Path, skip_existing=False):
|
def preprocess_librispeech(datasets_root: Path, out_dir: Path, skip_existing=False):
|
||||||
for dataset_name in librispeech_datasets["train"]["other"]:
|
for dataset_name in librispeech_datasets["train"]["other"]:
|
||||||
|
|||||||
Binary file not shown.
@@ -1,4 +1,4 @@
|
|||||||
from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2
|
from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2, preprocess_aidatatang_200zh
|
||||||
from utils.argutils import print_args
|
from utils.argutils import print_args
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import argparse
|
import argparse
|
||||||
@@ -10,17 +10,7 @@ if __name__ == "__main__":
|
|||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
description="Preprocesses audio files from datasets, encodes them as mel spectrograms and "
|
description="Preprocesses audio files from datasets, encodes them as mel spectrograms and "
|
||||||
"writes them to the disk. This will allow you to train the encoder. The "
|
"writes them to the disk. This will allow you to train the encoder. The "
|
||||||
"datasets required are at least one of VoxCeleb1, VoxCeleb2 and LibriSpeech. "
|
"datasets required are at least one of LibriSpeech, VoxCeleb1, VoxCeleb2, aidatatang_200zh. ",
|
||||||
"Ideally, you should have all three. You should extract them as they are "
|
|
||||||
"after having downloaded them and put them in a same directory, e.g.:\n"
|
|
||||||
"-[datasets_root]\n"
|
|
||||||
" -LibriSpeech\n"
|
|
||||||
" -train-other-500\n"
|
|
||||||
" -VoxCeleb1\n"
|
|
||||||
" -wav\n"
|
|
||||||
" -vox1_meta.csv\n"
|
|
||||||
" -VoxCeleb2\n"
|
|
||||||
" -dev",
|
|
||||||
formatter_class=MyFormatter
|
formatter_class=MyFormatter
|
||||||
)
|
)
|
||||||
parser.add_argument("datasets_root", type=Path, help=\
|
parser.add_argument("datasets_root", type=Path, help=\
|
||||||
@@ -29,7 +19,7 @@ if __name__ == "__main__":
|
|||||||
"Path to the output directory that will contain the mel spectrograms. If left out, "
|
"Path to the output directory that will contain the mel spectrograms. If left out, "
|
||||||
"defaults to <datasets_root>/SV2TTS/encoder/")
|
"defaults to <datasets_root>/SV2TTS/encoder/")
|
||||||
parser.add_argument("-d", "--datasets", type=str,
|
parser.add_argument("-d", "--datasets", type=str,
|
||||||
default="librispeech_other,voxceleb1,voxceleb2", help=\
|
default="librispeech_other,voxceleb1,aidatatang_200zh", help=\
|
||||||
"Comma-separated list of the name of the datasets you want to preprocess. Only the train "
|
"Comma-separated list of the name of the datasets you want to preprocess. Only the train "
|
||||||
"set of these datasets will be used. Possible names: librispeech_other, voxceleb1, "
|
"set of these datasets will be used. Possible names: librispeech_other, voxceleb1, "
|
||||||
"voxceleb2.")
|
"voxceleb2.")
|
||||||
@@ -63,6 +53,7 @@ if __name__ == "__main__":
|
|||||||
"librispeech_other": preprocess_librispeech,
|
"librispeech_other": preprocess_librispeech,
|
||||||
"voxceleb1": preprocess_voxceleb1,
|
"voxceleb1": preprocess_voxceleb1,
|
||||||
"voxceleb2": preprocess_voxceleb2,
|
"voxceleb2": preprocess_voxceleb2,
|
||||||
|
"aidatatang_200zh": preprocess_aidatatang_200zh,
|
||||||
}
|
}
|
||||||
args = vars(args)
|
args = vars(args)
|
||||||
for dataset in args.pop("datasets"):
|
for dataset in args.pop("datasets"):
|
||||||
@@ -19,4 +19,5 @@ flask
|
|||||||
flask_wtf
|
flask_wtf
|
||||||
flask_cors
|
flask_cors
|
||||||
gevent==21.8.0
|
gevent==21.8.0
|
||||||
flask_restx
|
flask_restx
|
||||||
|
tensorboard
|
||||||
13
synthesizer/gst_hyperparameters.py
Normal file
13
synthesizer/gst_hyperparameters.py
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
class GSTHyperparameters():
|
||||||
|
E = 512
|
||||||
|
|
||||||
|
# reference encoder
|
||||||
|
ref_enc_filters = [32, 32, 64, 64, 128, 128]
|
||||||
|
|
||||||
|
# style token layer
|
||||||
|
token_num = 10
|
||||||
|
# token_emb_size = 256
|
||||||
|
num_heads = 8
|
||||||
|
|
||||||
|
n_mels = 256 # Number of Mel banks to generate
|
||||||
|
|
||||||
@@ -49,12 +49,15 @@ hparams = HParams(
|
|||||||
# frame that has all values < -3.4
|
# frame that has all values < -3.4
|
||||||
|
|
||||||
### Tacotron Training
|
### Tacotron Training
|
||||||
tts_schedule = [(2, 1e-3, 20_000, 24), # Progressive training schedule
|
tts_schedule = [(2, 1e-3, 10_000, 12), # Progressive training schedule
|
||||||
(2, 5e-4, 40_000, 24), # (r, lr, step, batch_size)
|
(2, 5e-4, 15_000, 12), # (r, lr, step, batch_size)
|
||||||
(2, 2e-4, 80_000, 24), #
|
(2, 2e-4, 20_000, 12), # (r, lr, step, batch_size)
|
||||||
(2, 1e-4, 160_000, 24), # r = reduction factor (# of mel frames
|
(2, 1e-4, 30_000, 12), #
|
||||||
(2, 3e-5, 320_000, 24), # synthesized for each decoder iteration)
|
(2, 5e-5, 40_000, 12), #
|
||||||
(2, 1e-5, 640_000, 24)], # lr = learning rate
|
(2, 1e-5, 60_000, 12), #
|
||||||
|
(2, 5e-6, 160_000, 12), # r = reduction factor (# of mel frames
|
||||||
|
(2, 3e-6, 320_000, 12), # synthesized for each decoder iteration)
|
||||||
|
(2, 1e-6, 640_000, 12)], # lr = learning rate
|
||||||
|
|
||||||
tts_clip_grad_norm = 1.0, # clips the gradient norm to prevent explosion - set to None if not needed
|
tts_clip_grad_norm = 1.0, # clips the gradient norm to prevent explosion - set to None if not needed
|
||||||
tts_eval_interval = 500, # Number of steps between model evaluation (sample generation)
|
tts_eval_interval = 500, # Number of steps between model evaluation (sample generation)
|
||||||
|
|||||||
@@ -70,7 +70,7 @@ class Synthesizer:
|
|||||||
|
|
||||||
def synthesize_spectrograms(self, texts: List[str],
|
def synthesize_spectrograms(self, texts: List[str],
|
||||||
embeddings: Union[np.ndarray, List[np.ndarray]],
|
embeddings: Union[np.ndarray, List[np.ndarray]],
|
||||||
return_alignments=False):
|
return_alignments=False, style_idx=0):
|
||||||
"""
|
"""
|
||||||
Synthesizes mel spectrograms from texts and speaker embeddings.
|
Synthesizes mel spectrograms from texts and speaker embeddings.
|
||||||
|
|
||||||
@@ -125,7 +125,7 @@ class Synthesizer:
|
|||||||
speaker_embeddings = torch.tensor(speaker_embeds).float().to(self.device)
|
speaker_embeddings = torch.tensor(speaker_embeds).float().to(self.device)
|
||||||
|
|
||||||
# Inference
|
# Inference
|
||||||
_, mels, alignments = self._model.generate(chars, speaker_embeddings)
|
_, mels, alignments = self._model.generate(chars, speaker_embeddings, style_idx=style_idx)
|
||||||
mels = mels.detach().cpu().numpy()
|
mels = mels.detach().cpu().numpy()
|
||||||
for m in mels:
|
for m in mels:
|
||||||
# Trim silence from end of each spectrogram
|
# Trim silence from end of each spectrogram
|
||||||
|
|||||||
135
synthesizer/models/global_style_token.py
Normal file
135
synthesizer/models/global_style_token.py
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.init as init
|
||||||
|
import torch.nn.functional as tFunctional
|
||||||
|
from synthesizer.gst_hyperparameters import GSTHyperparameters as hp
|
||||||
|
|
||||||
|
|
||||||
|
class GlobalStyleToken(nn.Module):
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
|
||||||
|
super().__init__()
|
||||||
|
self.encoder = ReferenceEncoder()
|
||||||
|
self.stl = STL()
|
||||||
|
|
||||||
|
def forward(self, inputs):
|
||||||
|
enc_out = self.encoder(inputs)
|
||||||
|
style_embed = self.stl(enc_out)
|
||||||
|
|
||||||
|
return style_embed
|
||||||
|
|
||||||
|
|
||||||
|
class ReferenceEncoder(nn.Module):
|
||||||
|
'''
|
||||||
|
inputs --- [N, Ty/r, n_mels*r] mels
|
||||||
|
outputs --- [N, ref_enc_gru_size]
|
||||||
|
'''
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
|
||||||
|
super().__init__()
|
||||||
|
K = len(hp.ref_enc_filters)
|
||||||
|
filters = [1] + hp.ref_enc_filters
|
||||||
|
convs = [nn.Conv2d(in_channels=filters[i],
|
||||||
|
out_channels=filters[i + 1],
|
||||||
|
kernel_size=(3, 3),
|
||||||
|
stride=(2, 2),
|
||||||
|
padding=(1, 1)) for i in range(K)]
|
||||||
|
self.convs = nn.ModuleList(convs)
|
||||||
|
self.bns = nn.ModuleList([nn.BatchNorm2d(num_features=hp.ref_enc_filters[i]) for i in range(K)])
|
||||||
|
|
||||||
|
out_channels = self.calculate_channels(hp.n_mels, 3, 2, 1, K)
|
||||||
|
self.gru = nn.GRU(input_size=hp.ref_enc_filters[-1] * out_channels,
|
||||||
|
hidden_size=hp.E // 2,
|
||||||
|
batch_first=True)
|
||||||
|
|
||||||
|
def forward(self, inputs):
|
||||||
|
N = inputs.size(0)
|
||||||
|
out = inputs.view(N, 1, -1, hp.n_mels) # [N, 1, Ty, n_mels]
|
||||||
|
for conv, bn in zip(self.convs, self.bns):
|
||||||
|
out = conv(out)
|
||||||
|
out = bn(out)
|
||||||
|
out = tFunctional.relu(out) # [N, 128, Ty//2^K, n_mels//2^K]
|
||||||
|
|
||||||
|
out = out.transpose(1, 2) # [N, Ty//2^K, 128, n_mels//2^K]
|
||||||
|
T = out.size(1)
|
||||||
|
N = out.size(0)
|
||||||
|
out = out.contiguous().view(N, T, -1) # [N, Ty//2^K, 128*n_mels//2^K]
|
||||||
|
|
||||||
|
self.gru.flatten_parameters()
|
||||||
|
memory, out = self.gru(out) # out --- [1, N, E//2]
|
||||||
|
|
||||||
|
return out.squeeze(0)
|
||||||
|
|
||||||
|
def calculate_channels(self, L, kernel_size, stride, pad, n_convs):
|
||||||
|
for i in range(n_convs):
|
||||||
|
L = (L - kernel_size + 2 * pad) // stride + 1
|
||||||
|
return L
|
||||||
|
|
||||||
|
|
||||||
|
class STL(nn.Module):
|
||||||
|
'''
|
||||||
|
inputs --- [N, E//2]
|
||||||
|
'''
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
|
||||||
|
super().__init__()
|
||||||
|
self.embed = nn.Parameter(torch.FloatTensor(hp.token_num, hp.E // hp.num_heads))
|
||||||
|
d_q = hp.E // 2
|
||||||
|
d_k = hp.E // hp.num_heads
|
||||||
|
# self.attention = MultiHeadAttention(hp.num_heads, d_model, d_q, d_v)
|
||||||
|
self.attention = MultiHeadAttention(query_dim=d_q, key_dim=d_k, num_units=hp.E, num_heads=hp.num_heads)
|
||||||
|
|
||||||
|
init.normal_(self.embed, mean=0, std=0.5)
|
||||||
|
|
||||||
|
def forward(self, inputs):
|
||||||
|
N = inputs.size(0)
|
||||||
|
query = inputs.unsqueeze(1) # [N, 1, E//2]
|
||||||
|
keys = tFunctional.tanh(self.embed).unsqueeze(0).expand(N, -1, -1) # [N, token_num, E // num_heads]
|
||||||
|
style_embed = self.attention(query, keys)
|
||||||
|
|
||||||
|
return style_embed
|
||||||
|
|
||||||
|
|
||||||
|
class MultiHeadAttention(nn.Module):
|
||||||
|
'''
|
||||||
|
input:
|
||||||
|
query --- [N, T_q, query_dim]
|
||||||
|
key --- [N, T_k, key_dim]
|
||||||
|
output:
|
||||||
|
out --- [N, T_q, num_units]
|
||||||
|
'''
|
||||||
|
|
||||||
|
def __init__(self, query_dim, key_dim, num_units, num_heads):
|
||||||
|
|
||||||
|
super().__init__()
|
||||||
|
self.num_units = num_units
|
||||||
|
self.num_heads = num_heads
|
||||||
|
self.key_dim = key_dim
|
||||||
|
|
||||||
|
self.W_query = nn.Linear(in_features=query_dim, out_features=num_units, bias=False)
|
||||||
|
self.W_key = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
|
||||||
|
self.W_value = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
|
||||||
|
|
||||||
|
def forward(self, query, key):
|
||||||
|
querys = self.W_query(query) # [N, T_q, num_units]
|
||||||
|
keys = self.W_key(key) # [N, T_k, num_units]
|
||||||
|
values = self.W_value(key)
|
||||||
|
|
||||||
|
split_size = self.num_units // self.num_heads
|
||||||
|
querys = torch.stack(torch.split(querys, split_size, dim=2), dim=0) # [h, N, T_q, num_units/h]
|
||||||
|
keys = torch.stack(torch.split(keys, split_size, dim=2), dim=0) # [h, N, T_k, num_units/h]
|
||||||
|
values = torch.stack(torch.split(values, split_size, dim=2), dim=0) # [h, N, T_k, num_units/h]
|
||||||
|
|
||||||
|
# score = softmax(QK^T / (d_k ** 0.5))
|
||||||
|
scores = torch.matmul(querys, keys.transpose(2, 3)) # [h, N, T_q, T_k]
|
||||||
|
scores = scores / (self.key_dim ** 0.5)
|
||||||
|
scores = tFunctional.softmax(scores, dim=3)
|
||||||
|
|
||||||
|
# out = score * V
|
||||||
|
out = torch.matmul(scores, values) # [h, N, T_q, num_units/h]
|
||||||
|
out = torch.cat(torch.split(out, 1, dim=0), dim=3).squeeze(0) # [N, T_q, num_units]
|
||||||
|
|
||||||
|
return out
|
||||||
@@ -3,8 +3,7 @@ import numpy as np
|
|||||||
import torch
|
import torch
|
||||||
import torch.nn as nn
|
import torch.nn as nn
|
||||||
import torch.nn.functional as F
|
import torch.nn.functional as F
|
||||||
from pathlib import Path
|
from synthesizer.models.global_style_token import GlobalStyleToken
|
||||||
from typing import Union
|
|
||||||
|
|
||||||
|
|
||||||
class HighwayNetwork(nn.Module):
|
class HighwayNetwork(nn.Module):
|
||||||
@@ -338,6 +337,7 @@ class Tacotron(nn.Module):
|
|||||||
self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
|
self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
|
||||||
encoder_K, num_highways, dropout)
|
encoder_K, num_highways, dropout)
|
||||||
self.encoder_proj = nn.Linear(encoder_dims + speaker_embedding_size, decoder_dims, bias=False)
|
self.encoder_proj = nn.Linear(encoder_dims + speaker_embedding_size, decoder_dims, bias=False)
|
||||||
|
self.gst = GlobalStyleToken()
|
||||||
self.decoder = Decoder(n_mels, encoder_dims, decoder_dims, lstm_dims,
|
self.decoder = Decoder(n_mels, encoder_dims, decoder_dims, lstm_dims,
|
||||||
dropout, speaker_embedding_size)
|
dropout, speaker_embedding_size)
|
||||||
self.postnet = CBHG(postnet_K, n_mels, postnet_dims,
|
self.postnet = CBHG(postnet_K, n_mels, postnet_dims,
|
||||||
@@ -358,11 +358,11 @@ class Tacotron(nn.Module):
|
|||||||
def r(self, value):
|
def r(self, value):
|
||||||
self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
|
self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
|
||||||
|
|
||||||
def forward(self, x, m, speaker_embedding):
|
def forward(self, texts, mels, speaker_embedding):
|
||||||
device = next(self.parameters()).device # use same device as parameters
|
device = next(self.parameters()).device # use same device as parameters
|
||||||
|
|
||||||
self.step += 1
|
self.step += 1
|
||||||
batch_size, _, steps = m.size()
|
batch_size, _, steps = mels.size()
|
||||||
|
|
||||||
# Initialise all hidden states and pack into tuple
|
# Initialise all hidden states and pack into tuple
|
||||||
attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
|
attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
|
||||||
@@ -383,7 +383,12 @@ class Tacotron(nn.Module):
|
|||||||
|
|
||||||
# SV2TTS: Run the encoder with the speaker embedding
|
# SV2TTS: Run the encoder with the speaker embedding
|
||||||
# The projection avoids unnecessary matmuls in the decoder loop
|
# The projection avoids unnecessary matmuls in the decoder loop
|
||||||
encoder_seq = self.encoder(x, speaker_embedding)
|
encoder_seq = self.encoder(texts, speaker_embedding)
|
||||||
|
# put after encoder
|
||||||
|
if self.gst is not None:
|
||||||
|
style_embed = self.gst(speaker_embedding)
|
||||||
|
style_embed = style_embed.expand_as(encoder_seq)
|
||||||
|
encoder_seq = encoder_seq + style_embed
|
||||||
encoder_seq_proj = self.encoder_proj(encoder_seq)
|
encoder_seq_proj = self.encoder_proj(encoder_seq)
|
||||||
|
|
||||||
# Need a couple of lists for outputs
|
# Need a couple of lists for outputs
|
||||||
@@ -391,10 +396,10 @@ class Tacotron(nn.Module):
|
|||||||
|
|
||||||
# Run the decoder loop
|
# Run the decoder loop
|
||||||
for t in range(0, steps, self.r):
|
for t in range(0, steps, self.r):
|
||||||
prenet_in = m[:, :, t - 1] if t > 0 else go_frame
|
prenet_in = mels[:, :, t - 1] if t > 0 else go_frame
|
||||||
mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
|
mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
|
||||||
self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
|
self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
|
||||||
hidden_states, cell_states, context_vec, t, x)
|
hidden_states, cell_states, context_vec, t, texts)
|
||||||
mel_outputs.append(mel_frames)
|
mel_outputs.append(mel_frames)
|
||||||
attn_scores.append(scores)
|
attn_scores.append(scores)
|
||||||
stop_outputs.extend([stop_tokens] * self.r)
|
stop_outputs.extend([stop_tokens] * self.r)
|
||||||
@@ -414,7 +419,7 @@ class Tacotron(nn.Module):
|
|||||||
|
|
||||||
return mel_outputs, linear, attn_scores, stop_outputs
|
return mel_outputs, linear, attn_scores, stop_outputs
|
||||||
|
|
||||||
def generate(self, x, speaker_embedding=None, steps=2000):
|
def generate(self, x, speaker_embedding=None, steps=200, style_idx=0):
|
||||||
self.eval()
|
self.eval()
|
||||||
device = next(self.parameters()).device # use same device as parameters
|
device = next(self.parameters()).device # use same device as parameters
|
||||||
|
|
||||||
@@ -440,6 +445,18 @@ class Tacotron(nn.Module):
|
|||||||
# SV2TTS: Run the encoder with the speaker embedding
|
# SV2TTS: Run the encoder with the speaker embedding
|
||||||
# The projection avoids unnecessary matmuls in the decoder loop
|
# The projection avoids unnecessary matmuls in the decoder loop
|
||||||
encoder_seq = self.encoder(x, speaker_embedding)
|
encoder_seq = self.encoder(x, speaker_embedding)
|
||||||
|
|
||||||
|
# put after encoder
|
||||||
|
if self.gst is not None and style_idx >= 0 and style_idx < 10:
|
||||||
|
gst_embed = self.gst.stl.embed.cpu().data.numpy() #[0, number_token]
|
||||||
|
gst_embed = np.tile(gst_embed, (1, 8))
|
||||||
|
scale = np.zeros(512)
|
||||||
|
scale[:] = 0.3
|
||||||
|
speaker_embedding = (gst_embed[style_idx] * scale).astype(np.float32)
|
||||||
|
speaker_embedding = torch.from_numpy(np.tile(speaker_embedding, (x.shape[0], 1))).to(device)
|
||||||
|
style_embed = self.gst(speaker_embedding)
|
||||||
|
style_embed = style_embed.expand_as(encoder_seq)
|
||||||
|
encoder_seq = encoder_seq + style_embed
|
||||||
encoder_seq_proj = self.encoder_proj(encoder_seq)
|
encoder_seq_proj = self.encoder_proj(encoder_seq)
|
||||||
|
|
||||||
# Need a couple of lists for outputs
|
# Need a couple of lists for outputs
|
||||||
@@ -494,7 +511,7 @@ class Tacotron(nn.Module):
|
|||||||
# Use device of model params as location for loaded state
|
# Use device of model params as location for loaded state
|
||||||
device = next(self.parameters()).device
|
device = next(self.parameters()).device
|
||||||
checkpoint = torch.load(str(path), map_location=device)
|
checkpoint = torch.load(str(path), map_location=device)
|
||||||
self.load_state_dict(checkpoint["model_state"])
|
self.load_state_dict(checkpoint["model_state"], strict=False)
|
||||||
|
|
||||||
if "optimizer_state" in checkpoint and optimizer is not None:
|
if "optimizer_state" in checkpoint and optimizer is not None:
|
||||||
optimizer.load_state_dict(checkpoint["optimizer_state"])
|
optimizer.load_state_dict(checkpoint["optimizer_state"])
|
||||||
|
|||||||
@@ -71,6 +71,7 @@ class Toolbox:
|
|||||||
|
|
||||||
# Initialize the events and the interface
|
# Initialize the events and the interface
|
||||||
self.ui = UI()
|
self.ui = UI()
|
||||||
|
self.style_idx = 0
|
||||||
self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, seed)
|
self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, seed)
|
||||||
self.setup_events()
|
self.setup_events()
|
||||||
self.ui.start()
|
self.ui.start()
|
||||||
@@ -233,7 +234,7 @@ class Toolbox:
|
|||||||
texts = processed_texts
|
texts = processed_texts
|
||||||
embed = self.ui.selected_utterance.embed
|
embed = self.ui.selected_utterance.embed
|
||||||
embeds = [embed] * len(texts)
|
embeds = [embed] * len(texts)
|
||||||
specs = self.synthesizer.synthesize_spectrograms(texts, embeds)
|
specs = self.synthesizer.synthesize_spectrograms(texts, embeds, style_idx=int(self.ui.style_idx_textbox.text()))
|
||||||
breaks = [spec.shape[1] for spec in specs]
|
breaks = [spec.shape[1] for spec in specs]
|
||||||
spec = np.concatenate(specs, axis=1)
|
spec = np.concatenate(specs, axis=1)
|
||||||
|
|
||||||
|
|||||||
@@ -574,10 +574,14 @@ class UI(QDialog):
|
|||||||
self.seed_textbox = QLineEdit()
|
self.seed_textbox = QLineEdit()
|
||||||
self.seed_textbox.setMaximumWidth(80)
|
self.seed_textbox.setMaximumWidth(80)
|
||||||
layout_seed.addWidget(self.seed_textbox, 0, 1)
|
layout_seed.addWidget(self.seed_textbox, 0, 1)
|
||||||
|
layout_seed.addWidget(QLabel("Style#:(0~9)"), 0, 2)
|
||||||
|
self.style_idx_textbox = QLineEdit("-1")
|
||||||
|
self.style_idx_textbox.setMaximumWidth(80)
|
||||||
|
layout_seed.addWidget(self.style_idx_textbox, 0, 3)
|
||||||
self.trim_silences_checkbox = QCheckBox("Enhance vocoder output")
|
self.trim_silences_checkbox = QCheckBox("Enhance vocoder output")
|
||||||
self.trim_silences_checkbox.setToolTip("When checked, trims excess silence in vocoder output."
|
self.trim_silences_checkbox.setToolTip("When checked, trims excess silence in vocoder output."
|
||||||
" This feature requires `webrtcvad` to be installed.")
|
" This feature requires `webrtcvad` to be installed.")
|
||||||
layout_seed.addWidget(self.trim_silences_checkbox, 0, 2, 1, 2)
|
layout_seed.addWidget(self.trim_silences_checkbox, 0, 4, 1, 2)
|
||||||
gen_layout.addLayout(layout_seed)
|
gen_layout.addLayout(layout_seed)
|
||||||
|
|
||||||
self.loading_bar = QProgressBar()
|
self.loading_bar = QProgressBar()
|
||||||
|
|||||||
@@ -11,7 +11,6 @@ def check_model_paths(encoder_path: Path, synthesizer_path: Path, vocoder_path:
|
|||||||
|
|
||||||
# If none of the paths exist, remind the user to download models if needed
|
# If none of the paths exist, remind the user to download models if needed
|
||||||
print("********************************************************************************")
|
print("********************************************************************************")
|
||||||
print("Error: Model files not found. Follow these instructions to get and install the models:")
|
print("Error: Model files not found. Please download the models")
|
||||||
print("https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models")
|
|
||||||
print("********************************************************************************\n")
|
print("********************************************************************************\n")
|
||||||
quit(-1)
|
quit(-1)
|
||||||
|
|||||||
@@ -9,10 +9,12 @@ from vocoder.wavernn import inference as rnn_vocoder
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
import re
|
import re
|
||||||
from scipy.io.wavfile import write
|
from scipy.io.wavfile import write
|
||||||
|
import librosa
|
||||||
import io
|
import io
|
||||||
import base64
|
import base64
|
||||||
from flask_cors import CORS
|
from flask_cors import CORS
|
||||||
from flask_wtf import CSRFProtect
|
from flask_wtf import CSRFProtect
|
||||||
|
import webbrowser
|
||||||
|
|
||||||
def webApp():
|
def webApp():
|
||||||
# Init and load config
|
# Init and load config
|
||||||
@@ -29,6 +31,7 @@ def webApp():
|
|||||||
synthesizers = list(Path(syn_models_dirt).glob("**/*.pt"))
|
synthesizers = list(Path(syn_models_dirt).glob("**/*.pt"))
|
||||||
synthesizers_cache = {}
|
synthesizers_cache = {}
|
||||||
encoder.load_model(Path("encoder/saved_models/pretrained.pt"))
|
encoder.load_model(Path("encoder/saved_models/pretrained.pt"))
|
||||||
|
# rnn_vocoder.load_model(Path("vocoder/saved_models/pretrained/pretrained.pt"))
|
||||||
gan_vocoder.load_model(Path("vocoder/saved_models/pretrained/g_hifigan.pt"))
|
gan_vocoder.load_model(Path("vocoder/saved_models/pretrained/g_hifigan.pt"))
|
||||||
|
|
||||||
def pcm2float(sig, dtype='float32'):
|
def pcm2float(sig, dtype='float32'):
|
||||||
@@ -65,7 +68,6 @@ def webApp():
|
|||||||
@app.route("/api/synthesize", methods=["POST"])
|
@app.route("/api/synthesize", methods=["POST"])
|
||||||
def synthesize():
|
def synthesize():
|
||||||
# TODO Implementation with json to support more platform
|
# TODO Implementation with json to support more platform
|
||||||
|
|
||||||
# Load synthesizer
|
# Load synthesizer
|
||||||
if "synt_path" in request.form:
|
if "synt_path" in request.form:
|
||||||
synt_path = request.form["synt_path"]
|
synt_path = request.form["synt_path"]
|
||||||
@@ -79,10 +81,16 @@ def webApp():
|
|||||||
current_synt = synthesizers_cache[synt_path]
|
current_synt = synthesizers_cache[synt_path]
|
||||||
print("using synthesizer model: " + str(synt_path))
|
print("using synthesizer model: " + str(synt_path))
|
||||||
# Load input wav
|
# Load input wav
|
||||||
wav_base64 = request.form["upfile_b64"]
|
if "upfile_b64" in request.form:
|
||||||
wav = base64.b64decode(bytes(wav_base64, 'utf-8'))
|
wav_base64 = request.form["upfile_b64"]
|
||||||
wav = pcm2float(np.frombuffer(wav, dtype=np.int16), dtype=np.float32)
|
wav = base64.b64decode(bytes(wav_base64, 'utf-8'))
|
||||||
encoder_wav = encoder.preprocess_wav(wav, 16000)
|
wav = pcm2float(np.frombuffer(wav, dtype=np.int16), dtype=np.float32)
|
||||||
|
sample_rate = Synthesizer.sample_rate
|
||||||
|
else:
|
||||||
|
wav, sample_rate, = librosa.load(request.files['file'])
|
||||||
|
write("temp.wav", sample_rate, wav) #Make sure we get the correct wav
|
||||||
|
|
||||||
|
encoder_wav = encoder.preprocess_wav(wav, sample_rate)
|
||||||
embed, _, _ = encoder.embed_utterance(encoder_wav, return_partials=True)
|
embed, _, _ = encoder.embed_utterance(encoder_wav, return_partials=True)
|
||||||
|
|
||||||
# Load input text
|
# Load input text
|
||||||
@@ -99,6 +107,7 @@ def webApp():
|
|||||||
embeds = [embed] * len(texts)
|
embeds = [embed] * len(texts)
|
||||||
specs = current_synt.synthesize_spectrograms(texts, embeds)
|
specs = current_synt.synthesize_spectrograms(texts, embeds)
|
||||||
spec = np.concatenate(specs, axis=1)
|
spec = np.concatenate(specs, axis=1)
|
||||||
|
# wav = rnn_vocoder.infer_waveform(spec)
|
||||||
wav = gan_vocoder.infer_waveform(spec)
|
wav = gan_vocoder.infer_waveform(spec)
|
||||||
|
|
||||||
# Return cooked wav
|
# Return cooked wav
|
||||||
@@ -112,10 +121,11 @@ def webApp():
|
|||||||
|
|
||||||
host = app.config.get("HOST")
|
host = app.config.get("HOST")
|
||||||
port = app.config.get("PORT")
|
port = app.config.get("PORT")
|
||||||
print(f"Web server: http://{host}:{port}")
|
web_address = 'http://{}:{}'.format(host, port)
|
||||||
|
print(f"Web server:" + web_address)
|
||||||
|
webbrowser.open(web_address)
|
||||||
server = wsgi.WSGIServer((host, port), app)
|
server = wsgi.WSGIServer((host, port), app)
|
||||||
server.serve_forever()
|
server.serve_forever()
|
||||||
|
|
||||||
return app
|
return app
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
@@ -5,3 +5,4 @@ PORT = 8080
|
|||||||
MAX_CONTENT_PATH =1024 * 1024 * 4 # mp3文件大小限定不能超过4M
|
MAX_CONTENT_PATH =1024 * 1024 * 4 # mp3文件大小限定不能超过4M
|
||||||
SECRET_KEY = "mockingbird_key"
|
SECRET_KEY = "mockingbird_key"
|
||||||
WTF_CSRF_SECRET_KEY = "mockingbird_key"
|
WTF_CSRF_SECRET_KEY = "mockingbird_key"
|
||||||
|
TEMPLATES_AUTO_RELOAD = True
|
||||||
BIN
web/static/img/bird-sm.png
Normal file
BIN
web/static/img/bird-sm.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 40 KiB |
BIN
web/static/img/bird.png
Normal file
BIN
web/static/img/bird.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 39 KiB |
BIN
web/static/img/mockingbird.png
Normal file
BIN
web/static/img/mockingbird.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 89 KiB |
@@ -4,8 +4,7 @@
|
|||||||
<head>
|
<head>
|
||||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
|
||||||
<link rel="shortcut icon" type="image/png"
|
<link rel="shortcut icon" type="image/png" href="../static/img/bird-sm.png">
|
||||||
href="https://cdn.jsdelivr.net/gh/xiangyuecn/Recorder@latest/assets/icon.png">
|
|
||||||
|
|
||||||
<title>MockingBird Web Server</title>
|
<title>MockingBird Web Server</title>
|
||||||
|
|
||||||
@@ -24,50 +23,105 @@
|
|||||||
<div class="main">
|
<div class="main">
|
||||||
|
|
||||||
<div class="mainBox">
|
<div class="mainBox">
|
||||||
<div class="pd btns">
|
<div class="title" >
|
||||||
|
<div style="width: 15%;float: left;margin-left: 5%;">
|
||||||
|
<img src="../static/img/bird.png" style="width: 100%;border-radius:50%;"></img>
|
||||||
|
</div>
|
||||||
|
<div style="width: 80% ;height: 15%;; margin-left: 15%;overflow: hidden;">
|
||||||
|
<div style="margin-left: 5%;margin-top: 15px;font-size: xx-large;font-weight: bolder;">
|
||||||
|
拟声鸟工具箱
|
||||||
|
</div>
|
||||||
|
<div style="margin-left: 5%;margin-top: 3px;font-size: large;">
|
||||||
|
<a href="https://github.com/babysor/MockingBird" target="_blank">https://github.com/babysor/MockingBird</a>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style="margin-left: 5%;margin-top: 50px;width: 90%;">
|
||||||
|
<div style="font-size: larger;font-weight: bolder;">1. 请输入中文</div>
|
||||||
|
<textarea id="user_input_text"
|
||||||
|
style="border:1px solid #ccc; width: 100%; height: 100px; font-size: 15px; margin-top: 10px;"></textarea>
|
||||||
|
</div>
|
||||||
|
<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; ">
|
||||||
<!-- <div>
|
<!-- <div>
|
||||||
<button onclick="recOpen()" style="margin-right:10px">打开录音,请求权限</button>
|
<button onclick="recOpen()" style="margin-right:10px">打开录音,请求权限</button>
|
||||||
<button onclick="recClose()" style="margin-right:0">关闭录音,释放资源</button>
|
<button onclick="recClose()" style="margin-right:0">关闭录音,释放资源</button>
|
||||||
</div> -->
|
</div> -->
|
||||||
<button onclick="recStart()" style="margin-left:100px">录制</button>
|
<div style="font-size: larger;font-weight: bolder;">2. 请直接录音,点击停止结束</div>
|
||||||
<button onclick="recStop()" style="margin-left:100px">停止</button>
|
<button onclick="recStart()" >录制</button>
|
||||||
<button onclick="recPlay()" style="margin-left:100px">播放</button>
|
<button onclick="recStop()">停止</button>
|
||||||
|
<button onclick="recPlay()" >播放</button>
|
||||||
</div>
|
</div>
|
||||||
|
<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; ">
|
||||||
|
<div style="font-size: larger;font-weight: bolder;">或上传音频</div>
|
||||||
|
<input type="file" id="fileInput" accept=".wav" />
|
||||||
|
<label for="fileInput">选择音频</label>
|
||||||
|
<div id="audio1"></div>
|
||||||
|
</div>
|
||||||
|
<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; ">
|
||||||
|
<div style="font-size: larger;font-weight: bolder;">3. 选择Synthesizer模型</div>
|
||||||
|
<span class="box">
|
||||||
|
<select id="select">
|
||||||
|
</select>
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
<div class="pd btns" style="margin-left: 5%;margin-top: 20px;width: 90%; text-align:right;">
|
||||||
|
<button id="upload" onclick="recUpload()">上传合成</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
<!-- 波形绘制区域 -->
|
<!-- 波形绘制区域 -->
|
||||||
<div class="pd recpower">
|
<!-- <div class="pd recpower">
|
||||||
<div style="height:40px;width:100%;background:#fff;position:relative;">
|
<div style="height:40px;width:100%;background:#fff;position:relative;">
|
||||||
<div class="recpowerx" style="height:40px;background:#ff3295;position:absolute;"></div>
|
<div class="recpowerx" style="height:40px;background:#ff3295;position:absolute;"></div>
|
||||||
<div class="recpowert" style="padding-left:50px; line-height:40px; position: relative;"></div>
|
<div class="recpowert" style="padding-left:50px; line-height:40px; position: relative;"></div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div> -->
|
||||||
<div class="pd waveBox" style="height:100px;">
|
<!-- <div class="pd waveBox" style="height:100px;">
|
||||||
<div style="border:1px solid #ccc;display:inline-block; width: 100%; height: 100px;">
|
<div style="border:1px solid #ccc;display:inline-block; width: 100%; height: 100px;">
|
||||||
<div style="height:100px; width: 100%; background-color: #FE76B8; position: relative;left: 0px;top: 0px;z-index: 10;"
|
<div style="height:100px; width: 100%; background-color: #5da1f5; position: relative;left: 0px;top: 0px;z-index: 10;"
|
||||||
class="recwave"></div>
|
class="recwave"></div>
|
||||||
<div
|
<div
|
||||||
style="background-color: transparent;position: relative;top: -80px;left: 30%;z-index: 20;font-size: 48px;color: #fff;">
|
style="background-color: transparent;position: relative;top: -80px;left: 30%;z-index: 20;font-size: 48px;color: #fff;">
|
||||||
音频预览</div>
|
音频预览</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div> -->
|
||||||
<div>
|
<div class="reclog" style="margin-left: 5%;margin-top: 20px;width: 90%;"></div>
|
||||||
<div>请输入文本:</div>
|
|
||||||
<input type="text" id="user_input_text"
|
|
||||||
style="border:1px solid #ccc; width: 100%; height: 20px; font-size: 18px;" />
|
|
||||||
</div>
|
|
||||||
<div class="pd btns">
|
|
||||||
<button onclick="recUpload()" style="margin-left: 300px; margin-top: 15px;">上传</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 日志输出区域 -->
|
|
||||||
<div class="mainBox">
|
|
||||||
<div class="reclog"></div>
|
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
<script>
|
<script>
|
||||||
|
|
||||||
|
$("#fileInput").change(function(){
|
||||||
|
var file = $("#fileInput").get(0).files;
|
||||||
|
if (file.length > 0) {
|
||||||
|
var path = URL.createObjectURL(file[0]);
|
||||||
|
var audio = document.createElement('audio');
|
||||||
|
audio.src = path;
|
||||||
|
audio.controls = true;
|
||||||
|
$('#audio1').empty().append(audio);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
fetch("/api/synthesizers", {
|
||||||
|
method: 'get',
|
||||||
|
headers: {
|
||||||
|
"X-CSRFToken": "{{ csrf_token() }}"
|
||||||
|
}
|
||||||
|
}).then(function (res) {
|
||||||
|
if (!res.ok) throw Error(res.statusText);
|
||||||
|
return res.json();
|
||||||
|
}).then(function (data) {
|
||||||
|
for (var synt of data) {
|
||||||
|
var option = document.createElement('option');
|
||||||
|
option.text = synt.name
|
||||||
|
option.value = synt.path
|
||||||
|
$("#select").append(option);
|
||||||
|
}
|
||||||
|
}).catch(function (err) {
|
||||||
|
console.log('Error: ' + err.message);
|
||||||
|
})
|
||||||
|
|
||||||
var rec, wave, recBlob;
|
var rec, wave, recBlob;
|
||||||
/**调用open打开录音请求好录音权限**/
|
/**调用open打开录音请求好录音权限**/
|
||||||
var recOpen = function () {//一般在显示出录音按钮或相关的录音界面时进行此方法调用,后面用户点击开始录音时就能畅通无阻了
|
var recOpen = function () {//一般在显示出录音按钮或相关的录音界面时进行此方法调用,后面用户点击开始录音时就能畅通无阻了
|
||||||
@@ -78,11 +132,11 @@
|
|||||||
type: "wav", bitRate: 16, sampleRate: 16000
|
type: "wav", bitRate: 16, sampleRate: 16000
|
||||||
, onProcess: function (buffers, powerLevel, bufferDuration, bufferSampleRate, newBufferIdx, asyncEnd) {
|
, onProcess: function (buffers, powerLevel, bufferDuration, bufferSampleRate, newBufferIdx, asyncEnd) {
|
||||||
//录音实时回调,大约1秒调用12次本回调
|
//录音实时回调,大约1秒调用12次本回调
|
||||||
document.querySelector(".recpowerx").style.width = powerLevel + "%";
|
// document.querySelector(".recpowerx").style.width = powerLevel + "%";
|
||||||
document.querySelector(".recpowert").innerText = bufferDuration + " / " + powerLevel;
|
// document.querySelector(".recpowert").innerText = bufferDuration + " / " + powerLevel;
|
||||||
|
|
||||||
//可视化图形绘制
|
//可视化图形绘制
|
||||||
wave.input(buffers[buffers.length - 1], powerLevel, bufferSampleRate);
|
// wave.input(buffers[buffers.length - 1], powerLevel, bufferSampleRate);
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -93,7 +147,7 @@
|
|||||||
rec = newRec;
|
rec = newRec;
|
||||||
|
|
||||||
//此处创建这些音频可视化图形绘制浏览器支持妥妥的
|
//此处创建这些音频可视化图形绘制浏览器支持妥妥的
|
||||||
wave = Recorder.FrequencyHistogramView({ elem: ".recwave" });
|
// wave = Recorder.FrequencyHistogramView({ elem: ".recwave" });
|
||||||
|
|
||||||
reclog("已打开录音,可以点击录制开始录音了", 2);
|
reclog("已打开录音,可以点击录制开始录音了", 2);
|
||||||
}, function (msg, isUserNotAllow) {//用户拒绝未授权或不支持
|
}, function (msg, isUserNotAllow) {//用户拒绝未授权或不支持
|
||||||
@@ -186,15 +240,21 @@
|
|||||||
|
|
||||||
/**上传**/
|
/**上传**/
|
||||||
function recUpload() {
|
function recUpload() {
|
||||||
var blob = recBlob;
|
var blob
|
||||||
|
var loadedAudios = $("#fileInput").get(0).files
|
||||||
|
if (loadedAudios.length > 0) {
|
||||||
|
blob = loadedAudios[0];
|
||||||
|
} else {
|
||||||
|
blob = recBlob;
|
||||||
|
}
|
||||||
if (!blob) {
|
if (!blob) {
|
||||||
reclog("请先录音,然后停止后再上传", 1);
|
reclog("请先录音或选择音频,然后停止后再上传", 1);
|
||||||
return;
|
return;
|
||||||
};
|
};
|
||||||
|
|
||||||
//本例子假设使用原始XMLHttpRequest请求方式,实际使用中自行调整为自己的请求方式
|
//本例子假设使用原始XMLHttpRequest请求方式,实际使用中自行调整为自己的请求方式
|
||||||
//录音结束时拿到了blob文件对象,可以用FileReader读取出内容,或者用FormData上传
|
//录音结束时拿到了blob文件对象,可以用FileReader读取出内容,或者用FormData上传
|
||||||
var api = "http://127.0.0.1:8080/api/synthesize";
|
var api = "/api/synthesize";
|
||||||
|
|
||||||
reclog("开始上传到" + api + ",请求稍后...");
|
reclog("开始上传到" + api + ",请求稍后...");
|
||||||
|
|
||||||
@@ -203,15 +263,18 @@
|
|||||||
var csrftoken = "{{ csrf_token() }}";
|
var csrftoken = "{{ csrf_token() }}";
|
||||||
var user_input_text = document.getElementById("user_input_text");
|
var user_input_text = document.getElementById("user_input_text");
|
||||||
var input_text = user_input_text.value;
|
var input_text = user_input_text.value;
|
||||||
var postData = "";
|
var postData = new FormData();
|
||||||
postData += "mime=" + encodeURIComponent(blob.type);//告诉后端,这个录音是什么格式的,可能前后端都固定的mp3可以不用写
|
postData.append("text", input_text)
|
||||||
postData += "&upfile_b64=" + encodeURIComponent((/.+;\s*base64\s*,\s*(.+)$/i.exec(reader.result) || [])[1]) //录音文件内容,后端进行base64解码成二进制
|
postData.append("file", blob)
|
||||||
postData += "&text=" + encodeURIComponent(input_text);
|
var sel = document.getElementById("select");
|
||||||
|
var path = sel.options[sel.selectedIndex].value;
|
||||||
|
if (!!path) {
|
||||||
|
postData.append("synt_path", path);
|
||||||
|
}
|
||||||
|
|
||||||
fetch(api, {
|
fetch(api, {
|
||||||
method: 'post',
|
method: 'post',
|
||||||
headers: {
|
headers: {
|
||||||
"Content-type": "application/x-www-form-urlencoded; charset=UTF-8",
|
|
||||||
"X-CSRFToken": csrftoken
|
"X-CSRFToken": csrftoken
|
||||||
},
|
},
|
||||||
body: postData
|
body: postData
|
||||||
@@ -277,7 +340,7 @@
|
|||||||
var div = document.createElement("div");
|
var div = document.createElement("div");
|
||||||
var elem = document.querySelector(".reclog");
|
var elem = document.querySelector(".reclog");
|
||||||
elem.insertBefore(div, elem.firstChild);
|
elem.insertBefore(div, elem.firstChild);
|
||||||
div.innerHTML = '<div style="color:' + (!color ? "" : color == 1 ? "red" : color == 2 ? "#FE76B8" : color) + '">[' + t + ']' + s + '</div>';
|
div.innerHTML = '<div style="color:' + (!color ? "" : color == 1 ? "#327de8" : color == 2 ? "#5da1f5" : color) + '">[' + t + ']' + s + '</div>';
|
||||||
};
|
};
|
||||||
window.onerror = function (message, url, lineNo, columnNo, error) {
|
window.onerror = function (message, url, lineNo, columnNo, error) {
|
||||||
reclog('<span style="color:red">【Uncaught Error】' + message + '<pre>' + "at:" + lineNo + ":" + columnNo + " url:" + url + "\n" + (error && error.stack || "不能获得错误堆栈") + '</pre></span>');
|
reclog('<span style="color:red">【Uncaught Error】' + message + '<pre>' + "at:" + lineNo + ":" + columnNo + " url:" + url + "\n" + (error && error.stack || "不能获得错误堆栈") + '</pre></span>');
|
||||||
@@ -312,11 +375,11 @@
|
|||||||
|
|
||||||
a {
|
a {
|
||||||
text-decoration: none;
|
text-decoration: none;
|
||||||
color: #FE76B8;
|
color: #327de8;
|
||||||
}
|
}
|
||||||
|
|
||||||
a:hover {
|
a:hover {
|
||||||
color: #f00;
|
color: #5da1f5;
|
||||||
}
|
}
|
||||||
|
|
||||||
.main {
|
.main {
|
||||||
@@ -330,7 +393,6 @@
|
|||||||
padding: 12px;
|
padding: 12px;
|
||||||
border-radius: 6px;
|
border-radius: 6px;
|
||||||
background: #fff;
|
background: #fff;
|
||||||
--border: 1px solid #f60;
|
|
||||||
box-shadow: 2px 2px 3px #aaa;
|
box-shadow: 2px 2px 3px #aaa;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -340,20 +402,31 @@
|
|||||||
cursor: pointer;
|
cursor: pointer;
|
||||||
border: none;
|
border: none;
|
||||||
border-radius: 3px;
|
border-radius: 3px;
|
||||||
background: #FE76B8;
|
background: #5698c3;
|
||||||
color: #fff;
|
color: #fff;
|
||||||
padding: 0 15px;
|
padding: 0 15px;
|
||||||
margin: 3px 20px 3px 0;
|
margin: 3px 10px 3px 0;
|
||||||
|
width: 70px;
|
||||||
line-height: 36px;
|
line-height: 36px;
|
||||||
height: 36px;
|
height: 36px;
|
||||||
overflow: hidden;
|
overflow: hidden;
|
||||||
vertical-align: middle;
|
vertical-align: middle;
|
||||||
}
|
}
|
||||||
|
|
||||||
.btns button:active {
|
.btns #upload {
|
||||||
background: #fd54a6
|
background: #5698c3;
|
||||||
|
color: #fff;
|
||||||
|
width: 100px;
|
||||||
|
height: 42px;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
.btns button:active {
|
||||||
|
background: #5da1f5
|
||||||
|
}
|
||||||
|
|
||||||
|
.btns button:hover {
|
||||||
|
background: #5da1f5
|
||||||
|
}
|
||||||
.pd {
|
.pd {
|
||||||
padding: 0 0 6px 0;
|
padding: 0 0 6px 0;
|
||||||
}
|
}
|
||||||
@@ -361,12 +434,74 @@
|
|||||||
.lb {
|
.lb {
|
||||||
display: inline-block;
|
display: inline-block;
|
||||||
vertical-align: middle;
|
vertical-align: middle;
|
||||||
background: #ff3d9b;
|
background: #327de8;
|
||||||
color: #fff;
|
color: #fff;
|
||||||
font-size: 14px;
|
font-size: 14px;
|
||||||
padding: 2px 8px;
|
padding: 2px 8px;
|
||||||
border-radius: 99px;
|
border-radius: 99px;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#fileInput {
|
||||||
|
width: 0.1px;
|
||||||
|
height: 0.1px;
|
||||||
|
opacity: 0;
|
||||||
|
overflow: hidden;
|
||||||
|
position: absolute;
|
||||||
|
z-index: -1;
|
||||||
|
}
|
||||||
|
#fileInput + label {
|
||||||
|
padding: 0 15px;
|
||||||
|
border-radius: 4px;
|
||||||
|
color: white;
|
||||||
|
background-color: #5698c3;
|
||||||
|
display: inline-block;
|
||||||
|
width: 70px;
|
||||||
|
line-height: 36px;
|
||||||
|
height: 36px;
|
||||||
|
}
|
||||||
|
#fileInput + label {
|
||||||
|
cursor: pointer; /* "hand" cursor */
|
||||||
|
}
|
||||||
|
#fileInput:focus + label,
|
||||||
|
#fileInput + label:hover {
|
||||||
|
background-color: #5da1f5;
|
||||||
|
}
|
||||||
|
|
||||||
|
.box select {
|
||||||
|
background-color: #5698c3;
|
||||||
|
color: white;
|
||||||
|
padding: 8px;
|
||||||
|
width: 120px;
|
||||||
|
border: none;
|
||||||
|
border-radius: 4px;
|
||||||
|
font-size: 0.5em;
|
||||||
|
outline: none;
|
||||||
|
margin: 3px 10px 3px 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
.box::before {
|
||||||
|
content: "\f13a";
|
||||||
|
position: absolute;
|
||||||
|
top: 0;
|
||||||
|
right: 0;
|
||||||
|
width: 20%;
|
||||||
|
height: 100%;
|
||||||
|
text-align: center;
|
||||||
|
font-size: 28px;
|
||||||
|
line-height: 45px;
|
||||||
|
color: rgba(255, 255, 255, 0.5);
|
||||||
|
background-color: rgba(255, 255, 255, 0.1);
|
||||||
|
pointer-events: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
.box:hover::before {
|
||||||
|
color: rgba(255, 255, 255, 0.6);
|
||||||
|
background-color: rgba(255, 255, 255, 0.2);
|
||||||
|
}
|
||||||
|
|
||||||
|
.box select option {
|
||||||
|
padding: 30px;
|
||||||
|
}
|
||||||
</style>
|
</style>
|
||||||
|
|
||||||
</body>
|
</body>
|
||||||
|
|||||||
Reference in New Issue
Block a user