0%

VITS

VITS 基于变分推断的端到端TTS模型(融合了声学模型与声码器)

Paper

https://arxiv.org/abs/2106.06103

https://github.com/jaywalnut310/vits

VITS框架图如下:
vits

实验

中文场景(韵律嵌入)

利用中文标贝数据进行训练。

  • 祝大家中秋节快乐
  • 我说点什么好呢?念一个绕口令吧。八百标兵奔北坡,炮兵并排北边跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。八百标兵奔北坡,北坡八百炮兵炮,标兵怕碰炮兵炮,炮兵怕把标兵碰。八了百了标了兵了奔了北了坡,炮了兵了并了排了北了边了跑,炮了兵了怕了把了标了兵了碰,标了兵了怕了碰了炮了兵了炮。

中英文混合场景

  • 我刚刚去 Starbucks 买了杯 Vanilla Latte 和两块 Oatmeal Raisin Cookie, 搭配起来还蛮不错的。
  • 你多吃一点 means “Have some more.” 而慢慢吃 expresses politeness to someone when eating.
  • 这是一个【bad case】,共同成长。我可以读共同富裕,就是不能说共同成长.

模型及代码详解

TextEncoder

Text Encoder将 ctextc_{text} 映射到 htexth_{text},由6层transformer encoder构成,其中MultiHeadAttention中的n_heads为2,FFN中的kernel_size为3,这里需要注意的是 relative positional representation instead of absolute positional encoding.其中proj将htexth_{text}映射到分布的均值与方差。

Text Encoder 参数量为 6,353,664,这个会根据音素长度的不同略有差别。

详细代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class TextEncoder(nn.Module):
def __init__(self,
n_vocab,
out_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout):
super().__init__()
self.n_vocab = n_vocab
self.out_channels = out_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout

self.emb = nn.Embedding(n_vocab, hidden_channels)
nn.init.normal_(self.emb.weight, 0.0, hidden_channels**-0.5)

self.prosody_emb = nn.Embedding(5, hidden_channels)
nn.init.normal_(self.prosody_emb.weight, 0.0, hidden_channels**-0.5)

self.encoder = attentions.Encoder(
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout)
self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)

def forward(self, x, x_lengths, prosody):
# print(x.shape) # [8, 101],
# print(x_lengths.shape) # [8]
# print(prosody.shape) # [8, 101],

x = self.emb(x) * math.sqrt(self.hidden_channels) # [b, t, h]
prosody = self.prosody_emb(prosody) * math.sqrt(self.hidden_channels)
x = x + prosody
x = torch.transpose(x, 1, -1) # [b, h, t]
x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
# print(x_mask.shape) # [8, 1, 101]
x = self.encoder(x * x_mask, x_mask)
stats = self.proj(x) * x_mask

m, logs = torch.split(stats, self.out_channels, dim=1) # 获得 PRIOR 的 均值与方差
return x, m, logs, x_mask

PosteriorEncoder

主要的作用是将线性谱映射到分布的均值与方差,主要由non-causal WaveNet residual blocks组成,A WaveNet residual block consists of layers of dilated convolutions with a gated activation unit and skip connection.

为什么是线性谱,而不是mel谱。作者的回复是:In our problem setting, we aim to provide more high-resolution information for the posterior encoder. We, therefore, use the linear-scale spectrogram of target speech xlinx_{lin} as input rather than the mel-spectrogram.

  • pre:将513维利用卷积核为1的1维卷积映射到192维度
  • enc:WaveNet模块,这里dilation_rate为1
    WaveNet模块流程图如下:
    WN

WaveNet模块代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
class WN(torch.nn.Module):
def __init__(self, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=0, p_dropout=0):
super(WN, self).__init__()
assert(kernel_size % 2 == 1)
self.hidden_channels =hidden_channels
self.kernel_size = kernel_size,
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.gin_channels = gin_channels
self.p_dropout = p_dropout

self.in_layers = torch.nn.ModuleList()
self.res_skip_layers = torch.nn.ModuleList()
self.drop = nn.Dropout(p_dropout)

if gin_channels != 0:
cond_layer = torch.nn.Conv1d(gin_channels, 2*hidden_channels*n_layers, 1)
self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')

for i in range(n_layers):
dilation = dilation_rate ** i
padding = int((kernel_size * dilation - dilation) / 2)
in_layer = torch.nn.Conv1d(hidden_channels, 2*hidden_channels, kernel_size,
dilation=dilation, padding=padding)
in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
self.in_layers.append(in_layer)

# last one is not necessary
if i < n_layers - 1:
res_skip_channels = 2 * hidden_channels
else:
res_skip_channels = hidden_channels

res_skip_layer = torch.nn.Conv1d(hidden_channels, res_skip_channels, 1)
res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
self.res_skip_layers.append(res_skip_layer)

def forward(self, x, x_mask, g=None, **kwargs):
output = torch.zeros_like(x)
n_channels_tensor = torch.IntTensor([self.hidden_channels]) # tensor([192], dtype=torch.int32)

if g is not None:
g = self.cond_layer(g)

for i in range(self.n_layers):
x_in = self.in_layers[i](x)
if g is not None:
cond_offset = i * 2 * self.hidden_channels
g_l = g[:,cond_offset:cond_offset+2*self.hidden_channels,:]
else:
g_l = torch.zeros_like(x_in)

acts = commons.fused_add_tanh_sigmoid_multiply(
x_in,
g_l,
n_channels_tensor)
acts = self.drop(acts)

res_skip_acts = self.res_skip_layers[i](acts)
if i < self.n_layers - 1:
res_acts = res_skip_acts[:,:self.hidden_channels,:]
x = (x + res_acts) * x_mask
output = output + res_skip_acts[:,self.hidden_channels:,:]
else:
output = output + res_skip_acts
return output * x_mask

def remove_weight_norm(self):
if self.gin_channels != 0:
torch.nn.utils.remove_weight_norm(self.cond_layer)
for l in self.in_layers:
torch.nn.utils.remove_weight_norm(l)
for l in self.res_skip_layers:
torch.nn.utils.remove_weight_norm(l)

  • proj:将192维度映射到192*2,划分分布的均值和方差

PosteriorEncoder参数量为7,238,016.

详细代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class PosteriorEncoder(nn.Module):
def __init__(self,
in_channels,
out_channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
gin_channels=0):
super().__init__()
self.in_channels = in_channels # 513
self.out_channels = out_channels # 192
self.hidden_channels = hidden_channels # 192
self.kernel_size = kernel_size # 5
self.dilation_rate = dilation_rate # 1
self.n_layers = n_layers # 16
self.gin_channels = gin_channels # 0

self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
# num_param = sum(param.numel() for param in self.parameters()) # 参数量 7,238,016
# print(num_param)

def forward(self, x, x_lengths, g=None):
x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
x = self.pre(x) * x_mask
# print(x.shape) # [8, 192, 654]
x = self.enc(x, x_mask, g=g) #
# print(x.shape) # [8, 192, 654] 参数维度没有任何变化
stats = self.proj(x) * x_mask #
m, logs = torch.split(stats, self.out_channels, dim=1)
z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
return z, m, logs, x_mask

Generator

将 linear spectrograms 经过 PosteriorEncoder 得到的隐变量,上采样到 wav 波形。这里的结果与 hifi-gan 中的一摸一样,唯一不同的就是 hifi-gan 是将 mel-spectrograms 上采样。详细可参考hifi-gan那篇文章。

flow

为什么要加入 flow 呢?作者是这样回复的:We found that increasing the expressiveness of the prior distribution is important for generating realistic samples. 可以简单理解为将正态分布映射到一个更复杂的分布。

flow模块总共的参数量为 7,102,080。

ResidualCouplingBlock由ResidualCouplingLayer和Flip构成。
该模块流程图如下:
flow

详细代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class ResidualCouplingBlock(nn.Module):
def __init__(self,
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
n_flows=4,
gin_channels=0):
super().__init__()
self.channels = channels # 192
self.hidden_channels = hidden_channels # 192
self.kernel_size = kernel_size # 5
self.dilation_rate = dilation_rate # 1
self.n_layers = n_layers # 4
self.n_flows = n_flows # 4
self.gin_channels = gin_channels # 0

self.flows = nn.ModuleList()
for i in range(n_flows):
self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
self.flows.append(modules.Flip())
# num_param = sum(param.numel() for param in self.parameters()) # 7,102,080
# print(num_param)

def forward(self, x, x_mask, g=None, reverse=False):
if not reverse:
for flow in self.flows:
x, _ = flow(x, x_mask, g=g, reverse=reverse)
else:
for flow in reversed(self.flows):
x = flow(x, x_mask, g=g, reverse=reverse)
return x

ResidualCouplingLayer模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class ResidualCouplingLayer(nn.Module):
def __init__(self,
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
p_dropout=0,
gin_channels=0,
mean_only=False):
assert channels % 2 == 0, "channels should be divisible by 2"
super().__init__()
self.channels = channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.half_channels = channels // 2
self.mean_only = mean_only

self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
self.enc = WN(hidden_channels, kernel_size, dilation_rate, n_layers, p_dropout=p_dropout, gin_channels=gin_channels)
self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1)
self.post.weight.data.zero_()
self.post.bias.data.zero_()

def forward(self, x, x_mask, g=None, reverse=False):
x0, x1 = torch.split(x, [self.half_channels]*2, 1)
# print(x0.shape) [8, 96, 654]
# print(x1.shape) [8, 96, 654]
# 首先将 line_spec 进行对半切分
h = self.pre(x0) * x_mask # 96 --> 192
h = self.enc(h, x_mask, g=g) # WN 模块 192 --> 192
stats = self.post(h) * x_mask # 192 --> 96 (mean_only = True)
if not self.mean_only:
m, logs = torch.split(stats, [self.half_channels]*2, 1)
else:
m = stats
logs = torch.zeros_like(m)

if not reverse:
x1 = m + x1 * torch.exp(logs) * x_mask
x = torch.cat([x0, x1], 1)
logdet = torch.sum(logs, [1,2])
return x, logdet
else:
x1 = (x1 - m) * torch.exp(-logs) * x_mask
x = torch.cat([x0, x1], 1)
return x

Stochastic Duration Predictor

为啥引入这个模块呢?

为了更好的提高语音合成的表现力,解决 one to many maping 的问题。

主要包括两个部分:

  • stack residual blocks with dilated and
    depth-separable convolutional layers,同时用到了残差分组膨胀
  • neural spline flows

sdp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87

class StochasticDurationPredictor(nn.Module):
def __init__(self, in_channels, filter_channels, kernel_size, p_dropout, n_flows=4, gin_channels=0):
super().__init__()
filter_channels = in_channels # it needs to be removed from future version.
self.in_channels = in_channels # 192
self.filter_channels = filter_channels
self.kernel_size = kernel_size # 3
self.p_dropout = p_dropout # 0.5
self.n_flows = n_flows # 4
self.gin_channels = gin_channels

self.log_flow = modules.Log()
self.flows = nn.ModuleList()
self.flows.append(modules.ElementwiseAffine(2))
for i in range(n_flows):
self.flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
self.flows.append(modules.Flip())

self.post_pre = nn.Conv1d(1, filter_channels, 1)
self.post_proj = nn.Conv1d(filter_channels, filter_channels, 1)
self.post_convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
self.post_flows = nn.ModuleList()
self.post_flows.append(modules.ElementwiseAffine(2))
for i in range(4):
self.post_flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
self.post_flows.append(modules.Flip())

self.pre = nn.Conv1d(in_channels, filter_channels, 1)
self.proj = nn.Conv1d(filter_channels, filter_channels, 1)
self.convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
if gin_channels != 0:
self.cond = nn.Conv1d(gin_channels, filter_channels, 1)

def forward(self, x, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
x = torch.detach(x)
# 截断梯度,这里不会影响 Text Encoder 模块的参数更新
x = self.pre(x) # [8, 192, 101] [b, h, t]
if g is not None:
# 针对多说话人,这里也会不同,因为不同人说话的节奏韵律是不一样的
g = torch.detach(g)
x = x + self.cond(g)
x = self.convs(x, x_mask) # [8, 192, 101]
x = self.proj(x) * x_mask #
# x 是在处理 Text Encoder 的输出

if not reverse:
flows = self.flows
assert w is not None
# w 是在处理 duration
logdet_tot_q = 0
h_w = self.post_pre(w) # [8, 192, 123]
h_w = self.post_convs(h_w, x_mask) #
h_w = self.post_proj(h_w) * x_mask #
e_q = torch.randn(w.size(0), 2, w.size(2)).to(device=x.device, dtype=x.dtype) * x_mask # [8, 2, 123]
# 正态分布
z_q = e_q # [8, 2, 123]
for flow in self.post_flows:
z_q, logdet_q = flow(z_q, x_mask, g=(x + h_w))
logdet_tot_q += logdet_q
# print(z_q.shape) [8, 2, 123]
# print(logdet_tot_q) [8]
z_u, z1 = torch.split(z_q, [1, 1], 1) # u 和 v
u = torch.sigmoid(z_u) * x_mask
z0 = (w - u) * x_mask
logdet_tot_q += torch.sum((F.logsigmoid(z_u) + F.logsigmoid(-z_u)) * x_mask, [1,2])
logq = torch.sum(-0.5 * (math.log(2*math.pi) + (e_q**2)) * x_mask, [1,2]) - logdet_tot_q

logdet_tot = 0
z0, logdet = self.log_flow(z0, x_mask)
logdet_tot += logdet
z = torch.cat([z0, z1], 1)
for flow in flows:
z, logdet = flow(z, x_mask, g=x, reverse=reverse)
logdet_tot = logdet_tot + logdet
nll = torch.sum(0.5 * (math.log(2*math.pi) + (z**2)) * x_mask, [1,2]) - logdet_tot
return nll + logq # [b]
else:
flows = list(reversed(self.flows))
flows = flows[:-2] + [flows[-1]] # remove a useless vflow
z = torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype) * noise_scale
for flow in flows:
z = flow(z, x_mask, g=x, reverse=reverse)
z0, z1 = torch.split(z, [1, 1], 1)
logw = z0
return logw

最早是Glow-TTS提出来的方法(和VITS是同一个作者),目的省去额外对齐的过程。

  • 计算似然矩阵
1
2
3
4
5
6
7
8
with torch.no_grad():
# negative cross-entropy
s_p_sq_r = torch.exp(-2 * logs_p) # [b, d, t]
neg_cent1 = torch.sum(-0.5 * math.log(2 * math.pi) - logs_p, [1], keepdim=True) # [b, 1, t_s]
neg_cent2 = torch.matmul(-0.5 * (z_p ** 2).transpose(1, 2), s_p_sq_r) # [b, t_t, d] x [b, d, t_s] = [b, t_t, t_s]
neg_cent3 = torch.matmul(z_p.transpose(1, 2), (m_p * s_p_sq_r)) # [b, t_t, d] x [b, d, t_s] = [b, t_t, t_s]
neg_cent4 = torch.sum(-0.5 * (m_p ** 2) * s_p_sq_r, [1], keepdim=True) # [b, 1, t_s]
neg_cent = neg_cent1 + neg_cent2 + neg_cent3 + neg_cent4
  • 根据似然矩阵得到对齐路径(动态规划算法,可以参考DTW理解)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np

value = np.load("/home/admin/yuanxin/Bil-vits/attn.npy") # 这个是保存的《似然矩阵》
value = value[0].T

t_x, t_y = value.shape

path = np.zeros([t_x, t_y])

Q = float('-inf') * np.ones_like(value)

for y in range(t_y):
for x in range(max(0, t_x + y - t_y), min(t_x, y + 1)):
if y == 0:
Q[x, 0] = value[x, 0]
else:
if x == 0:
v_prev = float('-inf')
else:
v_prev = Q[x-1, y-1]
v_cur = Q[x, y-1]
Q[x, y] = value[x, y] + max(v_prev, v_cur)

# Backtrack from last observation
index = t_x - 1
for y in range(t_y - 1, -1, -1):
path[index, y] = 1
if index !=0 and (index == y or Q[index, y-1] < Q[index-1, y-1]):
index = index - 1

# np.save("/home/admin/yuanxin/Bil-vits/path.npy", path)

左图为似然矩阵,右图为对齐结果

mas

切片训练

这里不是整个音频进行对抗训练的,而是进行了随机切片处理。hifi-gan也是这样的。segment_size=8192,大约是0.37秒。

对抗训练

与 hifi-gan 很类似,鉴别器还是多周期鉴别器(多周期和hifi-gan中一样) + 多尺度鉴别器(多尺度这里就用到了一个)。

  • 损失函数

与 hifi-gan 不同的就是损失函数新增了先验后验的KL散度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
"""
z_p, logs_q: [b, h, t_t]
m_p, logs_p: [b, h, t_t]
"""
z_p = z_p.float()
logs_q = logs_q.float()
m_p = m_p.float()
logs_p = logs_p.float()
z_mask = z_mask.float()

kl = logs_p - logs_q - 0.5
kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
kl = torch.sum(kl * z_mask)
l = kl / torch.sum(z_mask)
return l