VITS | 1615

VITS 基于变分推断的端到端TTS模型（融合了声学模型与声码器）

Paper

https://arxiv.org/abs/2106.06103

https://github.com/jaywalnut310/vits

VITS框架图如下：

实验

中文场景（韵律嵌入）

利用中文标贝数据进行训练。

祝大家中秋节快乐

我说点什么好呢？念一个绕口令吧。八百标兵奔北坡，炮兵并排北边跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。八百标兵奔北坡，北坡八百炮兵炮，标兵怕碰炮兵炮，炮兵怕把标兵碰。八了百了标了兵了奔了北了坡，炮了兵了并了排了北了边了跑，炮了兵了怕了把了标了兵了碰，标了兵了怕了碰了炮了兵了炮。

中英文混合场景

我刚刚去 Starbucks 买了杯 Vanilla Latte 和两块 Oatmeal Raisin Cookie, 搭配起来还蛮不错的。

你多吃一点 means “Have some more.” 而慢慢吃 expresses politeness to someone when eating.

这是一个【bad case】，共同成长。我可以读共同富裕，就是不能说共同成长.

模型及代码详解

TextEncoder

Text Encoder将 $c_{text}$ 映射到 $h_{text}$ ，由6层transformer encoder构成，其中MultiHeadAttention中的n_heads为2，FFN中的kernel_size为3，这里需要注意的是 relative positional representation instead of absolute positional encoding.其中proj将 $h_{text}$ 映射到分布的均值与方差。

Text Encoder 参数量为 6,353,664，这个会根据音素长度的不同略有差别。

详细代码如下：

class TextEncoder(nn.Module):
  def __init__(self,
      n_vocab,
      out_channels,
      hidden_channels,
      filter_channels,
      n_heads,
      n_layers,
      kernel_size,
      p_dropout):
    super().__init__()
    self.n_vocab = n_vocab
    self.out_channels = out_channels
    self.hidden_channels = hidden_channels
    self.filter_channels = filter_channels
    self.n_heads = n_heads
    self.n_layers = n_layers
    self.kernel_size = kernel_size
    self.p_dropout = p_dropout

    self.emb = nn.Embedding(n_vocab, hidden_channels)
    nn.init.normal_(self.emb.weight, 0.0, hidden_channels**-0.5)

    self.prosody_emb = nn.Embedding(5, hidden_channels)
    nn.init.normal_(self.prosody_emb.weight, 0.0, hidden_channels**-0.5)

    self.encoder = attentions.Encoder(
      hidden_channels,
      filter_channels,
      n_heads,
      n_layers,
      kernel_size,
      p_dropout)
    self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)

  def forward(self, x, x_lengths, prosody):
    # print(x.shape)          # [8, 101],
    # print(x_lengths.shape)  # [8]
    # print(prosody.shape)    # [8, 101],

    x = self.emb(x) * math.sqrt(self.hidden_channels)  # [b, t, h]
    prosody = self.prosody_emb(prosody) * math.sqrt(self.hidden_channels)
    x = x + prosody
    x = torch.transpose(x, 1, -1)  # [b, h, t]
    x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
    # print(x_mask.shape)          # [8, 1, 101]
    x = self.encoder(x * x_mask, x_mask)
    stats = self.proj(x) * x_mask

    m, logs = torch.split(stats, self.out_channels, dim=1)  # 获得 PRIOR 的 均值与方差
    return x, m, logs, x_mask

PosteriorEncoder

主要的作用是将线性谱映射到分布的均值与方差，主要由non-causal WaveNet residual blocks组成，A WaveNet residual block consists of layers of dilated convolutions with a gated activation unit and skip connection.

为什么是线性谱，而不是mel谱。作者的回复是：In our problem setting, we aim to provide more high-resolution information for the posterior encoder. We, therefore, use the linear-scale spectrogram of target speech $x_{lin}$ as input rather than the mel-spectrogram.

pre：将513维利用卷积核为1的1维卷积映射到192维度
enc：WaveNet模块，这里dilation_rate为1
WaveNet模块流程图如下：

WaveNet模块代码如下：

class WN(torch.nn.Module):
  def __init__(self, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=0, p_dropout=0):
    super(WN, self).__init__()
    assert(kernel_size % 2 == 1)
    self.hidden_channels =hidden_channels
    self.kernel_size = kernel_size,
    self.dilation_rate = dilation_rate
    self.n_layers = n_layers
    self.gin_channels = gin_channels
    self.p_dropout = p_dropout

    self.in_layers = torch.nn.ModuleList()
    self.res_skip_layers = torch.nn.ModuleList()
    self.drop = nn.Dropout(p_dropout)

    if gin_channels != 0:
      cond_layer = torch.nn.Conv1d(gin_channels, 2*hidden_channels*n_layers, 1)
      self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')

    for i in range(n_layers):
      dilation = dilation_rate ** i
      padding = int((kernel_size * dilation - dilation) / 2)
      in_layer = torch.nn.Conv1d(hidden_channels, 2*hidden_channels, kernel_size,
                                 dilation=dilation, padding=padding)
      in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
      self.in_layers.append(in_layer)

      # last one is not necessary
      if i < n_layers - 1:
        res_skip_channels = 2 * hidden_channels
      else:
        res_skip_channels = hidden_channels

      res_skip_layer = torch.nn.Conv1d(hidden_channels, res_skip_channels, 1)
      res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
      self.res_skip_layers.append(res_skip_layer)

  def forward(self, x, x_mask, g=None, **kwargs):
    output = torch.zeros_like(x)
    n_channels_tensor = torch.IntTensor([self.hidden_channels])  # tensor([192], dtype=torch.int32)

    if g is not None:
      g = self.cond_layer(g)

    for i in range(self.n_layers):
      x_in = self.in_layers[i](x)
      if g is not None:
        cond_offset = i * 2 * self.hidden_channels
        g_l = g[:,cond_offset:cond_offset+2*self.hidden_channels,:]
      else:
        g_l = torch.zeros_like(x_in)

      acts = commons.fused_add_tanh_sigmoid_multiply(
          x_in,
          g_l,
          n_channels_tensor)
      acts = self.drop(acts)

      res_skip_acts = self.res_skip_layers[i](acts)
      if i < self.n_layers - 1:
        res_acts = res_skip_acts[:,:self.hidden_channels,:]
        x = (x + res_acts) * x_mask
        output = output + res_skip_acts[:,self.hidden_channels:,:]
      else:
        output = output + res_skip_acts
    return output * x_mask

  def remove_weight_norm(self):
    if self.gin_channels != 0:
      torch.nn.utils.remove_weight_norm(self.cond_layer)
    for l in self.in_layers:
      torch.nn.utils.remove_weight_norm(l)
    for l in self.res_skip_layers:
     torch.nn.utils.remove_weight_norm(l)

proj：将192维度映射到192*2，划分分布的均值和方差

PosteriorEncoder参数量为7,238,016.

详细代码如下：

class PosteriorEncoder(nn.Module):
  def __init__(self,
      in_channels,
      out_channels,
      hidden_channels,
      kernel_size,
      dilation_rate,
      n_layers,
      gin_channels=0):
    super().__init__()
    self.in_channels = in_channels          # 513
    self.out_channels = out_channels        # 192
    self.hidden_channels = hidden_channels  # 192
    self.kernel_size = kernel_size          # 5
    self.dilation_rate = dilation_rate      # 1
    self.n_layers = n_layers                # 16
    self.gin_channels = gin_channels        # 0

    self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
    self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
    self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
    # num_param = sum(param.numel() for param in self.parameters())  # 参数量 7,238,016
    # print(num_param)

  def forward(self, x, x_lengths, g=None):
    x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
    x = self.pre(x) * x_mask
    # print(x.shape)                  # [8, 192, 654]
    x = self.enc(x, x_mask, g=g)      #
    # print(x.shape)                  # [8, 192, 654] 参数维度没有任何变化
    stats = self.proj(x) * x_mask     #
    m, logs = torch.split(stats, self.out_channels, dim=1)
    z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
    return z, m, logs, x_mask

Generator

将 linear spectrograms 经过 PosteriorEncoder 得到的隐变量，上采样到 wav 波形。这里的结果与 hifi-gan 中的一摸一样，唯一不同的就是 hifi-gan 是将 mel-spectrograms 上采样。详细可参考hifi-gan那篇文章。

flow

为什么要加入 flow 呢？作者是这样回复的：We found that increasing the expressiveness of the prior distribution is important for generating realistic samples. 可以简单理解为将正态分布映射到一个更复杂的分布。

flow模块总共的参数量为 7,102,080。

ResidualCouplingBlock由ResidualCouplingLayer和Flip构成。
该模块流程图如下：

详细代码如下：

class ResidualCouplingBlock(nn.Module):
  def __init__(self,
      channels,
      hidden_channels,
      kernel_size,
      dilation_rate,
      n_layers,
      n_flows=4,
      gin_channels=0):
    super().__init__()
    self.channels = channels                 # 192
    self.hidden_channels = hidden_channels   # 192
    self.kernel_size = kernel_size           # 5
    self.dilation_rate = dilation_rate       # 1
    self.n_layers = n_layers                 # 4
    self.n_flows = n_flows                   # 4
    self.gin_channels = gin_channels         # 0

    self.flows = nn.ModuleList()
    for i in range(n_flows):
      self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
      self.flows.append(modules.Flip())
    # num_param = sum(param.numel() for param in self.parameters())  # 7,102,080
    # print(num_param)

  def forward(self, x, x_mask, g=None, reverse=False):
    if not reverse:
      for flow in self.flows:
        x, _ = flow(x, x_mask, g=g, reverse=reverse)
    else:
      for flow in reversed(self.flows):
        x = flow(x, x_mask, g=g, reverse=reverse)
    return x

ResidualCouplingLayer模块

class ResidualCouplingLayer(nn.Module):
  def __init__(self,
      channels,
      hidden_channels,
      kernel_size,
      dilation_rate,
      n_layers,
      p_dropout=0,
      gin_channels=0,
      mean_only=False):
    assert channels % 2 == 0, "channels should be divisible by 2"
    super().__init__()
    self.channels = channels
    self.hidden_channels = hidden_channels
    self.kernel_size = kernel_size
    self.dilation_rate = dilation_rate
    self.n_layers = n_layers
    self.half_channels = channels // 2
    self.mean_only = mean_only

    self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
    self.enc = WN(hidden_channels, kernel_size, dilation_rate, n_layers, p_dropout=p_dropout, gin_channels=gin_channels)
    self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1)
    self.post.weight.data.zero_()
    self.post.bias.data.zero_()

  def forward(self, x, x_mask, g=None, reverse=False):
    x0, x1 = torch.split(x, [self.half_channels]*2, 1)
    # print(x0.shape)  [8, 96, 654]
    # print(x1.shape)  [8, 96, 654]
    # 首先将 line_spec 进行对半切分
    h = self.pre(x0) * x_mask        # 96 --> 192
    h = self.enc(h, x_mask, g=g)     # WN 模块 192 --> 192
    stats = self.post(h) * x_mask    # 192 --> 96 (mean_only = True)
    if not self.mean_only:
      m, logs = torch.split(stats, [self.half_channels]*2, 1)
    else:
      m = stats
      logs = torch.zeros_like(m)

    if not reverse:
      x1 = m + x1 * torch.exp(logs) * x_mask
      x = torch.cat([x0, x1], 1)
      logdet = torch.sum(logs, [1,2])
      return x, logdet
    else:
      x1 = (x1 - m) * torch.exp(-logs) * x_mask
      x = torch.cat([x0, x1], 1)
      return x

Stochastic Duration Predictor

为啥引入这个模块呢？

为了更好的提高语音合成的表现力，解决 one to many maping 的问题。

主要包括两个部分：

stack residual blocks with dilated and
depth-separable convolutional layers，同时用到了残差、分组、膨胀。
neural spline flows


class StochasticDurationPredictor(nn.Module):
  def __init__(self, in_channels, filter_channels, kernel_size, p_dropout, n_flows=4, gin_channels=0):
    super().__init__()
    filter_channels = in_channels # it needs to be removed from future version.
    self.in_channels = in_channels              # 192
    self.filter_channels = filter_channels
    self.kernel_size = kernel_size              # 3
    self.p_dropout = p_dropout                  # 0.5
    self.n_flows = n_flows                      # 4
    self.gin_channels = gin_channels

    self.log_flow = modules.Log()
    self.flows = nn.ModuleList()
    self.flows.append(modules.ElementwiseAffine(2))
    for i in range(n_flows):
      self.flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
      self.flows.append(modules.Flip())

    self.post_pre = nn.Conv1d(1, filter_channels, 1)
    self.post_proj = nn.Conv1d(filter_channels, filter_channels, 1)
    self.post_convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
    self.post_flows = nn.ModuleList()
    self.post_flows.append(modules.ElementwiseAffine(2))
    for i in range(4):
      self.post_flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
      self.post_flows.append(modules.Flip())

    self.pre = nn.Conv1d(in_channels, filter_channels, 1)
    self.proj = nn.Conv1d(filter_channels, filter_channels, 1)
    self.convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
    if gin_channels != 0:
      self.cond = nn.Conv1d(gin_channels, filter_channels, 1)

  def forward(self, x, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
    x = torch.detach(x)
    # 截断梯度，这里不会影响 Text Encoder 模块的参数更新
    x = self.pre(x)             # [8, 192, 101] [b, h, t]
    if g is not None:
      # 针对多说话人，这里也会不同，因为不同人说话的节奏韵律是不一样的
      g = torch.detach(g)
      x = x + self.cond(g)
    x = self.convs(x, x_mask)   # [8, 192, 101]
    x = self.proj(x) * x_mask   #
    # x 是在处理 Text Encoder 的输出

    if not reverse:
      flows = self.flows
      assert w is not None
      # w 是在处理 duration
      logdet_tot_q = 0 
      h_w = self.post_pre(w)               # [8, 192, 123]
      h_w = self.post_convs(h_w, x_mask)   #
      h_w = self.post_proj(h_w) * x_mask   #
      e_q = torch.randn(w.size(0), 2, w.size(2)).to(device=x.device, dtype=x.dtype) * x_mask  # [8, 2, 123]
      # 正态分布
      z_q = e_q  # [8, 2, 123]
      for flow in self.post_flows:
        z_q, logdet_q = flow(z_q, x_mask, g=(x + h_w))
        logdet_tot_q += logdet_q
      # print(z_q.shape)     [8, 2, 123]
      # print(logdet_tot_q)  [8]
      z_u, z1 = torch.split(z_q, [1, 1], 1)    # u 和 v
      u = torch.sigmoid(z_u) * x_mask
      z0 = (w - u) * x_mask
      logdet_tot_q += torch.sum((F.logsigmoid(z_u) + F.logsigmoid(-z_u)) * x_mask, [1,2])
      logq = torch.sum(-0.5 * (math.log(2*math.pi) + (e_q**2)) * x_mask, [1,2]) - logdet_tot_q

      logdet_tot = 0
      z0, logdet = self.log_flow(z0, x_mask)
      logdet_tot += logdet
      z = torch.cat([z0, z1], 1)
      for flow in flows:
        z, logdet = flow(z, x_mask, g=x, reverse=reverse)
        logdet_tot = logdet_tot + logdet
      nll = torch.sum(0.5 * (math.log(2*math.pi) + (z**2)) * x_mask, [1,2]) - logdet_tot
      return nll + logq # [b]
    else:
      flows = list(reversed(self.flows))
      flows = flows[:-2] + [flows[-1]] # remove a useless vflow
      z = torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype) * noise_scale
      for flow in flows:
        z = flow(z, x_mask, g=x, reverse=reverse)
      z0, z1 = torch.split(z, [1, 1], 1)
      logw = z0
      return logw

Monotonic Alignment Search

最早是Glow-TTS提出来的方法（和VITS是同一个作者），目的省去额外对齐的过程。

计算似然矩阵

with torch.no_grad():
  # negative cross-entropy
  s_p_sq_r = torch.exp(-2 * logs_p) # [b, d, t]
  neg_cent1 = torch.sum(-0.5 * math.log(2 * math.pi) - logs_p, [1], keepdim=True) # [b, 1, t_s]
  neg_cent2 = torch.matmul(-0.5 * (z_p ** 2).transpose(1, 2), s_p_sq_r) # [b, t_t, d] x [b, d, t_s] = [b, t_t, t_s]
  neg_cent3 = torch.matmul(z_p.transpose(1, 2), (m_p * s_p_sq_r)) # [b, t_t, d] x [b, d, t_s] = [b, t_t, t_s]
  neg_cent4 = torch.sum(-0.5 * (m_p ** 2) * s_p_sq_r, [1], keepdim=True) # [b, 1, t_s]
  neg_cent = neg_cent1 + neg_cent2 + neg_cent3 + neg_cent4

根据似然矩阵得到对齐路径（动态规划算法，可以参考DTW理解）

import numpy as np

value = np.load("/home/admin/yuanxin/Bil-vits/attn.npy")  # 这个是保存的《似然矩阵》
value = value[0].T

t_x, t_y = value.shape

path = np.zeros([t_x, t_y])

Q = float('-inf') * np.ones_like(value)

for y in range(t_y):
    for x in range(max(0, t_x + y - t_y), min(t_x, y + 1)):
        if y == 0:
            Q[x, 0] = value[x, 0]
        else:
            if x == 0:
                v_prev = float('-inf')
            else:
                v_prev = Q[x-1, y-1]
            v_cur = Q[x, y-1]
            Q[x, y] = value[x, y] + max(v_prev, v_cur)

# Backtrack from last observation
index = t_x - 1
for y in range(t_y - 1, -1, -1):
    path[index, y] = 1
    if index !=0 and (index == y or Q[index, y-1] < Q[index-1, y-1]):
        index = index - 1

# np.save("/home/admin/yuanxin/Bil-vits/path.npy", path)

左图为似然矩阵，右图为对齐结果

切片训练

这里不是整个音频进行对抗训练的，而是进行了随机切片处理。hifi-gan也是这样的。segment_size=8192，大约是0.37秒。

对抗训练

与 hifi-gan 很类似，鉴别器还是多周期鉴别器（多周期和hifi-gan中一样） + 多尺度鉴别器（多尺度这里就用到了一个）。

损失函数

与 hifi-gan 不同的就是损失函数新增了先验和后验的KL散度

def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
  """
  z_p, logs_q: [b, h, t_t]
  m_p, logs_p: [b, h, t_t]
  """
  z_p = z_p.float()
  logs_q = logs_q.float()
  m_p = m_p.float()
  logs_p = logs_p.float()
  z_mask = z_mask.float()

  kl = logs_p - logs_q - 0.5
  kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
  kl = torch.sum(kl * z_mask)
  l = kl / torch.sum(z_mask)
  return l