NVC-Net

一种直接利用原始音频的语音转换系统，可实现 zero-shot 语音转换。

引言

论文链接 https://arxiv.org/abs/2106.00992

代码连接 https://github.com/sony/ai-research-code

模型优缺点：

模型训练是贼慢，卡少的低端玩家直接劝退。我觉着主要问题是数据增强及 Speaker encoder 模块导致的，同时 Mel-GAN 声码器建模能力相对较弱。
模型 GPU 推理速度贼快，全是卷积操作，省去了特征提取过程。
怎么能让它训练的更快，且效果更好呢？我这里想到的方法是不做 Zero-shot 这种太过于娱乐的产品，将 Speaker encoder 直接换成 One-hot 的 Speaker Embedding 特征（论文中 NVC-Net+ 已经这样做了）。同时，将 Generator 改成 HiFi-GAN 的生成器，判别器也改掉，增强判断能力。为了提升训练的速度，完全可以将模型提取到谱特征，然后利用谱特征建模。改完之后和 VITS 很像了，VITS 删除 Text Encoder 那个模块就和这个模型差不多了。

模型框架图：

代码

Content Encoder

主要过程就是下采样过程，提取音频中的 content 信息。
下采样模块主要为两个部分：第一个是膨胀卷积（不引入额外参数的前提下，可以任意扩大感受野，同时保证特征图的分辨率不变，这里引入了残差的概念，为了让网络可以更深）；第二个是正常的卷积（利用stride为2，2，8，8进行下采样，同时增加通道数）。

class Encoder(Module):
    r"""Implementation of the content encoder.

    Args:
        hp (HParams): Hyper-parameters.
    """

    def __init__(self, hp):
        self.hp = hp
        # n_residual_layers = 4,
        # ratios = [8, 8, 2, 2],
        for i in range(len(hp.ratios)):
            setattr(self, f"block_{i}", DownBlock(hp))  # 设置网络层

    def call(self, x):
        hp = self.hp
        # print(x.shape)  # [1, 1, 54474] # 54474 是原始波形的长度
        with nn.parameter_scope("first_layer"):
            x = F.pad(x, (0, 0, 3, 3), 'reflect')  # 按照 reflect 的方式左右 pad  [1, 1, 54480]
            x = wn_conv(x, hp.ngf, (7,))           # 32个1维卷积  [1, 32, 54474]

        for i, r in enumerate(reversed(hp.ratios)):
            x = getattr(self, f"block_{i}")(x, r, 2**(i + 1))  # 提取网络层 (4个降采样层)
            # 降采样维度为：(1, 64, 27237) --> (1, 128, 13618) --> (1, 256, 1702) --> (1, 512, 212)


        with nn.parameter_scope("last_layer"):
            x = F.gelu(x)
            x = F.pad(x, (0, 0, 3, 3), 'reflect')
            x = wn_conv(x, x.shape[1], (7,))
            # (1, 512, 212)

        with nn.parameter_scope("content"):
            x = F.gelu(x)
            x = F.pad(x, (0, 0, 3, 3), 'reflect')
            x = wn_conv(x, hp.bottleneck_dim, (7,), with_bias=False)  # (1, 4, 212)
            # bottleneck_dim = 4
            x = x / F.sum(x**2 + 1e-12, axis=1, keepdims=True)**0.5
            # 按照 通道进行了一次 归一化

        return x

class DownBlock(Module):
    def __init__(self, hp):
        self.hp = hp
        for i in range(hp.n_residual_layers):
            setattr(self, f"resblock_{i}", ResnetBlock())

    def call(self, x, r, mult):
        hp = self.hp
        for i in range(hp.n_residual_layers):
            x = getattr(self, f"resblock_{i}")(x, None, 3 ** i)
        # 膨胀卷积 (膨胀率为 3 ** i)
        with nn.parameter_scope("conv"):
            # ngf = 32
            # r = [2, 2, 8, 8]
            # mult = 2 ** i
            x = F.gelu(x)
            x = wn_conv(
                x, mult * hp.ngf, (r * 2,),
                stride=(r,), pad=(r // 2 + r % 2,),
            )
        # 按照 r 进行  扩大通道 32 --> 64  ; 降采样 T
        return x

class ResnetBlock(Module):
    def call(self, x, spk_emb, dilation):
        dim = x.shape[1]
        with nn.parameter_scope('shortcut'):
            s = wn_conv(x, dim, (1,))  # 获得一个和 x 相同维度的向量（这里应该是残差，最后加到结果上）
        with nn.parameter_scope('block'):
            b = F.pad(x, (0, 0, dilation, dilation), 'reflect')
            # 膨胀卷积卷积核计算公式（优点：不引入额外参数的前提下，可以任意扩大感受野，同时保证特征图的分辨率不变）
            # out_dim = (in_dim + 2 * padding - (kernel_size - 1) * dilation - 1) / stride + 1
            b = wn_conv(b, 2 * dim, (3,), dilation=(dilation,), name='conv_1')
            if spk_emb is not None:
                b = b + wn_conv(spk_emb, 2 * dim, (1,), name="spk_emb")
            b = F.tanh(b[:, :dim, ...]) * F.sigmoid(b[:, dim:, ...])
            b = wn_conv(b, dim, (1,), dilation=(dilation,), name='conv_2')
        return s + b

Generator

主要过程就是上采样过程，根据提取的源 content 与目标 speaker 信息，得到目标音频。主要包括两个部分：第一个是转置卷积用于上采样，第二个是膨胀卷积（这里与Content Encoder中膨胀卷积结构一样，不过在膨胀卷积之后引入了 speaker 信息）

class Decoder(Module):
    r"""Implementation of the generator.

    Args:
        hp (HParams): Hyper-parameters.
    """

    def __init__(self, hp):
        self.hp = hp
        for i in range(len(hp.ratios)):
            setattr(self, f"block_{i}", UpBlock(hp))

    def call(self, x, spk_emb):
        hp = self.hp
        self.hop_length = np.prod(hp.ratios)  # 256
        mult = int(2 ** len(hp.ratios))       # 16

        with nn.parameter_scope("upsample"):
            x = F.pad(x, (0, 0, 3, 3), 'reflect')
            x = wn_conv(x, mult * hp.ngf, (7,))
            # [1, 512, 212]

        with nn.parameter_scope("first_layer"):
            x = F.gelu(x)
            x = F.pad(x, (0, 0, 3, 3), 'reflect')
            x = wn_conv(x, x.shape[1], (7,))

        for i, r in enumerate(hp.ratios):
            x = getattr(self, f"block_{i}")(x, spk_emb, r, mult // (2**i))
        # 上采样模块 (1, 256, 1696) -》 (1, 128, 13568) --》 (1, 64, 27136)  --》 (1, 32, 54272)

        with nn.parameter_scope("waveform"):
            x = F.gelu(x)
            x = F.pad(x, (0, 0, 3, 3), 'reflect')
            x = wn_conv(x, 1, (7,))
            x = F.tanh(x)
        # [1, 1, 54272]
        return x

class UpBlock(Module):
    def __init__(self, hp):
        self.hp = hp
        for i in range(hp.n_residual_layers):
            setattr(self, f"resblock_{i}", ResnetBlock())

    def call(self, x, spk_emb, r, mult):
        hp = self.hp
        with nn.parameter_scope("deconv"):
            x = F.gelu(x)
            # print(x.shape)   # [1, 512, 212]
            x = wn_deconv(
                x, mult * hp.ngf // 2, (r * 2, ),
                stride=(r,), pad=(r // 2 + r % 2,),
            )
            # wn_deconv 转置卷积进行上采样, 转置卷积计算维度的公式为
            # out_dim = (in_dim - 1) * stride - 2 * padding + kernel_size
            # [1, 256, 1696]

        for i in range(hp.n_residual_layers):
            x = getattr(self, f"resblock_{i}")(x, spk_emb, 3 ** i)

        return x

Speaker Encoder

根据 mel 谱获得目标说话人特征

class Speaker(Module):
    r"""Implementation of the speaker encoder.

    Args:
        hp (HParams): Hyper-parameters.
    """

    def __init__(self, hp):
        self.hp = hp
        self.rng = np.random.RandomState(hp.seed)

    def call(self, x):
        hp = self.hp                    #
        dim = hp.n_speaker_embedding    # 128
        kernel, pad = (3,), (1,)

        with nn.parameter_scope('melspectrogram'):
            # sr = 22050
            # 利用 mel spectrogram 构建 speaker embedding
            out = log_mel_spectrogram(x, hp.sr, 1024)  # [1, 80, 219]
            if self.training:
                out = random_split(
                    out, axis=2, rng=self.rng,
                    lo=hp.split_low,    # 30
                    hi=hp.split_hight,  # 45
                )
                # 为了增加模型的鲁棒性也是绝了！ 先将 mel spectrogram 切分成小块，然后再 cat 到一起

        with nn.parameter_scope('init_conv'):
            out = wn_conv(out, 32, kernel=kernel, pad=pad)  # [1, 32, 219]

        # n_spk_layers = 5

        for i in range(hp.n_spk_layers):
            dim_out = min(out.shape[1] * 2, 512)
            out = res_block(
                out, dim_out,
                kernel=kernel, pad=pad,
                scope=f'downsample_{i}',
                training=self.training
            )
        # 通过 卷积 和 avg_pool
        # (1, 64, 109) --> (1, 128, 54) --> (1, 256, 27)  --> (1, 512, 13) --> (1, 512, 6)

        with nn.parameter_scope('last_layer'):
            out = F.average_pooling(out, kernel=(1, out.shape[-1]))
            out = F.leaky_relu(out, 0.2)
        # [1, 512, 1]

        with nn.parameter_scope('mean'):
            mu = wn_conv(out, dim, kernel=(1,))
        # [1, 128, 1]
        with nn.parameter_scope('logvar'):
            logvar = wn_conv(out, dim, kernel=(1,))
        # [1, 128, 1]
        return mu, logvar

Discriminator

类似与 MelGAN 的多尺度判别器

class Discriminator(Module):
    r"""Implementation of the multi-scale discriminator.

    Args:
        hp (HParams): Hyper-parameters.
    """

    def __init__(self, hp):
        self.hp = hp
        for i in range(hp.num_D):
            setattr(self, f'dis_{i}', NLayerDiscriminator(hp))

    def call(self, x, y):
        hp = self.hp
        results = []
        for i in range(hp.num_D):
            # (8, 1, 32768)  # 输入第一个判别器的
            results.append(getattr(self, f'dis_{i}')(x, y))
            x = F.average_pooling(
                x, (1, 4),
                stride=(1, 2),
                pad=(0, 1),
                including_pad=False
            )
            # (8, 1, 16384) # 输入第二个判别器的
            # (8, 1, 8192)  # 输入第三个判别器的
            # (8, 1, 4096)
        return results

class NLayerDiscriminator(Module):
    r"""A single discriminator.

    Args:
        hp (HParams): Hyper-parameters.
    """

    def __init__(self, hp):
        self.hp = hp

    def call(self, x, y):
        # x: [1, 1, 32768] 原始波形
        # y: [1, 1] x 对应的 label
        hp = self.hp
        results = []
        with nn.parameter_scope('layer_0'):
            x = F.pad(x, (0, 0, 7, 7), 'reflect')
            x = wn_conv(x, hp.ndf, (15,))
            x = F.leaky_relu(x, 0.2, inplace=True)
            # x: [1, 16, 32768]
            results.append(x)

        nf = hp.ndf  # 16
        stride = hp.downsamp_factor  # 4
        # n_layers_D: 4
        for i in range(1, hp.n_layers_D + 1):
            nf_prev = nf
            nf = min(nf * stride, 1024)
            with nn.parameter_scope(f'layer_{i}'):
                x = wn_conv(
                    x, nf, (stride * 10 + 1,),
                    stride=(stride,),
                    pad=(stride * 5,),
                    group=nf_prev // 4,
                )
                x = F.leaky_relu(x, 0.2, inplace=True)
                # (1, 64, 8192)
                # (1, 256, 2048)
                # (1, 1024, 512)
                # (1, 1024, 128)
                results.append(x)
        with nn.parameter_scope(f'layer_{hp.n_layers_D + 1}'):
            nf = min(nf * 2, 1024)
            x = wn_conv(x, nf, kernel=(5,), pad=(2,))
            x = F.leaky_relu(x, 0.2, inplace=True)
            # (1, 1024, 128)
            results.append(x)

        with nn.parameter_scope(f'layer_{hp.n_layers_D + 2}'):
            x = wn_conv(x, hp.n_speakers, kernel=(3,), pad=(1,))
            # (1, 103:n_speakers, 128)
            if y is not None:
                # stack: 类似与 cat
                idx = F.stack(
                    F.arange(0, hp.batch_size),
                    y.reshape((hp.batch_size,))
                )
                # idx: [2, 8]
                # x: [8, 103, 128]
                x = F.gather_nd(x, idx)
                # 这里就是从 103 列中选出与 y_label 对应的那一列
                # x: [8, 128]
            results.append(x)

        return results

训练过程

训练判别器的损失函数

对抗损失（交叉熵损失，真实语音为1，语音转换后的为0），使得判别器有更强的判别能力，能分辨是真实语音还是合成语音

def adversarial_loss(self, results, v):
    r"""Returns the adversarial loss.

    Args:
        results (list): Output from discriminator.
        v (int, optional): Target value. Real=1.0, fake=0.0.

    Returns:
        nn.Variable: Output variable.
    """
    loss = []
    # 利用 三个判别器 输出计算交叉熵损失
    # (8, 128)
    # (8, 64)
    # (8, 32)
    for out in results:
        t = F.constant(v, shape=out[-1].shape)
        r = F.sigmoid_cross_entropy(out[-1], t)
        loss.append(F.mean(r))
    return sum(loss)

训练生成器的损失函数

对抗损失（交叉熵损失，转换语音为1），使得生成器有更强的生成能力，能欺骗判别器
内容保存损失（真实语音经过内容编码器得到的向量和转换语音经过内容编码器得到的向量计算MSE），这里是确保内容编码器确实是得到了内容
kl散度损失：这里的做法其实有一些不太理解的，或者我认为这里是有问题的。既然文章说为了能从潜在空间采样，这里假设残差是服从标准正态分布的，但是为什么又直接把 speaker encoder 的后验和标准正态分布做 kl散度的约束呢？那假如这个约束力度过大，不是所有的人都得出一样的结果了吗？
重构损失：一个是特征匹配损失，一个是频谱损失。

训练过程中的数据增强

为了训练这个 GAN 模型，做了很多的数据增强的操作：

当将信号的相位移动180度时，人类的听觉感知不会受到影响，因此，我们可以通过与-1相乘来翻转输入的符号以获得不同的输入
随机振幅缩放
少量的时间抖动
首先，将输入信号分割为均匀随机长度为0.35 ~ 0.45秒的片段。然后，我们打乱这些片段的顺序，然后将它们连接起来，以随机顺序打乱的语言信息形成一个新的输入。我们观察到随机洗牌策略有助于避免内容信息泄露到说话人嵌入中，从而达到更好的解耦。