Speech Decomposition with Source Filter Model

Speech Decomposition with Source Filter Model#

In this example, we’re going to decompose a speech signal into its source \(e[n]\) and filter components \(a_k\), following the LPC model (6) we introduced in the section Differentiable Implementation of IIR Filters.

\[ s[n] = e[n] + \sum_{k=1}^M a_k s[n-k] \]

We’ll first use the traditional method to estimate the LPC filter, and then we’ll use our differentiable LPC to do end-to-end decomposition.

Again, let’s first import the necessary packages and define some helper functions.

We’re going to use a speech sample from the CMU Arctic speech synthesis database.

!wget "http://festvox.org/cmu_arctic/cmu_arctic/cmu_us_awb_arctic/wav/arctic_a0007.wav"

../_images/7399fbc33043dc732b2214b1d52e2dcbf02440c030916699722cce13ea8a1a80.png

Let’s pick up one short segment from the speech, with relatively static pitch and formants for a stationary model.

../_images/fcbc73d55824ba15cb4f196b96f3f282ca3dbf453a564727dacd7bffad0a9eef.png

Classic LPC Estimation#

The common way to estimate the LPC filter is assuming the current sample \(s[n]\) can only be approximated from past samples. This results in minimising the prediction error \(e[n]\):

\[ \min_{a_k} \left( s[n] - \sum_{k=1}^M a_k s[n-k] \right)^2 = \min_{a_k} e[n]^2 \]

Its least squares solution can be computed from the autocorrelation of the signal [Mak75]. We’ll use the diffsptk package to compute this.

lpc_order = 18
frame_length = 1024

lpc = diffsptk.LPC(lpc_order, frame_length)
gain, coeffs = lpc(target).split([1, lpc_order], dim=-1)
print(f"Gain: {gain.item()}")

Gain: 0.23411035537719727

If we plot the spectrum of the LPC filter, we’d see that it approximates the spectral envelope of the signal.

../_images/ec7619c611082877de2da7f8f2795db611635d5de0dbf129a0071a3daf7c9f82.png

We can get the source (or residual) \(e[n]\) by inverse filtering the signal with the LPC coefficients, which is equivalent to the filtering the signal with a FIR filter \([1, -a_1, -a_2, \dots, -a_M]\).

\[ e[n] = s[n] - \sum_{k=1}^M a_k s[n-k] \]

e = (
    target
    + F.conv1d(
        F.pad(target[None, None, :-1], (lpc_order, 0)), coeffs.flip(0)[None, None, :]
    ).squeeze()
)
e = e / gain

After cancelling the spectral envelope, the frequency response of the residual becomes flatter and has very equal energy across the spectrum. This is a result of the least squares optimisation, which assumes that the prediction error is white noise.

../_images/e23e1cf973c955d958cf6bb8c6a1aab66bd5835a66d6d1e2229970c001a49527.png

Decomposing Speech with Differentiable LPC and a Glottal Flow Model#

In the above example, we have very little assumptions about the source \(e[n]\). We only assume that it is whilte-noise like. In the next example, we’re going to incorporate a glottal flow model to give more constraints to the source.

The model we’re going to use is the transformed-LF [Fan95] model, which models the periodic vibration of the vocal folds. Specifically, we’re using the derivative of the glottal flow model, which combines the glottal flow with lips radiation by assuming lips radiation is a first-order differentiator. This model has only one parameter \(R_d\), which is strongly correlated with the perceived vocal effort. Although the model is differentiable, for computational efficiency, we’re going to use a pre-computed lookup table to approximate the model.

../_images/c6ea0d763b180503117e171bc8557a4528dab8a27929a6426872c3e449c8683d.png

../_images/4e31fef83070a172f15240a27801a5cf28ce3175b7c982cf479c87a00d566177.png

The full model we’re going to use is:

\[ s[n] = g \cdot w\left((\frac{n f_0}{f_s} + \phi) \mod 1; R_d \right) + \sum_{k=1}^M a_k s[n-k]. \]

We replace source \(e[n]\) with the following parameters: gain \(g\), fundamental frequency \(f_0\), phase offset \(\phi\), and \(R_d\). \(w\) is the pre-computed glottal flow model, and \(f_s\) is the sampling rate. Let’s define this model in code.

class SourceFilter(torch.nn.Module):
    def __init__(
        self,
        lpc_order: int,
        sr: int,
        table_points=1024,
        num_tables=100,
        init_f0: float = 100.0,
        init_offset: float = 0.0,
        init_log_gain: float = 0.0,
    ):
        super().__init__()

        Rd_sampled = torch.exp(torch.linspace(math.log(0.3), math.log(2.7), num_tables))
        table = transformed_lf(Rd_sampled, points=table_points)
        peaks = table.argmin(dim=-1)
        shifts = peaks.max() - peaks
        aligned_table = torch.stack(
            [torch.roll(table[i], shifts[i].item(), 0) for i in range(table.shape[0])]
        )
        self.register_buffer("table", aligned_table)
        self.register_buffer("Rd_sampled", Rd_sampled)

        self.f0 = torch.nn.Parameter(torch.tensor(init_f0))
        self.offset = torch.nn.Parameter(torch.tensor(init_offset))
        self.Rd_index_logits = torch.nn.Parameter(torch.zeros(1))
        self.log_gain = torch.nn.Parameter(torch.tensor(init_log_gain))

        # we use the reflection coefficients parameterisation for stable optimisation
        self.log_area_ratios = torch.nn.Parameter(torch.zeros(lpc_order))
        self.logits2lpc = torch.nn.Sequential(
            diffsptk.LogAreaRatioToParcorCoefficients(lpc_order),
            diffsptk.ParcorCoefficientsToLinearPredictiveCoefficients(lpc_order),
        )

        self.lpc_order = lpc_order
        self.table_points = table_points
        self.num_tables = num_tables
        self.sr = sr

    @property
    def Rd_index(self):
        return torch.sigmoid(self.Rd_index_logits) * (self.num_tables - 1)

    @property
    def Rd(self):
        return self.Rd_sampled[torch.round(self.Rd_index).long().item()]

    @property
    def gain(self):
        return torch.exp(self.log_gain)

    @property
    def filter_coeffs(self):
        return self.logits2lpc(
            torch.cat([self.log_gain.view(1), self.log_area_ratios])
        ).split([1, self.lpc_order])

    def source(self, steps):
        """
        Generate the gloottal pulse source signal
        """

        # select the wavetable using linear interpolation
        select_index_floor = self.Rd_index.long().item()
        p = self.Rd_index - select_index_floor
        selected_table = (
            table[select_index_floor] * (1 - p) + table[select_index_floor + 1] * p
        )

        # generate the source signal by interpolating the wavetable
        phase = (
            torch.arange(
                steps, device=selected_table.device, dtype=selected_table.dtype
            )
            / self.sr
            * self.f0
            + self.offset
        ) % 1
        phase_index = phase * self.table_points
        # append the first sample to the end for easier interpolation
        padded_table = torch.cat([selected_table, selected_table[:1]])
        phase_index_floor = phase_index.long()
        phase_index_ceil = phase_index_floor + 1
        p = phase_index - phase_index_floor
        glottal_pulse = (
            padded_table[phase_index_floor] * (1 - p)
            + padded_table[phase_index_ceil] * p
        )
        return glottal_pulse

    def forward_filt(self, e):
        """
        Apply the LPC filter to the input signal
        """
        # get filter coefficients
        log_gain, lpc_coeffs = self.filter_coeffs

        # IIR filtering
        b = log_gain.new_zeros(1 + lpc_coeffs.shape[-1])
        b[0] = torch.exp(log_gain)
        a = torch.cat([lpc_coeffs.new_ones(1), lpc_coeffs])
        return lfilter(e, a, b, clamp=False)

    def forward(self, steps):
        """
        Generate the speech signal
        """
        return self.forward_filt(self.source(steps))

    def inverse_filt(self, s):
        """
        Inverse filtering
        """
        # get filter coefficients
        _, lpc_coeffs = self.filter_coeffs

        e = (
            s
            + F.conv1d(
                F.pad(s[None, None, :-1], (self.lpc_order, 0)),
                lpc_coeffs.flip(0)[None, None, :],
            ).squeeze()
        )
        e = e / self.gain
        return e

Proper initialisation of the parameters plays an important role in the optimisation. We’re going to use the following initialisation.

model = SourceFilter(lpc_order, sr, init_f0=130.0, init_offset=0.0, init_log_gain=-1.3)
print(f"Gain: {model.gain.item()}")
print(f"Rd: {model.Rd.item()}")
print(f"f0: {model.f0.item()}")
print(f"Offset: {model.offset.item() % 1}")

Gain: 0.27253180742263794
Rd: 0.9100430011749268
f0: 130.0
Offset: 0.0

../_images/6ef2b711b956d8a7c40ee1b4cc92a2655b2229366f8778e706f5ca23afd336fb.png

Let’s optimise the parameters with gradient descent. We’re going to use the famous Adam optimiser with a learning rate of 0.001 and run it for 2000 iterations. The loss function we’re going to use is the L1 loss between the original signal and the modelled signal.

../_images/9b0440d677d79a4272ce2d0b6e93cce0549e3110bc1129f033279a04cddb98dd.png

Gain: 0.18434563279151917
Rd: 1.5502203702926636
f0: 131.02642822265625
Offset: 0.9482915364205837

../_images/e9b621064a33a63824193fee865702cbf106c509c3eef4ac0cfe6c1394c09d42.png

Wow, this is pretty good! We can see that the model reconstructs the original signal quite well with very similar waveforms. Moreover, the model tells what are the optimal parameters to construct the source signal. Let’s see what is the source signal looks like.

../_images/3340bf95b52a864303fb01a8bb8f041054edb6644a917808161c286916501096.png

Let’s compare the spectrum of the two filters.

../_images/082b3adf0ed80ab50cf2fd1dc2bf1115cb5def0ece09eeb8edc26b20fee20aa0.png

Interestingly, the two filters looks very different. The biggest reason is because we restricted the source signal to have specific shapes. The gradient method also can not achieve a lossless decomposition, while the classic LPC method can. However, the source signal we get from the gradient method is much more interpretable. In fact, the latter method is a simplified version of the synthesiser used in GOLF vocoder proposed by Yu et al. [YF23].

References#

Fan95: Gunnar Fant. The lf-model revisited. transformations and frequency domain analysis. Speech Trans. Lab. Q. Rep., Royal Inst. of Tech. Stockholm, 2(3):40, 1995.
Mak75: John Makhoul. Linear prediction: a tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975.
YF23: Chin-Yun Yu and György Fazekas. Singing voice synthesis using differentiable lpc and glottal-flow-inspired wavetables. arXiv preprint arXiv:2306.17252, 2023.

Speech Decomposition with Source Filter Model

Contents

Speech Decomposition with Source Filter Model#

Classic LPC Estimation#

Decomposing Speech with Differentiable LPC and a Glottal Flow Model#

References#