Module imbe::unvoiced [−] [src]

Unvoiced spectrum synthesis.

At a high level, the unvoiced signal is generated with these steps:

Construct the frequency-domain representation of a white noise signal using the Discrete Fourier Transform (DFT).
Bandpass the spectrum to contain only the frequencies marked as unvoiced in the current frame.
Perform an Inverse Discrete Fourier Transform (IDFT) on this spectrum to produce a white noise signal containing only the unvoiced frequency content.

Rather than performing both DFT and IDFT operations, which are relatively expensive and were found to be a bottleneck, this implementation computes an equivalent result using only a partial IDFT.

DFT of Noise

Under certain circumstances the DFT of a noise signal can have its points sampled from a probability distribution, rather than each computed with a O(N) procedure.

According to the standard [p58], the DFT is generated from a windowed white noise signal u(n). The standard specifies that the signal can have arbitrary mean but doesn't specify if there are any constraints on the variance or if the signal must be real or complex (although the given example u(n) is real with sample mean μ_x ≈ 26562 and variance σ_x² ≈ 235198690.)

This implementation assumes the white noise signal is real with samples pulled from a Gaussian distribution having sample mean μ_x = 0 and sample variance σ_x² = 1, i.e., u(n) ~ N(0, 1). The resulting DFT of this signal then has points with real and complex parts that are sampled from a Gaussian distribution with mean μ = 0 and variance σ² = E_wσ_x² / 2 = E_w / 2, i.e., Re[U_w(m)], Im[U_w(m)] ~ N(0, E_w / 2), where E_w is the energy of the speech synthesis window w_s(n).

Note that the given source only derives this result for a complex signal with equal real and imaginary sample variances, but empirical evaluations show that the result is the same with a real signal.

DFT Symmetry

According to Eq 125, the IDFT is defined as

u_w(n) = [U_w(-128) exp(j 2π(-128)n/256) + U_w(-127) exp(j 2π(-127)n/256) + ··· + U_w(126) exp(j π(126)n/256) + U_w(127) exp(j 2π(127)n/256)] / 256

Since u(n) is a real signal, U_w(m) = U_w(-m)^* (i.e., the complex conjugate), and the DFT magnitude is symmetric around DC: for all m, 0 ≤ m ≤ 127,

U_w(m) exp(j mφ) + U_w(-m) exp(-j mφ) =

(a + j b)(cos mφ + j sin mφ) + (a - j b)(cos mφ - sin mφ) =

2a cos mφ - 2b sin mφ = 2 Re[U(m)] cos mφ - 2 Im[U(m)] sin mφ

Additionally, the definition of a_l and b_l in Eqs 122 and 123 guarantees that for all L and ω₀ parameters, a₁ ≥ 2 and b_L ≤ 125. So according to Eq 124, every frame has at least U_w(-128) = U_w(0) = 0.

Using these results, it can be seen that the sum for u_w(n) can be "simplified" to

u_w(n) =

[0 + U_w(-127) exp(j 2π(-127)n/256) + ··· + U_w(-1) exp(j 2π(-1)n/256) + 0 + U_w(1) exp(j π(1)n/256) + ··· + U_w(127) exp(j 2π(127)n/256)] / 256 =

2 [Re[U(0)] cos(2π(0)n/256) - Im[U(0)] sin(2π(0)n/256) + Re[U(1)] cos(2π(1)n/256) - Im[U(1)] sin(2π(1)n/256) + ··· + Re[U(127)] cos(2π(127)n/256) - Im[U(127)] sin(2π(127)n/256)] / 256

which requires half as many U_w(m) values and performs no complex arithmetic.

Structs

Unvoiced	Synthesizes unvoiced spectrum signal s_uv(n).
UnvoicedDft	Constructs unvoiced DFT/IDFT.