Module imbe::unvoiced
[−]
[src]
Unvoiced spectrum synthesis.
At a high level, the unvoiced signal is generated with these steps:
- Construct the frequency-domain representation of a white noise signal using the Discrete Fourier Transform (DFT).
- Bandpass the spectrum to contain only the frequencies marked as unvoiced in the current frame.
- Perform an Inverse Discrete Fourier Transform (IDFT) on this spectrum to produce a white noise signal containing only the unvoiced frequency content.
Rather than performing both DFT and IDFT operations, which are relatively expensive and were found to be a bottleneck, this implementation computes an equivalent result using only a partial IDFT.
DFT of Noise
Under certain circumstances the DFT of a noise signal can have its points sampled from a probability distribution, rather than each computed with a O(N) procedure.
According to the standard [p58], the DFT is generated from a windowed white noise signal u(n). The standard specifies that the signal can have arbitrary mean but doesn't specify if there are any constraints on the variance or if the signal must be real or complex (although the given example u(n) is real with sample mean μx ≈ 26562 and variance σx2 ≈ 235198690.)
This implementation assumes the white noise signal is real with samples pulled from a Gaussian distribution having sample mean μx = 0 and sample variance σx2 = 1, i.e., u(n) ~ N(0, 1). The resulting DFT of this signal then has points with real and complex parts that are sampled from a Gaussian distribution with mean μ = 0 and variance σ2 = Ewσx2 / 2 = Ew / 2, i.e., Re[Uw(m)], Im[Uw(m)] ~ N(0, Ew / 2), where Ew is the energy of the speech synthesis window ws(n).
Note that the given source only derives this result for a complex signal with equal real and imaginary sample variances, but empirical evaluations show that the result is the same with a real signal.
DFT Symmetry
According to Eq 125, the IDFT is defined as
uw(n) = [Uw(-128) exp(j 2π(-128)n/256) + Uw(-127) exp(j 2π(-127)n/256) + ··· + Uw(126) exp(j π(126)n/256) + Uw(127) exp(j 2π(127)n/256)] / 256
Since u(n) is a real signal, Uw(m) = Uw(-m)* (i.e., the complex conjugate), and the DFT magnitude is symmetric around DC: for all m, 0 ≤ m ≤ 127,
Uw(m) exp(j mφ) + Uw(-m) exp(-j mφ) =
(a + j b)(cos mφ + j sin mφ) + (a - j b)(cos mφ - sin mφ) =
2a cos mφ - 2b sin mφ = 2 Re[U(m)] cos mφ - 2 Im[U(m)] sin mφ
Additionally, the definition of al and bl in Eqs 122 and 123 guarantees that for all L and ω0 parameters, a1 ≥ 2 and bL ≤ 125. So according to Eq 124, every frame has at least Uw(-128) = Uw(0) = 0.
Using these results, it can be seen that the sum for uw(n) can be "simplified" to
uw(n) =
[0 + Uw(-127) exp(j 2π(-127)n/256) + ··· + Uw(-1) exp(j 2π(-1)n/256) + 0 + Uw(1) exp(j π(1)n/256) + ··· + Uw(127) exp(j 2π(127)n/256)] / 256 =
2 [Re[U(0)] cos(2π(0)n/256) - Im[U(0)] sin(2π(0)n/256) + Re[U(1)] cos(2π(1)n/256) - Im[U(1)] sin(2π(1)n/256) + ··· + Re[U(127)] cos(2π(127)n/256) - Im[U(127)] sin(2π(127)n/256)] / 256
which requires half as many Uw(m) values and performs no complex arithmetic.
Structs
Unvoiced |
Synthesizes unvoiced spectrum signal suv(n). |
UnvoicedDft |
Constructs unvoiced DFT/IDFT. |