I-Vector算法
以上的推导来源于leon晋 :https://www.zhihu.com/question/63978977
变量维度声明
训练集有S句话,有S = { 1 , 2 , … , S } S = \{1,2,\dots,S\} S = { 1 , 2 , … , S }
第s s s 句话有H ( s ) H(s) H ( s ) 帧,每一帧的MFCC特征为D D D 维,则每句语音的声学特征矩阵为:X ( s ) = [ x 1 , x 2 , … , x i , … , x H ( s ) ] X(s)=[x_1,x_2,\dots,x_i,\dots,x_{H(s)}] X ( s ) = [ x 1 , x 2 , … , x i , … , x H ( s ) ]
MFCC特征维度为:( n , D ) (n,D) ( n , D )
第k个高斯分布的参数为:λ = { π k , μ k , Σ k } k = 1 K \lambda = \{\pi_k,\mu_k,\Sigma_k\}_{k=1}^K λ = { π k , μ k , Σ k } k = 1 K
高斯均值超向量(列向量) m = [ μ 1 ; μ 2 ; … ; μ c ] m=[\mu_1;\mu_2;\dots;\mu_c] m = [ μ 1 ; μ 2 ; … ; μ c ] ,维度大小为:(K ⋅ D , 1 K\cdot D,1 K ⋅ D , 1 )
综上,可以引出i-vector重要公式:
M ( s ) = m + T w ( s ) M(s) = m + T w(s)
M ( s ) = m + T w ( s )
M(s) 是UBM根据说话人MAP后面的均值超向量,之前直接用来识别了,现在也需要垒起来。
m 是均值超向量
T 是 映射空间,我把它叫做映射规则或者变换规则
w(s) 是i − v e c t o r i-vector i − v e c t o r
B-W的3种充分统计量
这部分是基于GMM高斯混合模型推导而出的B-W统计量,根据每个分布中的后验概率公式
这里计算B-W 统计量的目的是为了后面公式的简洁表达
N k ( s ) = ∑ i = 1 H ( s ) P λ ( k = z i ∣ x i ) F k ( s ) = ∑ i = 1 H ( s ) P λ ( k = z i ∣ x i ) ( x i − μ k ) S k ( s ) = ∑ i = 1 H ( s ) P λ ( k = z i ∣ x i ) ( x i − μ k ) ( x i − μ k ) T \begin{aligned}
&N_{k}(s)=\sum_{i=1}^{H(s)} P_{\lambda}\left(k=z_i | x_{i}\right)\\
&F_{k}(s)=\sum_{i=1}^{H(s)} P_{\lambda}\left(k=z_i | x_{i}\right)\left(x_{i}-\mu_{k}\right)\\
&S_{k}(s)=\sum_{i=1}^{H(s)} P_{\lambda}\left(k=z_i | x_{i}\right)\left(x_{i}-\mu_{k}\right)\left(x_{i}-\mu_{k}\right)^{T}
\end{aligned}
N k ( s ) = i = 1 ∑ H ( s ) P λ ( k = z i ∣ x i ) F k ( s ) = i = 1 ∑ H ( s ) P λ ( k = z i ∣ x i ) ( x i − μ k ) S k ( s ) = i = 1 ∑ H ( s ) P λ ( k = z i ∣ x i ) ( x i − μ k ) ( x i − μ k ) T
注:这和EM算法迭代公式不同,主要区别在F k ( s ) F_k(s) F k ( s ) ,这里做了一个去均值操作。此外,公式中需要对F k ( s ) F_k(s) F k ( s ) 和S k ( s ) S_k(s) S k ( s ) 除以当前k k k 分布下的分布个数,即N k ( s ) N_k(s) N k ( s ) 。这里为了向量化处理公式,把N k ( s ) N_k(s) N k ( s ) 和F ( s ) F(s) F ( s ) 做了矩阵扩充。
N ( s ) = [ N 1 ( s ) I 0 … 0 N c ( s ) I ] N(s)=\left[\begin{array}{ccc}
{N_{1}(s) I} & {} & {0} \\
{} & {\dots} & {} \\
{0} & {} & {N_{c}(s) I}
\end{array}\right]
N ( s ) = ⎣ ⎡ N 1 ( s ) I 0 … 0 N c ( s ) I ⎦ ⎤
F ( s ) = [ F 1 ( s ) … F c ( s ) ] F(s)=\left[\begin{array}{c}
{F_{1}(s)} \\
{\dots} \\
{F_{c}(s)}
\end{array}\right]
F ( s ) = ⎣ ⎡ F 1 ( s ) … F c ( s ) ⎦ ⎤
[N, F] = expectation(data, ubm); function [N, F] = expectation (data, gmm) post = postprob(data, gmm.mu, gmm.sigma, gmm.w(:)); N = sum(post, 2 ); F = data * post'; function [post, llk] = postprob (data, mu, sigma, w) post = lgmmprob(data, mu, sigma, w); llk = logsumexp(post, 1 ); post = exp (bsxfun (@minus, post, llk)); function logprob = lgmmprob (data, mu, sigma, w) ndim = size (data, 1 ); C = sum(mu.*mu./sigma) + sum(log (sigma)); D = (1. /sigma)' * (data .* data) - 2 * (mu./sigma)' * data + ndim * log (2 * pi ); logprob = -0.5 * (bsxfun (@plus, C', D)); logprob = bsxfun (@plus, logprob, log (w)); function y = logsumexp (x, dim) xmax = max (x, [], dim); y = xmax + log (sum(exp (bsxfun (@minus, x, xmax)), dim)); ind = find (~isfinite (xmax)); if ~isempty (ind) y(ind) = xmax(ind); end F = reshape (F, ndim * nmix, 1 ); m = reshape (ubm.mu, ndim * nmix, 1 ); F = F - N(idx_sv) .* m;
EM算法迭代
EM目标优化公式:
Q ( T ∣ T ( t ) ) = ∑ s = 1 S E ( w ^ ( t ) ( s ) T T T Σ − 1 F ( s ) ) − 1 2 ∑ s = 1 S E ( w ^ ( t ) ( s ) T T T Σ − 1 N ( s ) T w ^ ( t ) ( s ) T ) \begin{array}{l}{Q\left(T | T^{(t)}\right)=\sum_{s=1}^{S} E\left(\widehat{w}^{(t)}(s)^{T} T^{T} \Sigma^{-1} F(s)\right)} {-\frac{1}{2} \sum_{s=1}^{S} E\left(\widehat{w}^{(t)}(s)^{T} T^{T} \Sigma^{-1} N(s) T \widehat{w}^{(t)}(s)^{T}\right)}\end{array}
Q ( T ∣ T ( t ) ) = ∑ s = 1 S E ( w ( t ) ( s ) T T T Σ − 1 F ( s ) ) − 2 1 ∑ s = 1 S E ( w ( t ) ( s ) T T T Σ − 1 N ( s ) T w ( t ) ( s ) T )
E-Step:
EM算法的本质就是对隐变量 进行估计,核心是隐变量的后验概率公式 :
y ( s ) ∣ x ∼ N ( l − 1 T T Σ − 1 F ( s ) , l − 1 ( s ) ) y(s)|x \sim N(l^{-1}T^T\Sigma^{-1}F(s),l^{-1}(s))
y ( s ) ∣ x ∼ N ( l − 1 T T Σ − 1 F ( s ) , l − 1 ( s ) )
其中方差l − 1 ( s ) l^{-1}(s) l − 1 ( s ) 为:
σ ( s ) = l − 1 ( s ) = I + T T Σ − 1 N ( s ) T \sigma(s) = l^{-1}(s) = I + T^T\Sigma^{-1}N(s)T
σ ( s ) = l − 1 ( s ) = I + T T Σ − 1 N ( s ) T
有了上面的后验概率分布,我们可以确定新的"B-W"公式:
E ( w ^ ( t ) ( s ) T ) = σ t ( s ) − 1 T T Σ − 1 F ( s ) E ( w ^ ( t ) ( s ) T ⋅ w ^ ( t ) ( s ) ) = σ t ( s ) − 1 + E ( w ^ ( t ) ( s ) T ) ⋅ E ( w ^ ( t ) ( s ) T ) T \begin{aligned}
E\left(\widehat{w}^{(t)}(s)^{T}\right)= & \sigma^{t}(s)^{-1} T^{T} \Sigma^{-1} F(s) \\
E\left(\widehat{w}^{(t)}(s)^{T} \cdot \widehat{w}^{(t)}(s)\right)= & \sigma^{t}(s)^{-1}+E\left(\widehat{w}^{(t)}(s)^{T}\right) \cdot E\left(\widehat{w}^{(t)}(s)^{T}\right)^{T}
\end{aligned}
E ( w ( t ) ( s ) T ) = E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) ) = σ t ( s ) − 1 T T Σ − 1 F ( s ) σ t ( s ) − 1 + E ( w ( t ) ( s ) T ) ⋅ E ( w ( t ) ( s ) T ) T
说明:上面的第一个公式就是后验概率的均值
说明2: 第二个方差如何计算? E [ X ⋅ Y ] = c o v ( X , Y ) + E [ X ] E [ Y ] E[X\cdot Y] = cov(X,Y)+E[X]E[Y] E [ X ⋅ Y ] = c o v ( X , Y ) + E [ X ] E [ Y ]
M-Step:
T t + 1 = a r g m a x Q ( T ∣ T ( t ) ) T^{t+1}=argmaxQ(T|T^{(t)})
T t + 1 = a r g m a x Q ( T ∣ T ( t ) )
因此有公式:
∑ s = 1 S Σ − 1 F ( s ) E ( w ( t ) ( s ) T ) = ∑ s = 1 S Σ − 1 N ( s ) T t + 1 E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) ) \sum_{s=1}^S \Sigma^{-1}F(s)E(w^{(t)}(s)^T) = \sum_{s=1}^S \Sigma^{-1}N(s)T^{t+1}E(w^{(t)}(s)^T
\cdot w^{(t)}(s))
s = 1 ∑ S Σ − 1 F ( s ) E ( w ( t ) ( s ) T ) = s = 1 ∑ S Σ − 1 N ( s ) T t + 1 E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) )
但是上面的方程没办法编程的原因在于T T T 矩阵被夹在两个矩阵的中间,根本无法求逆,所以必须根据在分布内进行更新
上面的公式同上有:
∑ s = 1 S F c ( s ) E ( w ( t ) ( s ) T ) = T t + 1 ∑ s = 1 S N c ( s ) E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) ) T t + 1 = ( ∑ s = 1 S F c ( s ) E ( w ( t ) ( s ) T ) ) ( ∑ s = 1 S N c ( s ) E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) ) ) − 1 \sum_{s=1}^S F_c(s)E(w^{(t)}(s)^T) = T^{t+1}\sum_{s=1}^SN_c(s)E(w^{(t)}(s)^T
\cdot w^{(t)}(s)) \\
T^{t+1} = (\sum_{s=1}^S F_c(s)E(w^{(t)}(s)^T))(\sum_{s=1}^SN_c(s)E(w^{(t)}(s)^T
\cdot w^{(t)}(s)))^{-1}
s = 1 ∑ S F c ( s ) E ( w ( t ) ( s ) T ) = T t + 1 s = 1 ∑ S N c ( s ) E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) ) T t + 1 = ( s = 1 ∑ S F c ( s ) E ( w ( t ) ( s ) T ) ) ( s = 1 ∑ S N c ( s ) E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) ) ) − 1
上面的方程与M a t l a b Matlab M a t l a b 代码中的稍有不同,在于T T T 矩阵的大小不同。
算法流程
对T T T 进行随机初始化
根据T ( t ) T^{(t)} T ( t ) 计算 E ( w ^ ( t ) ( s ) T ) E\left(\widehat{w}^{(t)}(s)^{T}\right) E ( w ( t ) ( s ) T ) 和 E ( w ^ ( t ) ( s ) T ⋅ w ^ ( t ) ( s ) ) E\left(\widehat{w}^{(t)}(s)^{T} \cdot \widehat{w}^{(t)}(s)\right) E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) )
重新计算T ( t + 1 ) T^{(t+1)} T ( t + 1 )
重复迭代第2,3步,大约5到10步迭代结束
计算i − v e c t o r i-vector i − v e c t o r ,即E ( w ^ ( t ) ( s ) T ) E\left(\widehat{w}^{(t)}(s)^{T}\right) E ( w ( t ) ( s ) T )
w = σ t ( s ) − 1 T T Σ − 1 F ( s ) w = \sigma^{t}(s)^{-1} T^{T} \Sigma^{-1} F(s)
w = σ t ( s ) − 1 T T Σ − 1 F ( s )
Matlab代码编程
数据准备: stats(:)
stats = cell(nTra_Speakers, nTra_Channels); for s=1 :nTra_Speakers for c=1 :nTra_Channels [N,F] = compute_bw_stats(trainSpeakerData{s,c}, ubm); stats{s,c} = [N; F]; end end
充分统计量是两个列向量的叠加,N是( n m i x , 1 ) (nmix,1) ( n m i x , 1 ) ,F是超向量:( n m i x ∗ n d i m ) (nmix*ndim) ( n m i x ∗ n d i m )
核心主函数: train_tv_space
tvDim = 100 ; niter = 5 ; T = train_tv_space(stats(:), ubm, tvDim, niter, nworkers);
Function
说明:
function T = train_tv_space (dataList, ubmFilename, tv_dim, niter, nworkers, tvFilename) [ndim, nmix] = size (ubm.mu); S = reshape (ubm.sigma, ndim * nmix, 1 ); [N, F] = load_data(dataList, ndim, nmix); T = randn (tv_dim, ndim * nmix) * sum(S) * 0.001 ; fprintf('Re-estimating the total subspace with %d factors ...\n' , tv_dim); for iter = 1 : niter, fprintf('EM iter#: %d \t' , iter); tim = tic; [LU, RU] = expectation_tv(T, N, F, S, tv_dim, nmix, ndim, nworkers); T = maximization_tv(LU, RU, ndim, nmix); tim = toc(tim); fprintf('[elaps = %.2f s]\n' , tim); end
需要对d a t a L i s t dataList d a t a L i s t 进行分析,划分相应的N N N 和F F F
stats{nTra_Speakers,channel}
观察stats(:)
,Matlab默认竖直读取数据。
nfiles = length (datalist); N = zeros (nfiles, nmix, 'single' ); F = zeros (nfiles, ndim * nmix, 'single' ); for file = 1 : nfiles, N(file, :) = datalist{file}(1 :nmix); F(file, :) = datalist{file}(nmix + 1 : end ); end
E - Step ⭐️
σ ( s ) = l − 1 ( s ) = I + T T Σ − 1 N ( s ) T \sigma(s) = l^{-1}(s) = I + T^T\Sigma^{-1}N(s)T
σ ( s ) = l − 1 ( s ) = I + T T Σ − 1 N ( s ) T
其中Σ \Sigma Σ 是K D × K D KD\times KD K D × K D 的块对角矩阵,子对角块是Σ 1 , … , Σ C \Sigma_1,\dots,\Sigma_C Σ 1 , … , Σ C ;N ( s ) N(s) N ( s ) 也是一个K D × K D KD\times KD K D × K D 的块对角矩阵,子对角块是N 1 I , … , N C I N_1I,\dots,N_CI N 1 I , … , N C I 。
维度说明:σ ( s ) \sigma(s) σ ( s ) 是隐变量的方差,故维度是
( t v , t v ) = ( t v , t v ) + ( K D , t v ) ′ ( K D , K D ) ∗ ( K D , K D ) ∗ ( K D , t v ) (tv,tv)=(tv,tv)+(KD,tv)'(KD,KD)*(KD,KD)*(KD,tv) ( t v , t v ) = ( t v , t v ) + ( K D , t v ) ′ ( K D , K D ) ∗ ( K D , K D ) ∗ ( K D , t v )
Matlab编程的方程为:
σ ( s ) = I + ( T . / Σ ) . ∗ N ∗ T ′ \sigma(s) = I + (T./\Sigma).*N*T'
σ ( s ) = I + ( T . / Σ ) . ∗ N ∗ T ′
其中:Σ = [ Σ 1 ⋱ Σ C ] \Sigma =\left[ \begin{matrix}
\Sigma _1& & \\
& \ddots& \\
& & \Sigma _C\\
\end{matrix} \right] Σ = ⎣ ⎡ Σ 1 ⋱ Σ C ⎦ ⎤ 简化为列矩阵,即Σ = [ Σ 1 ⋮ Σ C ] \Sigma =\left[ \begin{array}{c}
\Sigma _1\\
\vdots\\
\Sigma _C\\
\end{array} \right] Σ = ⎣ ⎢ ⎡ Σ 1 ⋮ Σ C ⎦ ⎥ ⎤
其中:N = [ N 1 I ⋱ N C I ] N\ =\ \left[ \begin{matrix}
N_1I& & \\
& \ddots& \\
& & N_CI\\
\end{matrix} \right] N = ⎣ ⎡ N 1 I ⋱ N C I ⎦ ⎤ 简化为列矩阵,即N 2 = [ N 1 I ⋮ N C I ] N_2=\left[ \begin{array}{c}
N_1I\\
\vdots\\
N_CI\\
\end{array} \right] N 2 = ⎣ ⎢ ⎡ N 1 I ⋮ N C I ⎦ ⎥ ⎤ ,为了进一步节省内存,可以简化为N 1 = [ N 1 ⋮ N C ] N_1\ =\ \left[ \begin{array}{c}
N_1\\
\vdots\\
N_C\\
\end{array} \right] N 1 = ⎣ ⎢ ⎡ N 1 ⋮ N C ⎦ ⎥ ⎤
idx_sv = reshape (repmat (1 : nmix, ndim, 1 ), ndim * nmix, 1 ) N2 = N1(ix, idx_sv)
对于编程矩阵的简化:
T _ i n v S = T T Σ − 1 = [ a 11 ⋯ a 1 , K D ⋮ ⋱ ⋮ a t v × 1 ⋯ a t v × KD ] t v , K D ⋅ [ Σ 1 ⋱ Σ C ] K D , K D = [ Σ 1 ⋅ a 11 ⋯ Σ 1 ⋅ a 1 , K D ⋮ ⋱ ⋮ Σ C ⋅ a t v × 1 ⋯ Σ C ⋅ a t v timesKD ] = r e p m a t ( [ Σ 1 ⋮ Σ C ] , ( 1 , K D ) ) . ∗ T = T . / Σ 【 维 度 为 ( t v , K D ) 】 \begin{aligned}
T\_invS = T^T\Sigma^{-1}=&
\left[ \begin{matrix}
a_{11}& \cdots& a_{1,KD}\\
\vdots& \ddots& \vdots\\
a_{tv\times 1}& \cdots& a_{tv\times \text{KD}}\\
\end{matrix} \right] _{tv,KD}\cdot \left[ \begin{matrix}
\Sigma _1& & \\
& \ddots& \\
& & \Sigma _C\\
\end{matrix} \right] _{KD,KD} \\
=&\left[ \begin{matrix}
\Sigma _1\cdot a_{11}& \cdots& \Sigma _1\cdot a_{1,KD}\\
\vdots& \ddots& \vdots\\
\Sigma _C\cdot a_{tv\times 1}& \cdots& \Sigma _C\cdot a_{tv\text{timesKD}}\\
\end{matrix} \right] \ \\
= &\ repmat\left( \left[ \begin{array}{c}
\Sigma _1\\
\vdots\\
\Sigma _C\\
\end{array} \right] ,\left( 1,KD \right) \right) .*T \\
=& \ T./\Sigma 【维度为(tv,KD)】
\end{aligned}
T _ i n v S = T T Σ − 1 = = = = ⎣ ⎢ ⎡ a 1 1 ⋮ a t v × 1 ⋯ ⋱ ⋯ a 1 , K D ⋮ a t v × KD ⎦ ⎥ ⎤ t v , K D ⋅ ⎣ ⎡ Σ 1 ⋱ Σ C ⎦ ⎤ K D , K D ⎣ ⎢ ⎡ Σ 1 ⋅ a 1 1 ⋮ Σ C ⋅ a t v × 1 ⋯ ⋱ ⋯ Σ 1 ⋅ a 1 , K D ⋮ Σ C ⋅ a t v timesKD ⎦ ⎥ ⎤ r e p m a t ⎝ ⎜ ⎛ ⎣ ⎢ ⎡ Σ 1 ⋮ Σ C ⎦ ⎥ ⎤ , ( 1 , K D ) ⎠ ⎟ ⎞ . ∗ T T . / Σ 【 维 度 为 ( t v , K D ) 】
可以发现T T T 矩阵乘以对角矩阵,相当于给T矩阵的每一个行向量乘上一个系数。
T_invS = bsxfun (@rdivide, T, S');
同理也有下面:
T _ i n v S ( t v , K D ) ∗ [ N 1 I ⋱ N C I ] K D , K D = T _ i n v S . ∗ r e p m a t ( [ N 1 I ⋮ N C I ] K D , 1 , ( 1 , t v ) ) T\_invS_{(tv,KD)} * \left[ \begin{matrix}
N_1I& & \\
& \ddots& \\
& & N_CI\\
\end{matrix} \right]_{KD,KD} =T\_invS.*repmat(\left[ \begin{array}{c}
\N_1I\\
\vdots\\
\N _CI\\
\end{array} \right]_{KD,1},(1,tv))
T _ i n v S ( t v , K D ) ∗ ⎣ ⎡ N 1 I ⋱ N C I ⎦ ⎤ K D , K D = T _ i n v S . ∗ r e p m a t ( ⎣ ⎢ ⎡ N 1 I ⋮ N C I ⎦ ⎥ ⎤ K D , 1 , ( 1 , t v ) )
bsxfun (@times, T_invS, N1(ix, idx_sv)
综上,公式1的的Matlab编程:
σ ( s ) = I + ( T . / Σ ) . ∗ N ∗ T ′ \sigma(s) = I + (T./\Sigma).*N*T'
σ ( s ) = I + ( T . / Σ ) . ∗ N ∗ T ′
I = eye (tv_dim); L = I + bsxfun (@times, T_invS, N1(ix, idx_sv)) * T';
期望公式:
E ( w ^ ( t ) ( s ) T ) = σ t ( s ) − 1 T T Σ − 1 F ( s ) E ( w ^ ( t ) ( s ) T ⋅ w ^ ( t ) ( s ) ) = σ t ( s ) − 1 + E ( w ^ ( t ) ( s ) T ) ⋅ E ( w ^ ( t ) ( s ) T ) T \begin{aligned}
E\left(\widehat{w}^{(t)}(s)^{T}\right)= & \sigma^{t}(s)^{-1} T^{T} \Sigma^{-1} F(s) \\
E\left(\widehat{w}^{(t)}(s)^{T} \cdot \widehat{w}^{(t)}(s)\right)= & \sigma^{t}(s)^{-1}+E\left(\widehat{w}^{(t)}(s)^{T}\right) \cdot E\left(\widehat{w}^{(t)}(s)^{T}\right)^{T}
\end{aligned}
E ( w ( t ) ( s ) T ) = E ( w ( t ) ( s ) T ⋅ w ( t ) ( s ) ) = σ t ( s ) − 1 T T Σ − 1 F ( s ) σ t ( s ) − 1 + E ( w ( t ) ( s ) T ) ⋅ E ( w ( t ) ( s ) T ) T
这里为了对每个高斯分布进行统计量操作,所以需要遍历所有的样本个数,将所有的Ex和Exx累加起来。
未批处理 操作:
for (ix = 1 : size (F, 1 ), nworkers) L = I + bsxfun (@times, T_invS, N1(ix, idx_sv)) * T'; Cxx = pinv(L); B = T_invS * F1(ix, :)'; Ex(:, ix) = Cxx * B; Exx(:, :, ix) = Cxx + Ex(:, ix) * Ex(:, ix)'; end RU = RU + Ex * F1; for (mix = 1 : nmix, nworkers) tmp = bsxfun (@times, Exx, reshape (N1(:, mix),[1 1 len])); LU{mix} = LU{mix} + sum(tmp, 3 ); end
⭐️Matlab累加操作,是通过矩阵乘的方式完成的!!!!!
E x ( w ) : ( t v , n f i l e s ) Ex(w): (tv,nfiles) E x ( w ) : ( t v , n f i l e s )
E x x ( w T w ) : ( t v , t v , n f i l e s ) Exx(w^Tw):(tv,tv,nfiles) E x x ( w T w ) : ( t v , t v , n f i l e s )
F : ( K D , n f i l e s ) F:(KD,nfiles) F : ( K D , n f i l e s )
⭐️累加操作:∑ s = 1 S \sum_{s=1}^S ∑ s = 1 S
E x ( w ) ∗ F ′ Ex(w)*F' E x ( w ) ∗ F ′ 以及s u m ( E x x , 3 ) sum(Exx,3) s u m ( E x x , 3 )
完整代码如下:
function [LU, RU] = expectation_tv (T, N, F, S, tv_dim, nmix, ndim, nworkers) idx_sv = reshape (repmat (1 : nmix, ndim, 1 ), ndim * nmix, 1 ); nfiles = size (F, 1 ); LU = cell(nmix, 1 ); LU(:) = {zeros (tv_dim)}; RU = zeros (tv_dim, nmix * ndim); I = eye (tv_dim); T_invS = bsxfun (@rdivide, T, S'); parts = 250 ; nbatch = floor ( nfiles/parts + 0.99999 ); for batch = 1 : nbatch, start = 1 + ( batch - 1 ) * parts; fin = min (batch * parts, nfiles); len = fin - start + 1 ; index = start : fin; N1 = N(index, :); F1 = F(index, :); Ex = zeros (tv_dim, len); Exx = zeros (tv_dim, tv_dim, len); parfor (ix = 1 : len, nworkers) L = I + bsxfun (@times, T_invS, N1(ix, idx_sv)) * T'; Cxx = pinv(L); B = T_invS * F1(ix, :)'; Ex(:, ix) = Cxx * B; Exx(:, :, ix) = Cxx + Ex(:, ix) * Ex(:, ix)'; end RU = RU + Ex * F1; parfor (mix = 1 : nmix, nworkers) tmp = bsxfun (@times, Exx, reshape (N1(:, mix),[1 1 len])); LU{mix} = LU{mix} + sum(tmp, 3 ); end end
M - Step
T T T 矩阵的更新公式:
T t + 1 = ( ∑ s = 1 S F c ( s ) E ( w ( t ) ( s ) T ) ) ( ∑ s = 1 S N c ( s ) E ( w ( t ) ( s ) T T^{t+1} = (\sum_{s=1}^S F_c(s)E(w^{(t)}(s)^T))(\sum_{s=1}^SN_c(s)E(w^{(t)}(s)^T
T t + 1 = ( s = 1 ∑ S F c ( s ) E ( w ( t ) ( s ) T ) ) ( s = 1 ∑ S N c ( s ) E ( w ( t ) ( s ) T
- 代码中$RU$为$\sum_{s=1}^S F(s)E(w^{(t)}(s)^T)$ 可以直接计算出,大小为$(tv,KD)$,上面的等式中F带C,更新$T$矩阵的时候,只需要取出当前高斯分布下的D维,大小为$(tv,D)$
- 代码中$LU$ 为 $\sum_{s=1}^SN_c(s)E(w^{(t)}(s)^T
\cdot w^{(t)}(s)))^{-1}$ 只有根据分布来计算,所以大小为$(nmix,1)$cell格式,每个数据的大小为$(tv,tv)$
- $T$ 矩阵的分块更新方式为:$T = LU$ \ $RU$ 维度的大小为【$(tv,D) = (tv,tv) *(tv,D)$】
function RU = maximization_tv (LU, RU, ndim, nmix) for mix = 1 : nmix idx = ( mix - 1 ) * ndim + 1 : mix * ndim; RU(:, idx) = LU{mix} \ RU(:, idx); end
以上内容来源:https://blog.csdn.net/JackyTintin/article/details/79803501
模型理解:概率数据生成背景
PLDA是LDA的概率版本,两者都是将数据之间的差异分为类内差异 和类间差异 ,但是从概率的角度进行建模。
定义:隐变量y y y 决定每个高斯分布的生成方式,其均值的差异可以看做是类间方差 Φ b \Phi_b Φ b 。每个高斯分布的内部的数据差异可看做是类内差异 Φ w \Phi_w Φ w 。因此有以下的概率关系:
p ( x ∣ y ) = N ( x ∣ y , Φ w ) p ( y ) = N ( y ∣ m , Φ b ) \begin{array}{l}
{p(\mathbf{x} | \mathbf{y})=\mathcal{N}\left(\mathbf{x} | \mathbf{y}, \Phi_{w}\right)} \\
{p(\mathbf{y})=\mathcal{N}\left(\mathbf{y} | \mathbf{m}, \Phi_{b}\right)}
\end{array}
p ( x ∣ y ) = N ( x ∣ y , Φ w ) p ( y ) = N ( y ∣ m , Φ b )
LDA 假设各类中心服从离散分布,离散中心的个数固定,是训练数据中已知的类别数;PLDA 假设各类中心服从一个连续分布(高斯分布)。因此,PLDA 能够扩展到未知类别,从而用于未知类别的识别与认证。
这里要求协方差矩阵 Φ w \Phi_w Φ w 是正定的对称方阵,反映了类内(within-class)的差异;Φ b \Phi_b Φ b 是半正定 的对称方阵,反映了类间(between-class)的差异。因此,PLDA 建模了数据生成的过程,并且同时 LDA 一样,显式地考虑了类内和类间方差。
为了推导方便,消除类内方差为单位矩阵:存在一个矩阵V V V 可以让Φ w \Phi_w Φ w 和Φ b \Phi_b Φ b 同时合同对角化 。存在可逆矩阵V V V ,满足V T Φ b V = Ψ V^T\Phi_bV =\Psi V T Φ b V = Ψ (对角阵) 且 V T Φ w V = I V^T\Phi_wV=I V T Φ w V = I (单位对角矩阵)。
PLDA等价表述:
x = m + A u x = m + A u
x = m + A u
其中,
u ∼ N ( ⋅ ∣ v , I ) v ∼ N ( ⋅ ∣ 0 , Ψ ) A = ( V T ) − 1 u \sim N(\cdot|v,I) \\
v \sim N(\cdot |0,\Psi) \\
A = (V^T)^{-1}
u ∼ N ( ⋅ ∣ v , I ) v ∼ N ( ⋅ ∣ 0 , Ψ ) A = ( V T ) − 1
u u u 是数据空间在投影空间的对应投影点。Ψ \Psi Ψ 反映了类内(w i t h i n − c l a s s within-class w i t h i n − c l a s s )的差异; I I I 反映了类间(b e t w e e n − c l a s s between-class b e t w e e n − c l a s s )的差异(这里被规一化为单位矩阵)。
基于PLDA的推断
对于每一个观测数据 x \mathbf{x} x 我们都可以计算对应的 u = A − 1 ( x − m ) u=A^{-1}(x-m) u = A − 1 ( x − m ) 。P L D A PLDA P L D A 的推断的投影空间中进行
给定观一组同类的测数据 u 1 , … , n , v \mathbf{u}_{1,\dots,n} ,\mathbf{v} u 1 , … , n , v 的后验概率分布为:
p ( v ∣ u 1 , … , n ) = N ( v ∣ n Ψ n Ψ + I u ‾ , Ψ n Ψ + I ) p\left(\mathbf{v} | \mathbf{u}_{1, \ldots, n}\right)=\mathcal{N}\left(\mathbf{v} | \frac{n \Psi}{n \Psi+I} \overline{\mathbf{u}}, \frac{\Psi}{n \Psi+I}\right)
p ( v ∣ u 1 , … , n ) = N ( v ∣ n Ψ + I n Ψ u , n Ψ + I Ψ )
如何求解隐变量的后验概率?
给定某类的 n n n 个数据 x 1 , … , n x_{1,\dots,n} x 1 , … , n ,则y y y 的后验分布可以为:
p ( y ∣ x 1 , … , n ) = p ( x 1 , … , n ∣ y ) p ( y ) / p ( x 1 , … , n ) ∝ p ( x 1 , … , n ∣ y ) p ( y ) p\left(y | x_{1, \ldots, n}\right)=p\left(x_{1, \ldots, n} | y\right) p(y) / p\left(x_{1, \ldots, n}\right) \propto p\left(x_{1, \ldots, n} | y\right) p(y)
p ( y ∣ x 1 , … , n ) = p ( x 1 , … , n ∣ y ) p ( y ) / p ( x 1 , … , n ) ∝ p ( x 1 , … , n ∣ y ) p ( y )
后验为两个高斯分布的乘积,因此也服从高斯。因此,我们只需要计算均值向量和方差矩阵,即可以确定后验分布。
ln p ( y ∣ x 1 , … , n ) = ln p ( x 1 , … , n ∣ y ) + ln p ( y ) = ∑ i ln p ( x i ∣ y ) + ln p ( y ) = C − 0.5 ∑ i y T Φ w − 1 y + ∑ i x i T Φ w − 1 y − 0.5 y T Φ b − 1 y + m T Φ b − 1 y = C − 0.5 y T ( n Φ w − 1 + Φ b − 1 ) y + ∑ i x i T Φ w − 1 y + m T Φ b − 1 y \begin{aligned}
\ln p\left(y | x_{1, \dots, n}\right)= &\ln p\left(x_{1}, \ldots, n | y\right)+\ln p(y) \\
= &\sum_{i} \ln p\left(x_{i} | y\right)+\ln p(y)\\
= &C-0.5 \sum_{i} y^{T} \Phi_{w}^{-1} y+\sum_{i} x_{i}^{T} \Phi_{w}^{-1} y-0.5 y^{T} \Phi_{b}^{-1} y+m^{T} \Phi_{b}^{-1} y \\
= & C - 0.5y^T(n\Phi_w^{-1}+\Phi_b^{-1})y+\sum_{i} x_{i}^{T} \Phi_{w}^{-1}y+m^T\Phi_b^{-1}y
\end{aligned}
ln p ( y ∣ x 1 , … , n ) = = = = ln p ( x 1 , … , n ∣ y ) + ln p ( y ) i ∑ ln p ( x i ∣ y ) + ln p ( y ) C − 0 . 5 i ∑ y T Φ w − 1 y + i ∑ x i T Φ w − 1 y − 0 . 5 y T Φ b − 1 y + m T Φ b − 1 y C − 0 . 5 y T ( n Φ w − 1 + Φ b − 1 ) y + i ∑ x i T Φ w − 1 y + m T Φ b − 1 y
以上第三个等式的来源:
p ( x ∣ y ) ∼ N ( x ∣ y , Φ w ) p ( y ) ∼ N ( y ∣ m , Φ b ) p(x|y) \sim N(x|y,\Phi_w) \\
p(y) \sim N(y|m,\Phi_b)
p ( x ∣ y ) ∼ N ( x ∣ y , Φ w ) p ( y ) ∼ N ( y ∣ m , Φ b )
高斯分布公式为:
N ( x ∣ μ , Σ ) = 1 ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 exp { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \mathbf{\Sigma})=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\mathbf{\Sigma}|^{1 / 2}} \exp \left\{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{T} \mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\}
N ( x ∣ μ , Σ ) = ( 2 π ) D / 2 1 ∣ Σ ∣ 1 / 2 1 exp { − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) }
展开后,忽略常数项,只关注指数项里面的系数:
e x p { − 0.5 x T Σ − 1 x + x T Σ − 1 μ } = e x p { − 0.5 x T Σ − 1 x + μ T ( Σ − 1 ) T x } = e x p { − 0.5 x T Σ − 1 x + μ T Σ − 1 x } \begin{aligned}
& exp\left\{ -0.5x^T\Sigma^{-1}x + x^T \Sigma^{-1}\mu \right\} \\
=&exp\left\{ -0.5x^T\Sigma^{-1}x + \mu ^T(\Sigma^{-1})^T x \right\} \\
=&exp\left\{ -0.5x^T\Sigma^{-1}x + \mu ^T\Sigma^{-1} x \right\}
\end{aligned}
= = e x p { − 0 . 5 x T Σ − 1 x + x T Σ − 1 μ } e x p { − 0 . 5 x T Σ − 1 x + μ T ( Σ − 1 ) T x } e x p { − 0 . 5 x T Σ − 1 x + μ T Σ − 1 x }
根据高斯分布的一次项 和 二次项 可得 均值 和 方差 :
m ^ = Φ ^ ( Φ b − 1 m + Φ w − 1 ∑ i x i ) Φ ^ = ( n Φ w − 1 + Φ b − 1 ) − 1 \begin{aligned}
\hat{m} =& \hat{\Phi}(\Phi_b^{-1}m+\Phi_w^{-1}\sum_ix_i) \\
\hat{\Phi} =& (n\Phi_w^{-1}+\Phi_b^{-1})^{-1} \\
\end{aligned}
m ^ = Φ ^ = Φ ^ ( Φ b − 1 m + Φ w − 1 i ∑ x i ) ( n Φ w − 1 + Φ b − 1 ) − 1
综上后验分布为:
p ( y ∣ x 1 , … , n ) ∼ N ( m ^ , Φ ^ ) p(y|x_1,\dots,n) \sim N(\hat{m},\hat{\Phi})
p ( y ∣ x 1 , … , n ) ∼ N ( m ^ , Φ ^ )
PLDA 在声纹识别中的判断
因此,对于未知数据点u p u^p u p 以及某类的若干数据点 u 1 , … , n g \mathbf{u}^g_{1,\dots,n} u 1 , … , n g 属于该类的似然值:
用同样的方法,可以得到下面的式子:
p ( u p ∣ u 1 , … , n g ) = N ( n Ψ n Ψ + I u ‾ g , Ψ n Ψ + I + I ) ln p ( u p ∣ u 1 , … , n g ) = C − 1 2 ( u p − n Ψ n Ψ + I u ‾ g ) T ( Ψ n Ψ + I + I ) − 1 ( u p − n Ψ n Ψ + I u ‾ g ) − 1 2 ln ∣ Ψ n Ψ + I + I ∣ \begin{array}{c}
{p\left(\mathbf{u}^{p} | \mathbf{u}_{1, \ldots, n}^{g}\right)=\mathcal{N}\left(\frac{n \Psi}{n \Psi+I} \overline{\mathbf{u}}^{g}, \frac{\Psi}{n \Psi+I}+I\right)} \\
{\ln p\left(\mathbf{u}^{p} | \mathbf{u}_{1, \ldots, n}^{g}\right)=C-\frac{1}{2}\left(\mathbf{u}^{p}-\frac{n \Psi}{n \Psi+I} \overline{\mathbf{u}}^{g}\right)^{T}\left(\frac{\Psi}{n \Psi+I}+I\right)^{-1}\left(\mathbf{u}^{p}-\frac{n \Psi}{n \Psi+I} \overline{\mathbf{u}}^{g}\right)-\frac{1}{2} \ln \left|\frac{\Psi}{n \Psi+I}+I\right|}
\end{array}
p ( u p ∣ u 1 , … , n g ) = N ( n Ψ + I n Ψ u g , n Ψ + I Ψ + I ) ln p ( u p ∣ u 1 , … , n g ) = C − 2 1 ( u p − n Ψ + I n Ψ u g ) T ( n Ψ + I Ψ + I ) − 1 ( u p − n Ψ + I n Ψ u g ) − 2 1 ln ∣ ∣ n Ψ + I Ψ + I ∣ ∣
其中,C = − 1 2 d l n 2 π C = -\frac{1}{2}dln2\pi C = − 2 1 d l n 2 π 是与数据无关的常量,d d d 是 数据的维度。特殊的,u p u^p u p 不属于任何已知类的概率为:
p ( u p ∣ ϕ ) = N ( u p ∣ 0 , Ψ + I ) ln p ( u p ∣ 0 ) = C − 1 2 u p T ( Ψ + I ) − 1 u p − 1 2 ln ∣ Ψ + I ∣ \begin{aligned}
& p\left(\mathbf{u}^{p} | \phi\right)=\mathcal{N}\left(\mathbf{u}^{p} | 0, \Psi+I\right) \\
\ln p\left(\mathbf{u}^{p} | 0\right)=& C-\frac{1}{2} \mathbf{u}^{p T}(\Psi+I)^{-1} \mathbf{u}^{p}-\frac{1}{2} \ln |\Psi+I|
\end{aligned}
ln p ( u p ∣ 0 ) = p ( u p ∣ ϕ ) = N ( u p ∣ 0 , Ψ + I ) C − 2 1 u p T ( Ψ + I ) − 1 u p − 2 1 ln ∣ Ψ + I ∣
由于 Ψ \Psi Ψ 是对角阵,因此上式中各个协方差也都是对角阵,因此,似然和对数似然都很容易求得。
利用 PLDA 进行识别(recognition)方法如下:
i = argmax i ln p ( u p ∣ u 1 , … , n g i ) i=\underset{i}{\operatorname{argmax}} \ln p\left(\mathbf{u}^{p} | \mathbf{u}_{1, \ldots, n}^{g_{i}}\right)
i = i a r g m a x ln p ( u p ∣ u 1 , … , n g i )
对于认证问题(verification),可以计算其似然比:
R = p ( u p ∣ u 1 , … , n g ) p ( u p ∣ ϕ ) R = \frac{p(u^p|u^g_{1,\dots,n})}{p(u^p|\phi)}
R = p ( u p ∣ ϕ ) p ( u p ∣ u 1 , … , n g )
或似然比对数(log likelihood ratio):
l n R = l n p ( u p ∣ u 1 , … , n g ) − l n p ( u p ∣ ϕ ) lnR = lnp(u^p|u^g_{1,\dots,n})-lnp(u^p|\phi)
l n R = l n p ( u p ∣ u 1 , … , n g ) − l n p ( u p ∣ ϕ )
适当的选定阈值 T T T ,当 R > T R>T R > T 判定 u \mathbf{u} u 与已知数据属于同一个类别,反之则不是。
期望最大化方法(EM)
PLDA 中,我们需要估计的参数包括 A 、 Ψ 和 m A、\Psi 和 \mathbf{m} A 、 Ψ 和 m 。
算法流程
输入:K K K 类 d d d 维数据,第 k k k 个类别包含 n k n_k n k 个样本,记 x k i ∈ R d , 1 ≤ k ≤ K x_{ki}∈R^d,1≤k≤K x k i ∈ R d , 1 ≤ k ≤ K 为 第 k k k 个类别的第 i i i 个样本点。
输出:d × d d\times d d × d 对称矩阵Φ w \Phi_w Φ w ,d × d d \times d d × d 对称矩阵Φ b \Phi_b Φ b ,d d d 维向量 m m m 。
计算统计量
N = ∑ k = 1 K n k , f k = ∑ i = 1 n k x k i m = 1 N ∑ k f k S = ∑ k ∑ i x k i x k i T \begin{aligned}
N=&\sum_{k=1}^{K} n_{k}, \\
f_{k}=&\sum_{i=1}^{n_{k}} x_{k i}\\
m=&\frac{1}{N} \sum_{k} f_{k}\\
S=&\sum_{k} \sum_{i} x_{k i} x_{k i}^{T}
\end{aligned}
N = f k = m = S = k = 1 ∑ K n k , i = 1 ∑ n k x k i N 1 k ∑ f k k ∑ i ∑ x k i x k i T
随机初始化Φ w , Φ b , m \Phi_w,\Phi_b,m Φ w , Φ b , m
重复如下步骤至到满足终止条件:
对每一个类别,计算:Φ ^ = ( n Φ w − 1 + Φ b − 1 ) − 1 , y = Φ ^ ( Φ b − 1 m + Φ w − 1 f ) , y y t = Φ ^ + y y T \hat{\Phi}=\left(n \Phi_{w}^{-1}+\Phi_{b}^{-1}\right)^{-1}, y=\hat{\Phi}\left(\Phi_{b}^{-1} m+\Phi_{w}^{-1} f\right), y y t=\hat{\Phi}+y y^{T} Φ ^ = ( n Φ w − 1 + Φ b − 1 ) − 1 , y = Φ ^ ( Φ b − 1 m + Φ w − 1 f ) , y y t = Φ ^ + y y T
中间计算:R = ∑ k n k ⋅ y y t k , T = ∑ k y k f k T , P = ∑ k Φ ^ k , E = ∑ k ( y k − m ) ( y k − m ) T R=\sum_{k} n_{k} \cdot y y t_{k}, T=\sum_{k} y_{k} f_{k}^{T}, P=\sum_{k} \hat{\Phi}_{k}, \quad E=\sum_{k}\left(y_{k}-m\right)\left(y_{k}-m\right)^{T} R = ∑ k n k ⋅ y y t k , T = ∑ k y k f k T , P = ∑ k Φ ^ k , E = ∑ k ( y k − m ) ( y k − m ) T
更新:Φ w = 1 N ( S + R − ( T + T T ) ) , Φ b = 1 K ( P + E ) \Phi_{w}=\frac{1}{N}\left(S+R-\left(T+T^{T}\right)\right), \Phi_{b}=\frac{1}{K}(P+E) Φ w = N 1 ( S + R − ( T + T T ) ) , Φ b = K 1 ( P + E )
基于$\Phi_w 和 和 和 \Phi_b可 以 计 算 出 可以计算出 可 以 计 算 出 \Psi和 和 和 A^{-1}$。
根据LDA的结论,最大投影方向为 :Φ w − 1 Φ b \Phi_w^{-1}\Phi_b Φ w − 1 Φ b 的特征向量w 1 , … , d w_1,\dots,d w 1 , … , d ,每个特征向量为一列,组成矩阵 W W W 。则有:
Λ b = W T Φ b W Λ w = W T Φ w W \begin{aligned}
&\Lambda_{b}=W^{T} \Phi_{b} W\\
&\Lambda_{w}=W^{T} \Phi_{w} W
\end{aligned}
Λ b = W T Φ b W Λ w = W T Φ w W
为了让Φ w \Phi_w Φ w 合同对角化到单位矩阵,显然有下式:
I = Λ w − 1 / 2 Λ w Λ w − 1 / 2 = Λ w − 1 / 2 W T Φ w W Λ w − 1 / 2 I=\Lambda_{w}^{-1 / 2} \Lambda_{w} \Lambda_{w}^{-1 / 2}=\Lambda_{w}^{-1 / 2} W^{T} \Phi_{w} W \Lambda_{w}^{-1 / 2}
I = Λ w − 1 / 2 Λ w Λ w − 1 / 2 = Λ w − 1 / 2 W T Φ w W Λ w − 1 / 2
则找到了转换矩阵V V V :
V = W Λ w − 1 / 2 V = W \Lambda_{w}^{-1 / 2}
V = W Λ w − 1 / 2
Ψ = V T Φ b V = ( W Λ w − 1 / 2 ) T Φ b W Λ w − 1 / 2 = ( Λ w − 1 / 2 ) T Λ b Λ w − 1 / 2 = 上 面 都 是 对 称 矩 阵 所 以 相 当 于 点 乘 = Λ b Λ w − 1 \begin{aligned}
\Psi =& V^T\Phi_bV=(W \Lambda_{w}^{-1 / 2})^T\Phi_bW \Lambda_{w}^{-1 / 2}\\
= &(\Lambda_{w}^{-1 / 2})^T\Lambda_b \Lambda_{w}^{-1 / 2} \\
= &上面都是对称矩阵所以相当于点乘 \\
= &\Lambda_b \Lambda_w^{-1}
\end{aligned}
Ψ = = = = V T Φ b V = ( W Λ w − 1 / 2 ) T Φ b W Λ w − 1 / 2 ( Λ w − 1 / 2 ) T Λ b Λ w − 1 / 2 上 面 都 是 对 称 矩 阵 所 以 相 当 于 点 乘 Λ b Λ w − 1
至于A的求解方法,还记得PLDA的等效公式吗?:
x = m + A u x = m + A u
x = m + A u
u u u 是属于投影空间,x x x 属于实际空间,故说明A A A 是连接实际空间与投影空间 之间的变换矩阵,故:
A = V − 1 A = V^{-1}
A = V − 1
PLDA算法版本2 – MSR工具箱
数据生成背景:
x i j = μ + F h i + G w i j + ϵ i j \mathbf{x}_{i j}=\mu+\mathbf{F h}_{i}+\mathbf{G} \mathbf{w}_{i j}+\epsilon_{i j}
x i j = μ + F h i + G w i j + ϵ i j
上述超参数说明:
变量
变量说明
x i j x_{ij} x i j
经过JFA简化后的i-vector向量,现需求消除类内方差
F h i \mathbf{F h}_{i} F h i
表征不同说话人的差异,$h_i \sim N(\cdot
G w i j \mathbf{G} \mathbf{w}_{i j} G w i j
表征同一个说话人的差异,现希望让这部分变小,$w_{ij} \sim N(\cdot
ϵ i j \epsilon_{i j} ϵ i j
残差建模,$\epsilon_{i j}\sim N(\cdot
在文献【1】中将 F F F 和 G G G 合并起来:
[ x 1 x 2 ⋮ x N ] = [ μ μ ⋮ μ ] + [ F G 0 … 0 F 0 G … 0 ⋮ ⋮ ⋮ ⋱ ⋮ F 0 0 … G ] [ h w 1 w 2 ⋮ ϵ N ] + [ ϵ 1 ϵ 2 ⋮ ϵ N ] \left[\begin{array}{c}
{\mathbf{x}_{1}} \\
{\mathbf{x}_{2}} \\
{\vdots} \\
{\mathbf{x}_{N}}
\end{array}\right]=\left[\begin{array}{c}
{\mu} \\
{\mu} \\
{\vdots} \\
{\mu}
\end{array}\right]+\left[\begin{array}{ccccc}
{\mathbf{F}} & {\mathbf{G}} & {\mathbf{0}} & {\ldots} & {\mathbf{0}} \\
{\mathbf{F}} & {\mathbf{0}} & {\mathbf{G}} & {\ldots} & {\mathbf{0}} \\
{\vdots} & {\vdots} & {\vdots} & {\ddots} & {\vdots} \\
{\mathbf{F}} & {\mathbf{0}} & {\mathbf{0}} & {\ldots} & {\mathbf{G}}
\end{array}\right]\left[\begin{array}{c}
{\mathbf{h}} \\
{\mathbf{w}_{1}} \\
{\mathbf{w}_{2}} \\
{\vdots} \\
{\mathbf{\epsilon}_{N}}
\end{array}\right]+\left[\begin{array}{c}
{\epsilon_{1}} \\
{\epsilon_{2}} \\
{\vdots} \\
{\epsilon_{N}}
\end{array}\right]
⎣ ⎢ ⎢ ⎢ ⎡ x 1 x 2 ⋮ x N ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ μ μ ⋮ μ ⎦ ⎥ ⎥ ⎥ ⎤ + ⎣ ⎢ ⎢ ⎢ ⎡ F F ⋮ F G 0 ⋮ 0 0 G ⋮ 0 … … ⋱ … 0 0 ⋮ G ⎦ ⎥ ⎥ ⎥ ⎤ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ h w 1 w 2 ⋮ ϵ N ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ + ⎣ ⎢ ⎢ ⎢ ⎡ ϵ 1 ϵ 2 ⋮ ϵ N ⎦ ⎥ ⎥ ⎥ ⎤
简化为:
x ′ = μ ′ + A y + ϵ ′ \mathbf{x}^{\prime}=\mu^{\prime}+\mathbf{A} \mathbf{y}+\epsilon^{\prime}
x ′ = μ ′ + A y + ϵ ′
E步公式:
E [ y i ] = ( A T Σ ′ − 1 A + I ) − 1 A T Σ ′ − 1 ( x i − μ ′ ) E [ y i y i T ] = ( A T Σ ′ − 1 A T + I ) − 1 + E [ y i ] E [ y i ] T \begin{aligned}
E\left[\mathbf{y}_{i}\right] &=\left(\mathbf{A}^{T} \Sigma^{\prime-1} \mathbf{A}+\mathbf{I}\right)^{-1} \mathbf{A}^{T} \Sigma^{\prime-1}\left(\mathbf{x}_{i}-\mu^{\prime}\right) \\
E\left[\mathbf{y}_{i} \mathbf{y}_{i}^{T}\right] &=\left(\mathbf{A}^{T} \Sigma^{\prime-1} \mathbf{A}^{T}+\mathbf{I}\right)^{-1}+E\left[\mathbf{y}_{i}\right] E\left[\mathbf{y}_{i}\right]^{T}
\end{aligned}
E [ y i ] E [ y i y i T ] = ( A T Σ ′ − 1 A + I ) − 1 A T Σ ′ − 1 ( x i − μ ′ ) = ( A T Σ ′ − 1 A T + I ) − 1 + E [ y i ] E [ y i ] T
M步公式:
μ = 1 I J ∑ i , j x i j B = ( ∑ i , j ( x i j − μ ) E [ z i ] T ) ( ∑ i , j E [ z i z i T ] ) − 1 Σ = 1 I J ∑ i , j Diag [ ( x i j − μ ) ( x i j − μ ) T − B E [ z i ] ( x i j − μ ) T ] \begin{array}{l}
{\mu=\frac{1}{I J} \sum_{i, j} \mathbf{x}_{i j}} \\
{\mathbf{B}=\left(\sum_{i, j}\left(\mathbf{x}_{i j}-\mu\right) E\left[\mathbf{z}_{i}\right]^{T}\right)\left(\sum_{i, j} E\left[\mathbf{z}_{i} \mathbf{z}_{i}^{T}\right]\right)^{-1}} \\
{\mathbf{\Sigma}=\frac{1}{I J} \sum_{i, j} \operatorname{Diag}\left[\left(\mathbf{x}_{i j}-\mu\right)\left(\mathbf{x}_{i j}-\mu\right)^{T}-\mathbf{B} E\left[\mathbf{z}_{i}\right]\left(\mathbf{x}_{i j}-\mu\right)^{T}\right]}
\end{array}
μ = I J 1 ∑ i , j x i j B = ( ∑ i , j ( x i j − μ ) E [ z i ] T ) ( ∑ i , j E [ z i z i T ] ) − 1 Σ = I J 1 ∑ i , j D i a g [ ( x i j − μ ) ( x i j − μ ) T − B E [ z i ] ( x i j − μ ) T ]
在文献【2】讲述了识别原理:
对于two-covariance
版本,需要同时迭代出两个方差,文献【1】中采用合并的方式训练,上述模型中包含两个部分:
(i) 【类间】说话人的部分:x i j = μ + F h i \mathbf{x}_{i j}=\mu+\mathbf{F h}_{i} x i j = μ + F h i
(ii) 【类内】通道效应部分:c r = G w i j + ϵ i j c_r = \mathbf{G} \mathbf{w}_{i j}+\epsilon_{i j} c r = G w i j + ϵ i j
而这里我们只对说话人的进行建模,而对通道效应进行简化:
x i j = μ + F h i + ϵ i j \mathbf{x}_{i j} =\mu+\mathbf{F h}_{i}+\epsilon_{i j}
x i j = μ + F h i + ϵ i j
这里ϵ i j \epsilon_{i j} ϵ i j 从原先的diagonal covariace
Σ \Sigma Σ 变为full-covariance
Σ \Sigma Σ
Verification score
算法原理
得分公式如下:
score = log p ( η 1 , η 2 ∣ H s ) p ( η 1 ∣ H d ) p ( η 2 ∣ H d ) \text { score }=\log \frac{p\left(\eta_{1}, \eta_{2} | \mathscr{H}_{s}\right)}{p\left(\eta_{1} | \mathscr{H}_{d}\right) p\left(\eta_{2} | \mathscr{H}_{d}\right)}
score = log p ( η 1 ∣ H d ) p ( η 2 ∣ H d ) p ( η 1 , η 2 ∣ H s )
分子代表两个不同的i-vector
来源于同一个分布的概率;而分母代表i-vector
来源于不同的分布的概率。分子应越大越好,分母应越小越好,得分越大说明两者为同一人的概率越高。
如果有PPCA
或者因子分析法
的推导基础,很容易可以写出两个变量的似然概率函数,来源于同一个分布,方差应为fuu-covariance
而不同分布的变量独立,所以为diagnonal-covariance
。
score = log N ( [ η 1 η 2 ] ; [ m m ] , [ Σ t o t Σ a c Σ a c Σ t o c ] ) − log N ( [ η 1 η 2 ] ; [ m m ] , [ Σ t o t 0 0 Σ t o t ] ) \begin{aligned}
\text { score } &=\log \mathcal{N}\left(\left[\begin{array}{l}
\eta_{1} \\
\eta_{2}
\end{array}\right] ;\left[\begin{array}{l}
m \\
m
\end{array}\right],\left[\begin{array}{ll}
\Sigma_{t o t} & \Sigma_{a c} \\
\Sigma_{a c} & \Sigma_{t o c}
\end{array}\right]\right) \\
&-\log \mathcal{N}\left(\left[\begin{array}{l}
\eta_{1} \\
\eta_{2}
\end{array}\right] ;\left[\begin{array}{l}
m \\
m
\end{array}\right],\left[\begin{array}{cc}
\Sigma_{t o t} & 0 \\
0 & \Sigma_{t o t}
\end{array}\right]\right)
\end{aligned}
score = log N ( [ η 1 η 2 ] ; [ m m ] , [ Σ t o t Σ a c Σ a c Σ t o c ] ) − log N ( [ η 1 η 2 ] ; [ m m ] , [ Σ t o t 0 0 Σ t o t ] )
这里文献【2】中写错了,其中Σ t o t = F F T + Σ \Sigma_{tot}=FF^T+\Sigma Σ t o t = F F T + Σ 和Σ a c = F F T \Sigma_{ac} = FF^T Σ a c = F F T 。
文献【3】给出了有关矩阵操作的所有数学知识点,可以随时翻阅,以下是第9章的Special Matrix
的内容
[ A 11 A 12 A 21 A 22 ] − 1 = [ C 1 − 1 − A 11 − 1 A 12 C 2 − 1 − C 2 − 1 A 21 A 11 − 1 C 2 − 1 ] = [ A 11 − 1 + A 11 − 1 A 12 C 2 − 1 A 21 A 11 − 1 − C 1 − 1 A 12 A 22 − 1 − A 22 − 1 A 21 C 1 − 1 A 22 − 1 + A 22 − 1 A 21 C 1 − 1 A 12 A 22 − 1 ] \begin{aligned}\left[\begin{array}{c|c}\mathbf{A}_{11} & \mathbf{A}_{12} \\\hline \mathbf{A}_{21} & \mathbf{A}_{22}\end{array}\right]^{-1}= &\left[\begin{array}{c|c}\mathbf{C}_{1}^{-1} & -\mathbf{A}_{11}^{-1} \mathbf{A}_{12} \mathbf{C}_{2}^{-1} \\\hline-\mathbf{C}_{2}^{-1} \mathbf{A}_{21} \mathbf{A}_{11}^{-1} & \mathbf{C}_{2}^{-1}\end{array}\right] \\= &\left[\begin{array}{c|c}\mathbf{A}_{11}^{-1}+\mathbf{A}_{11}^{-1} \mathbf{A}_{12} \mathbf{C}_{2}^{-1} \mathbf{A}_{21} \mathbf{A}_{11}^{-1} & -\mathbf{C}_{1}^{-1} \mathbf{A}_{12} \mathbf{A}_{22}^{-1} \\\hline-\mathbf{A}_{22}^{-1} \mathbf{A}_{21} \mathbf{C}_{1}^{-1} & \mathbf{A}_{22}^{-1}+\mathbf{A}_{22}^{-1} \mathbf{A}_{21} \mathbf{C}_{1}^{-1} \mathbf{A}_{12} \mathbf{A}_{22}^{-1}\end{array}\right]\end{aligned}
[ A 1 1 A 2 1 A 1 2 A 2 2 ] − 1 = = [ C 1 − 1 − C 2 − 1 A 2 1 A 1 1 − 1 − A 1 1 − 1 A 1 2 C 2 − 1 C 2 − 1 ] [ A 1 1 − 1 + A 1 1 − 1 A 1 2 C 2 − 1 A 2 1 A 1 1 − 1 − A 2 2 − 1 A 2 1 C 1 − 1 − C 1 − 1 A 1 2 A 2 2 − 1 A 2 2 − 1 + A 2 2 − 1 A 2 1 C 1 − 1 A 1 2 A 2 2 − 1 ]
C 1 = A 11 − A 12 A 22 − 1 A 21 C 2 = A 22 − A 21 A 11 − 1 A 12 \begin{array}{l}\mathbf{C}_{1}=\mathbf{A}_{11}-\mathbf{A}_{12} \mathbf{A}_{22}^{-1} \mathbf{A}_{21} \\\mathbf{C}_{2}=\mathbf{A}_{22}-\mathbf{A}_{21} \mathbf{A}_{11}^{-1} \mathbf{A}_{12}\end{array}
C 1 = A 1 1 − A 1 2 A 2 2 − 1 A 2 1 C 2 = A 2 2 − A 2 1 A 1 1 − 1 A 1 2
根据上述注释中的知识点,可以推导出:
s c o r e = η 1 T Q η + η 2 T Q η 2 + 2 η 1 T P η 1 + c o n s t score = \eta_1^TQ\eta + \eta_2^TQ\eta_2+2\eta_1^TP\eta_1+const
s c o r e = η 1 T Q η + η 2 T Q η 2 + 2 η 1 T P η 1 + c o n s t
其中:
Q = Σ tot − 1 − ( Σ tot − Σ a c Σ tot − 1 Σ a c ) − 1 P = Σ tot − 1 Σ a c ( Σ tot − Σ a c Σ tot − 1 Σ a c ) − 1 \begin{array}{l}
\mathrm{Q}=\Sigma_{\text {tot}}^{-1}-\left(\Sigma_{\text {tot}}-\Sigma_{a c} \Sigma_{\text {tot}}^{-1} \Sigma_{a c}\right)^{-1} \\
\mathrm{P}=\Sigma_{\text {tot}}^{-1} \Sigma_{a c}\left(\Sigma_{\text {tot}}-\Sigma_{a c} \Sigma_{\text {tot}}^{-1} \Sigma_{a c}\right)^{-1}
\end{array}
Q = Σ tot − 1 − ( Σ tot − Σ a c Σ tot − 1 Σ a c ) − 1 P = Σ tot − 1 Σ a c ( Σ tot − Σ a c Σ tot − 1 Σ a c ) − 1
这里还可以进一步进行降维处理:
将P P P 矩阵进行特征对角化
P = [ U K ∣ U D − K ] diag ( [ λ 1 , … , λ K , 0 , 0 , … ] ) [ U K ∣ U D − K ] T = U K diag ( [ λ 1 , … , λ K ] ) U K T \begin{aligned}
\mathrm{P} &=\left[\mathrm{U}_{K} | \mathrm{U}_{D-K}\right] \operatorname{diag}\left(\left[\lambda_{1}, \ldots, \lambda_{K}, 0,0, \ldots\right]\right)\left[\mathrm{U}_{K} | \mathrm{U}_{D-K}\right]^{T} \\
&=\mathrm{U}_{K} \operatorname{diag}\left(\left[\lambda_{1}, \ldots, \lambda_{K}\right]\right) U_{K}^{T}
\end{aligned}
P = [ U K ∣ U D − K ] d i a g ( [ λ 1 , … , λ K , 0 , 0 , … ] ) [ U K ∣ U D − K ] T = U K d i a g ( [ λ 1 , … , λ K ] ) U K T
得到新的score
公式,此时η 1 \eta_{1} η 1 从( D , 1 ) (D,1) ( D , 1 ) 维降维到η ~ 1 \tilde{\eta}_{1} η ~ 1 得( K , 1 ) (K,1) ( K , 1 ) 维
score = η ~ 1 T Q ~ η ~ 1 + η ~ 2 T Q ~ η ~ 2 + 2 η ~ 1 T Λ η ~ 2 + const \text { score }=\tilde{\eta}_{1}^{T} \widetilde{\mathrm{Q}} \tilde{\eta}_{1}+\tilde{\eta}_{2}^{T} \widetilde{\mathrm{Q}} \tilde{\eta}_{2}+2 \tilde{\eta}_{1}^{T} \Lambda \tilde{\eta}_{2}+\text { const }
score = η ~ 1 T Q η ~ 1 + η ~ 2 T Q η ~ 2 + 2 η ~ 1 T Λ η ~ 2 + const
其中Λ = d i a g ( [ λ 1 , … , λ K ] ) \Lambda = diag([\lambda_1,\dots,\lambda_K]) Λ = d i a g ( [ λ 1 , … , λ K ] ) 。
这样得做法的目的就为了使得投影后的η 1 \eta_1 η 1 和η 2 \eta_2 η 2 的协方差矩阵变成diagonal covariance
,从而达到了消除了不同人之间得相关性(即类间方差)的目的。
Matlab中MSR中的得分内容
function scores = score_gplda_trials (plda, model_iv, test_iv) Phi = plda.Phi; Sigma = plda.Sigma; W = plda.W; M = plda.M; model_iv = bsxfun (@minus, model_iv, M); model_iv = length_norm(model_iv); model_iv = W' * model_iv; test_iv = bsxfun (@minus, test_iv, M); test_iv = length_norm(test_iv); test_iv = W' * test_iv; nphi = size (Phi, 1 ); Sigma_ac = Phi * Phi'; Sigma_tot = Sigma_ac + Sigma; Sigma_tot_i = pinv(Sigma_tot); Sigma_i = pinv(Sigma_tot-Sigma_ac*Sigma_tot_i*Sigma_ac); Q = Sigma_tot_i-Sigma_i; P = (Sigma_tot_i * Sigma_ac) * Sigma_i; [U, S] = svd(P); S = diag (S); Lambda = diag (S(1 : nphi)); Uk = U(:, 1 : nphi); Q_hat = Uk' * Q * Uk; model_iv = Uk' * model_iv; test_iv = Uk' * test_iv; score_h1 = diag (model_iv' * Q_hat * model_iv); score_h2 = diag (test_iv' * Q_hat * test_iv); score_h1h2 = 2 * model_iv' * Lambda * test_iv; scores = bsxfun (@plus, score_h1h2, score_h1); scores = bsxfun (@plus, scores, score_h2');
文献:
Probabilistic Linear Discriminant Analysis for Inferences About Identity
Analysis of I-vector Length Normalization in Speaker Recognition Systems
The Matrix Cookbook
EM for Probabilistic LDA 详细推导了PLDA的EM算法(略复杂)
https://sites.google.com/site/fastplda/ 提供了对应的代码