# Reading: Stein method in machine learning

Published:

Reading notes for paper Stein variational gradient descent: A general purpose bayesian inference algorithm. (Liu & Wang, 2016) including introduction to Stein’s identity, Stein discrepancy and finally Stein variational gradient descent.

## Introduction

Stein discrepancy, as a discrepancy statistic inspired by Stein’s identity, provides a new way of measuring two distributions. Coupled with RKHS, namely the reproducing kernel Hilbert space, its probable intractable nature in computation is overcome, allowing it much potential in applications. As a measure between distributions, it’s a natural idea to put Stein’s discrepancy into goodness-of-fit test, comparing the actual data and hypothesis, as well as apply it to obtain data sample approximating the target distribution. Its decent performances have inspire further exploration into diverse applications in machine learning using it as a new tool.

## Stein’s identity

Stein’s method has provided a new proof for Central Limit Theorem, estimating how a set of random variables (not necessarily uncorrelated) approximates a normal variable. Firstly, Stein’s Identity is constructed under Stein’s Characterization.

$\mathbb{E}_p[\mathcal{A}_pf(x)] = 0$

where Ap is Stein operator defined as Apf = ∇·(pf), with a continuous differentiable density p and a function f in the Stein class of p. Stein’s Identity also holds in more general case. To be specific, it still holds when ∇ is defined to be weak derivative, where

$pf \in (W^1)^d,~~ f\in (L_p^2)^d,~~ \mathcal{A}_p\ f \in L_p^2~~\text{and} ~ supp(p) = R^d.$

## Stein’s Discrepancy

Consider Ep[ Apf(x) ] with different distributions p and q. Since for smooth p, q with the same support and f in the Stein class of p, it can be shown that

$\mathbb{E}_p[\mathcal{A}_qf(x)] = \mathbb{E}_p[(s_q(x)-s_p(x))f(x)^\top]$

With proper f, the above equation is an expectation of weighted score function difference and it could be related to evaluate difference between distributions. From this perspective, the definition of Stein discrepancy between two distributions follows is natural

$\mathbb{D}(p \parallel q):= \max\limits_{f \in H} \{ \mathbb{E}_p[\mathcal{A}_qf(x)], s.t. \parallel f \parallel_H \leqslant 1 \}$

Served as a measure between two distributions, Stein discrepancy is supposed to be zero if and only if the two distributions p and q are identical. Actually, it holds when the functional space H is chosen to be Hp+ dense in Lp2, meanwhile functions in Hp+ being bounded and continuous and p positive everywhere.

This property might not hold if density functions p or q have bad behavior or an improper functional space H is chosen. Moreover, it would be more satisfying if H consists of functions with handy representation. It leads to the following discussions on functional space. ​

​ ​

With the help of Kernelized Stein Discrepancy, SVGD works by iteratively transports a set of particles to match the target distribution

$x^{l+1}_i \leftarrow x_i^l + \epsilon_l \phi^*(x_i^l)$

where φ∗ is chosen to be the function in unit ball of RKHS that achieves the maximum KSD. Further analysis reveal the relation between KSD and KL divergence

$\nabla_\epsilon KL(q_{[T]} \parallel p) \|_{\epsilon = 0} = - \mathbb{E}_{x\sim q}[trace(\mathcal{A}_p\phi(x))]$

Then φ∗ is the optimal perturbation direction that gives the steepest descent to KL divergence. As for its convergence rate, with a bounded Lipschitz metric, it could be shown that if BL(μ0n , μ0 ) converges to 0, then for any finite iteration l, BL(μln , μl )also converges to 0. Moreover, when μ0 is chosen to be log-concave, an O(1/√n) rate is attained.

Discussion However, conclusions on convergence rate with respect to time are not so direct when the funcitonal space discussed is too general . Firstly, even when the functional space H is constrained to be RKHS, the choice of kernel function largely affects the performance of SVGD. To be more specific, consider φ∗ as follow

$\phi^*(x) = \frac{1}{n} \sum_j [k(x_j^l,x)\nabla_{x_j^l}log p(x_j^l)+\nabla_{x_j^l} k(x_j^l,x)]$

Intuitively speaking, the second term plays the role of repulsive force preventing particles collapse into local modes. However, when the kernel function is not properly chosen, linear kernel k(x, y) =< x, y > for instance, then the second term would be x itself and it is no longer repulsive, not being able to achieve the potential of SVGD. Also from the analysis above, it’s reasonable to expect kernels that decay faster than polynomial like RBF kernel to have better performance. Some experiments have shown that SVGD dynamics has the potential to converge exponentially as the Langevin Dynamics. So I wonder whether more satisfying conclusions could be drawn if we consider a more specific functional space like RKHS with kernel function with good behaviour.

Also I’ve heard that in high dimensional space, the choice of kernel function is especially thorny due to the low discrimination between points. Is it possible to design a kernel function to ease this situation?

## Future Work

At present, I’m interested in how KSD is applied to specific applications such as importance sampling, reinforcement learning and GAN. Also I’d like to read more papers including about related topics to gain more intuitions on KSD by comparison.