# Reading: Stein method and its applications (Part 1)

Published:

Short notes of five papers I read these days on applications of Stein’s method including wild variational inference, reinforcement learning and sampling. Most paper are from the project Stein’s method for practical machine learning that I am quite interested in.

## Approximate Inference with Amortised MCMC

Amortised MCMC is propsed as a framework to approximate the posterior distribution p of interest. With the following three main ingredients:

• a parametric set Q = {qφ} of the sampler distribution
• a transition kernel K(zt|zt−1) in MCMC dynamics
• divergence D(…||…) and update rule for φ

the process of approximating distribution p is described as follow, where we update the distribution in the parametric set Q by minimizing discrepancy between samples before and after MCMC. Firstly, with sample {z0k} from the distribution qφt−1 in (t − 1)th iteration, apply T-step transition to obtain {zTk } where zTk ∼ KT (. . . | z0k ). With the purpose to minimize the discrepancy between {z0k} and {zTk }, φ is updated as follow

$\phi \leftarrow \phi_{t-1} - \eta \nabla_{\phi} D(\{z_0^k\}\parallel \{z_T^k \})_{\phi_{t-1}}$ ​ ​

Alternatives for update includes minimizing KL divergence, adversarially estimated diver- gences and energy matching.

## Two Methods for Wild Variational Inference

### Amortized SVGD

Trainig an inference network f(η;ξ) by iteratively adjusting η so that the outputs of f(η;ξ) moves towards its SVGD-updated counterpart. This process is similar to the amortised MCMC mentioned in last section, differing in the way of moving samples produced by f(η;ξ) with random {ξi} in every iteration. Specific steps in every iteration include calculating Stein variational gradient ∆zi for zi = f(η; ξi), and updating η as follow

$\Delta \eta = \partial_\eta \sum_{i=1}^n \parallel f(\eta;\xi_i) - z_i - \epsilon \Delta z_i \parallel_2^2 = -2\epsilon \partial_\eta f(\eta;\xi_i) \Delta z_i$

### KSD Variational Inference

Optimizing η with a standard gradient descent where the gradient is obtained with the purpose to minimize KSD, approximated by a U-statistics as follow

$\mathbb{D}^2(q_\eta \parallel p) \approx \frac{1}{n(n-1)} \sum_{i\neq j} k_p(f(\eta;\xi_i),f(\eta;\xi_j))$

​ In the context of reinforced learning, an agent takes action a in the environment and then the environment gives an instant scalar feedback r to the agent. The agent needs to learn a policy π to maximize its expected return

$J(\pi) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)],~\text{where}~ a_{t+1}\sim \pi(a_t \mid s_t) ~\text{and}~s_{t+1} \sim P(s_{t+1} \mid a_t,s_t)$

and P determined by the environment.

Taking policies θi to be particles and p(θ) to be the target distribution optimizing the expected return, SVGD would then be applied to draw a sample of p. To obtain the target distribution q(θ), we need to optimized regularized expected return as follow

$max_q \{ \mathbb{E}_{\theta \sim q(\theta)} [J(\theta)] - \alpha \mathbb{D}(q \parallel q_0) \}$

where the first term plays the role of exploration and the second term plays the role of exploitation such that q is maximizing the expected return but meanwhile not too far from a prior distribution q0. For a specific representation of q, take its derivative w.r.t. q(θ) and set it to zero. Then we would have

$0 = \partial_{q(\theta)} \mathbb{E}_{\theta \sim q(\theta)} [J(\theta)] - \alpha \mathbb{D}(q \parallel q_0) = \int J(\theta) - \alpha (1 + ln q(\theta) - ln q_0(\theta)) d\theta$ from which we obtain the following results

$\frac{1}{\alpha} J(\theta) = 1 + ln q(\theta) - ln q_0(\theta) \Rightarrow ln q(\theta) = \frac{1}{\alpha} J(\theta) + ln q_0(\theta) - 1 \Rightarrow q(\theta) \propto \exp\{\frac{1}{\alpha} J(\theta) \} q_0(\theta)$

and in each iteration of SVGD, we update {θi} with

$\Delta \theta_i = \frac{1}{n} \sum_{j=1}^n \nabla_{\theta_j} [\frac{1}{\alpha} J(\theta_j) + ln q_0(\theta_j) - 1] k(\theta_j,\theta_i) + \nabla_{\theta_j} k(\theta_j \theta_i)$