Skip to content

Commit 18666c0

Browse files
HumphreyYangmmcky
andauthored
[statistics and information] Update reader's suggestions (#585)
* incorprate reader's feedback * updates * updates * updates --------- Co-authored-by: Matt McKay <mmcky@users.noreply.github.com>
1 parent a7648c0 commit 18666c0

File tree

5 files changed

+150
-127
lines changed

5 files changed

+150
-127
lines changed

lectures/divergence_measures.md

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ jupytext:
44
extension: .md
55
format_name: myst
66
format_version: 0.13
7-
jupytext_version: 1.17.1
7+
jupytext_version: 1.16.6
88
kernelspec:
99
display_name: Python 3 (ipykernel)
1010
language: python
@@ -60,10 +60,6 @@ import pandas as pd
6060
from IPython.display import display, Math
6161
```
6262

63-
64-
65-
66-
6763
## Primer on entropy, cross-entropy, KL divergence
6864

6965
Before diving in, we'll introduce some useful concepts in a simple setting.
@@ -163,6 +159,8 @@ f(z; a, b) = \frac{\Gamma(a+b) z^{a-1} (1-z)^{b-1}}{\Gamma(a) \Gamma(b)}
163159
\Gamma(p) := \int_{0}^{\infty} x^{p-1} e^{-x} dx
164160
$$
165161
162+
We introduce two Beta distributions $f(x)$ and $g(x)$, which we will use to illustrate the different divergence measures.
163+
166164
Let's define parameters and density functions in Python
167165
168166
```{code-cell} ipython3
@@ -198,8 +196,6 @@ plt.legend()
198196
plt.show()
199197
```
200198
201-
202-
203199
(rel_entropy)=
204200
## Kullback–Leibler divergence
205201
@@ -457,15 +453,14 @@ plt.show()
457453
458454
We now generate plots illustrating how overlap visually diminishes as divergence measures increase.
459455
460-
461456
```{code-cell} ipython3
462457
param_grid = [
463458
((1, 1), (1, 1)),
464459
((1, 1), (1.5, 1.2)),
465460
((1, 1), (2, 1.5)),
466461
((1, 1), (3, 1.2)),
467-
((1, 1), (5, 1)),
468-
((1, 1), (0.3, 0.3))
462+
((1, 1), (0.3, 0.3)),
463+
((1, 1), (5, 1))
469464
]
470465
```
471466
@@ -512,8 +507,6 @@ def plot_dist_diff(para_grid):
512507
divergence_data = plot_dist_diff(param_grid)
513508
```
514509
515-
516-
517510
## KL divergence and maximum-likelihood estimation
518511
519512

lectures/imp_sample.md

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ import matplotlib.pyplot as plt
3636
from math import gamma
3737
```
3838

39-
## Mathematical Expectation of Likelihood Ratio
39+
## Mathematical expectation of likelihood ratio
4040

4141
In {doc}`this lecture <likelihood_ratio_process>`, we studied a likelihood ratio $\ell \left(\omega_t\right)$
4242

@@ -57,11 +57,10 @@ $$
5757
Our goal is to approximate the mathematical expectation $E \left[ L\left(\omega^t\right) \right]$ well.
5858

5959
In {doc}`this lecture <likelihood_ratio_process>`, we showed that $E \left[ L\left(\omega^t\right) \right]$ equals $1$ for all $t$.
60+
6061
We want to check out how well this holds if we replace $E$ by with sample averages from simulations.
6162

62-
This turns out to be easier said than done because for
63-
Beta distributions assumed above, $L\left(\omega^t\right)$ has
64-
a very skewed distribution with a very long tail as $t \rightarrow \infty$.
63+
This turns out to be easier said than done because for Beta distributions assumed above, $L\left(\omega^t\right)$ has a very skewed distribution with a very long tail as $t \rightarrow \infty$.
6564

6665
This property makes it difficult efficiently and accurately to estimate the mean by standard Monte Carlo simulation methods.
6766

@@ -156,7 +155,7 @@ $$
156155
E^g\left[\ell\left(\omega\right)\right] = \int_\Omega \ell(\omega) g(\omega) d\omega = \int_\Omega \ell(\omega) \frac{g(\omega)}{h(\omega)} h(\omega) d\omega = E^h\left[\ell\left(\omega\right) \frac{g(\omega)}{h(\omega)}\right]
157156
$$
158157

159-
## Selecting a Sampling Distribution
158+
## Selecting a sampling distribution
160159

161160
Since we must use an $h$ that has larger mass in parts of the distribution to which $g$ puts low mass, we use $h=Beta(0.5, 0.5)$ as our importance distribution.
162161

@@ -178,7 +177,7 @@ plt.ylim([0., 3.])
178177
plt.show()
179178
```
180179

181-
## Approximating a Cumulative Likelihood Ratio
180+
## Approximating a cumulative likelihood ratio
182181

183182
We now study how to use importance sampling to approximate
184183
${E} \left[L(\omega^t)\right] = \left[\prod_{i=1}^T \ell \left(\omega_i\right)\right]$.
@@ -233,7 +232,7 @@ For our importance sampling estimate, we set $q = h$.
233232
estimate(g_a, g_b, h_a, h_b, T=1, N=10000)
234233
```
235234

236-
Evidently, even at T=1, our importance sampling estimate is closer to $1$ than is the Monte Carlo estimate.
235+
Evidently, even at $T=1$, our importance sampling estimate is closer to $1$ than is the Monte Carlo estimate.
237236

238237
Bigger differences arise when computing expectations over longer sequences, $E_0\left[L\left(\omega^t\right)\right]$.
239238

@@ -248,7 +247,17 @@ estimate(g_a, g_b, g_a, g_b, T=10, N=10000)
248247
estimate(g_a, g_b, h_a, h_b, T=10, N=10000)
249248
```
250249

251-
## Distribution of Sample Mean
250+
The Monte Carlo method underestimates because the likelihood ratio $L(\omega^T) = \prod_{t=1}^T \frac{f(\omega_t)}{g(\omega_t)}$ has a highly skewed distribution under $g$.
251+
252+
Most samples from $g$ produce small likelihood ratios, while the true mean requires occasional very large values that are rarely sampled.
253+
254+
In our case, since $g(\omega) \to 0$ as $\omega \to 0$ while $f(\omega)$ remains constant, the Monte Carlo procedure undersamples precisely where the likelihood ratio $\frac{f(\omega)}{g(\omega)}$ is largest.
255+
256+
As $T$ increases, this problem worsens exponentially, making standard Monte Carlo increasingly unreliable.
257+
258+
Importance sampling with $q = h$ fixes this by sampling more uniformly from regions important to both $f$ and $g$.
259+
260+
## Distribution of sample mean
252261

253262
We next study the bias and efficiency of the Monte Carlo and importance sampling approaches.
254263

@@ -323,17 +332,15 @@ The simulation exercises above show that the importance sampling estimates are u
323332

324333
Evidently, the bias increases with increases in $T$.
325334

326-
## Choosing a Sampling Distribution
335+
## Choosing a sampling distribution
327336

328337
+++
329338

330339
Above, we arbitraily chose $h = Beta(0.5,0.5)$ as the importance distribution.
331340

332341
Is there an optimal importance distribution?
333342

334-
In our particular case, since we know in advance that $E_0 \left[ L\left(\omega^t\right) \right] = 1$.
335-
336-
We can use that knowledge to our advantage.
343+
In our particular case, since we know in advance that $E_0 \left[ L\left(\omega^t\right) \right] = 1$, we can use that knowledge to our advantage.
337344

338345
Thus, suppose that we simply use $h = f$.
339346

@@ -364,10 +371,10 @@ b_list = [0.5, 1.2, 5.]
364371
```{code-cell} ipython3
365372
w_range = np.linspace(1e-5, 1-1e-5, 1000)
366373
367-
plt.plot(w_range, g(w_range), label=f'p=Beta({g_a}, {g_b})')
368-
plt.plot(w_range, p(w_range, a_list[0], b_list[0]), label=f'g=Beta({a_list[0]}, {b_list[0]})')
369-
plt.plot(w_range, p(w_range, a_list[1], b_list[1]), label=f'g=Beta({a_list[1]}, {b_list[1]})')
370-
plt.plot(w_range, p(w_range, a_list[2], b_list[2]), label=f'g=Beta({a_list[2]}, {b_list[2]})')
374+
plt.plot(w_range, g(w_range), label=f'g=Beta({g_a}, {g_b})')
375+
plt.plot(w_range, p(w_range, a_list[0], b_list[0]), label=f'$h_1$=Beta({a_list[0]},{b_list[0]})')
376+
plt.plot(w_range, p(w_range, a_list[1], b_list[1]), label=f'$h_2$=Beta({a_list[1]},{b_list[1]})')
377+
plt.plot(w_range, p(w_range, a_list[2], b_list[2]), label=f'$h_3$=Beta({a_list[2]},{b_list[2]})')
371378
plt.title('real data generating process $g$ and importance distribution $h$')
372379
plt.legend()
373380
plt.ylim([0., 3.])

0 commit comments

Comments
 (0)