Unexpected LOO and ELPD output

**Describe the bug**
I am currently using arviz for an undergrad internship project.
When using compare() to weigh different models using Stacking or Pseudo-BMA, arviz seems to calculate different elpds and weights than I would expect. The “[stats.py](http://stats.py/)” file containing the compare() code seems to calculate certain steps in this process in a different way to how I would expect. Specifically, the steps where Importance Sampling / Pointwise Log-Likelihood values are calculated, and the step where the Stacking weights are calculated. I’ve written some of my own code following this paper’s process: https://arxiv.org/pdf/1704.02030 (Yao et al. 2017). The main difference being in the Importance Sampling step, where I use a more complex LOO process outlined in this paper: “[https://arxiv.org/abs/2410.03507”](https://arxiv.org/abs/2410.03507%E2%80%9D) (Nguyen et al. 2024). The first paper states that the Importance Sampling method arviz uses and the LOO-based IS method Nguyen et al. uses are equivalent, or at least proportional. I’d like to investigate whether the differences seen in the calculated weights between these two methods are due to this difference, whether this is expected or if there is any evident bug in my code

$$
r_i^s= \frac{1}{p(y_i|\theta^s,M_k)}\propto \frac{p(\theta^s|y_{-i})}{p(\theta^s|y)}
$$

**To Reproduce**
I have linked a notebook and a .py file. 
https://github.com/antdifo66/antoniobma/tree/main/arvizforum
The code attempts to fit three slightly different candidate polynomial models to a more complicated polynomial model that generates some noisy data. It then uses arviz’s compare() function to attempt to model average between the candidate models. I also do the same analysis using code that I’ve written, in which I have tried to replicate the equations given in the papers above. The resulting elpds (and more importantly, the weights) are different between the two codes.

**Expected behavior**
The elpds and weights should be the same, or at least quite similar. Sometimes the weights are off by 10%, and sometimes different models are picked out entirely (usually in the case of stacking.

**Additional context**
Other libraries: emcee, pandas, getdist, numpy, matplotlib, tqdm. Windows 11, arviz version 0.20.0.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unexpected LOO and ELPD output #2448

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unexpected LOO and ELPD output #2448

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions