|
| 1 | + |
| 2 | +# Formula syntax |
| 3 | + |
| 4 | +MixedModels.jl uses the variant of the Wilkinson-Rogers (1973) notation for models of (co)variance implemented by [StatsModels.jl](https://juliastats.org/StatsModels.jl/stable/formula/#Modeling-tabular-data). |
| 5 | +Additionally, MixedModels.jl extends this syntax to use the pipe `|` as the grouping operator. |
| 6 | +Further extensions are provided by [RegressionFormulae.jl](https://github.com/kleinschmidt/RegressionFormulae.jl?tab=readme-ov-file), in particular the use of the slash `/` as the nesting operator and the use of the caret `^` to indicate main effects and interactions up to a specified order. |
| 7 | +Currently, MixedModels.jl loads RegressionFormulae.jl by default, though this may change in a future release. |
| 8 | +If you require specific functionality from RegressionFormulae.jl, it is best to load it directly so that you can control the version used. |
| 9 | + |
| 10 | +## General rules |
| 11 | + |
| 12 | +- "Addition" (`+`) indicates additive, i.e., main effects: `a + b` indicates main effects of `a` and `b`. |
| 13 | +- "Multiplication" (`*`) indicates crossing: main effects and interactions between two terms: `a * b` indicates main effects of `a` and `b` as well as their interaction. |
| 14 | +- Usual algebraic rules apply (associativity and distributivity): |
| 15 | + - `(a + b) * c` is equivalent to `a * c + b * c` |
| 16 | + - `a * b * c` corresponds to main effects of `a`, `b`, and `c`, as well as all three two-way interactions and the three-way interaction. |
| 17 | +- Categorical terms are expanded into the associated indicators/contrast variables. See the [StatsModels.jl documentation on contrasts](https://juliastats.org/StatsModels.jl/stable/contrasts/) for more information. |
| 18 | +- Interactions are expressed with the ampersand (`&`). (This is contrast to R, which uses the colon `:` for this operation.). `a&b` is the interaction of `a` and `b`. For categorical terms, appropriate combinations of indicators/contrast variables are generated. |
| 19 | +- Tilde (`~`) is used to separate response from predictors. |
| 20 | +- The intercept is indicated by `1`. |
| 21 | +- `y ~ 1 + (a + b) * c` is read as: |
| 22 | + - The response variable is `y`. |
| 23 | + - The model contains an intercept. |
| 24 | + - The model contains main effects of `a`, `b`, and `c`. |
| 25 | + - The model contains interactions between `a` and `c` and between `b` and `c` but not `a` and `b`. |
| 26 | +- An intercept is included by default, i.e. there is an implicit `1 + ` in every formula. The intercept may be suppressed by including a `0 + ` in the formula. (In contrast to R, the use of `-1` is **not** supported.) |
| 27 | + |
| 28 | +### MixedModels.jl provided extensions |
| 29 | + |
| 30 | +- The pipe operator (`|`) indicates grouping or blocking. |
| 31 | +- `(1 + a | subject)` indicates "by-subject random effects for the intercept and main effect `a`". |
| 32 | +- This is in line with the usual statistical reading of `|` as "conditional on". |
| 33 | + |
| 34 | +### RegressionFormulae.jl provided extensions |
| 35 | + |
| 36 | +- "Exponentiation" (`^`) works like repeated multiplication and generates all multiplicative and additive terms up to the given order. |
| 37 | + - `(a + b + c)^2` generates `a + b + c + a&b + a&c + b&c`, but not `a&b&c`. |
| 38 | + - The presence of interaction terms within the base will result in redundant terms and is currently unsupported. |
| 39 | +- `fulldummy(a)` assigns "contrasts" to `a` that include all indicator columns (dummy variables) and an intercept column. The resulting overparameterization is generally useful in the fixed effects only as part of nesting. |
| 40 | +- The slash operator (`/`) indicates nesting: |
| 41 | + - `a / b` is read as "`b` is nested within `a`". |
| 42 | + - `a / b` expands to `a + fulldummy(a) & b`. |
| 43 | +- It is generally not necessary to specify nesting in the blocking variables, when the inner levels are unique across outer levels. In other words, in a study with children (`C1`, `C2`, etc. ) nested within schools (`S1`, `S2`, etc.), |
| 44 | + - it is not necessary to specify the nesting when `C1` identifies a unique child across schools. In other words, intercept-only random effects terms can be written as `(1|C) + `(1|S)`. |
| 45 | + - it is necessary to specify the nesting when chid identifiers are re-used across schools, e.g. `C1` refers to a child in `S1` and a different child in `S2`. In this case, the nested syntax `(1|S/C)` expands to `(1|S) + (1|S&C)`. The interaction term in the second blocking variable generates unique labels for each child across schools. |
| 46 | + |
| 47 | + |
| 48 | + |
| 49 | +## Mixed models in Wilkinson-Rogers and mathematical notation |
| 50 | + |
| 51 | +Models fit with MixedModels.jl are generally linear mixed-effects models with unconstrained random effects covariance matrices and homoskedastic, normally distributed residuals. |
| 52 | +Under these assumptions, the model specification |
| 53 | + |
| 54 | +`response ~ 1 + (age + sex) * education * n_children + (1 | subject)` |
| 55 | + |
| 56 | +corresponds to the statistical model |
| 57 | + |
| 58 | +```math |
| 59 | +\begin{align*} |
| 60 | +\left(Y |\mathcal{B}=b\right) &\sim N\left(X\beta + Zb, \sigma^2 I \right) \\ |
| 61 | +\mathcal{B} &\sim N\left(0, G\right) |
| 62 | +\end{align*} |
| 63 | +``` |
| 64 | + |
| 65 | +for which we wish to obtain the maximum-likelihood estimates for ``G`` and thus the fixed-effects ``\beta``. |
| 66 | + |
| 67 | +- The model contains no restrictions on ``G``, except that it is positive semidefinite. |
| 68 | +- The response ``Y`` is the value of a given response. |
| 69 | +- The fixed-effects design matrix ``X`` consists of columns for |
| 70 | + - the intercept, age, sex, education, and number of children (contrast coded as appropriate) |
| 71 | + - the interaction of all lower order terms, excluding interactions between age and sex |
| 72 | +- The random-effects design matrix ``Z`` includes a column for |
| 73 | + - the intercept for each subject |
0 commit comments