9 Varying-intercept(s)

9.1 Objectives

Introduce multilevel structures
Introduce varying intercept models

library(pacman)

p_load(tidyverse, here, janitor, purrr, viridis, brms, tidybayes, bayesplot,
       modelr, forcats)

theme_set(theme_bw())

9.2 Reading

Chapter 12 “Multilevel Models” from Statistical Rethinking by Richard McElreath.

Motivation

Dummy variables derived multistate categorical variables force our model to consider the different states independently – the logically related parameters are statistically independent.

But we want models that use all of the information given, like that certain states are all from that category. And that inferences about one state can inform inferences about another.

This does not mean treating all states as identical. Instead we want to understand the average state and the population of states.

Simultaneously inference means we can transfer information between states and improve our inferences.

Population of states. States are not identical, but they are from the same population. The distribution for the population of states is the prior for each state. Except this prior isn’t exactly the same as we’re used to. Instead, this prior is actually learned from the data.

A parameter for each state as well as a mean and sd for the population. Every observation updates all of these parameters. If the population is highly variable, then the prior is flat and uninformative, which means that individual states have very little to do with each other. if the population contains little variation, then the prior is narrow and highly informative – an observation about 1 state will have a big impact on the estiamtes of of another state.

Multilevel models.

Models that remember features of each (identified) cluster in the data as they learn about all of the clusters. The amount of variation among the clusters determines how much the model pools information across clusters and improving our overall estimates.

Pragmatic benefits of multilevel modeling

Improved estimates for repeat sampling. when more than one observation arises from the same individual, location, or time, then traditional single-level models either maximally underfit or overfit the data. By identifying and modeling that repeated sampling, we can improve our estimates of the variable being measured and propegate our uncertainty about that parameter.
Improved estimates for imbalance in sampling.
Estimates of variation. If our research questions include variation among individuals or other groups within the data, then multilevel models are a bigh help because they model variation explicity. Dummy coding does not include the variation between the states, which is potentially very useful information.
Avoid averaging, retatin variation. Too often people just summarize repeated sampling by pre-averaging some data to construct their variables. This can be dangerous because averaging removes variation. Also, there are lots of ways to average data. Averaging manufactures false confidence and introduces an arbitrary data transform. Multilevel models allow us to preserve our uncertainty and avoid unnecessary and unprincipled data transforms.

Multilevel regression should be the default approach to regression. In the vast majority of cases, multilevel models are superior single-level models.

Once you grasp the basics of the multilevel modeling strategy it becomes much easier to incorporate related tricks such as allowing for measurement error in the data and even modeling missing data!

Defining, implementing, and understanding a multilevel model is in general more complicated than a single-level model.

We have to define the distribution of our states – something that sounds a lot harder than it is.

Also, multilevel models are a lot harder to fit than most single-level models. This means more time tuning our MCMC algorithm, something we’ve so far been ableo to avoid.

Finally, because multilevel models by their nature make statements at different “levels” of the data, they can be difficult for people to understanding. People rarely think about a population and the individual values within that population simultaneously.

multilevel == hierachical == mixed effects

Simpson’s Paradox

Dummy coding versus varying-intercepts. varying intercepts reframe problem. now the intercept is “universal” and the codes are their divergence from the mean.

In both cases each group gets its “own intercept” but how that value is calculated differs. Also, in varying intercept models we can put a prior on the variance between the groups.

ARM: with grouped data, a regression that includes indicators for groups is called a varying-intercept model because it can be interepreted as a model with a different intercept for each group.

\[ \begin{align} y_{i} &= \text{Normal}(\alpha_{j[i]} + \beta x_{i}, \sigma) \end{align} \]

Clustered data

pantheria <- read_tsv(here('data', 'PanTHERIA_1-0_WR05_Aug2008.txt'), 
                      na = '-999.00') %>%
  clean_names() %>% 
  rename(order = msw05_order,
         family = msw05_family,
         genus = msw05_genus,
         species = msw05_species,
         binomial = msw05_binomial) %>%
  mutate(mass_log = log(x5_1_adult_body_mass_g),
         range_group_log = log(x22_1_home_range_km2),
         range_indiv_log = log(x22_2_home_range_indiv_km2),
         density_log = log(x21_1_population_density_n_km2),
         activity_cycle = case_when(x1_1_activity_cycle == 1 ~ 'nocturnal',
                                    x1_1_activity_cycle == 2 ~ 'mixed',
                                    x1_1_activity_cycle == 3 ~ 'diurnal'),
         trophic_level = case_when(x6_2_trophic_level == 1 ~ 'herbivore',
                                   x6_2_trophic_level == 2 ~ 'omnivore',
                                   x6_2_trophic_level == 3 ~ 'carnivore')) %>%
  drop_na(density_log, mass_log, trophic_level, order, family, genus) %>%
  mutate(mass_log_stan = (mass_log - mean(mass_log)) / (2 * sd(mass_log)),
         trophic_level = fct_infreq(trophic_level))

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   MSW05_Order = col_character(),
##   MSW05_Family = col_character(),
##   MSW05_Genus = col_character(),
##   MSW05_Species = col_character(),
##   MSW05_Binomial = col_character(),
##   References = col_character()
## )

## See spec(...) for full column specifications.

m_1 <- pantheria %>%
  brm(data = .,
      family = gaussian(),
      formula = bf(density_log ~ 1 + mass_log_stan + trophic_level + (1 | order)),
      prior = c(prior(normal(0, 10), class = 'Intercept'),
                prior(normal(0, 5), class = 'b'),
                prior(normal(-1, 5), class = 'b', coef = 'mass_log_stan'),
                prior(cauchy(0, 5), class = 'sigma'),
                prior(cauchy(0, 1), class = 'sd', group = 'order')),
      iter = 2000,
      warmup = 1000,
      chains = 4,
      cores = 4,
      refresh = 0)

Repeated measurements