Day 4 Linear models, ANOVA shells applied to the more basic experiment designs

June 12th, 2025

4.1 Review

  • Mindmap of the course, designed experiments, and the reason behind all these analyses.
  • The golden rules to design experiments:
    • Replication
    • Randomization
    • Local control (blocking)
  • A good set of steps to analyze data that is handed to us:
    • What are the treatment factors?
    • What is the experimental unit? Is it the same as the observational unit?
    • How were the treatments applied? (What is the blueprint of the design/underlying structure?)
  • Building the statistical model:
    • Deterministic component
    • Random component (probability distribution)
    • Estimation Method

4.2 Linear models

4.2.1 The most common model - Assumptions

Common assumptions behind the default in most software:

  • Constant variance
  • Independence this is what is affected by the design!
  • Normality

We can describe the general linear model as \[\begin{equation} y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \end{equation}\] \[\begin{equation} \varepsilon_{ij} \sim N(0, \sigma^2), \end{equation}\] where \(y_{ij}\) is the \(j\)th observation of the \(i\)th treatment, \(\mu\) is the overall mean, \(\tau_i\) is the treatment effect of the \(i\)th treatment, and \(\varepsilon_{ij}\) is the residual for the \(j\)th observation of the \(i\)th treatment (i.e., the difference between observed and predicted).

4.2.2 Connection between this and your classical ANOVA table

  • Recall that under Maximum Likelihood Estimation (MLE) and assuming a normal distribution, the estimates for MLE and Least Squares Estimation (LSE) are equivalent: \(\hat\beta_{MLE} =\hat\beta_{LSE}\).
  • Least Squares means that the estimates (\(\hat\beta\)) are the ones that minimize \(\sum_{i=1}^ny_i-\bar{y}\).
  • The SS can be divided into the ones explained by the model and the ones not explained by the model \(SS_{total} = SS_{model} + SS_{error}\).
  • The SS can also be expressed dividing it into batches depending of their source of variability.
  • The different SS are then used to get an \(F\) value used in the analysis of variance.
  • The SS can also be used to estimate \(\sigma^2\):
    • \(\sigma^2 = \frac{\sum_{i=1}^n (y_i-\hat{y_i})^2}{df_e}=\frac{SS_{error}}{df_e}\)
    • \(\sigma^2 = \frac{SSR}{df_e}\)
  • How does \(\sigma^2\) affect inference?
    • Confidence intervals: \(CI_{95\%} = \hat{\beta_j}\pm t_{dfe, 1-\alpha} \cdot \widehat{s.e.}(\hat\beta_j)\)
    • \(\widehat{s.e.}(\hat\beta_j) = \sqrt{\frac{\widehat{\sigma^2}}{s^2_x (n-1)}}\)

4.2.3 Sorghum example

The data below were generated by an experiment comparing sorghum genotypes (Omer et al., 2015). The data presented here correspond to a randomized complete block design ( design structure) that was performed to study different genotypes. Remember that blocks are assumed to be aproximately homogeneous within.

Back to the code we worked on yesterday.

Check out the code from class here.

4.3 Takehomes

Using the sorghum study, we demonstrated that if the true underlying process was actually represented by blocks (i.e., disjoint areas in the field),

library(tidyverse)
library(emmeans)
library(agridat)
library(multcomp)

# load data
data("omer.sorghum")
df <- omer.sorghum %>% filter(env == "E3")

options(contrasts = c("contr.sum", "contr.poly"))
m_without <- lm(yield ~ gen , data= df)
m_with <- lm(yield ~ gen + rep, data= df)

4.3.1 If you don’t include the design elements (blocks), their portion of the variance goes to the error

# check residuals df with blocks
m_with$df.residual
## [1] 51
summary(m_with)$sigma
## [1] 160.0855
# check df without blocks
m_without$df.residual
## [1] 54
summary(m_without)$sigma
## [1] 168.9182
means_with <- emmeans(m_without, ~ gen)
means_without <- emmeans(m_with, ~ gen)

cld(means_with, method = "sidak", Letters = letters)
##  gen emmean   SE df lower.CL upper.CL .group
##  G17    285 84.5 54      116      455  a    
##  G06    315 84.5 54      145      484  ab   
##  G02    515 84.5 54      346      685  abc  
##  G12    539 84.5 54      370      708  abc  
##  G01    582 84.5 54      413      751  abc  
##  G14    583 84.5 54      414      752  abc  
##  G16    605 84.5 54      435      774  abcd 
##  G05    643 84.5 54      474      812  abcd 
##  G15    647 84.5 54      477      816  abcd 
##  G11    728 84.5 54      559      897   bcd 
##  G10    739 84.5 54      569      908   bcd 
##  G08    741 84.5 54      572      911   bcd 
##  G04    749 84.5 54      580      919   bcd 
##  G03    805 84.5 54      636      974    cd 
##  G09    813 84.5 54      644      982    cd 
##  G13    825 84.5 54      656      994    cd 
##  G07    937 84.5 54      768     1106    cd 
##  G18   1030 84.5 54      861     1199     d 
## 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 18 estimates 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same.
cld(means_without, method = "sidak", Letters = letters)
##  gen emmean SE df lower.CL upper.CL .group
##  G17    285 80 51      125      446  a    
##  G06    315 80 51      154      475  ab   
##  G02    515 80 51      355      676  abc  
##  G12    539 80 51      378      700  abcd 
##  G01    582 80 51      421      743  abcd 
##  G14    583 80 51      422      744  abcd 
##  G16    605 80 51      444      765  abcd 
##  G05    643 80 51      482      804  abcde
##  G15    647 80 51      486      807  abcde
##  G11    728 80 51      567      889   bcde
##  G10    739 80 51      578      899    cde
##  G08    741 80 51      581      902    cde
##  G04    749 80 51      589      910    cde
##  G03    805 80 51      644      966    cde
##  G09    813 80 51      652      974    cde
##  G13    825 80 51      664      986    cde
##  G07    937 80 51      776     1098     de
##  G18   1030 80 51      869     1191      e
## 
## Results are averaged over the levels of: rep 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 18 estimates 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same.

4.4 Tomorrow

  • Kahoot for attendance