Day 4 Linear models, ANOVA shells applied to the more basic experiment designs
June 12th, 2025
4.1 Review
- Mindmap of the course, designed experiments, and the reason behind all these analyses.
- The golden rules to design experiments:
- Replication
- Randomization
- Local control (blocking)
- A good set of steps to analyze data that is handed to us:
- What are the treatment factors?
- What is the experimental unit? Is it the same as the observational unit?
- How were the treatments applied? (What is the blueprint of the design/underlying structure?)
- Building the statistical model:
- Deterministic component
- Random component (probability distribution)
- Estimation Method
4.2 Linear models
4.2.1 The most common model - Assumptions
Common assumptions behind the default in most software:
- Constant variance
- Independence this is what is affected by the design!
- Normality
We can describe the general linear model as \[\begin{equation} y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \end{equation}\] \[\begin{equation} \varepsilon_{ij} \sim N(0, \sigma^2), \end{equation}\] where \(y_{ij}\) is the \(j\)th observation of the \(i\)th treatment, \(\mu\) is the overall mean, \(\tau_i\) is the treatment effect of the \(i\)th treatment, and \(\varepsilon_{ij}\) is the residual for the \(j\)th observation of the \(i\)th treatment (i.e., the difference between observed and predicted).
4.2.2 Connection between this and your classical ANOVA table
- Recall that under Maximum Likelihood Estimation (MLE) and assuming a normal distribution, the estimates for MLE and Least Squares Estimation (LSE) are equivalent: \(\hat\beta_{MLE} =\hat\beta_{LSE}\).
- Least Squares means that the estimates (\(\hat\beta\)) are the ones that minimize \(\sum_{i=1}^ny_i-\bar{y}\).
- The SS can be divided into the ones explained by the model and the ones not explained by the model \(SS_{total} = SS_{model} + SS_{error}\).
- The SS can also be expressed dividing it into batches depending of their source of variability.
- The different SS are then used to get an \(F\) value used in the analysis of variance.
- The SS can also be used to estimate \(\sigma^2\):
- \(\sigma^2 = \frac{\sum_{i=1}^n (y_i-\hat{y_i})^2}{df_e}=\frac{SS_{error}}{df_e}\)
- \(\sigma^2 = \frac{SSR}{df_e}\)
- How does \(\sigma^2\) affect inference?
- Confidence intervals: \(CI_{95\%} = \hat{\beta_j}\pm t_{dfe, 1-\alpha} \cdot \widehat{s.e.}(\hat\beta_j)\)
- \(\widehat{s.e.}(\hat\beta_j) = \sqrt{\frac{\widehat{\sigma^2}}{s^2_x (n-1)}}\)
4.2.3 Sorghum example
The data below were generated by an experiment comparing sorghum genotypes (Omer et al., 2015). The data presented here correspond to a randomized complete block design ( design structure) that was performed to study different genotypes. Remember that blocks are assumed to be aproximately homogeneous within.
Back to the code we worked on yesterday.
Check out the code from class here.
4.3 Takehomes
Using the sorghum study, we demonstrated that if the true underlying process was actually represented by blocks (i.e., disjoint areas in the field),
library(tidyverse)
library(emmeans)
library(agridat)
library(multcomp)
# load data
data("omer.sorghum")
df <- omer.sorghum %>% filter(env == "E3")
options(contrasts = c("contr.sum", "contr.poly"))
m_without <- lm(yield ~ gen , data= df)
m_with <- lm(yield ~ gen + rep, data= df)
4.3.1 If you don’t include the design elements (blocks), their portion of the variance goes to the error
## [1] 51
## [1] 160.0855
## [1] 54
## [1] 168.9182
means_with <- emmeans(m_without, ~ gen)
means_without <- emmeans(m_with, ~ gen)
cld(means_with, method = "sidak", Letters = letters)
## gen emmean SE df lower.CL upper.CL .group
## G17 285 84.5 54 116 455 a
## G06 315 84.5 54 145 484 ab
## G02 515 84.5 54 346 685 abc
## G12 539 84.5 54 370 708 abc
## G01 582 84.5 54 413 751 abc
## G14 583 84.5 54 414 752 abc
## G16 605 84.5 54 435 774 abcd
## G05 643 84.5 54 474 812 abcd
## G15 647 84.5 54 477 816 abcd
## G11 728 84.5 54 559 897 bcd
## G10 739 84.5 54 569 908 bcd
## G08 741 84.5 54 572 911 bcd
## G04 749 84.5 54 580 919 bcd
## G03 805 84.5 54 636 974 cd
## G09 813 84.5 54 644 982 cd
## G13 825 84.5 54 656 994 cd
## G07 937 84.5 54 768 1106 cd
## G18 1030 84.5 54 861 1199 d
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 18 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
## gen emmean SE df lower.CL upper.CL .group
## G17 285 80 51 125 446 a
## G06 315 80 51 154 475 ab
## G02 515 80 51 355 676 abc
## G12 539 80 51 378 700 abcd
## G01 582 80 51 421 743 abcd
## G14 583 80 51 422 744 abcd
## G16 605 80 51 444 765 abcd
## G05 643 80 51 482 804 abcde
## G15 647 80 51 486 807 abcde
## G11 728 80 51 567 889 bcde
## G10 739 80 51 578 899 cde
## G08 741 80 51 581 902 cde
## G04 749 80 51 589 910 cde
## G03 805 80 51 644 966 cde
## G09 813 80 51 652 974 cde
## G13 825 80 51 664 986 cde
## G07 937 80 51 776 1098 de
## G18 1030 80 51 869 1191 e
##
## Results are averaged over the levels of: rep
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 18 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.