To use the ANOVA test we made the following assumptions:
- Each group sample is drawn from a normally distributed population
- All populations have a common variance
- All samples are drawn independently of each other
- Within each sample, the observations are sampled randomly and independently of each other
- Factor effects are additive
The presence of outliers can also cause problems. In addition, we need to make sure that the F statistic is well behaved. In particular, the F statistic is relatively robust to violations of normality provided:
- The populations are symmetrical and uni-modal.
- The sample sizes for the groups are equal and greater than 10
In general, as long as the sample sizes are equal (called a balanced model) and sufficiently large, the normality assumption can be violated provided the samples are symmetrical or at least similar in shape (e.g. all are negatively skewed).
The F statistic is not so robust to violations of homogeneity of variances. A rule of thumb for balanced models is that if the ratio of the largest variance to smallest variance is less than 3 or 4, the F-test will be valid. If the sample sizes are unequal then smaller differences in variances can invalidate the F-test. Much more attention needs to be paid to unequal variances than to non-normality of data.
We now look at how to test for violations of these assumptions and how to deal with any violations when they occur.
In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a technique that can be used to compare means of two or more samples (using the F distribution). This technique can be used only for numerical response data, the "Y", usually one variable, and numerical or (usually) categorical input data, the "X", always one variable, hence "one-way".
The ANOVA tests the null hypothesis that samples in all groups are drawn from populations with the same mean values. To do this, two estimates are made of the population variance. These estimates rely on various assumptions (see below). The ANOVA produces an F-statistic, the ratio of the variance calculated among the means to the variance within the samples. If the group means are drawn from populations with the same mean values, the variance between the group means should be lower than the variance of the samples, following the central limit theorem. A higher ratio therefore implies that the samples were drawn from populations with different mean values.
Typically, however, the one-way ANOVA is used to test for differences among at least three groups, since the two-group case can be covered by a t-test (Gosset, 1908). When there are only two means to compare, the t-test and the F-test are equivalent; the relation between ANOVA and t is given by F = t2. An extension of one-way ANOVA is two-way analysis of variance that examines the influence of two different categorical independent variables on one dependent variable.
The results of a one-way ANOVA can be considered reliable as long as the following assumptions are met:
If data are ordinal, a non-parametric alternative to this test should be used such as Kruskal–Wallis one-way analysis of variance. If the variances are not known to be equal, a generalization of 2-sample Welch's t-test can be used.
Departures from population normality
ANOVA is a relatively robust procedure with respect to violations of the normality assumption.
The one-way ANOVA can be generalized to the factorial and multivariate layouts, as well as to the analysis of covariance.[clarification needed]
It is often stated in popular literature that none of these F-tests are robust when there are severe violations of the assumption that each population follows the normal distribution, particularly for small alpha levels and unbalanced layouts. Furthermore, it is also claimed that if the underlying assumption of homoscedasticity is violated, the Type I error properties degenerate much more severely.
However, this is a misconception, based on work done in the 1950s and earlier. The first comprehensive investigation of the issue by Monte Carlo simulation was Donaldson (1966). He showed that under the usual departures (positive skew, unequal variances) "the F-test is conservative" so is less likely than it should be to find that a variable is significant. However, as either the sample size or the number of cells increases, "the power curves seem to converge to that based on the normal distribution". Tiku (1971) found that "the non-normal theory power of F is found to differ from the normal theory power by a correction term which decreases sharply with increasing sample size." The problem of non-normality, especially in large samples, is far less serious than popular articles would suggest.
The current view is that "Monte-Carlo studies were used extensively with normal distribution-based tests to determine how sensitive they are to violations of the assumption of normal distribution of the analyzed variables in the population. The general conclusion from these studies is that the consequences of such violations are less severe than previously thought. Although these conclusions should not entirely discourage anyone from being concerned about the normality assumption, they have increased the overall popularity of the distribution-dependent statistical tests in all areas of research."
For nonparametric alternatives in the factorial layout, see Sawilowsky. For more discussion see ANOVA on ranks.
The case of fixed effects, fully randomized experiment, unbalanced data
The normal linear model describes treatment groups with probability distributions which are identically bell-shaped (normal) curves with different means. Thus fitting the models requires only the means of each treatment group and a variance calculation (an average variance within the treatment groups is used). Calculations of the means and the variance are performed as part of the hypothesis test.
The commonly used normal linear models for a completely randomized experiment are:
- (the means model)
- (the effects model)
- is an index over experimental units
- is an index over treatment groups
- is the number of experimental units in the jth treatment group
- is the total number of experimental units
- are observations
- is the mean of the observations for the jth treatment group
- is the grand mean of the observations
- is the jth treatment effect, a deviation from the grand mean
- , are normally distributed zero-mean random errors.
The index over the experimental units can be interpreted several ways. In some experiments, the same experimental unit is subject to a range of treatments; may point to a particular unit. In others, each treatment group has a distinct set of experimental units; may simply be an index into the -th list.
The data and statistical summaries of the data
One form of organizing experimental observations is with groups in columns:
Comparing model to summaries: and . The grand mean and grand variance are computed from the grand sums, not from group means and variances.
The hypothesis test
Given the summary statistics, the calculations of the hypothesis test are shown in tabular form. While two columns of SS are shown for their explanatory value, only one column is required to display results.
is the estimate of variance corresponding to of the model.
The core ANOVA analysis consists of a series of calculations. The data is collected in tabular form. Then
- Each treatment group is summarized by the number of experimental units, two sums, a mean and a variance. The treatment group summaries are combined to provide totals for the number of units and the sums. The grand mean and grand variance are computed from the grand sums. The treatment and grand means are used in the model.
- The three DFs and SSs are calculated from the summaries. Then the MSs are calculated and a ratio determines F.
- A computer typically determines a p-value from F which determines whether treatments produce significantly different results. If the result is significant, then the model provisionally has validity.
If the experiment is balanced, all of the terms are equal so the SS equations simplify.
In a more complex experiment, where the experimental units (or environmental effects) are not homogeneous, row statistics are also used in the analysis. The model includes terms dependent on . Determining the extra terms reduces the number of degrees of freedom available.
Consider an experiment to study the effect of three different levels of a factor on a response (e.g. three levels of a fertilizer on plant growth). If we had 6 observations for each level, we could write the outcome of the experiment in a table like this, where a1, a2, and a3 are the three levels of the factor being studied.
a1 a2 a3 6 8 13 8 12 9 4 9 11 5