# Normal Distribution

## Testing distribution of your data

# Shapiro-Wilk normality test

The shapiro-wilk test tends to be quite powerful. So with large samples, the test can be significant even when the scores are only slightly different from a normal distribution.

→ Use histograms, Q-Q plots, and values of skewness/ kurtosis to double-check.

The Shapiro-Wilk test found that the hygiene scores on Day 1 were significantly non-normal at the 5% level of significance (W=0.99591, p<0.05).

**However, **inspection of Q-Q plotss and skewness and kurtosis suggested that the data follow a normal distribution; we continued analysis with the assumption of normality.

So when testing for normality, you should look at Q-Q plots, histogram, skewness and kurtosis, and the Shapiro-Wilk test.

When sample is small, Shapiro-Wilk test is better.

# Homogeneity of variance

## Typical Assumptions

There are several common assumptions for parametric models and tests.

- Normally distributed data (can be model or sampling distribution)
- Homogeneity of variance (“homoscedascity”)
- Interval data
- Independence

## Homoscedascity

**Homogeneity of variance **or **homoscedascity**: = the variance is the same among different groups. This is a common assumption of regression models and tests like ANOVA. In regression models (i.e., with continuous predictor variables), we typically inspect graphs. With groups of data (categorical predictors), we use a test called “Levene’s Test”.

Left: In general, mean is changing. The variance is the same for each of those. Standard deviation is the same.

Right: It is tight in Brixton and increases in variance further to the right. You don’t have homoscedascity in the data and must use different test.

## Testing Homoscedascity

To test Homoscedascity, we can use Levene’s test in the **car **library.

Whoops, looks like it’s expecting dlf$day2 to be a categorical variable — we need this data to be in the right format.

**leveneTest() **expects data in long format.

One way to this is via **dplyr:**

Data manipulation and pivoting is valuable for more than just tidying data!

not quite, **day **is a **chr**

“Levene’s Test for Homogeneity of Variance was not significant at the 5% level of significance (F = .3003, p = 0.7407). We thus proceeded with the assumption of homoscedascity.

Like the shapiro-wilk test, levene’s test may find an effect when none really exists. The variance ratio (Hartley’s Fmax), which compares the ratio of the largest variance to the smallest, can be used to double check a significant result (DSUR 5.7.3).

# Assumptions Summary

The second rule is to **trust yout intuition as an analyst and weigh the evidence**. Let me tell you how I would proceed for normality in general.

- I would look at histograms and Q-Q plots to see if anything is amiss. If I don’t see anything, and the tests do not reveal any violations, then we can continue with the assumption of normality.
- If the plots don’t look very normal and the tests indicate non-normality, I would proceed as if the data are non-normal.

a. If my model is robust to that in certain conditions, then I can proceed (citing evidence that the model is robust to that violation — it’s important to provide rationale).

b. If my mode is not robust ot that assumption, then I need to deal with that problem.

3. If the plots look fine (especially the Q-Q plot) but the Shapiro-Wilk test is significant, then I would look at the sample size and judge the statistical and pracitical significance. If values for skewness and kurtosis are statistically significantly different from zero, but aren’t practically significant (i.e., the values are close to 0), and the sample size is good, then I would take the Q-Q plot as enough evidence to proceed if it looks quite good.

When you are in doubt, though, deal with the problem to be safe.