Testing for normality
Anthropometric characters tend to be continuous and many tests are constructed on the assumption that the data approximate to a normal distribution. An easy way of seeing whether the distribution is skewed is to compare the values of the mean and median. For normal distributions the mean and median are numerically identical. As the distribution becomes more skewed, the difference between mean and median increases. There are a number of statistical tests available for testing 'normality' and the researcher may well get different results depending on which test is used. For example the Kolmorogov-Smirnoff test examines the cumulative distribution, which conflates skewness and kurtosis, while the Cox test determines the extent of skewness and kurtosis separately. Since skewness is more constraining than kurtosis the Cox test is preferable. Nevertheless significant skewness and/or kurtosis may occur with large samples even though the magnitude of the effect (s) is very small.
Table 1. Mean BMIs of mothers by birth outcome
Birth outcome n |
Mean |
SD |
Child died 345 |
20.36 |
2.66 |
Child survived 3805 |
21.25 |
2.68 |
Total 4105 |
21.18 |
2.69 |
F-test = 1.02, not significant.
t-test= 5.88, P < 0.001.
If the distribution of an anthropometric character does show significant skewness then a simple logarithmic (either log_{10} or loge) transformation will probably normalize the distribution. For instance body mass index (BMI: kg/m^{2}) has been shown to show skewness in some populations because of the extended tail at the upper end of the distribution.
Table 2. One-way Analysis of Variance and a posterior) test
Educational level and BMI in Bangladesh | |||||
0 |
1 |
2 |
3 | ||
None |
Primary |
Secondary |
Tertiary | ||
Mean |
20.32 |
20.79 |
21.41 |
22.24 | |
n |
1182 |
698 |
1355 |
915 | |
• Analysis of variance Source |
d.f |
Sum of squares |
Mean squares |
F ratio |
P |
Between groups |
3 |
2079.1 |
693.20 |
102.37 |
<0.0001 |
Within groups |
4146 |
28076.4 |
6.77 | ||
Total |
4149 |
30155.75 |
• Multiple range test: Student-Newman-Keuls procedure
*Denotes pairs of groups significantly different at the 0.050 level
Mean |
Group |
0 |
1 |
2 |
3 |
20.32 |
Group 0 | ||||
20.79 |
Group 1 |
* | |||
21.41 |
Group 2 |
* |
* | ||
22.24 |
Group 3 |
* |
* |
* |
Cross-sectional statistical analyses
To illustrate the types of tests which can be used, data from a large Bangladeshi survey of 4150 mother-child pairs in which mothers' anthropometric data were related to birth outcome have been used. The study was conducted in 10 medical centres in Bangladesh and all the women were full term. Mothers with antepartum haemorrhage, or undergoing miscarriage and abortion, multiple pregnancy, eclampsia or with gross fetal abnormalities were excluded.
Table 3. Analysis of variance of BMI by educational level and gravidity
Cell means: |
Total population 21.18 | |||||
Education level | ||||||
0 |
1 |
2 |
3 | |||
20.32 |
20.79 |
21.41 |
22.24 | |||
(1182) |
(698) |
(1355) |
(915) | |||
Gravidity | ||||||
0 |
1 |
2 |
3 |
4 |
5+ | |
20.98 |
21.44 |
21.29 |
21.35 |
21.38 |
20.95 | |
(1882) |
(1013) |
(604) |
(349) |
(122) |
(180) | |
Gravidity | ||||||
Educational level |
0 |
1 |
2 |
3 |
4 |
5+ |
0 |
20.01 |
20.62 |
20.29 |
20.40 |
21.23 |
20.29 |
(429) |
(246) |
(189) |
(151) |
(62) |
(105) | |
1 |
20.52 |
20.66 |
20.74 |
21.57 |
21.19 |
21.90 |
(286) |
(163) |
(116) |
(67) |
(29) |
(37) | |
2 |
21.16 |
21.51 |
21.76 |
22.00 |
22.16 |
21.57 |
(681) |
(346) |
(174) |
(101) |
(22) |
(31) | |
3 |
21.85 |
22.63 |
22.67 |
23.43 |
21.18 |
23.07 |
(486) |
(258) |
(125) |
(30) |
(9) |
(7) |
• Analysis of variance Source of variation |
Sum of squares |
d.f |
Mean square |
F |
P |
Education |
2272.92 |
3 |
757.64 |
113.53 |
0.001 |
Gravidity |
372.50 |
5 |
74.50 |
11.16 |
0.001 |
2-way interaction: | |||||
Education x Gravidity |
168.80 |
15 |
11.25 |
1.69 |
0.047 |
Residual |
27534.84 |
4126 |
6.67 | ||
Total |
30155.76 |
4149 |
7.27 | ||
• Multiple classification analysis |
Unadjusted |
Adjusted for independents | ||||
Variable + category |
n |
dev'n |
Eta |
dev'n |
Beta |
Education | |||||
0 none |
1182 |
0.86 |
0.94 | ||
1 primary |
698 |
0.39 |
0.42 | ||
2 secondary |
1355 |
0.24 |
0.27 | ||
3 tertiary |
915 |
1.06 |
1.13 | ||
0.26 |
0.28 | ||||
Gravidity | |||||
0 |
1882 |
0.20 |
-0.31 | ||
1 |
1013 |
0.26 |
0.18 | ||
2 |
604 |
0.11 |
0.18 | ||
3 |
349 |
0.17 |
0.48 | ||
4 |
122 |
0.21 |
0.65 | ||
5+ |
180 |
-0.23 |
0.31 | ||
0.08 |
0.11 | ||||
Multiple R^{2} |
0.81 | ||||
Multiple R |
0.285 |
Continuous dependent variable and an independent variable with 2 categories (l-test and F-test)
One question of interest is whether there is any significant relationship between mothers' BMI and birth outcome, i.e. does the infant die? Since there are only two categories (death or no child death) a simple t-test will suffice. The simple t-test assumes non-significant differences in sample variances and a test for homogeneity of variances (F-test) is usually performed before going on to the l-test. If the F-test shows significant heterogeneity a separate variance t-test is used and most computer-based statistical packages (e.g. SPSS/PC+) provide both the pooled and separate variance t-tests.
The comparison of mean BMIs of mothers by birth outcome is presented in Table 1. Since there was no difference in sample variances a pooled t-test statistic was calculated. The results show that there is a highly significant difference in means; mothers whose child died have, on average, a lower mean BMI. In these analyses a two-tailed t-test was used because the null hypothesis (Ho) was that there was no difference between means. If, however, some previous study had shown a significantly reduced BMI in mothers whose child had died the hypothesis would have been the alternative one (H1) and a one-tailed t-test would have been used. The calculations of both one- and two-tailed l-tests are identical; the only difference is in the interpretation of the probability tables.
Continuous dependent variable and an independent variable with 3 or more categories (one-way analysis of variance)
It is frequently reported that BMI varies between people with different educational levels, where the educational level is taken as a proxy for a combination of knowledge of health matters and socio-economic status. In Bangladesh it is usual to grade people's educational attainment into four levels, no education (coded as 0 here), primary (1), secondary (2) and tertiary (3). The mean BMIs for the four groups are shown in Table 2 together with the analysis of variance (ANOVA). Many computer packages also include tests of a posterior) differences (i.e. the F-test is significant and the researcher wants to know which means are significant). There are a number of a posterior) tests; the one illustrated here is the Student-Newman-Keuls but other frequently used tests would be the Scheffé and a posterior) (t-test).
The ANOVA shows that there are highly significant differences between the four means. The a posterior) test reveals that all group means are very different.
Continuous dependent variables and two independent variables with 2 or more categories (ANOVA)
A slightly more complex analysis is used when the researcher wants to examine the simultaneous effect of two or more discrete characters on a continuous variable. One example is examining the relationship between BMI and educational level and gravidity. The same categories for educational level are used as described previously. Gravidity has been coded from 0 (primigravida) to 5 (the last category referring to mothers who have 5 or more children). The results of the ANOVA are presented in Table 3. The results show that there are significant additive effects of both educational level and gravidity and a borderline significant interaction effect. The multiple classification analysis compares each group in relation to the overall (grand) mean. It is clear for instance that the initial pattern of means for gravidity which show lower means for primigravida and multigravida (5+) women change when educational level is taken into account. The multiple R2 provides a measure of how much of the variation in BMI is explained by educational level and gravidity. In this example the two independent variables account for 8. 1% of the total variation.
Table 4. Regression analysis of BMI on mother's age
Multiple R |
0.140 |
R^{2} |
0.0197 |
Adjusted R^{2} |
0.0194 |
Standard error |
2.6696 |
• Analysis of variance
d.f |
Sum of square |
Mean squares | |
Regression |
1 |
593.71 |
593 71 |
Residual |
4148 |
29562.05 |
7 13 |
F= 83.31,P<0.0001 |
Variable |
B |
SE B |
Beta |
t |
P |
Age |
0.077 |
0.0084 |
0.140 |
9.13 |
0.0001 |
(Constant) |
19.300 |
0.210 |
91.94 |
0.001 |
Table 5. Test or curvilineanty of BMI against mother's age
Step 1. Age entered | |
Multiple R |
0.140 |
R^{2} |
0.0197 |
Adjusted R^{2} |
0.0194 |
Standard error |
2.6696 |
• Analysis of variance
d.f |
Sum of squares |
Mean square | |
Regression |
1 |
593.71 |
593.71 |
Residual |
4148 |
29562.05 |
7.13 |
F = 83.31, P < 0.0001 |
Variable |
B |
SE B |
Beta |
t |
P |
Age |
0.077 |
0.0084 |
0.140 |
9.13 |
0.0001 |
(Constant) |
19.300 |
0.210 |
91.94 |
0.0001 |
Step 2. Age^{2} entered
Multiple R |
0.154 |
R^{2} |
0.0238 |
Adjusted R^{2} |
0.0233 |
Standard error |
2.6644 |
• Analysis of variance
d.f |
Sum of squares |
Mean square | |
Regression |
2 |
716.70 |
358.35 |
Residual |
4147 |
29439.05 |
7.10 |
F = 50.48, P < 0.0001 |
Variable |
B |
SE B |
Beta |
t |
P |
Age |
0.356 |
0.068 |
0.652 |
5.26 |
0.0001 |
Age^{2} |
-0.005 |
0.001 |
-0.515 |
-.16 |
0.0001 |
(Constant) |
15.76 |
0.875 |
18.00 |
0.0001 |
Continuous dependent variable and a continuous independent variable (regression analysis)
Regression analysis is used to examine the bivariate relationship between two continuous variables when there is no dependency or when the researcher wants to plot the best fitting line. Alternatively correlation analysis can be used if there is no dependent/independent relationship. The results of regressing BMI on age are shown in Table 4. There is a clear positive relationship with BMI increasing with mother's age and the regression line suggests that for each yearly increment in age BMI increases by ±0.08. It is always advisable to examine the residual plot because if there is a linear association residuals will be symmetrically arranged. In this analysis the examination of the residuals for BMI and age (not shown) revealed a curvilinear pattern which suggests that a quadratic term should be included in the analysis. The next section details how to test for a curvilinear relationship.
Table 6. Analysis of variance of BMI with age and age squared, educational level and gravidity
Source of variation |
Sum of squares |
d.f |
Mean square |
F |
P |
Covariates | |||||
Age |
196.656 |
1 |
196.656 |
29.892 |
0.001 |
Age2 |
122.992 |
1 |
122.992 |
18.695 |
0.001 |
Main effects | |||||
Education |
1909.395 |
3 |
636.465 |
96.743 |
0.001 |
Gravidity |
69.508 |
5 |
13.902 |
2.113 |
0.061 |
2-way interactions | |||||
Education/Gravidity |
181.932 |
15 |
12.129 |
1.844 |
0.024 |
Residual |
27131.445 |
4124 |
6.579 | ||
Total |
30155.752 |
4149 |
7.268 |
• Multiple classification analysis Grand mean = 21.178
Unadjusted +covariates |
Adjusted for independents | ||||
Variable + category |
n |
dev'n |
Eta |
dev'n |
Beta |
Education |
1182 |
-0.86 |
-0.92 | ||
1 primary |
698 |
-0.39 |
-0.36 | ||
2 secondary |
1355 |
0.24 |
0.31 | ||
3 tertiary |
915 |
1.06 |
0.99 | ||
0.26 |
0.27 | ||||
Gravidity | |||||
0 |
1882 |
0.20 |
0.08 | ||
1 |
1013 |
0.26 |
0.18 | ||
2 |
604 |
0.11 |
-0.02 | ||
3 |
349 |
0.17 |
0.07 | ||
4 |
122 |
0.21 |
0.16 | ||
5+ |
180 |
-0.23 |
0.38 | ||
0.08 |
0.05 | ||||
Multiple R2 = 0.094 | |||||
Multiple R = 0.307 |
Test for curvilinearity for a continuous dependent variable and a continuous independent variable (regression analysis)
With the inclusion of a quadratic term, the generalized regression equation changes from Y = a ± bX to Y = a ± bX±CX_{2}. The analyses of BMI against mother's age (linear and quadratic) are presented in Table 5. The quadratic term for age is shown as Age^{2} in Table S and it is highly significant (t = 4.162, P < 0.0001) indicating significant curvilinearity. The effect of a negative quadratic term (-0.005) is to lower predicted BMIs at higher ages.
Continuous dependent variable and a continuous independent variable and a number of discrete independent variables (ANOVA or multiple regression analysis) The previous analyses have shown that there are relationships between BMI and educational level, gravidity and maternal age. The simultaneous effects of these variables can be examined using analysis of variance.
In this analysis of variance the effects of the continuous characters (age and age^{2}) have been removed first of all before determining the effect of educational level and gravidity but researchers are usually free to choose in which order terms are removed. The results are presented in Table 6 and show that after removing the linear and quadratic effects of age, the impact of education remained very much as it was, whereas gravidity is no longer significant. In addition there is a significant interaction between education and gravidity (P = 0.024). About 9% of the variance of BMI is explained by the three variables.
Multiple regression analysis would give similar results to ANOVA and its use is discussed in the next section.