Contents - Previous - Next

This is the old United Nations University website. Visit the new site at

Validation of food intake data: implications of food composition variation

A particular consideration of the impact of food composition variation on regression and correlation analyses arises in connection with validation trials in which food intake estimated by recall or observation methods is compared with intake during the same period estimated by direct chemical measurement of duplicate meals. This is a common procedure. What may not always be recognized is that variation in food composition will inevitably yield a bias in regression slopes and attenuation of correlation coefficients even if the estimation of food intake is perfect. Further, the impact on regression depends on whether the error term lies in the dependent or independent variable.

Table 8. Estimate of variability (CV) of one-day intake derived from consideration of both food composition variation and random error in the estimation of intake (assuming 15 foods in the dieta



  10 15 20 25 30 35 40 45
10 3.7 4.7 5.8 7.0 8.2 9.5 10.7 12.0
15 4.7 5.5 6.5 7.4 8.7 9.9 11.1 12.4
20 5.8 6.5 7.4 8.4 9.5 10.6 11.7 12.9
25 7.0 7.6 8.4 9.3 10.3 11.3 12.4 13.6
30 8.2 8.7 9.5 10.3 11.2 12.2 13.3 14.4
35 9.5 9.9 10.6 11.3 12.2 13.2 14.2 15.3
40 10.7 11.1 11.7 12.4 13.3 14.2 15.2 16.2
45 12.0 12.4 12.9 13.6 14.4 15.3 16.2 17.2

a. Values calculated as value in table 7 divided by square root of 15, the assumed number of food items in the one-day diet.

This can be illustrated in a simulation analysis. Consider a model in which iron intake is computed from estimated food intake and is chemically determined from duplicate meals. Assume that for 97 individuals the intakes range from 5 to 25 mg per day and that the individuals are randomly distributed across this range. Consider also that the iron composition for one-day intakes has a CV of 15 per cent, a value consistent with the estimates presented in table 4. In the simulation analysis, 97 random values for iron intake lying in the 5 to 25 mg range were selected. For each of these mean intake values, a random value was selected from the population of possible real values described by a random distribution having mean as specified and CV = 15 per cent. The regression across the 97 individuals was then computed. This exercise was then repeated de novo 1,000 times. Finally the regression parameters for the 1,000 estimates were examined. The results are presented below:

A. With
X = calculated intake (no error term in this model)
Y = chemically determined intake (includes the "error" term)

Regression parameters  
Intercept 0.0639 0.5440a
Slope 0.995 0.0441
Correlation coefficient 0.9229 0.0146

B. With
X = chemical composition (includes the "error" term)
Y = calculated intake (no error term in this model)

Regression parameters  
Intercept 2.1579 0.5403
Slope 0.8573 0.0379
Correlation coefficient 0.9229 0.0146

a. Mean SD from 1,000 iterations of model.

That the bias in regression slope is seen only in one variation of the model is a recognized phenomenon. It is the error term in the independent variable that biases the regression.

Table 9. The impact of error in the independent and dependent variables on (A) regression slope and (B) correlation coefficienta

A. Impact on regression slope

Dependent variable (Y) variance ratio

Independent variable (X) ratio of intra/inter variances

  0 0.4 0.8 1.2 1.6 2.0 2.4 2.8
0 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263
0.4 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263
0.8 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263
1.2 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263
1.6 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263
2.0 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263
2.4 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263
2.8 1.0 0.714 0.556 0.455 0.385 0.333 0.294 0.263

B. Impact on correlation coefficientb

Dependent variable (Y) variance ratio

Independent variable (X) ratio of intra/inter variances

  0 0.4 0.8 1.2 1.6 2.0 2.4 2.8
0 1.0 0.845 0.745 0.674 0.620 0.577 0.542 0.512
0.4 0.845 0.714 0.630 0.570 0.524 0.488 0.458 0.434
0.8 0.745 0.630 0.555 0.503 0.462 0.430 0.404 0.382
1.2 0.674 0.569 0.503 0.455 0.418 0.389 0.365 0.346
1.6 0.620 0.524 0.462 0.418 0.385 0.358 0.336 0.318
2.0 0.577 0.488 0.430 0.389 0.358 0.333 0.313 0.296
2.4 0.542 0.458 0.404 0.366 0.336 0.313 0.294 0.278
2.8 0.513 0.434 0.382 0.346 0.318 0.296 0.278 0.263

a. The reference slope and correlation are each set at 1.0. The tables portray the bias introduced as a multiple of true values.
b. All calculations presented assume that there is no correlation between errors.

A random error in the dependent variable has no specific effect on the slope. In the case of correlation analyses, the effect is the same no matter where the error lies. Here the correlations are high since the range of observed intakes is quite large in relation to the error term stipulated. More general models of these effects on regression slopes and on correlation coefficients are presented in table 9. The bases of these calculations will be found in Beaton et al. [2] and Snedecor and Cochrane [7].

Systematic errors in food composition data

The effects described in table 9 and discussed in the foregoing text are to be distinguished from a bias in the estimates of average intake for the group of subjects. If statistically significant differences between estimated and determined intakes are found, it is suggestive of systematic bias in either the food composition table or in the estimation of food intake. (The latter may be dismissed if food composites have been based on reported intake rather than being true duplicate meals. Both approaches to validation of food intake have been used. Building food composites from reported diets can be seen as a test of the validity of food composition tables. In this case the biological variability of individual samples of food should be considered in interpreting results. )

Any source of systematic error in food composition data will, of course, lead to a bias in the estimate of intake.

To consider the import of improving the food composition data base on data analyses involving regression or correlation analyses, it is necessary to consider concurrently the other error sources that may be present in the nutrient-intake estimates. It is necessary to recognize also that no matter how much the data base may be improved, there will always remain a biologic variability of composition among individual samples of foods. This will ultimately prove to be the limiting factor in improving the composition data base.

Relevance to priorities for food composition data

The above considerations have import in considering priorities in improving food composition data bases. Some of these are outlined below.

First, if there is any suspicion of true bias in the food composition data, the error will carry through all calculations. Thus, if there is suspicion that a methodologic error gives consistent under- or overestimation of composition, correction of the error should have high priority.

Second, it should be apparent from the foregoing that the major contribution to the error term in estimated one-day intakes will be associated with the foods that make the greatest contribution to total nutrient intake. That is, greater benefit will accrue from improvement of the composition estimate of the major nutrient contributors than from improvement of minor contributors. Therefore, obtaining more replicates of composition for major contributors will be more cost-effective than addressing minor contributors.

Similarly, and particularly in the connotation of the INFOODS programme, which will have international implications, the effect of improving the reliability of food composition will be greater in diets that include only a few foods than in diets that are marked by great diversity. It follows then that increasing the composition replicates will be more cost-effective in major foods of limited diets than in the case of diverse diets.

Obviously, if data are missing for certain foods, either imputed values must be used or intake from that food will be taken as 0. Either way there is a potential error. In the latter instance, the error will always be a bias toward underestimation of total intake; in the former, the error could be in either direction, across foods it might even be random. If the food is a major contributor of total nutrient intake, then the error term could be quite important. It follows that filling in missing data in the food composition table must have a priority. If the food is expected to be a minor contributor to total intake, and if reasonable imputations can be made, imputation of missing data may be quite reasonable. Conversely, if the food is thought to be a major contributor, it will be cost-effective to undertake analyses. Unless there is an a priori reason to believe that varietal differences are great, it may not be cost effective to undertake composition determinations for each variety in use. The same consideration will apply to compositional differences attributable to soil composition and growing conditions (but note here that if the soil composition and growing conditions of a specific area affect all, or many, of the foods consumed in that area, a bias in the estimate of intake could be present). Perhaps the most cost-effective approach in this case would be research intended to determine whether major effects are likely to be present, and then a reconsideration of analytical priorities.

In all considerations, across nutrients, a scale of relative priorities must be based upon the perceived importance of examination of the nutrient in question as well as the relative cost of determinations.


If we are to address priorities for improvement of food composition data bases through increased numbers of analyses of food samples, the foregoing may be taken as an example of the merit of the following recommendation: Before major investment in food composition analyses is undertaken, either to increase the number of replicate analyses or to refine the specification of foods few which analyses are presented, it is recommended that the anticipated uses of food composition data be defined and "sensitivity analyses" be undertaken to examine the cost-effectiveness of such investment. From such sensitivity analyses a ranking of relative priorities, based on cost effectiveness considerations, should be developed.


1. G. H. Beaton, "Nutritional Assessment of Observed Dietary Intake: An Interpretation of Recent Requirement Reports," in H. H. Draper, ea., Recent Advances in Nutrition Research (in press).

2. G. H. Beaton et al., "Sources of Variance in 24-hour Dietary Recall Data: Implications for Nutrition Study Design and Interpretation," A.J.C.N., 32:2546 2559 (1979).

3. G. H. Beaton et al., "Sources of Variance in 24-hour Dietary Recall Data: implications for Nutrition Study Design and Interpretation. Carbohydrate Sources, Vitamins and Minerals," A.1.C.N., 37: 986995 (1983).

4. D. R. Jacobs, Jr., J. T. Anderson, and H. Blackburn, "Diet and Serum Cholesterol. Do Zero Correlations Negate the Relationship?" Am. J. Epidemiol., 110: 77-87 (1979).

5. K. Liu, J. Stamler, A. Dyer, J. McKeever, and P. McKeever, "Statistical Methods to Assess and Minimize the Role of Intra-individual Variability in Obscuring Relationships between Dietary Lipids and Serum Cholesterol,"l. Chronic Dis., 1: 399 418 (1978).

6. National Academy of Sciences, National Research Council, Subcommittee on Criteria for Dietary Evaluation, Nutrition Adequacy: Assessment Using Four Consumption Surveys (National Academy Press, Washington, D.C., 1985).

7. G. W. Snedecor and W. G. Cochrane, Statistical Methods, 7th ed. (lowa State University Press, Ames, lowa, 1980).

8. R. A. Stallones, "Comments on the Assessment of Nutritional Status in Epidemiologic Studies and Surveys of Populations," A.J. C. N., 35: 1290- 1291 (1979).

9. US Department of Agriculture, "Composition of Foods: Raw, Processed, Prepared," Agriculture Handbook No. 8 (Science and Education Administration, USDA, Washington, D.C., 1976-). This reference is intended to cover a series of reports that is not yet complete.

10. WHO, Energy and Protein Requirements, report of a Joint FAD/WHO/ UNU Expert Consultation, WHO Technical Report Series, no. 724 (WHO, Geneva, 1985).

Dietary assessment methods used by the national health and nutrition examination surveys (NHANES)

Design of NHANES II
Major nutrition-related components of NHANES II
Uses of dietary data
Plans for future NHANES


National Center for Health Statistics, US Department of Health and Human Services,
Hyattsville, Maryland, USA


The National Center for Health Statistics has conducted health examination surveys for over 20 years. The National Health Survey Act of 1956 authorized the secretary of what is now the Department of Health and Human Services, acting through the National Center for Health Statistics, to collect statistics on a wide range of health issues. Among other topics, the centre is authorized to collect statistics on "determinants of health" and "the extent and nature of illness and disability of the population of the United States (or of any groupings of the people included in the population)..

As part of its response to this mandate, the centre fielded the first National Health Examination Survey in 1959. The target population for this survey was adults of 18 to 74 years of age. Two additional surveys were conducted during the 1960s, extending the age groups examined to include children of 6 to 11 years and adolescents of 12 to 17. In 1971, the range of topics included in the survey was extended to include nutritional status. Nutritional status was assessed through a fivefold approach including a medical history, a physician's examination, biochemical tests, body measurements, and a dietary interview. The first National Health and Nutrition Examination Survey (NHANES I), conducted from 1971 to 1974, examined a representative sample of persons between the ages of 1 and 74. An additional sample of adults aged 25 to 74 years, called the NHANES I Augmentation, was examined in 1974-1975. The second National Health and Nutrition Examination Survey (NHANES II) was conducted from 1976 to 1980. The age range extended to include infants of six months to one year. In December 1984, the centre completed data collection for the Hispanic Health and Nutrition Examination Survey (NHANES) of persons of Mexican American, Puerto Rican, and Cuban ancestry residing in the continental United States. We are beginning to plan the next survey, NHANES III, scheduled to begin in 1988.

Over the years, data generated by the health examination surveys have served a variety of uses. The surveys have provided estimates of the prevalence of characteristics or conditions in the American population. Normative or descriptive data have been published on topics such as weight and stature. Both types of estimates permit the monitoring or measurement of changes in health and nutritional status over time through successive surveys. Problems of public health importance have been identified. The survey data have also been used to study the interrelationships of health and nutrition variables in the general population.

My purpose is to describe the National Health and Nutrition Examination Surveys, paying particular attention to the dietary intake data. Using NHANES II as a reference point, I will discuss design considerations, the major components of the survey related to nutrition, uses of the dietary data, and plans for future surveys.

Design of NHANES II

In approaching the design of NHANES I and II, many factors had to be considered. I will discuss here the design specifications that were considered in planning the most recent national survey, NHANES II.

NHANES II was planned to be a profitability sample of the civilian, non-institutionalized population of the United States for persons of 6 months to 74 years. Three subgroups in the population were of special interest for nutritional assessment because it was thought they were at higher risk of malnutrition: pre-school children (6 months to 5 years), the aged (60 to 74 years), and the poor (persons below the poverty level as defined by the US Bureau of the Census). These groups were oversampled to improve the reliability of the statistics generated. Although women of child-bearing age were also considered to be at risk of malnutrition, no oversampling was necessary. The total sample size desired was 21,000 examined persons, and the number of sample persons selected in each primary sampling unit (PSU) was to be between 300 and 600.

The data collection mechanism used in NHANES I was used again in NHANES II with appropriate modifications. An initial interview was conducted in the household in which sociodemographic information and medical histories were collected. Sample persons were scheduled to visit mobile examination centres in which the physical examination, dietary interview, anthropometry, and other procedures and tests were conducted. At any time during the survey period, two centres were operating in different locations while a third was being serviced or relocated. The mobile examination centres provide a controlled, standardized environment for the examinations and tests. The examinations and tests were conducted by a small, well-trained staff which moved from site to site with the mobile examination centres.

Because of the small number of mobile examination centres, the logistical constraints involved in moving and setting up the centres, the large number of sample persons, and the length of the examination, the total period for data collection was planned to be three to four years. The average length of an individual examination was two to three hours, but it varied depending on the age of the examinee. The examination for pre-school children lasted no more than two hours, while the time for an adult did not exceed three hours.

The survey was designed to produce statistics for the four broad geographic regions of the United States and for the total population by age, sex, race, and income classifications. In the end, a total of 20,322 individuals were interviewed and examined in NHANES II in 64 primary sampling units. Because not all individuals underwent all aspects of the interview and examination, appropriate non-response adjustments were made. These non-response adjustments bring the sum of the final weights into close alignment with the age, sex, and race estimates of the Census Bureau at the mid-point of the survey.

Contents - Previous - Next