This is the old United Nations University website. Visit the new site at

Contents - Previous - Next

Part II Gathering the data

There are five categories of procedures for obtaining the numerical data for a food composition data base. These are discussed in detail in chapters 2 through 6.

Analysing foods. Given ample resources, the preferred method of obtaining data for a food composition data base is to analyse,or have analysed, samples of the particular foods for the desired nutrients. Chapter 2 contains an outline of the considerations involved in analysing food samples, and a listing of the subsequent calculations involved in determining nutrient levels. A much more extensive discussion and guide is provided by Greenfield and Southgate [26] and earlier by Southgate [84].

Calculating representative data. Multiple data points representing a single nutrient in a food must be combined into specific data entries. Chapter 3 describes the considerations and techniques used to calculate representative values.

Data from other sources. If satisfactory data already exist for the foods and nutrients of interest--in the literature, or in the files or data bases of others--these should be used. Chapter 4 discusses the concerns involved in searching for and evaluating these data.

Estimation from data on similar foods. If data cannot be obtained by analysis, or satisfactorily found elsewhere, it may be possible to estimate the needed data on the basis of data for a similar food. This can be a difficult procedure and must be done with great care. Chapter 5 discusses the considerations involved in these situations.

Calculations for multi-ingredient foods. Foods which contain more than a single ingredient (sometimes called recipe foods or mixed dishes) require procedures for estimation of the nutrient content of the dish or final product based on the composition of the ingredients. As in the estimation of data from similar foods, such procedures can be complex and inexact and should be used with care. They are discussed in chapter 6.

These five categories cover the basic techniques used to produce the numbers for food composition data bases. If none of these approaches is appropriate or satisfactory, then the data must be left as "missing" and carefully labelled as such. It is especially important that missing data never be represented by zero.

In general, the preferred method for data acquisition is by analysis, while estimation of data by analogy with similar foods or from the composition of ingredients is not preferred; however, a nutrient level estimated from the contributions of several ingredients may well be better than an analysis that was poorly done. Moreover, the term "quality" is not necessarily equivalent to "accuracy", but must include some aspect of "suitability for purpose". (See page 12.)

2. Analysing foods

Obtaining foods
Adequacy of analytical methodology
Calculations to derive nutrients

Given unlimited resources, direct analysis of foods is the preferred method of obtaining food composition data for specific foods of interest. However, analysis has the major drawbacks of time and expense involved. Anyone contemplating this approach must be aware of the magnitude and complexity of such an effort and the practical experience and assistance required to do it properly. There are many references on this topic; the most relevant are the AOAC manuals [29, 104] and the INFOODS manual Guidelines for the Production, Management, and Use of Food Composition Data [26]. Others are Stewart and Whitaker's Modem Methods of Food Analysis [87], Southgate's Guide Lines for the Preparation of Tables of Food Composition [84], Oiso and Yamaguchi's Manual for Food Composition Analysis [57], Aurand et al., Food Composition and Analysis [7], Jacob's The Chemical Analysis of Foods and Food Products [40], and Pearson's The Chemical Analysis of Foods [60].

Generating analytic data for use in a food composition data base consists of several logically distinct steps: sampling foods representative of the population of foods of interest, preparing the samples and performing the analyses, and recording and preparing the final data. Each of these steps needs careful planning, documentation, and execution, whether done within the data base compiler's own laboratory or in another. A detailed plan of work must be developed by the compiler with the assistance of experts in the various areas and with the main users of the table. In addition, there must be frequent review and discussion between the compiler and the analytic laboratory to ensure that there is no disparity between the information that is wanted and the information that is provided.

Obtaining foods

A sample is defined (statistically) as one or more items taken from a population in a way to ensure that they are "representative" of that population. In food composition work, the population of interest would be a type of food; sampling describes how examples of that food are collected for analysis. Definition of the food must be complete enough to determine whether the aspects of brand names, cultivars, growing region, preservation, form, etc., should be considered in drawing the sample. (See the discussion in chapter 5 on how foods differ.) After defining the food, it is then necessary to collect a sample of food from across the spectrum of those factors. The number of individual items to be drawn is a function of the precision which is desired. The care with which the sample is taken, in terms of how well it represents the population of interest, is a major factor in the accuracy of the data. Additional discussion of extensive sampling protocols may be found in Greenfield and Southgate [26] and Holden et al. [34]. Some sampling plans are specified by national regulations; for example, the sampling plan required by the US Food and Drug Administration for nutrition labelling is found in reference 56.

Adequacy of analytical methodology

Choice of method is perhaps the most important aspect of analysing foods, although a good method will not compensate for a bad sampling plan. With the latter, one measures the wrong foods very well. The following represents only an outline of the considerations in choosing an analytic method. The INFOODS manual Guidelines for the Production, Management, and Use of Food Composition Data [26] discusses this topic in detail.

The methods chosen must be generally ACCEPTED, RELIABLE, and PRACTICAL. Even some methods that have been generally accepted in the past are now understood to produce inaccurate and imprecise data (e.g., calorimetric analysis for cholesterol). These methods should be avoided despite their past popularity. Reliability includes such attributes as accuracy, precision, specificity, and detectability (sometimes, although incorrectly, called sensitivity).

The accuracy and precision of an analytic method must be addressed separately. (See page 10.) The accuracy of a method should be verified by use of standardized or certified reference materials if these are available. Day-to-day accuracy must be monitored by analysis of quality control material. It is important that variability due to the method be minimized so that food composition data users can assess the variability of the nutrients in the food. Since these two sources of variation cannot be separated by examining a single value, or even a set of values, analysts should publish estimates of analytical error, and those estimates should be available when evaluating data.

The specificity of a method refers to its ability to measure only the specific nutrient for which it is being used, while the detectability, or limit of detection, of a method is how little of the nutrient need be present before it can be measured; both of these aspects should be considered in terms of the uses that will be made of the data.

The practical aspects of analytic methods include costs (space, equipment, consumable supplies, personnel time, and training), dependability, and safety. These tend to be country- and laboratory-specific, and often very real determinants in the choice of methods.

Calculations to derive nutrients

Because of the difficulty and cost of directly measuring some nutrients, many data base nutrient values are calculated from other analytic measurements. This procedure is used by virtually all compilers for some nutrients (such as energy and protein) and by some compilers for various other nutrients. Some of the calculations most commonly performed are briefly described below. Fuller details are contained in Greenfield and Southgate [26]. One of the complexities of discussing nutrients is that often the same name for different components is used in different food composition data bases (see below). An INFOODS listing of the "nutritionally distinct" compounds appears in Identification of Food Components for INFOODS Data Interchange [44].


The term carbohydrate, as used in food composition data bases, includes a number of different carbohydrate species. While for many purposes measurement of total carbohydrate is not useful, it commonly appears in tables and is used to calculate energy values. However, there are several different ways that total carbohydrate is calculated: the two basic methods are "by difference" and "by sum". Additionally, tabulated values are either "available" or "total" carbohydrate, depending on whether dietary fibre has been included. Difficulty arises when these different calculations are indistinguishably labelled as carbohydrate.

In the United States, values for carbohydrate are usually calculated "by difference". For each food, the value for carbohydrate is estimated as follows (all values are per 100 g):

Carbohydrate = 100 g - water - protein - fat - ash - alcohol

This method includes indigestible carbohydrate in the carbohydrate component, as well as any other components of a food item that are not measured as water, protein, fat, ash, or alcohol. This method yields only an approximation of carbohydrate.

Other data bases use carbohydrates "by sum", where individual carbohydrate components (sugars, starches, oligosaccharides, and dietary fibre) are measured and summed. There are various ways of doing this. (See Paul and Southgate [59] for a description of the British procedure for measuring "available" carbohydrate and expressing it in terms of monosaccharides.) Carbohydrate values calculated by these different methods may not be comparable.

EXAMPLE: Carbohydrate content of raw carrots (all values are per 100 g):

100 g carrots - 87.8 g water - 1.0 g protein - 0.2 g fat - 0.9 g ash = 10.1 g carbohydrate.

100 g carrots - 87.8 g water - 1.0 g protein - 0.2 g fat - 0.9 g ash - 3.2 g dietary fibre = 6.9 g carbohydrate.

5.4 grams carbohydrate.


Since it is relatively easy to measure the nitrogen content of a food (by the Kjeldahl method) and fairly difficult to measure protein, it is usual to use nitrogen values as a basis for reported protein values, converting to protein either by a general factor of 6.25 g protein per gram of nitrogen or by more specific factors for different food types (e.g., Jones [41] and the German tables [80]). Since the amount of nitrogen varies among amino acids, and amino acid composition varies among foods (there is also nonprotein nitrogen in many foods), the food-specific factors will give a more precise estimate of the protein content; however, this is still an approximation.

EXAMPLE: Protein content of almonds (data from China [16]):

In a daily food pattern over- and under-estimations due to using a general factor will tend to level out. Even with vegetarian diets a general factor versus specific factors differs less than 5% [101].


Metabolizable energy values are usually calculated, since the standard method for measuring the energy content of a food (bomb calorimetry) does not reflect the energy actually available to the body. The Atwater system [6, 51] uses factors to estimate available energy from the protein, fat, carbohydrate, and alcohol components of food items. Errors in the analyses of the four components, or in the determination of the factors, will be reflected as errors in the energy content. The general Atwater factors are used by some data base developers:

Energy (kcal) = (4 kcal/g x g protein) + (4 kcal/g x g carbohydrate) + (9 kcal/g x g fat) + (7 kcal/g x g alcohol)

The specific Atwater factors reflect variations among food groups and should provide more accurate estimates of the true available energy. These specific energy factors are now commonly used to calculate energy values. A comparison of the effect of the two methods on the energy value of several foods can be found in Merrill and Watt [51]. In the USDA tables [96], this is documented by including a listing of the factors that have been used for the individual foods in each section (food group).

Different methods of determining any of the energy components can clearly affect the value given for the total energy content of a food item. For example, the British food tables [59] express carbohydrate as available monosaccharides. The developers of these tables chose to use the general factors of 4 kcal/g for protein, 9 kcal/g for fat, and 7 kcal/g for alcohol, but to use 3.75 kcal/g as the factor for carbohydrate since there is less energy in the monosaccharide forms. (See the British tables for further discussion.)

Recently, the Atwater system was reviewed by the Federation of American Societies for Experimental Biology [3], who concluded "the Atwater system provides estimates of metabolizable energy within the limits of accuracy for measuring food intake and also within the predictive limits of food composition tables". A general review of the methods used in different countries to calculate energy values can be found in Perisse [63].

EXAMPLE: Energy content of raw apples:

An additional complexity with energy values is that they are alternatively expressed in either kilojoules or kilocalories. Conversion from one to the other may be performed either on the final value (4.184 kJ per kcal) or on the individual components (17 kJ per g protein, 37 kJ per g fat, 16 kJ per g carbohydrate, and 29 kJ per g alcohol). It should be recognized that, because of rounding, these two methods sometimes give slightly different values.

EXAMPLE: Energy value of dark beer in kilojoules, using data from the German tables [80]:

Dietary Fibre

Total dietary fibre can be determined directly or calculated from its various fractions. Often these fractions are also included in the data base (e.g., insoluble fibre or neutral detergent fibre, and soluble fibre [pectins, gums, soluble hemicelluloses]), since these fractions have known physiological importance. Alternatively, one of the fractions may be calculated as the difference between total fibre and the other fraction (e.g., soluble fibre as the difference between total and insoluble fibre). The multiple definitions and methods for determining dietary fibre values analytically make these types of calculations complex, and care must be used to ensure they are done correctly. If dietary fibre is of interest to the compiler of a food composition data base, a food chemist with expertise in this area should be consulted. See Lanza and Butrum [45], Stasse-Wolthuis [85], and Trowell et al. [92] for reviews of the complexities concerning dietary fibre. A recent collection of data from the United States appears in reference 98. Dietary fibre values have been available for many years in the British [59], German [79, 80], and other tables.

There is no accepted method for calculating dietary fibre from crude fibre. The ratios of crude to dietary fibre are highly variable among food items, and even among diets.

Vitamin A

Vitamin A is an example of a nutrient that has multiple chemical forms, each of which has differing biological activity. In such a case it is often necessary to assign different availability factors to these different forms in order to calculate the total nutrient value. Commonly, total vitamin A activity is expressed in retinol equivalents (RE) as a function of retinol, beta-carotene, and other carotenoids. The recommended factors are one for retinol, one-sixth for beta-carotene, and one-twelfth for other carotenoids.

Vitamin A activity (RE) = g retinol + 1/6 g beta-carotene + 1/12 g other carotenoids

Some data bases contain the individual components which permit calculation of the total value directly, while others contain only the total (with indication of the factors used). The specific values desired must be chosen by the compiler on the basis of the purposes for which the data base will be used.

Some food composition tables continue to report vitamin A in international units (IUs). This system does not adjust for absorption of the different forms of vitamin A and cannot be directly converted to RE for many foods. Since requirements are now expressed in retinol equivalents, use of the older units is no longer appropriate if the data base will be used to estimate dietary adequacy.

Vitamin E

Vitamin E is another example of a nutrient with a number of different forms (the tocopherols and tocotrienols), each with different biological activity. Here the choice is to measure and present the major forms of interest, to calculate and present a total vitamin E activity in "alpha-tocopherol equivalents", or both. Unfortunately, it is not possible to calculate alpha-tocopherol equivalents accurately if the mix of isomers is unknown, and further, the validity of these factors for all types of food is not known. In mixed diets, beta-tocopherol has approximately one-half the activity of alpha-tocopherol, while gamma-tocopherol has only one-tenth the activity. The 1989 edition of the US Recommended Dietary Allowances [54] provides additional details.


Analytic values for the niacin content of a food item are usually expressed as mg of preformed niacin. However, dietary guidance is usually offered in terms of mg of niacin equivalents, since some tryptophan (an essential amino acid) can be converted to niacin. The amount that can be converted depends on the total protein and energy in the diet, on the mix of amino acids in the diet, and on the nutritional status of the individual. However, a figure of 1 mg niacin per 60 mg tryptophan is considered average, and some data base compilers calculate niacin equivalents as the sum of preformed niacin and onesixtieth of the tryptophan in a food item.

EXAMPLE: Niacin equivalents in dinner rolls made with baking powder:

Niacin equivalents = 2.7 mg preformed niacin + 1/60 x 90 mg tryptophan = 4.2 mg per l00 g.

If no value for tryptophan is available, it is sometimes assumed that approximately one per cent of protein is tryptophan [58].

3. Calculating representative data

Preliminary examination of the data
Two or more populations
Summary statistics

Often a data compiler has several data points which must be summarized into a single value for the data base: multiple values representing a single nutrient in a single food. This arises when one has analyses for each of a number of different samples of the same food; when one has data from several sources, each described same food; and when one must combine data on different food items into an entry for a "generic" food (e.g., one needs an entry for apples and has individual data on each of several different varieties).

Multiple measurements provide valuable information both on the level of a nutrient in a food and on the variability of that nutrient in different samples of the food. While it is important for the results of the individual analyses to be maintained in some reference data base, for many purposes summary statistics of these data will be required. Whatever statistics are presented, it is important to realize that the numbers included in the data base are not directly analytic but are produced from the analytic data by certain manipulations, and that these manipulations must be carefully and completely documented.

Preliminary examination of the data

Before any calculations are made, it is important to examine the data to see whether they can be considered to represent a single biological population. There are two pertinent questions: Are there errors in the data? Could the data have come from two or more distinct foods? In addition, an inspection of the data should give insight as to what sort of a statistical distribution might best describe the sample (could it be symmetric about some mean value or perhaps skewed?). While these questions are very difficult to answer with only a moderate sample size, they are very important.


It is possible that some data points may have resulted problems of sample collection, laboratory procedure, or transcription. Standard statistical tests exist for single, potentially errant values [27]. While these tests tend to be quite conservative (retaining data when in doubt), it is recommended that the compiler follow them, leaving suspected data points in a set unless they can be statistically rejected or independent evidence can be found to explain their deviation.

Two or more populations

There are two basic situations in which the question of underlying multiple populations arises: when examination of the data suggests that there are two or more populations represented in data that are all identified as being the same food, and when the compiler knows, from external evidence, that there are two or more populations which must be aggregated.

Apparent Multiple Populations

In examining data which all have the same identification, it often appears that there are two (or more) distinct populations. No statistical or graphical procedure that examines the data from a single collection can ever completely assure the compiler that there are, or are not, multiple populations, and sophisticated judgement must ultimately be combined with whatever procedures are used. If there are multiple populations, the decision as to whether or not to split them may depend on such external considerations as project objectives as well as on the data. Separating multiple populations is normally the better choice in reference data bases because it is feasible to combine them later, while separation of aggregated summary values is typically impossible.

Statistical determination of whether multiple populations are present in a data set, and where the boundary between them falls, is an active area of statistical research. In general, the data should be examined visually or by preliminary statistical tests. Suitable graphical procedures are discussed by several authors [15, 17, 94]. It is often helpful to use known parameters such as cooking method, cut of meat, preservation state (frozen vs. fresh), part of plant, and other relevant variables for graphing. If distinct populations are suspected as a result of these examinations, the circumstances of data collection and the possible nutritional significance of distinct subsets should be investigated as part of deciding whether to assume a single population or to begin the often tedious search to identify the subsets and their boundaries. Without clear-cut guidance from the data themselves, the compiler should work with the data as if they were derived from a single population of food.

Multiple Different Populations

A data base compiler is sometimes confronted with two (or more) different sets of data (e.g., from different sources) which may or may not represent the same food. If the variables under consideration have reasonably normal (Gaussian) distributions, this is a straightforward statistical question which can be answered by use of Student's l-test, or Hotelling's T-squared test, or some form of analysis of variance [2, 82]. It is recommended that the results of these tests be evaluated through visual inspection of the data sets, with the final decision resting with food specialists rather than with statisticians.

Data Aggregation

Often, it is necessary to aggregate data for several different brands or varieties of a food in order to produce a single entry in a data base, perhaps to produce data for a "generic" food. The standard procedure is to calculate a "weighted" average, a procedure which permits the component data sets to have differing importance in determining the ultimate result. Thus, if one had two kinds of potatoes and wanted an average value that could be used with results from a consumption survey where the respondents did not know which type of potato they consumed, one would want the data to reflect the probability of consuming one type over the other. This can be resolved by 'weighting" the data by the consumption pattern or the market share.

The question of variability is more complex, with either a "pooled" or an "overall" variability to be calculated. The question of exactly what to do depends fundamentally on the ultimate purpose of the data base, and at the present time no general recommendation can be made except to carefully document what has been done.

Shape of the Data Distribution

Given that a single population can be assumed, it is very useful to characterize the shape of the distribution. Here the compiler must decide if the data seem to have come from a standard normal (Gaussian) or bell-shaped curve, or if another shape is underlying it (usually skewness). This is an important consideration in the choice of descriptive statistics. While there are statistical tests for normality, these are not powerful when used with small to moderate data sets. Until further research is done into the general questions of distribution of nutrients, data sets which do not look obviously non-normal should be considered normally distributed, but considerable caution should be exercised.

Summary statistics

The aspects of a data set that are commonly reported in a summary of the data are the number of samples, the "central" value of the data, and the variability of the data. The actual statistics which best summarize these aspects of the data currently represent an open area of food composition data research, an area which is only now beginning to be explored. The recommendations given below represent suggestions which are quite general and should be expected to be revised and expanded as more effort is focused on this area. These statistics, or equivalent distribution information, are most important for reference data bases and ideally should be supplied with any data base from which a user may need to reinterpret, rather than simply use, the data values.

Number of Samples

Every cell of a data base should have an assigned value that represents the number of independent observations which have gone into the calculation of the summary statistics.

This is straightforward if the compiler has data that represent a single population. In cases where the compiler is working with data that have already been summarized (e.g., means for several different samples), assigning values to cells becomes complicated. Because of the variety of different possible situations, each must be handled separately, and the end users must be provided with access to the details. It is recommended that the sample size indicated in the data base should reflect the total number of data points contributing to the statistics, and that information should be included to describe what this number actually represents.

Central Value

If the compiler has only a few data points, or the distribution appears to be "not different" from the normal (Gaussian), the single "representative" number used to denote the level of a nutrient in a single food should be the arithmetic mean (the usual average). If the distribution is skewed, then the median (the value that lies in the middle of the sample) or the geometric mean (assuming a log-normal distribution) should be presented, and this should be explicitly noted. It is rare that there will be sufficient data points to empirically determine the shape of the distribution with any confidence. When the data come from several distinct sources (laboratories or cultivars), the average of these averages should be calculated. This may be done by assuming that each data source is equivalent to the others, or by weighting the component averages. Weights are often assigned on the basis of the sample sizes, thus assuming that sources with more samples are better than those with fewer samples. Alternatively, a relative contribution, such as one based on market availability, can be used to determine weights.


Many food composition data users require information on nutrient variability. This is useful in a general sense to know how variable a nutrient source is and in a specific situation to know how extreme a value could be. For example, one might wish to estimate how much of a food must be provided to ensure that a certain nutrient level would likely be achieved in a hospital feeding programme. In the case of a known normal distribution, variability can be summarized by the standard deviation (together with the mean). However, this is often inappropriate when dealing with nutrients and foods that have not been well studied. Until such research is carried out, consideration should be given to providing estimates of extreme percentiles. If there are too few sample points available to compute a reliable standard deviation, or if a skewed distribution is suspected, mid-spread or fourth-spread, or the closely related interquartile range [32] should be considered, as should the alternative, used in the Chinese tables [16], of simply listing the data points. As estimators of population spread, these alternatives are less sensitive to extreme points than the range of a nutrient, especially when there are a small number of samples.

The calculation of standard deviation is straightforward if the individual data points are available. If a compiler is working with individual component means and standard deviations, then the appropriate action depends on both what is available and what is wanted. As in the case where all the individual data points are considered to provide equal information (all samples from the same population are equally important), a "pooled" standard error should be calculated. If standard deviations are available for samples which are essentially different (as in the case of differing samples for a generic food), then both a "between" and a "within" standard deviation should be calculated.

4. Data from other sources

Where to find food composition data
Evaluation of data from various sources

Obtaining data from other sources, sometimes called "borrowing" data, refers to using data originally generated or gathered by someone else. This is the most frequent way of obtaining data for many special-purpose data bases, with the usual sources being the large reference data bases (such as those of the USDA [96], the United Kingdom [59], etc.). One problem with the data of others is that they are often incompletely described. However, borrowing of data is not only justified but essential when analyses are impractical (i.e., allocation of resources is not justified) and "good" data are available elsewhere. See Jacob [39] for an overview of this problem from the point of view of some social scientists.

Given the decision that certain data are needed and cannot be generated de novo, the two basic tasks facing the compiler are finding and evaluating the data.

Contents - Previous - Next