This is the old United Nations University website. Visit the new site at

Contents - Previous - Next

Data base development

The compilation of a food composition data base involves a number of different tasks which require clear, careful planning and integration. To minimize waste and optimize selection, the purpose of the data base must be defined before assembling the data. Its ultimate use dictates both the content (the data) and the form (organization and medium) of the data base.

Once the data needed have been specified, it is necessary to identify the source and location of the data. This is discussed in detail in chapter 4. Parallel with these considerations is the handling of the data, both logically and physically. The merging of the data into the data base requires funkier decisions and planning, more fully discussed in pan II.

For data bases expected to provide long-term use, it is essential to embed the effort in a strong, ongoing, institutional framework. This will provide the machinery for correcting entries and keeping the data base current by adding and deleting foods and modifying those nutrients whose data have changed. These changes include food reformulations by industry (based on consumer demand, regulatory changes including changes in fortification levels, etc.) as well as the appearance of new foods in the marketplace. Additionally, reported nutrient levels may change due to improved analytical methods.

One of the most important activities is documentation of what is done and why. This record, essentially an annotation of the data, is fundamental to most uses of the data base.

Finally, the compiler should make contact with individuals and organizations with previous experience in compiling food composition data bases. The individuals and organizations that produce data bases, as listed in the available directories and reports [28, 37, 55, 99, 103], will often provide advice and assistance to both compilers and users.

Content of the Data Base

The information in a particular data base depends on the expected uses of the data base. Its focus can be determined by answering questions about the data base: What foods should be included? What nutrients should be included? What son of precision, accuracy, and description of the data will be required? And what information, in addition to composition of foods, will be needed?

The information required must be weighed against the cost of obtaining that information. Thus, early in its planning, any food composition data base effort must come to grips with the economics of the project. Cost estimates, at least approximate, should be calculated for each type of information to be included, and the different classes of information should be ordered in terms of priority so that informed decisions can be made.

Foods for Inclusion

The most important criterion for selecting foods for inclusion is that of relevance to purpose. Data bases compiled for use with a consumption survey would include foods likely to be consumed by the population under study. Extensive site visits may be required to assess agricultural products and market availability as well as to determine food intake patterns at home and in restaurants. Researchers interested in the relationship of diet to heart disease would need a data base that includes foods which contained nutrients or ingredients suspected of having a role in the promotion or prevention of that disease. Those responsible for monitoring contaminants or toxins might be interested in a data base listing foods commonly consumed, or containing significant amounts of components thought to be of public health significance.

A principal concern in selecting foods for inclusion in a food composition data base is the contribution of the foods to the diet.

Foods such as bread, rice, or corn consumed in large amounts by the population or subpopulation of interest, as well as those that supply large amounts of specific nutrients, should be among the first considered for inclusion in a data base.

Another concern when deciding on foods for inclusion is the level of aggregation that is needed. For many purposes, very specific data are needed, such as the variety or cultivar name, where the food was grown, or the name of the manufacturer (e.g., when analysing or designing a specific diet). Often, however, data on generic or aggregated foods are required, such as a generic apple or a prepared steak and kidney pie, when analysing a food consumption survey in which specifics were not collected or not recalled. For example, in national tables the entry for apples may be an average of data for several different varieties of apples, weighted by their representation in the market. Market representation, in turn, reflects the frequency of consumers' choices. It is essential that data base compilers decide whether such aggregated foods are necessary and then carefully, and explicitly, define them in terms of individual foods for which data are available. Alternatively, where information on foods that are prepared as mixed dishes (such as steak and kidney pie) is needed, it is often most efficient to prepare the dish by following a "representative" recipe and analyse that dish for the required food components. This has definite economic advantages over going into homes and shops, selecting different preparations, and running multiple analyses; however, it is important that some site visiting be done to ensure that the dish resulting from the chosen recipe is similar to what is actually eaten. This entire area of recipe variability and validation needs much further effort.

Nutrients for Inclusion

A data base should contain information on the food components that potential users of the data base will need. Some data bases, such as those compiled for particular research studies, will have their nutrients completely specified in advance. Other data bases will be intended for more general use, and will therefore contain a wide range of nutrients. However, all data base compilations have economic constraints that preclude listing all food components. Selections must be made considering the specific nutrients for which there are identified needs, information on additional components that might expand the user community and extend the utility of the data base, and the projected cost of including each nutrient. Criteria for inclusion are the importance of individual components and the specificity and adequacy of the analytic methods.

Importance of individual components. The primary need of users of food composition data bases is for data on components that affect human health. This includes the proximate nutrients, as well as specific other components, such as fatty acids and trace minerals, that are related and relevant to some distinct area of concern. Some components are included in food composition data bases because they describe the food (e.g., water), or may be useful in checking the other data (e.g., ash).

Specificity of analytic methods. In recent years, new analytic methods have been developed which permit the separation, identification, and measurement of individual vitamins and compounds which were previously aggregated under single nutrient labels. The compiler must be aware of such significant changes in methods and the research requirements for data on specific forms of nutrients. The number of different definitions and relationships between nutrients can cause confusion unless there is careful definition of exactly what is intended. For example: "vitamin A activity" may include activity from beta-carotene [22], and "niacin" sometimes includes activity from tryptophan, while "fibre" and even "carbohydrate" have a number of alternative definitions. The data compiler must be very careful to determine exactly what to include, and define this in unambiguous terms. The problem is compounded by the fact that many nutrients are reported in several different forms which have different meanings (e.g., it is not possible to unambiguously calculate retinol equivalents for vitamin A from International Units), while others are merely different conventions (e.g., the conversion from kilocalories to joules involves only multiplication). (See the INFOODS guidelines on these topics [26, 93] for further discussion.)

Adequacy of analytic methods. Adequate analytic methods are required to obtain accurate and precise data; however, the current state of food analysis is, overall, very uneven. For some nutrient and food combinations, accurate and precise methods exist, while for other combinations, methods are non-specific, inaccurate, tedious, or expensive [8, 83, 86]. Moreover, good methods used carelessly without adequate quality control will generate poor data. Even when appropriate analytic methods are used on the same sample in different laboratories, there may be a wide spread in values [35]. Foods which are main sources of nutrients in a region should be analysed, and the methods used should be subject to quality control. The compiler must often choose to include or exclude certain nutrient values on the basis of whether adequate analytic methods exist or were employed.

Specific Form of the Data for Inclusion

Modes of expression. Nutrient values are frequently expressed per 100 grams of edible portion or per "common household measure". If household measures are used, the gram weight of the measure should be included. For specialized data bases other modes are often used, such as grams of amino acid per gram of protein.

Accuracy and precision of the data. The accuracy of nutrient measurements is a function both of the representativeness of the sample to the population of given food items and of the accuracy and precision of the analytic technique employed. Precision is primarily a function of how many samples are taken and how carefully (and which) analyses are conduced. Both accuracy and precision are directly related to cost and effort expended in design and execution of the sampling plan (for accuracy) and in numbers of replicates (for precision). The compiler must make the decision as to how accurate and precise the data should be on the basis of the uses to be made of them. For example, data to support analysis of large consumption surveys intended to estimate mean intakes require less precision than data to support analysis of individual diets where intake extremes are of interest.

Numbers for Inclusion

The numbers (data) included in a data base depend on the needs of the user. At a minimum, for each nutrient and food, there should be some measure of average value, some measure of the variability of the data, and some indication of the number of sample points upon which these statistics were based and how these data points were manipulated to arrive at the given statistics. The amount of a specific nutrient in a specific kind of food is variable, and the task is essentially one of summarizing the potentially variable data, i.e., the statistical distribution of the values. This topic is more fully discussed in chapter 3.

Special Conventions

In the planning stage, it is important to choose conventions of data recording that are consistent with the desired data base. Two important conventions are the number of decimal places to be used for each nutrient and the distinctions between and notations used to represent zero levels of a nutrient, trace levels of a nutrient, and nutrient data that are missing or not available.

The number of significant figures reported for the level of a nutrient should not mislead the user into believing that the data are more precise than they actually are.

Thus, the decision on the number of significant figures to be retained and displayed involves the consideration of the variability of the data themselves. In general, a change in the last significant digit of a statistic, such as the average level, should be of the same order of magnitude as the standard error of that statistic.

There are very important differences between data that are missing, values that are very small, and values that are zero. The data base must clearly distinguish between these situations, and the differences must be preserved as the data are collected.

For some purposes, it is possible to effect a further collapse into two categories, ZERO and MISSING, by imputing a value of ZERO to TRACE. This should be done only where the distinction is unambiguously unimportant. MISSING and ZERO, however, must always be kept distinct, with the numeral "0" never used to represent MISSING.

The above distinctions do not suggest specific symbols to be used in food tables and data bases. The usual alternatives are alphabetic and numeric codes. An argument often made against the use of alphabetic characters is that they may make computer processing of the data more complicated. However, with the current computer technology, this should not be a major concern. The obvious alternative strategy is the use of negative, or large positive, integers. This has two serious disadvantages: scaling procedures to convert units based on a global default may distort these values, and the user of numeric values to encode these indications may accidentally use them as levels, for example, by including them in an average level.

Ancillary Information

Food composition data are used in conjunction with a variety of other types of data. To enhance its utility, a food composition table will often contain sets of ancillary information, as separate tables or text. For example:

A data base compiler will have to decide what, if any, ancillary information to include, and then plan the gathering and integration of this information so that it will be most useful to the users. An important aspect is organizing the data files so that there is natural correspondence between the ancillary data and the actual composition data (e.g., the same units, the same names, etc.), with sufficient cross-referencing.

Information on Quality

Information on the sources of data does not necessarily provide information on the quality of those data. It has been proposed that food composition data bases should include a code indicating the "quality" or "confidence" of each data point generated by the compiler. Although data quality is very difficult to define, research has been conducted on the form and utility of quality codes for food composition data (e.g., references 21 and 81).

Work in this area is difficult for two fundamental reasons. First, quality is not one dimensional. It should at least include aspects of accuracy, precision, and representativeness. Second, reporting data quality is sometimes dependent on the use to be made of the data. In particular, the question of the representativeness of data can only be evaluated in the context of "representative of what?" and therefore differs from application to application.

Work done thus far suggests that it is possible to derive quality codes in certain situations. Analytic data can be ranked according to method used, sample handling, and quality control. Values derived, directly or indirectly, from analytic values can also be ranked, but additional, less objective, factors must be considered:

It is important that the criteria for evaluation be clearly presented so that the user can assess the data included in the computation of the ranking. For examples of tables of nutrient values which provide confidence codes see Schubert et al. [81] for selenium and Exler [21] for iron. The ultimate utility of these rankings (as translated into confidence or quality codes) is obvious; for example, a user analysing diets could Bag nutrient totals that included data with confidence codes below a selected limit. However, much work is needed before such codes can be widely used.


A vital factor in food composition data bases is arrangement of the data for easy access. The presentation of food composition data is necessarily a function of how the data will be used.

In REFERENCE data bases (see page 6), all the data and information are preserved, although they are often not readily accessible. Usually a subset of these data bases is available for distribution, with the additional data and information (replicate observations, exact sources of data, etc.) remaining in the files and available on special request.

In SPECIAL-PURPOSE data bases (see page 7), the data are often presented with only pertinent information attached; the very nature of the application obviates the need for complete information. Such a data base is sometimes embedded in a computer program or system, but often exists only in a printed version, with an abbreviated introduction. Of paramount importance in SPECIAL-PURPOSE data bases are references to the location of more complete information.

Most food composition data bases are presently available as printed documents. This medium permits a variety of organizational plans and a very flexible range of information that can be included. The usual organization includes an extensive introduction that describes and discusses the contents, the major table with varying amounts of annotation, additional tables of nutrients for which limited data exist, and various indexes and supplementary tables.

Food composition data bases which are made available on electronic media (currently tapes or diskettes) tend to have less information than do printed tables, due, in part, to the perceived restriction of having the data arranged linearly, making various forms of annotation (e.g., footnotes) awkward. Partly in response to this, INFOODS developed its interchange scheme [44], which captures all the available information.

Printed tables tend to arrange the composition data in matrix fashion with the rows indicating the foods, the columns indicating the nutrients, and the cells (the intersection of a food and a nutrient) containing the numeric data. This format leads to difficulties when the number of nutrients increases to more than can be easily contained on a single page, or when there are nutrients for which there are few data. Consequently, other tables (e.g., those from Germany [79, 80] and the United States [96]) present one page per food, with nutrients in rows and attributes (e.g., mean value, number of samples, variances) in columns, thus providing easily located data with more analytic detail possible.


Careful and complete documentation of procedure is required throughout the compilation of a food composition data base. This information can inform the potential users of the extent and limitations of a given data base, enabling them to assess its suitability, understand its optimal use, and identify areas for addition or improvement. The utility of any data base will be greatly enhanced by complete documentation.

The information accompanying food composition data should describe the foods and nutrients included, define the precise meaning of the numbers and symbols used, and record how the data were obtained.

Description of Foods

Any food composition data base must contain or reference descriptions of the foods that are included. These descriptions must be sufficient for user identification of the entries. It is important, especially for reference data bases, that there be as much information as possible, to permit potential users to identify foods and decide if they are pertinent. (It is recognized that complete identification is impossible; the descriptions of foods are culturally determined, and pathways for global exchange of this information have not yet been adequately tested.) Ideally, this descriptive information should include common names, other names, scientific name, (reference to) recipe, preservation method, food source, growing location, conditions of growing and use, the condition of the food when encountered, etc. (See Truswell et al. [93].)

Description of Nutrients

Complete description of the reported nutrients is essential so that a user can judge the appropriateness of the data base. This information must include sufficient detail (or references to such detail) to permit duplication of the analyses; thus, documentation of sample preparation, method, etc. is of paramount importance. Klensin et al. [44] provide specific terminology and guidelines in this area.

Description of the Data

The data should be completely described, in terms of origin and manipulation (including the steps in that manipulation). Aggregated data, and data from composite foods, should be described in terms of what they represent. (See chapter 3.) Additionally, it is useful to include a description of the evolution of the data base, including why, how, and with what success the data base was developed.


The compiler should attempt to ensure the validity of the data base. Validity is a complex concept and ideally requires that:

It is important to review the data for consistency and to verify any numbers that appear incorrect. Data can be examined in several different ways to identify and verify numbers which are beyond expected ranges. While these checks will often find inaccurate analyses or non-representative samples of foods, they are particularly useful in identifying transcription errors and mistakes made in the units of expression. These validity checks fall into three classes:

Data base validation is an area that needs further effort. Although the basic procedures are relatively simple and easily performed by computers, there is no consensus on approaches.


The maintenance of a food composition data base consists, at minimum, of correcting mistakes. Additionally, there may be a need to modify data or add or delete foods. Note that any new data or changes in the data base should be checked for validity as indicated above. An important aspect of maintenance is to note and justify any change to a data base, and to carefully identify the different editions of the data base. Users should identify a particular edition of a data base that is used, so that possible or necessary corrections can be made when a newer edition is available. Several papers are available concerning the maintenance and management of food composition data base systems [11,14, 30, 43, 65].

Contents - Previous - Next