This is the old United Nations University website. Visit the new site at http://unu.edu
Contents - Previous - Next
This chapter describes the interchange elements that are used to describe the data values themselves: units of measure, statistical values, and their interpretation, using the same organization found in the previous two chapters.
|The <unit/> element is an
optional immediate subsidiary of the various <specific
component> elements. It specifies non-standard
units for the data value being reported.
Both start-tag and end-tag are required. The content consists of a keyword chosen from the list below and subsequent registrations.
The unformatted data item which is the content of the <unit/> element is usually a prescribed keyword, but may be a descriptive word or short phrase. It describes the numerator of the "unit" of the value of a <specific component>, such as "milligrams" or "ounces". Compare with <meas/>, which, in the same sense, expresses the denominator.
The <unit/> element may occur immediately subsidiary to a <specific component> only if it does not occur within the <fddflt> element of the same enclosing <food> and does not occur within the interchange file's <dflt> element. If it occurs nowhere, then the units specified in the registered definition of the containing <specific component> or <specific derived component> are assumed. The <unit/> element should not be used when the values reflect the default units for the particular food component.
The content of the <unit/> element is a keyword value. The initial set of keywords are as follows. This list can be expanded by future registrations.
<ca> 3.0 <unit/> g </unit/> </ca>
|The <meas/> element is an
optional immediate subsidiary of each <specific
component>. It specifies a non-standard measure
of food for which a quantity of the component is being
Both start-tag and end-tag are required. The content of a <meas/> element is a keyword, or keyword and element, chosen from the list below, optionally followed by a <cmt/> element.
While this element may be used to identify non-standard data for interchange purposes, data base compilers should be aware that quantities other than the conventional "per 100g edible portion" are likely to be extremely difficult to interpret in international or most comparative contexts. Consequently, this element should, if possible, be used only to identify supplemental data, e.g., values for specific household quantities or portions in addition to the "per 100g" quantities, not instead of them.
The content of the <meas/> element is usually a prescribed keyword, but some keywords require additional description, as shown below. The element describes the denominator of the "unit" of the value of a <specific component>, such as "per 100 grams" or "per one fruit" (but the word "per" is never included). Compare with <unit/>.
The <meas/> element may occur immediately subsidiary to a <specific component> only if it does not occur within the <fddflt> element of the same enclosing <food> and does not occur within the interchange file's <dflt> element. If it occurs nowhere, then the preferred measure "(per) 100 grams edible portion" is assumed. The <meas/> element should not be specified when the measure in use is the default.
Keywords and Structure
The content of the <meas/> element is a keyword value. The initial set of keywords is as follows. This list can be expanded by future registrations.
The absence of a <refuse/> element will be taken by most receiving parties as indicating that the values are supplied without refuse, i.e., as indicating that values supplied with "t100g" or "piece" are completely edible. Consequently, a <refuse/> element should be supplied if this is not the case. If it is not possible to supply a <refuse/> element when the quantity is "as purchased" or otherwise contains an inedible fraction, a <cmt/> element should be included that indicates this.
When the food is expressed in unconventional "household measures" or "as purchased", it is important to remember that people receiving a data file may be unfamiliar with the food. If information is available, the <qty/> and/or <refuse/> elements should be used to provide the basis for an approximate conversion to 100 grams edible portion if that is required, and additional information should be provided, as part of the food description, that will better describe portion sizes, etc.
Unfortunately, there has been little research or standardization in the description of this critical area of food quantity description. It is likely that, at least in the near future, food composition data bases that include measures other than "per 100 grams edible portion" will require extensive textual description, in <cmt/> elements associated with <meas/> and in food description elements to make the values useful and comparable outside, and possibly inside, the country of origin.
<qty/> Used to provide an estimate of the quantity, in grams, associated with a household or "as purchased" portion as eaten.
<refuse/> Used to provide an estimate of the fraction of a food "as purchased" that will be lost in conversion to "portion as eaten".
These two elements are ideally used together when "as purchased" quantities are involved. The second provides the conversion between quantity or portion "as purchased" and the amount considered edible and the first provides the conversion between that quantity and the specified portion size. The specific relationship is that, if the <qty/> and <refuse/> values are respectively Q and K, then edible portion size in grams = Q and the quantity purchased in grams, P, is
P= Q / (1 -R)
The information is often hard to obtain and quantify, and both of these elements are optional. Additional information may be provided as part of the food description subsidiary to <classif>.
<meas/> piece <cmt/> one
<meas/> piece <cmt/>
half-pound steak </cmt/>
|The <qty/> element is an
optional immediate subsidiary of the <meas/>
element, used with some of the keywords for that element.
It specifies an approximate conversion between a
household portion or quantity as purchased and grams as
prepared or consumed. Its exact semantics depend on the
keyword of the <meas/> element with which it is
Both start-tag and end-tag are required. The content consists of a numeral, typically representing the number of grams in the portion, piece, or size of product.
Unless the measure associated with the <qty/> element represents an absolute unit (e.g., "per ounce as eaten") the quantity divisor will represent a value that varies. For example, if the <meas/> element contains "piece" and a <cmt/> element indicating "one fruit", <qty/>, as described here, will represent the weight of an average fruit. As food composition data improve, it will probably be desirable to associate the same types of statistical distribution information about this value as are used to describe the data values themselves. This element will be extended as needed to accommodate that information.
The content of a <qty/> is a numeral, in floating-point notation if needed. It will be used as a unit-free divisor. The content may optionally contain a <cmt/> element, and additional elements may be added as discussed above.
|The <refuse/> element is an
optional immediate subsidiary of the <meas/>
element, used to supplement "as purchased"
quantities. It is expressed as a fraction of the total
amount that is refuse, with an optional <cmt/> that
describes the part that is discarded.
Both start-tag and end-tag are required. The refuse element is used to describe the amount of refuse, or waste, in converting between the portion of a food as purchased (or otherwise obtained, as in "raw" form) and the edible portion of the food.
For example, if the <meas/> element contains "piece" and a <cmt/> element indicating "one fruit", <refuse/>, as described here, might represent the weight of the pit and inedible peel (as discussed under <meas/>, <cmt/> and food description elements that indicate what is considered edible are critical for understanding the data).
For natural products, the actual refuse removed prior to consumption will typically differ from one example to the next. Consequently, a <refuse/> value always represents, at best, an average value with a potential, but rarely well-understood, variance.
As food composition data improve, it will probably be desirable to associate the same types of statistical distribution information about this value as are used to describe the data values themselves. This element will be extended as needed to accommodate that information.
The content of a <refuse/> element consists of a numeral, representing the fraction of the product that will be discarded, and an optional <cmt/> element describing the part that is discarded.
|The <srcfri/> and c srcorg/>
elements are immediate subsidiaries of the <specific
component> or <specific derived
component> cements. They may be specified with
<fddflt> when the same data records are used for
all component values for a given food; in this case, no c
srcfri/> or <srcorg/> elements should appear in
the food record itself.
When data values are calculated or otherwise derived from values in other tables, <srcfri/> is used to list the international food record identifiers of the data records that contributed to the calculations. <Srcorg/>, by contrast, is used to keep track of different unpublished data sources assembled by the compiler for the same food. For example, if two or more laboratories were used for different values, c srcorg/> could be used to identify the laboratories with the food components they supplied.
These elements are part of the overall system of tracking the use and evolution of data values discussed with the <ifri> element. In general, <srcfri/> will be used for published data values that have already entered the INFOODS interchange environment or that of one of the regions, so that a food record identifier has been assigned, and <srcorg/> will be used for other data values.
With one exception, listing more than one <srcfri/>, or both <srcorg/> and <srcfri/> elements, for a single food component will be rare. The exception arises when multiple sources, such as values from several tables, are used to derived a single value to be published.
"Srcfri" may be thought of as an abbreviation for "source food record identifier"..
Both start-tag and end-tag are required. The content consists of an international food record identifier. If more than one is required to identify the data source, more than one <srcfri/> element may appear.
The content of <srcfri/> consists of one or more unformatted strings, each of which is an international food record identifier. The strings are separated by the <-> delimiter.
|The <srcorg/> element is an
immediate subsidiary of the <specific
component> or <specific derived
component> elements. It may be, and often will
be, specified using <fddflt>. This element is part
of the overall system of tracking the use and evolution
of data values. Its specific application is discussed
with <srcfri/>. "Srcorg" may be thought
of as an abbreviation for "source food record
Both start-tag and end-tag are, required. The content consists of a set of elements representing an original data source of data not included in the international food record identifier system. Several <srcorg/> elements may appear if needed. The element sets may contain <ref/> elements, typically to identify published articles that contain data, or <cmt/> elements for more local information, e.g., laboratory identification. The list of permitted elements that may appear subsidiary to this one may be expanded as requirements become obvious.
With one exception, listing more than one <srcorg/> element, or both <srcorg/> and <srcfri/> elements, for a single food component will be rare. The exception arises when multiple sources, such as values from several laboratories, are used to derived a single value to be published.
The content of <srcorg/> consists of a set of elements, which represent an internal data source. If there are multiple sources, more than one <srcorg/> element may appear, as discussed above.
|The <data description>
elements are immediate subsidiaries of the various <specific
component> and <specific derived
component> elements. They specify various
statistical properties of the data value or sampling
information relevant to the particular food component.
The term <data description> is shown in italics as a reminder that it is not an actual tag and never appears in an interchange file but, instead, is a placeholder for a series of individual elements.
The content and structure of the various <data description> elements are different.
The content of the specific <data description> elements will be as registered.
Philosophy and Categories
The data description elements are intended to provide methods of supplying and identifying both data about food components and "metadata" (descriptive information about the data and conventional statistics), which are additional values and description about those data. The categories immediately following are influenced by both general data classification theory and the realities of practice in handling and presenting food composition data. We use five major categories, two of data (or statistics derived from the data) and three of metadata. It is possible to think of the <unit/> element as part of this group as well.
The intent of the discussion, and the elements specified here, is simultaneously to provide a framework for extensive description of data, statistical and otherwise, and to provide some justification for doing this. As with other components of the interchange system, none of these elements is required for minimal interchange. Even with more extensively documented data bases, the elements should not be used unless the data associated with them are available. Indeed, some elements are provided to identify descriptive values that appear in food tables that INFOODS does not recommend including. The maintainer of a table or data base that contains only point estimates of values which can be treated as means can ignore these elements entirely.
The element groups are listed below. The first two of these would normally be considered as statistics (or "data") and the other three provide metadata, including information about relationships to, and among the data. Each of these is described in more detail after the listing.
Consistent with the general interchange model, any of these categories for which information is not available may be omitted. Nonetheless, it is useful to understand that each of the categories, with the possible exception of the first, always exists at some level, however trivial or embedded in the subconscious of the table-preparer. No known food composition table or data base at present specifically contains either of the two last categories. The third normally appears only in the paper archival files that record the progress of data from laboratory to the file of raw data and from there to more refined and "cleaned up" data bases.
These categories are inexorably intertwined with each other and with <unit/>, both in logic and in how they are to be handled in the interchange system. If different location estimates or statistics are associated with, e.g., different units of expression or different data cleaning procedures, or different sets of beliefs, then a separate group of values (statistics and metadata) must appear for the alternative units or cleaning procedures.
Earlier sections of this document have discussed the use of the special delimiter element <-> to divide repeating groups of information. <-> may appear immediately subsidiary to a <specific component> or <specific derived component> element to indicate that it contains two or more groups of data. When "<->" is used in this context, we describe these groups as "statistical data groups". Each of these may have its own values, its own units, and its own statistical description of the data values. No mechanism parallelling <dflt> or <fddflt> is now defined or contemplated to permit implied copying of values from one of these groups to another, or for using one such group to establish defaults for another (which is much the same thing). If two groups of data are needed, and some of the information overlaps, then it must be duplicated.
Because of the nature of the best estimate of location, one group can contain at most one of these. If more than one such estimate is provided, then there must be a corresponding number of <data descriptions>, presumably with different applicable <unit/> or <meas/> elements associated with them or with different data treatments (e.g., methods for cleaning or rejecting outliers).
For ease in processing at the receiver end of an interchange, data producers should be encouraged to place the most representative or most internationally useful values and statistical information first when they have opinions on which is most representative or useful, but this is not a rule of the interchange system. For example, if a food composition table being translated into interchange form contains both data on the basis of 100 grams of edible portion of a food and data on the basis of the food as purchased, we would strongly recommend that the data should be reported in that order for each nutrient.
THE SETS OF VALUES
Two categories of data and three of metadata are listed above. This section explains those categories, and can safely be skipped by any reader who has an adequate grasp of the categories from the brief discussion above.
1) Best estimate of location
This category represents several realities, rather than statistical purity, although there is analogous statistical terminology for it (below). Many food composition tables report only a single value for a food. Most other tables feature one value prominently. In all of these cases, that value represents the value that a table-producer might supply in response to the question, If I have to use a single value to represent the amount of this particular nutrient in that particular food, what should I use? That value may be true or false, representative or misleading, but it presumably represents the best estimate available to that food table producer at that time. It is often inappropriate to label it, for example, an "average value", since that term has a very precise meaning and, in some cases, averages may be known to the table producer to be inappropriate. The interchange system makes provision for precisely identifying a value as the mean when that is appropriate.
We would expect that some variety of location estimate (possibly a "trace" indication, rather than a number) would appear for substantially every food component that is reported. Outside this section, examples of fragments of interchange files consequently omit specific location-statistic-identifying information.
2) Statistics about the data
As discussed immediately above, it is often inappropriate to identify many of the values that appear in food tables with precise statistical terms, such as "mean". At the same time, when descriptive statistical values are available, the interchange system must support reporting them, and reporting as much detail about them and their derivation as can be found. This general category subsumes the statistics themselves; the next one begins the description of the derivation of the statistics.
In principle, any [sample] statistic about the data may appear in this category. Examples of such statistics, each of which would be represented as one or more values tagged to indicate the statistic it represents, would include estimates of central tendency such as the mean, the median, and the geometric mean; estimates of spread such as the variance, standard deviation, range (expressed as the difference between maximum and minimum values or as the maximimum and minimum values themselves), and hingespread (difference between the upper and lower quartiles); critical values such as the maximum and minimum' 15th or 85th percentiles, or an 80% confidence limit; estimates of accuracy, such as the standard error or non-parametric estimates of standard error such as the jackknife; the sample size; and so forth.
This category also includes a special element that identifies the particular statistic represented by the best estimate of location if, in fact, that represents a precisely defined statistic. This provides a compact notation and prevents giving the appearance of more information than is actually present.
While the term is often too vague to be of significant use, this category includes all or most of what are often referred to as "descriptive statistics".
3) Treatment of the data
It is typical with observed or estimated data in general, and it seems particularly true of food composition data, that one rarely takes unevaluated "raw>' data, computes, e.g., a mean, and reports the value. Indeed, with most food composition data, such an approach would be irresponsible, as Greenfield and Southgate  argue most forcefully.
Instead, one should, and does, evaluate, removing values that are obviously bad, and potentially adjusting others because of what is known about the characteristics of the particular food sample. The evaluation process invariably involves the application of experience with, and theory about, food composition to the data, and may involve the application of formal procedures based on statistical theory.
With most food composition tables in the past, the process of data evaluation and treatment has not been described in detail, partially because of a belief that no one was interested or would be able to make use of the information, and partially because these was no framework in which to describe it. Consistent with the goal of organizing the interchange system so that an interchange file can be self-contained and include all information that is available, we wish to define a framework in which the information about data evaluation and treatment can be reported if that is desired.
The elements in this category describe the treatment itself, and the next two categories are available to describe the assumptions on which the treatment is based if those are relevant and available. Since choices of treatment used will often lead to different statistical values, there is an inexorable relationship between these elements and the statistical ones (categories 1 and 2) discussed above. As a corollary, if treatment elements are omitted, the receiver of the data will usually infer that the treatments applied are "safe" or "transparent" and do not affect the reported data values. Just as the descriptive values can include either formal statistical procedures or subjective or objective decisions based on experience and examination, the treatments can also. The content of this category could then include statements with such meanings as "examined the data on the basis of experience and discarded obviously silly values", or "applied a ten percent trimming rule to the sample data to eliminate erratic behavior in the tails of the distribution", or "discarded measured cholesterol values for this plant product".
This category will rarely appear without at least some statistics about the data (category 2), but there are exceptions. In particular, a data treatment statement such as the last one above would justify reporting a best estimate of location of zero, even if there were no specific descriptive statistics reported, and even if the measurement showed a trace quantity. Cholesterol values for plant products are a traditional example of this type of situation.
Nothing here is intended to encourage or deprecate any particular procedure, decision, or decision-making process. Instead, as with other aspects of the interchange system, it is important to facilitate the transfer of whatever information is available about what was done so that the data recipient can adequately perform his or her own critical evaluation of procedures and descriptions against the background of the use for which the data are intended. Consequently, whether or not it is reasonable to discard data that seem to indicate cholesterol in plant products is not at issue here; what is at issue is the degree to which the interchange system facilitates the reporting of such a decision, if it was made, and, to some extent, the degree to which it encourages the reporting of such decisions.
On the other hand, when one is going to report information about the properties and distribution of data, some statistics are clearly better than others. Some comments on that subject can be found in the descriptions of the individual elements, others appear as part of the INFOODS recommendations on compiling food composition tables .
One should also avoid the temptation to believe that data evaluation and cleaning methods based on statistical procedures are "better" than those that involve proportionately more of the wisdom and sophistication of an experienced scientist. The opposite may be true: with data as complex as most food composition data become by the time they are reported in a table, there may not only be no reasonable substitute for the judgement of a scientist, but the uncritical application of a statistical procedure may be appropriate only when neither experience nor theory is available.
4) Description of the distribution
It is sometimes possible to describe a sample distribution in ways that move beyond summary statistics. The most obvious of these is a simple listing of data points, or a listing of the frequencies or cumulative frequencies of groups of data points, with the groups determined by either equal-interval categories or some other rule, or a listing of the values associated with certain selected fractions of the data (collections of percentiles and values at the first and second standard deviation points fall into this category). To some degree, this information, if provided, supplies additional empirical information about the degree to which the summary statistics can be believed to be useful (if the best estimate of location is not a particular summary statistic and reported as such, one's confidence in it is simply one's confidence in the ability of the analyst and the table compiler to identify a "best" value; while this sounds dangerous and subjective on first glance, it reflects what is actually done and is a reasonable approach).
5) Beliefs about the distribution
Especially when statistical procedures are applied to evaluate or clean data, those procedures are typically based on beliefs about the underlying distribution and what values are, and are not, "possible". Beliefs could include such assertions as "I know the underlying distribution is normal (or exponential, or...)", "I know that there is probably normality in the population distribution, but the instrumentation is unable to detect concentrations below 0.0001 percent, so the sample distribution will be censored in the left tail", "There is reason to believe that the sample distribution is a mixture of two populations, so the data may be multi-modal", "I know that cholesterol does not appear in plants", and so forth. It is useful to document these assumptions and statements about beliefs because they may be controversial. Scientists who would agree should know that their assumptions are shared; those who disagree should be able to evaluate the data accordingly.
This information is typically even more abstract than the description of the distribution, and we would not expect it to be reported very often. At the same time, as discussed above, there is merit in arranging for it to be reported if it is available.
Common Practice and Recommendations
As in its other sections, this chapter is devoted to providing ways in which food composition data can be structured and identified for interchange, and possibly other, purposes. At the same time, this chapter probably contains more unfamiliar terminology, and elements describing unfamiliar concepts, relative to common practices with food composition data, than most others. The most common practice has been to report only a single value for each nutrient for a given food (which a statistician might call a "point estimate of location"). Less common, but still popular, has been to report the point estimate along with some indication of how much the actual value might be expected to vary-typically minimum and maximum values, a confidence interval, or a standard error, together with a sample size.
Since many existing tables and data bases take this approach, the interchange system partially provides for it by including an inexactly described point estimate of location. It also provides for upper and lower bounds ( <bounds> ), standard errors ( <serr> ), and sample size ( <smsz> ). However, when some of these estimates are calculated from the very small samples typically encountered in food composition work (often as small as five or six or fewer), they tend to be very sensitive to extreme values (see Rand et al.  or Rand  for more discussion of this subject in a food composition context and any of several standard statistical references-e.g., [8, 28, 34]-for a more extensive statistical treatment).
There are two current trends in statistical data analysis and data description that can be thought of as addressing the issues of better describing and understanding small samples of potentially guise irregular (e.g., non-Gaussian) data. One of these concentrates on the use of more "robust" estimates (those that are less prone to be significantly distorted by a few extreme, or otherwise bad, data points) and sometimes on describing the sample without trying to make population inferences. The other involves the explicit combination of external data or knowledge with the sample data to provide more information about both. The treatment in this chapter is intended to support both of those points of view, either of which is probably preferable to the traditional approaches unless sample sizes are quite large. While we hope that future INFOODS recommendations will address these issues in more detail, perhaps the best explanation of the first approach as it is applied here (although by no means an introductory tutorial) is provided by Hoaglin et al. . For the second approach, we recommend the discussion by Efron and Morris  for the empirical approach or Howsen and Urbach  for the broader philosophical issues involved.
STRUCTURE OF THE ACTUAL ELEMENTS
Within a <data description>, the best estimate of location must appear first, if it is to appear at all. It is important to note that it is data in the <comp> element, and is not part of some other element (ignoring the SGML interpretation of the special delimiter <->).
The other four categories of statistics and metadata may appear in any order within the <data description>. These categories are not represented as data within the content of the <comp> or <drvd-comp> elements, but as [tagged] elements, as specified below.
The current availability of data is such that we need not define the third through fifth categories in great detail at this time. We must be prepared to do so when they are needed, and must be reasonably assured that they can be accommodated without doing violence to the interchange system. Consequently, just as we have left "component content" somewhat undefined up to this point, it is now appropriate not to try to define it completely (which could be a never-ending task) but to define appropriate structures and then wait for actual practice to require further definition.
Consequently, having described what we mean by a <data description> we define each one as consisting of the following:
1) The best estimate of location, as described in the documents that define the food component tagnames. This estimate may contain elements in its content, as specified in those documents. It is optional, but we would expect that it will almost always appear.
2a) The descriptive statistics, each identified by a tag that indicates what it is. The elements will typically not require end-tags but will consist entirely of one or more numeric values; consequently the generic identifiers will not end in slashes. There will be no tag or element whose function is to delimit the descriptive statistics from the other content of the <data description>.
2b) As a convenience, to avoid the appearance of more information than exists, and to facilitate the specification of defaults that apply to an entire interchange file, an additional tag is introduced, named <loctype/>. Its purpose is to designate the location statistic actually represented by "best estimate of location", when that value can be responsibly reported as a location statistic. Its content is a keyword from a restricted vocabulary, which should be the list of tagnames for location statistics.
The <loctype> element can be mixed with the elements that describe particular location statistics, with slightly different meaning. "<NA> 5.2 <loctype> MEAN </NA>" indicates that there are 5.2 mg of sodium, that this is the best estimate of location, and that the value is the mean. We would usually take " <NA> 5.2 </NA> " to be identical to this, but the longer form provides slightly more confidence. In a slightly more elaborate form, or with a more devoted data base or interchange file producer, we might encounter
<NA> 5.2 <loctype> mean
<median> 5.1 </NA> or
<NA> 5.2 <median> 5.1 </NA>
with the same significance as the examples above, but indicating that the median was also computed and is provided. There is, however, another case, with slightly different semantics, which should not appear unless the data base developer has determined that there is a need to make a very specific point:
<NA> <mean> 5.2 <median> 5.1 </NA>
reports a mean and median as above, but suggests that the data base developer or maintainer is quite explicitly not willing to make an assertion as to which of these is the best estimate of location. Such an unusual assertion should normally be accompanied by an explanatory comment.
An obvious alternative would simply be to require table compilers to specify the statistic associated with the best element of location, with a possible value implying "unknown". There are several reasons why this was avoided. This estimate may be somewhat subjective or intuitive, rather than a well-defined statistic. Second, especially when recording tables compiled years ago, the precise statistical parameter would often be unknown, at least without further qualification, and the interchange system should not encourage people to guess at information that does not actually exist.
3) The description of the treatment or processing. Future work may be required to supplement whatever free text is used by recognizable and comparable tags or keywords. However, since, as far as we know, information of this type has not yet appeared in any food composition table (although it is becoming prevalent in other fields), free text description should be used for the near future. The <sclean/> tag identifies this information.
4) The empirical description of the distribution. The <edistr/> element identifies this information.
5) The description of more subjective beliefs about the distribution. As with <sclean/>, there is now provision only for free text content, but work should be started on specific elements and keywords as soon as that is practical. This information is identified with the <sdistr/> tag , since it provides a subjective description of the [hypothesized] population distribution.
Contents - Previous - Next