This is the old United Nations University website. Visit the new site at http://unu.edu


Contents - Previous - Next


3. Introduction to the reference material


This chapter provides information about the conventions used in Part II and about the principles for constructing interchange files that are not specific to any particular element or class of elements.

CHARACTER SETS

The text strings of which an interchange file is composed are, with a few exceptions, restricted to contain only a minimal set of characters. This permits these files to be displayed or printed on a wide range of devices in many countries. The characters are the graphics (plus "space") of the ISO 646 basic character set [40]. For a few specific situations, such as expressing the name of a food in a language that does not use the Roman alphabet, special provisions are made to identify the language, the alphabet, and the way the alphabet is encoded. Those provisions are discussed below.

Character Restrictions within Ordinary Data

The "less-than" sign ( <) and the "greater-than" sign (> ) are reserved for the construction and recognition of tags and may generally not appear within data. Thus, when reading data normally, any occurrence of "<" indicates the beginning of a tag; similarly, when reading data backward, any occurrence of "> " indicates the ending of a preceding tag.

Only a very small number of elements may contain data including "<" and "> ", and these are not permitted to have subsidiary elements. <Cmt/> elements-comments, which can have almost arbitrary character strings within them-and <ad/> and <x400/> elements subsidiary to <email/>-electronic mail addresses, which may require having the "greater-than" and "less-than" characters as part of the address-are the only elements of this type defined at present. The only strings the contents of these elements cannot include are their own end-tags ("</cmt/>", "</ad/>", and "</x400/>" respectively).

The space character ( ) plays a special role within formatted data. Line breaks (which may be system-dependent) and the tab character may also be used. Multiple spaces, tabs, and line breaks in this special role are treated as if only one space appeared; we use the term "whitespace" to refer to any sequence of consecutive spaces, tabs, and line breaks. The special uses of whitespace are discussed below.

Character Restrictions within Tags

A tag (except for <infoods 85>) consists of a generic identifier preceded by "<" or "</" and followed by "> ". Thus, a generic identifier may not include the " <" or "> " symbols, and must not start with the slant (/) It also must adhere to a number of other restrictions, listed below, to ensure that tags will have the same appearance, and the same meaning, regardless of printing device. These restrictions also help to prevent confusion between tags and actual data in an element, and identify some start-tags whose end-tag is required.

Generic identifiers must use an even more restricted subset of the ISO 646 basic character set than data. This subset consists of the numerals and letters (alphabetic characters), the hyphen, and the slant (/) All other characters, including the underscore (_), period, and space, are excluded. In addition, the slant may appear only as the last character of a generic identifier, the first character must be a letter, and hyphens must not appear adjacent to each other.

No distinction is made between upper- and lower-case characters in generic identifiers and keywords; i.e., <source> and <SOURCE> have the same interpretation. In unformatted text, there may be distinctions on the basis of case, as specified by the definition of the individual element.

Alternative Character Set Conventions

Exceptions to the very conservative ISO 646 basic character set are permitted for data values in a few elements. For example, an alternative character set may be used to spell out the local name of a food in its appropriate language. In such cases, the character set must be identified by the number of an ISO standard or the ISO registration number for that character set. The syntax for specifying an alternative character set is included in the description of the elements for which such characters are permitted.

CONVENTIONS FOR CONSTRUCTING ELEMENTS

Each element consists of a start-tag, content, and perhaps, depending on the particular element, an end-tag. The content consists of data, one or more subsidiary elements, or data followed by one or more subsidiary elements. Elements with no content are not permitted. For a discussion of the overall structure of an interchange file see the previous chapter.

Tags

A start-tag begins with " <" followed by an alphabetic character, while an end-tag begins with " </". Both end with "> ". Between the opening " <" or " </" and the closing "> " is a "word", the generic identifier. A generic identifier is constructed according to the rules under "Character Restrictions within Tags", above.

Some examples of tags are: <header>, </source>, <unit/>, and </unit/>. The corresponding generic identifiers are "header", "source", "unit/", and "unit/" (not "unit" or "/unit/"). The following are character strings that cannot be generic identifiers:

"ONE KIND", "3rd", "this&that", and "ANOTHER/TAG"

The special tag <-> is neither a start-tag nor an end-tag, though in certain cases it may act as both (see the section titled "Repeated and Counted Elements", below).

Formatted and Unformatted Data and Whitespace

Data can be formatted or unformatted. Formatted data consists of one or more numerals and/or keywords separated by whitespace (spaces, tabs, and/or new lines) whereas unformatted data is arbitrary text.

A numeral is a string of digits with an optional sign and/or decimal point, or a numeral in scientific notation as prescribed in the applicable standards [36]. Forms beginning with a decimal point should not be used, e.g., "0.4" should be used rather than ".4". A keyword has the same internal structure as a generic identifier except that it cannot end in a slash: it must start with a letter, and continue with letters, digits, and/or hyphens. For example, "0.128" is a numeral and "USDA" is a keyword. "0.128 USDA" is formatted data consisting of a numeral followed by a keyword separated by required whitespace.

A raw data string consists of either formatted data (one or more data values) or one unformatted data item, or both; if both, the formatted data must come first. In general, one cannot determine whether data are formatted or unformatted by looking at them; the definition of the tag and its content is required. Any formatted data, such as the example "0.128 USDA" above, could also be interpreted as an unformatted data item. On the other hand, "0.128USDA" can only be unformatted data: it is neither a numeral, because it contains letters, nor a keyword, because it starts with a digit.

Whitespace is required to separate successive formatted data items, and to separate formatted data from immediately following unformatted data. This whitespace is not part of the data item. Data items never begin or end with whitespace, although an unformatted data item may have embedded whitespace. For example, the string " This is a sample unformatted data value. " includes an unformatted data value consisting of 41 characters beginning with "T" and ending with ".". It has both leading and trailing whitespace, which are not part of-the data item. However, the spaces between "This" and "is", between "is" and "a", and so forth, are part of the data item.

Whitespace immediately before and after tags is ignored. This means that data always may have whitespace before or after. Optional and extra whitespace in the form of judicious indenting and line breaks can make the structure of an interchange file much easier for a person to read.

Contents

The content of an element consists of all of the characters between the start-tag and the end-tag of the element. The content of an element can be subsidiary elements or a raw data string, or both. If an element includes both raw data and subsidiary elements, the data must come first. Each type of element (as designated by its generic identifier) has a specific list of what data values and/or subsidiary elements are permitted or required within the content of that type of element.

No element has an empty content. If all of the subsidiary elements are optional and none are desired, then the element itself must also be optional and should be omitted; similarly, if it is to contain a data value and that value is non-existent, the element itself should be omitted.

In the following example, the content of the <VITB12> element is data, the numeral "03":

<VITB12> 03 </VITB12>

In the following example, the content of the <comp> element is two subsidiary elements. The first is the same <VITB12> element shown above, whose content is data. This subsidiary element is followed by a second subsidiary element, <VITE>, whose content consists of a data numeral followed by three subsidiary elements (<XBTP>, <XGTP>, and <XATT>), whose content is in each case a (data) numeral:

<comp>
<VITB12> 03 </VITB12>
<VITE> 0.7 <XBTP> 0.4 <XGTP> 0.1 <XATT> 0.26 </VITE>
</comp>

The only elements that do not require an end-tag are those that permit only a small number of formatted data items (numerals or keywords) or a single unformatted data item in their content. They do not permit subsidiary elements. These elements never have an end-tag; end-tags are never optional. Each such element is so identified as part of its registered description. For example, <VITE12> and <VITE> elements require an end-tag, but <XBTP>, <XGTP>, and <XATT> elements do not.

The Trailing Slash and End-tags

Whether or not an end-tag is required can be predicted from the form and type of the generic identifier. Conversely, the form of a generic identifier is determined by the context in which it is used and whether or not it requires an end-tag. Specifically

These conventions are, admittedly, complex. From a conceptual standpoint, it would have been much easier to simply require end-tags for all elements. However, it became very clear in the early discussions from which the interchange system evolved that there was a critical requirement that small and simple data files should require minimal structural overhead so that, for example, they could be exchanged on low-capacity media (notably diskette) and processed successfully on small computers. Consequently, more complex rules were adopted that tend to keep small files small and impose more of the burdens of structure and precise identification on the files and data bases that would be proportionately larger and more complex in any case.

STRUCTURAL ELEMENTS

The <infoods 85> element and those elements that appear for a few levels of elements and content below it are used primarily to structure, i.e., to organize and order, the interchange file, rather than to carry table-specific or food-specific information. These are called structural elements. Structural elements always have end-tags, their generic identifiers do not end in slashes, and their content consists of elements only. Structural elements-except <infoods 85>-can occur only as subsidiary elements of other structural elements, and occur therein only in a prescribed order (although some are optional). In other words, a structural element may never appear subsidiary to a nonstructural element.

One of the implications of this is that some elements mark the nesting boundary between structural and non-structural elements: no element subsidiary to them is structural, and all elements to which they are subsidiary are structural. Those elements, which are themselves considered structural, are <header>, <classif>, <comp>, and <drvd-comp> .

The order in which the elements subsidiary to a given element must appear, if any, is always specified as part of the definition of the containing element. In general, the subsidiary elements must appear in a specific order. The major exception is for elements immediately subsidiary to the boundary elements listed above: those subsidiary elements may appear in any order.

OTHER ELEMENTS

Specific Food Component and Derived Component Elements

<Specific component> and <specific derived component> elements are the subsidiary elements of <comp> and <drvd-comp>, respectively. Like structural elements, they require an end-tag and their generic identifiers do not end in a slash. However, since they are not structural elements, they may occur in any order. (The terms "<specific component>" and "<specific derived component>" are shown in italics to remind the reader that they are not actual tags or elements but only placeholders for the registered list of identifying generic identifiers and element structures for food components [17].)

Other Non-structural Elements

While there are a few exceptions, other non-structural elements deal directly with data. Unlike structural or <specific component> or <specific derived component> elements, certain of these elements do not use end-tags. To avoid any question as to which do and which do not, each of these elements requires an end-tag if and only if its generic identifier ends in a slash.

For example, in the structure

comp <VITB12> 7 </VITB12>
<VITE> 3 <unit/> IU </unit/> c XATP> 1.0 <XBTP> 0.4 </VITE>
</comp>
<drvd-comp> <chemsc> 0.52 </chemsc> </drvd-comp>

an end-tag is required for the comp and <drvd-comp> elements because they are structural elements. The <VITB12> and <VITE> elements require end-tags because they represent specific food components and ate immediately subsidiary to the structural element <comp> . The <chemsc> element requires an end-tag because it is a specific derived component, subsidiary to the structural element <drvd-comp>. Each <unit/> element requites an end-tag because "unit/" ends in a slash. The <XATP> element, which is subsidiary to the specific food component <VITE>, does not take an end-tag, because it is not a structural element or immediately subsidiary to one and "XATP" does not end in a slash.

Element Values of "Zero", 'Trace", and "Missing"

If the value for an element is actually "missing", i.e., no value is available, the element is omitted entirely. This is a case of the principle that elements without content do not appeal. If a value, however suspect, is available, it should be included: even values of questionable accuracy may be useful to some users under some sets of circumstances. Statistical and data treatment elements should be used to describe and, if possible, quantify the uncertainties. Under no circumstances should a zero (or any other number) be provided for a missing value unless that is the table compiler's best estimate of the actual value, preferably identified as such.

When the food component is measured, a zero value can occur either as the result of there actually not being any of the component present or as the result of limitations of apparatus, instrumentation, or procedures. Especially in the case of an apparent measured zero, data description elements should be used to give the receiver information about the accuracy to which measurement could be achieved.

The presence of a small, but not accurately measurable, amount-the so-called "trace" amount-provides another situation in which the description of the data value provides more information than the value itself. The special data item "TR" may be used as a keyword in any situation in which a data value would otherwise appear, but it should be used only with sufficient data description to identify the circumstances under which the "trace" value occurred, e.g., with an explicit element that identifies the detection level for the method used.

A related but slightly different approach to these problems has been provided by Kent Stewart [30].

REPEATED AND COUNTED ELEMENTS

Most element types can occur at most once as subsidiary within a given element; a few can tee repealed. For example, the <infoods 85> clement can have only one <header> but may have many <food> elements. However, the various <food> elements are not distinguished by which their sequence: they are identified by internal data, not by the order in which they appear. Occasionally it is useful to have a repeatable element whose repetitions are distinguished by sequence. In this case, a very special notation is used. Instead of repeating the entire element, with the first end-tag adjacent to the next start-tag, the repeated contents are separated by a special tag, <->. For example:

<VITA> 13 <-> 7.2 </VITA>

where the first value would normally "per 100 g edible portion" and the second value would be for some common unit, such as "per piece". That unit would be specified in a previous <fddflt> element. If it were not, this notation would indicate that the food had values of both 13 and 7.2 micrograms pet 100 g edible portion, a contradiction (the choice of "micrograms" is part of the definition of the <VITA> element but could be overridden with a separate subsidiary element, <unit/>).

All specific food component elements (of which <VITA> is one) are of this type. On the other hand, the <addr/> element contains various lines which must be presented sequentially for the address to make sense:

<addr/> Post Office Box 1234
<-> Anywhere, Maine 00001
<-> USA
</addr/>

Only a very few element types (but including all specific food component elements) are permitted this ordered repetition mechanism. Each one that does is clearly specified in its registered description.

THE MACRO ELEMENTS <dflt> AND <<fddflt>>

Two special elements are also defined that can be used to reduce the size of files of data in interchange format or to reduce the complexity of creating such files. They are always optional, and while they may be very convenient for some producers of interchange files, others will find it best to ignore them. They do add complexity to the structure and processing of interchange files, and therefore probably should be omitted (or, as explained below, expanded before the file is sent) if small flies are being transferred in interchange format to users with limited computer expertise. INFOODS regional data centres are expected to have the capability of processing these elements.

The two elements are identified by the tags <dflt>, which appears immediately subsidiary to <infoods 85> (at the same level as <food> ), and <fddflt>, which appears immediately subsidiary to <food> (at the same level as the <specific component> elements). <Dflt> is used to specify "default values" for all of the foods in the data base, while <fddflt> is used to specify "default values" for the components of a given food. Each has the same structure as the elements into which it substitutes; i.e., <dflt> has the same possible selection of content elements as <food>, and <fddflt> has the <specific component> elements as its content.

These elements are used as crude text "macros", providing for the substitution of values that do not appear directly in the content of <food> or <specific component> elements or their subsidiaries. The asterisk (*) is used to indicate the position of data that must be provided in the actual elements. To minimize processing complexity among these two elements and the elements to which their values are applied, there are no precedence rules: information may appear with <dflt>, with <fddflt>, or in the <food> element or its components, or not at all, but not in more than one per category. The element structure used in <dflt> or <fddflt> indicates where values are applied to the actual data elements. From a programming standpoint, the absence of precedence rules implies that a processing program can be constructed that will convert a file that contains <dflt> or <fddflt> elements into one that is fully expanded and in which they do not appear. With this model, the processing program requires no embedded knowledge of the specific foods or components. <Dflt> and <fddflt> may even appear together if they do not contain overlapping information. Such a program would continue to work with any future extension of the interchange system, including the addition of new elements. It could also operate independently of programs to convert or extract specific data from interchange files.

While other uses are possible-it can have any structure that <food> can have- <dflt> will typically be used to specify characteristics in common for all measurements of specific food components in a data base. For example, if all measurements of energy would normally be specified with the <energc> element with the "KJA" keyword, the following element could be provided:

<dflt> <comp> <energc> * KJA </energc> </comp> </dflt>

This would imply that any time an <energc> element appeared subsidiary to <food> and <comp> elements in the interchange file it would be treated as if "KJA" had appeared. In other words,

<food> ... <comp> ... <energc> 3 </energc> ... </comp> ... </food>

would be treated as if it read

<food> ... <comp> ... <energc> 3 KJA </energc> ... </comp> ... </food>

Because, as mentioned above, there are no precedence rules for substitution, the presence of the construction above would make it impossible to have any <energc> value in the file that contained a keyword specifying a method: if different <energc> methods appear in the file then <dflt> may not be used to specify any of them.

In this example, the content of <dflt> could also contain elements for other subsidiary elements of <comp>, for <drvd-comp> and its subsidiaries, and, in principle at least, for <classif> and its subsidiaries. At most, one <dflt> element is permitted in an interchange file.

The rules for application of fddflt are similar to those for <dflt>. If it appears, it applies to all the elements of <comp>, i.e., to all <specific component> elements. It will most often be used to express the units in which the food is reported, i.e., to provide the <meas/> element and its value for the entire food. Since the structure of fddflt parallels that of <specific component>, if the interchange file contains more than one set of measurements for each food component, the special delimiter element "<->" may be used to specify that the value of <fddflt> applies to only one. If <-> does not appear, it will be assumed to apply only to the first. So

<fddflt> * <-> <meas/> piece </meas> <fddflt>

would imply that, for any food components for which more than one value (or set of values, if full statistical information were provided: see Chapter 6) appeared, the second one would represent values reported "per piece".

No rule of the interchange system prevents using an <fddflt> element as a subsidiary of <dflt>. However, if this is done, the creator of the file must ensure that the food defaults apply to every food component in every food in the data base, and that no conflicts occur with values specified with the individual foods or components. In practice, the combination will be useful, if at all, only with highly specific data bases, e.g., ones reporting many measured values for the same food, as for different locations or seasons. In that situation, it might be sensible to provide <classy> and some of its elements as components of <dflt> as well.


Contents - Previous - Next