This is the old United Nations University website. Visit the new site at http://unu.edu


Contents - Previous - Next


Part III: Processing data and interchange files


7. Registering elements
8. Conversion of data to interchange format
9. Conversion of data from interchange format


7. Registering elements


INTRODUCTION

All systems which attempt to facilitate communication between different parties must develop a set of rules to which those parties must adhere. In the case of telephone or television transmission, these rules are largely invisible to the average user. The physical devices themselves are built according to agreed-upon standards and usually perform their functions without our having to think about them.

In the case of intra- and international exchange of data, where system independence is prerequisite, rules must also exist which allow all parties to participate in data interchange as efficiently, and with as few errors and misunderstandings, as possible. A goal of INFOODS is to create a mechanism to make interchange of food composition data as invisible as telephone or television transmission mechanisms. That goal is not yet possible, both because of differing standards about the data values themselves- analogous to two people trying to talk without speaking or understanding each other's languages-and because some identification issues, such as "When are two foods 'the same'?", require problem-specific scientific determination.

JUSTIFICATION FOR REGISTRATION

The interchange system is, as discussed in earlier chapters, a "tagged architecture" in which the meaning of each data value is specified by the generic identifier-more specifically, the structure of the element-with which it is associated. This obviously implies that the accurate matching of tagging structure to data is critical to the interpretation of those data: if a value for fat were somehow identified as a value for vitamin A, the distortion of values might be serious. Fortunately, a properly organized tagged architecture is less prone to misidentification of values than, say, an approach that depends on the order in which the data values appear. But permitting data to be moved among systems is not the only goal of the interchange system. Other goals include permitting those data to be exported and imported with very high accuracy and no loss of information and also being able to adapt to improving knowledge about nutrients and their analysis over time. Meeting those goals depends, to a significant degree, on agreement about generic identifiers and element structures between those who develop or send data and those who receive them. If an element appears in an imported interchange file, the receiver must be able to determine, efficiently and exactly, what that element, and all of its components, means.

Consequently, it is necessary to define the generic identifiers and elements, and the meaning of the data values, unambiguously and very precisely. That the initial listing of food components and associated tagnames [17] took four review cycles and two years to complete is an indication of the difficulty of the process. Perhaps more indicative: more information and discussion of subtle differences between variants on what is usually thought of as the same nutrient were needed (and added) during each of these cycles.

The initial list represented by these two documents cannot, obviously, be comprehensive for all time. New food components of interest will be identified, and improved methods, yielding different values, will be developed for food components now commonly reported. In addition, new elements will need to be defined for additional statistical and sample description and for non-nutritive components of foods such as additives and contaminants. Consequently, an integral part of the interchange system must be a mechanism for defining those new element structures and, in some cases, modifications to already-defined elements. That mechanism, following the broad concepts of an ISO model, involves the use of a "registration authority" for each type of element that can be defined. The INFOODS secretariat initially holds all registration responsibility and, as in the case of the food component tagnames, has gone beyond simple registration to a leadership role in defining the elements. It is expected that, in the future, other organizations will assume some element registration responsibilities, and that submissions of new element definitions will come from data user or producer organizations or from regional groups.

In any case, however elements originate, it is critical that they be precisely defined and registered, and the definitions made available, before their use in interchange is attempted. Since a major goal of the interchange system approach is to eliminate the need for the prospective receiver of a data file to have separate conversion programs for each organization from which data might be received, the receiver and the programs that support conversion from interchange format to the receiver's local formats must be able to accurately anticipate the structure and organization of any valid incoming interchange-format file. This goal is consistent with being able to expand the list of permissible elements over time only if there are dear rules about which types of elements can be ignored if not recognized; those rules are described in this chapter and elsewhere in this book.

This chapter defines the activities and responsibilities of the registration authorities, and specifies the conventions and requirements for proposing new elements and having them adopted. In principle, the structure of the interchange system can be extended in ways not specified in this chapter, such as by the addition of new "structural" elements. But doing so is not in the same category as the registration of a new data element: new structural elements can alter the rules about what can safely be ignored and would require fairly general consensus to adopt. Consequently, while the interchange model anticipates the possibility of such changes, no specific mechanism for making them is included here.

THE REGISTRATION AUTHORITY

Registration is a secretariat function, with little technical responsibility other than ensuring that materials required as part of a registration request are in order, complete, and consistent with previous registrations. There must also be a procedure for disseminating what has been registered. In the case of food composition data, only elements are registered. Those elements may identify new food components, or new modes of statistical description, or new sampling strategies, or new components of food descriptions, to name a few. In some cases, discussed below, a new submission for registration may also modify an existing registered definition, such as adding to a list of keywords that describe methods or conversion factors.

POLICY AND PROCEDURES FOR ACCEPTANCE

The registration authority must ensure that all elements are unique and that descriptions for each are complete and well-defined. The registration authority must also verify that relevant reference documentation is available. Neither definitions nor the actual elements themselves may be duplicated: one may neither have two definitions for the same name nor have two names for the same definition.

Details of the requirements and format of an application to register an element follow. Once the registration proposal has been submitted, the registration authority will evaluate the application. As mentioned above, applications to register elements are evaluated solely on uniqueness and completeness. The registration authority is not normally expected to evaluate the relative merits of the analytic methods used or to apply other qualitative criteria.

When an application is accepted, the appropriate registration authority assigns both a generic identifier for the element and whatever registration identification is required to catalogue and retrieve element definitions and cross-references over time.

REGISTRATION REQUIREMENTS AND FORMAT

In general, a registration proposal must, in addition to identifying the applicant and any review procedures already applied to the proposal, completely define the proposed element or modification, its context, area of application, and relationship to other elements that might be confused with it. The subsections that follow list this information in considerable detail, and may safely be omitted by the casual reader.

Required Information

The following items are required and must be included in a proposal to register a new element. They are described in more detail below.

Each of these items must be specified precisely and completely, according to the definitions and guidelines that follow.

Context within the Interchange System

Every defined element of the interchange system is subsidiary to some other element except for the root element, <infoods 85>. The small number of structural elements identified in this document-<header>, <dflt>, and the immediate subelements of <food>: <classif> <fddflt>, <comp>, and <drvd-comp>- provide a context for ail other elements, including those not yet defined. Elements that are not directly subsidiary to the structural elements, but that lie further "down" in the structure, also draw context from the elements to which they are directly subsidiary. A proposal for a new element must include the context or contexts in which that element may appear.

Definition and Justification

This portion of the proposal must provide a definition or description of the proposed element. This description may be informal, but must be sufficient to permit someone to distinguish between one element and another. For example, for primary nutrient elements, this section must include the name of the nutrient and, if appropriate, the analytic method that distinguishes it. If there is already an element defined for the same or a similar purpose, justification must be provided as to why the existing element is not sufficient and the differences between the older and newer one(s) defined very clearly. For obvious reasons, the registration proposal for a completely new food component will be less complex than a proposal to define a new element where similar elements already exist.

Proposed Generic Identifiers

Generic identifiers provide the identification of an element, and are typically the names by which the element is known, indexed, and referenced. For the sake of readability and use, they should ideally be from three to seven characters in length, and, if possible, in pronounceable or nearly-pronounceable strings (this criterion is often not practical). When possible, names that have mnemonic significance in some language are preferred to those that are completely arbitrary. Most of the initial generic identifiers were derived from Latin, English, or chemistry.

In order to reduce the risk of undetected transcription errors, generic identifiers shorter than three characters are discouraged, except under special circumstances (see the introduction to the reference sections), as when the generic identifiers are names or abbreviations in nearly universal use. On the other hand, generic identifiers over eight characters are discouraged in order to minimize the size of, and amount of processing required for, an interchange file. Just as short names will be permitted when there is strong reason for doing so, longer names may be permitted when more important principles apply. For example, the uniform system used to assign the initial food component tagnames for fatty acids led to several generic identifiers that were more than eight characters long. In this case, consistency was considered more important than brevity.

The character strings (names) used for generic identifiers must start with a simple alphabetic character followed by simple alphanumeric characters and, under restricted circumstances, hyphens. The "simple" alphabetic characters are selected on a common denominator basis as the alphabetic characters common to Latin-based alphabets, without diacritical marks, special symbols, or characters designated as "national use" in the various international character coding standards. Simple alphanumeric characters are the simple alphabetic characters plus the digits. Obviously, generic identifiers may not contain embedded blanks or other "whitespace" or non-graphic characters. A more precise description of the characters permitted, and the associated rules for using them, appears in Chapter 3. In spite of the comments above about mnemonic significance and standard abbreviations, the generic identifiers of the interchange system are ultimately arbitrary character strings: programs may not assume that similar-looking generic identifiers are related, and people should be discouraged from making that assumption.

While the registration authority must accept a complete, consistent, and new definition, the proposed generic identifier is just a recommendation: the registration authority will make final decisions on matters of taste in generic identifier assignment.

Description of the Content

The content entry specifies the values, and characteristics of those values, for the proposed element. This includes, as discussed above, a description of the meaning of the element when optional content components-typically keywords or subsidiary elements- are omitted. The description of the content will frequently include references to the subsidiary elements and cross-references, provided above, for the sake of completeness. Where there are existing applicable international standards-such as ISO, Codex Alimentarius, IUPAC, or AOAC definitions, standards, or units-they are preferred to other alternatives and should be used and referenced.

For nutrient elements, the description of the content should include the number of values that are required and permitted (if different) and how they are to be interpreted. "Interpretation" information includes the units that a numerical value represents (in units/unit form, e.g., "grams per hundred grams edible portion"). If some values are optional, this section must indicate what their omission means.

For values that are expected to be numeric, the plausible range should be given when that is useful. If this range can be modified by subsidiary elements, that fact should also be indicated along with what variations are possible. However, modification of values by subsidiary elements is not desirable. Ideally, a receiver who ignores subsidiary elements below a certain level should not encounter serious problems. With the exception of <unit/>, this principle is followed in all of the initial sets of element definitions, and the implications of <unit/> are restricted, as discussed below.

"Units" which express a scale, e.g., "expressed as an integer, to be divided by 1000", are not permitted since they put an excessive premium on external knowledge. Notation for the values themselves that uses an explicit scale will be used instead (e.g., "5.2E-3"). This should not be taken as precluding the use of common SI multiplier units, such as milligrams: those are explicitly permitted. It is strongly preferred that the default unit for a particular element should be the one in most common use and that is scientifically most acceptable to permit the <unit/> to be omitted in most cases.

For classification and other descriptive elements, the description of the content must include either a complete list of the values that may appear, a reference to where such a list may be found, or a very specific "generating rule" that can be used to determine what may or may not appear in the value. A "generating rule", as the term is used here, is a rule about what values are permitted and what they mean without listing the values. For example, "the quality value is a positive integer less than six" is a generating rule, while "the quality value must be one of 1, 2, 3, 4, or 5" is a complete list of values. Neither is a complete description of a content, since the meaning and interpretation of the values is not given.

If a list or generating rule is incorporated by reference, the reference must be specific and must refer to materials that are readily accessible to the scientific community. For example, this reference form is acceptable: "the Australian Food Composition Tables, Government of Australia, 2 January 1903" since it is a specific reference to readily available material. Conversely, this reference form is unacceptable: "the current version of the 'Factored Food Vocabulary"' since "the current version" is not wet/-defined. To be acceptable, a specific version and source for obtaining it would need to be provided. Even "the version in use on 1 June 1988" is not sufficiently specific, as there is no reference to a document that is readily accessible to the scientific community.

While having a document filed with the registration authority is not sufficient to make it "readily accessible to the scientific community", deposit of referenced materials is also required unless they are very widely known and accessible.

Keyword Content Values

In specific cases, contents are composed of keywords or keywords and values. A keyword is a member of a controlled list of possible values (usually best thought of as names) for some item of information. The list is always restricted, and is often an alternative to the use of long descriptions (e.g., the keyword used for the name of a language) or a complex list of conversion factors or similar values (e.g., the keyword "CODEX" used with the <enerc> element to indicate the Codex Alimentarius-recommended energy conversion factors). Keywords will usually be registered and maintained by the appropriate registration authority, just as generic identifiers are, but a registration proposal may incorporate an international standard by reference. For example, there is an ISO standard for the representation of names of languages [39] which forms the list of keywords for the <lang> generic identifier.

Example

<lang> AR
<unit/> KCAL </unit>

References for registered keywords are maintained by the registration authority with the keyword registration materials, in lieu of a specific list of keyword values.

List of Relevant Subsidiary Elements

In many cases, an element will permit or require additional elements as part of its content. Typical subsidiary elements might identify units of measure different from the default for the food component, analytic methods that do not alter the expected values for a nutrient (different analytic methods that produce different expected values for the same nutrient call for different elements, pairing the nutrient with the method), statistical or sample description of the values presented, or other qualifying or descriptive information.

The definition for a subsidiary element is, in principle, identical to the definition for an element. This section should identify subsidiary elements by reference to their definitions, and provide information as to whether they are requited or optional in this particular context. Any constraints on subsidiary element values should also be explained, using the general style discussed under cross referencing above. New subsidiary elements themselves may be defined either as part of the registration proposal for the parent element or, especially if their use in other contexts is anticipated, in separate, but concurrently submitted, registration proposals.

Required subsidiary elements are discouraged in order to make simple processors easier to construct, but necessary exceptions may arise and should be justified. For example, the <enerc> element requires, in addition to the nutrient value, either a keyword specifying a calculation method or subelements that list the specific conversion factors used. Without one or the other, the energy value cannot be adequately identified and interpreted, and the use of that particular element is not permitted (<enerc> must be used instead). When required subsidiary elements are needed, as in this case, a justification similar to this example should be given, preferably with a discussion of why a series of separate generic identifiers and associated elements are not a preferable solution.

The reason for avoiding required subsidiary elements when possible is to avoid processing complexity, especially for users who are seeking only a particular type of information with relatively modest software. Elements should be defined in a way that is consistent with the most accepted or common use, to minimize the amount of qualification that is required except in the more obscure or unusual cases. On the other hand, since there may not be general agreement about the most accepted use, the meaning of the element without any optional qualifying information should be clearly specified. Optional qualifying elements also tend to increase the machine size and code complexity needed to deal with the interchange system and should be avoided, when feasible, on those grounds as well.

Cross-references

In those cases in which a proposed element is closely related to one or more other elements (e.g., a new method, producing different expected values, for a nutrient for which elements are already registered), the registration proposal must identify the earlier registrations and what the relationship is between the existing elements and the proposed element. The registration authority will maintain and update this section in the permanent reference copy of the element definitions. It should be noted that registration cross-references are intended for understanding, interpreting, and maintaining the list of elements and the interchange system in general. Although it is not its primary purpose, the information may also be of use when sites or regions develop thesauri to expand or automatically generate data base searches.

In particular, if cross referencing is needed, the cross-references must not become convoluted. For example, they must refer to an original element definition, not to other cross-references. One should avoid "refers to element '<xxx>' as used subsidiary to element '<yyy>' with the modifications and constraints of '<zzz>"' because this type of reference rapidly leads to confusion and ambiguity.

In addition, while cross-references may constrain or qualify the values or definitions of an original element, they may not expand those values or definitions. For example, one might say "refers to element '<qqq>', except that, in this context, the value 'pounds per cubic meter' accepted in the general definition of that element, is not permitted" because it refers to another element but constrains its use. However, one may not say "refers to element '<m>', except that, in this context, the value 'pounds per cubic meter' accepted in the general definition of that element, is not permitted and instead substitute the value 'pounds per cubic inch"' because it both restricts and expands upon the original element definition, leading to some convolution of definitions.

A more adequate definition of the term "convolution", and the purpose of these restrictions, is to avoid definitions which a person (or computer) must construct dynamically by referencing several different pieces of text. Dynamically constructed definitions are confusing and annoying for the reader and, especially in extensible systems such as the interchange definition, are error-prone and subject to ambiguities.

REGISTERING A KEYWORD TO BE ADDED TO AN EXISTING LIST

Certain keyword lists may be expanded. An application to register a new keyword, that is, to extend an existing keyword list, must include the proposed word, its meaning, and the context (element) in which the keyword will appear. It is, of course, not possible to expand a keyword list established by reference to an international standard except by revising that standard. The reference sections identify the existing elements for which keyword lists may be extended. New registrations must clearly present the model, if any, for expanding element definitions with new keywords or other syntax.

TRACE AND MISSING VALUES

In the interchange system, "missing" data is never represented by a value, but by the omission of something-a value or an element. All situations in which values may be "missing" and what that situation means must be clearly identified. See Chapter 3 and Stewart's comments [24] for discussions of the representation of missing values in interchange files and food composition data more generally.

CONCLUSION

See the appendices for application forms for registering elements.


Contents - Previous - Next