This is the old United Nations University website. Visit the new site at http://unu.edu


Contents - Previous - Next


8. Conversion of data to interchange format


INTRODUCTION

Displaying data in interchange format involves printing out each data element in the proper order, and inserting tags and occasional whitespace between them. Additional whitespace (including line and record breaks) can be inserted to keep the line or record length below a system-mandated maximum or to make a listing of the interchange file easier for people to read. This chapter will discuss mechanisms for producing interchange files from "paper copy" and machine-readable data, the latter either directly maintained or accessed through a data base management system.

MANUAL INPUT OF DATA

If food data to be placed in an interchange format file are not in machine-readable form, and are not intended to be made so except for creating the file, then the appropriate way to create an interchange file is to use a word processor and manually enter the data for subsequent conversion into interchange format. If the data are irregular (e.g., different components represented for each food) or include only a few entries, it may be appropriate to enter the interchange tags themselves at the same time.

If the data consist of entries for the same set of food components for each of several or many foods, then it may be desirable to insert the necessary tags semi-automatically. This could, for example, be done within some editors by creating a "macro" that will insert all of the tags for a single food's entries. For example if the editor being used has the ability to capture keystrokes, the sequence of operations (assuming the data values are entered one per line) would be to align the editor's current position at the first datum for the first food, and then capture:

Turn off the keystroke-capturing. Position to the beginning of the next food and trigger a rerun of the captured keystrokes (i.e., execute the macro). This will automatically insert all of the tags for this food. Repeat until all foods are tagged. Then don't forget to manually insert the <infoods 85> tags and the initial <header> and <dflt> elements.

This same addition of tags could be accomplished by a small specially written program. For example, the following subroutine (written in the BASIC language for illustration) will add the necessary tags. (Note that the subroutine must be specially written for a specific sequence of data values for each food.)

100 input #1, datavalue$
write #2, "<food><classif><ifri>"; datavalues;" </ifri>"
input #1 datavalues
write #2 " <bvname>"; datavalues;" </bvname></classif>"
input #1, datavalue$
write #2, " <comp>"
and appropriate tags surounding "datavalue$"

repeat the input/write pair as needed

write #2, "</food>"
return

This subroutine accomplishes much the same thing as the editor macro just described. The only difference is that instead of modifying each line, the datum-only input line is copied onto the output file surrounded by the appropriate tags. The complete program must ensure that the first datum of each food is found, and will have to loop, recalling the subroutine until the last food's data is read. This will presumably be detected by end-of-file on the input file of raw data. If an editor is available, it is probably still easier to place the <header> and <dflt> elements et the beginning of the fife using the editor rather than building them into the program-unless the same program will be used more than just a few times.

Of course, a program could also be written that would prompt for the data to be provided as it is being tagged and written out. This might be the best means of making the file if an editor program is not available; it depends upon the availability of programmer time and skills to build such a program.

MACHINE-READABLE SOURCE DATA

The first step in dealing with non-machine-readable data is to key or scan it into the machine, and then process it into interchange format. If the source data is maintained in machinereadable format, there are two possibilities (other than the data base management system environments discussed in the next section):

On the one hand, the data might be stored as numerals and other character strings; in this case the data can be processed just as described for newly keyed-in non-machine-readable data. Such data are likely, however, to need sorting to get them into the order required by the interchange format: <ifri> and <bvname> before any food component data; <comp> data before <drvd-comp> data; etc. If all of the data for a single food can be read in before any are written out, this should not be difficult since the order of data in the input format and the interchange format is known in both cases.

It is also possible that the component data does not have the same components for every food. This is the case with the USDA data base [35]. In such cases, each data set for a component must necessarily include information identifying the component. Most ordinary text editors do not have the capability to look up generic identifiers for the various components; a specially written program will probably be needed. The program must determine the generic identifier, delete the identifying information, reorder the remaining data and tag them as appropriate, and then surround the result with the correctly generated start-tag/end-tag pair.

On the other hand, the numerical data might be stored in a "binary" representation, i.e., an internal numerical representation other than numerals made up of digits drawn from ISO 646 [40]. Few editors are suitable for this type of data; a special program will probably be needed. Existing subroutines that can read-in such data for one purpose or another should be identified; they are good starting points for creating a local-format-to-SGML conversion program.

Once the data for a given food's components are read, the values are written with appropriate bracketing tags. For example, the following BASIC subroutine will write out a <food> element for one food with a single local name and two nutrients:

100 WRITE #2, "<food>"
WRITE #2, " <classif>"
WRITE #2, " <ifri> ";IFRIS;" </ifri>"
WRITE #2, " <bvname> ";LOCALNAMES;" </bvname>"
WRITE #2, " </classif>"
WRITE #2, " <comp>"
WRITE #2, " <CA> "; CAS;" </CA>"
WRITE #2, " </comp>"
WRITE #2, "</food>"
WRITE #2, " <FE> ", FES;" </FE>"
RETURN

If, for example, the data values for calcium and iron were loaded as numbers rather than numerals (i.e., character strings), this subroutine listing would have "CA$" and "FE$" replaced respectively by "CA" and "FE".

If, in accordance with that example, the values for "LOCALNAME$", "CA$", and "FE$" are set respectively to "Banana", "5.7", and "63", and IFRI$ is set appropriately before the subroutine is called, then the following will appear in the output file:

<food>
<classif>
<ifri> an appropriate IFRI </ifri>
<bvname> Banana </bvname>
</classif>
<comp>
<CA> 5.7 </CA>
<FE> 63 </FE> </comp>
</food>

SOURCE DATA IN A DATA BASE MANAGEMENT SYSTEM

If a food composition data base is maintained within a data base management system, it will usually be most convenient to use that system's report writer to produce results in interchange format. The techniques will be similar to those just illustrated, but the "programming" language used will be that of the system's report writer. The examples below omit the <ifri> element as well as much of the identification and classification information that might be present in a production data base.

Example: The Janus data base management component Of the Consistent System.

A program called "Janus" [16, 20, 29], designed as a data organizing and management front end for the Consistent System [5], is fairly representative of the state of the art in data systems specifically designed for the management and handling of statistical data. It was included in the experiments run at the INFOODS Secretariat because it contains a collection of special operators and operations that make working with nutrient composition data quite simple. Although the system is essentially relational, the Janus design both predates SQL by some years and supports operations that do not mesh well with the relational model [18].

The Janus Command Sequence

Janus uses a non-procedural command language for all operations; while there is a host language interface, it is rarely used and should never be necessary (see [16]). Its formatted output operations occur through a command called "display", and the [single] command used to produce the output shown below is as follows:

display cntel ("<FOOD> <CLASSIF> <FDA-FFV-8807>"), FFV, cntel ("</FDA-FFV-8807> <USDA-NDB>"), ndb,
cntel ("</USDA-NDB> <EUROCODE2>"), eurocode,
cntel ("</EUROCODE2> <USDA-NAME>"), usda name,
cntel ("</USDA-NAME> </CLASSIF>") cntel ("<COMP> <ASH>"), ash,
cntel ("</ASH>"), cntel ("<CHOCDF>"), carbo, cntel ("</CHOCDF>"),
cntel ("<ENERC>"), energy268, cntel ("USDA </ENERC>"),
cntel ("<FIBC>"), fiber, cntel ("</FIBC>"),
cntel ("<PROCNT>"), protein cntel ("JONES </PROCNT>"),
cntel ("<FAT>"), total lipid cntel ("</FAT>"),
cntel ("<WATER>"), water,
cntel ("</WATER> </COMP> </FOOD>"),
in extract foods with no blocking, no enn, lnl=600
save in Mlf ">udd>INFOODS>jck>extract4.list";

In this command, the names "ndb", "eurocode", "usda name", and so forth are names of fields (which Janus calls "attributes", reflecting early relational data base management system terminology). The function "cntel" is used to force its value to be displayed along with the field (attribute) values, rather than being separated as if they were summary fields. The qualifications following "with" are specifications about the display output format, specifying that selected Janus display defaults are to be disabled or overridden. See the Janus Reference Manual [20] if additional information is needed.

The data used in this example are extracted from the USDA Standard Reference Database [35] with factored food vocabulary codings supplied by FDA prior to the workshop discussed in reference 13. Since the records being produced do not contain <ifri> elements, they are not complete interchange records.

The Janus Output

The following output has been reformatted to fit better on a page. The command specification given above produces a single "line" per food record. Additional syntax could have been supplied to the Janus display command to force the indention shown, but the formatting is not required for the interchange format and would make the syntax example considerably more obscure.

<FOOD>
<CLASSIF>
<FDA-FFV-8807> E134 A185 C245 B1201 P24 N01 M001 K03 J001 H101 F14 </FDA-FFV8807>
<USDA-NOB> 1014 </USDA-NOB>
<EUROCODE2> 1.54 </EUROCODE2>
<USDA-NAME> CHEESE: NATURAL, COTTAGE, UNCREAMED, DRY, LARGE OR SMALL CURD
</USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 0.690 </ASH>
<CHOCDF> 1.850 </CHOCDF>
<ENERC> 354.027 USDA </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 17.270 JONES </PROCNT>
<FAT> 0.420 </FAT>
<WATER> 79.770 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> C245 H247 A281 J001 F14 B1201 K03 M001 E125 H107 P24 N01 </FDA-FFV-8807>
<USDA-NOB> 1035 </USDA-NOB>
<EUROCODE2> 1.51 </EUROCODE2>
<USDA-NAME> CHEESE: NATURAL, PROVOLONE </USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 4.710 </ASH>
<CHOCDF> 2.140 </CHOCDF>
<ENERC> 1471.019 USDA </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 25.580 JONES </PROCNT>
<FAT> 26.620 </FAT>
<WATER> 40.950 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> C245 A111 F14 H107 J001 K03 M001 N01 P24 B1201 E001
</FDA-FFV-8807>
<USDA-NOB> 1046 </USDA-NDB>
<EUROCODE2> 1.56 </EUROCODE2>
<USDA-NAME> CHEESE FOOD: PASTEURIZED PROCESSED, AMERICAN, W/O Dl NA PHOS </USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 5.350 </ASH>
<CHOCDF> 7.290 </CHOCDF>
<ENERC> 1373.437 USDA </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 19.610 JONES </PROCNT>
<FAT> 24.600 </FAT>
<WATER> 43.150 </WATER>
</COMP> </FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> J135 M001 P24 N01 A148 B1201 C113 E123 F18 H001 K03
</FDA-FFV-8807>
<USDA-NOB> 1049 </USDA-NOB>
<EUROCODE2> 1.01 </EUROCODE2>
<USDA-NAME> CREAM: FLUID, HALF & HALF, CREAM AND MILK </USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 0.670 </ASH>
<CHOCDF> 4.300 </CHOCDF>
<ENERC> 545.578 USDA </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 2.960 JONES </PROCNT>
<FAT> 11.500 </FAT>
<WATER> 80.570 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> M001 N01 P24 J135 F18 H221 C235 B1201 A182 K03 E123 H208
</FDA-FFV-8807>
<USDA-NOB> 1059 </USDA-NOB>
<EUROCODE2> 1.03 </EUROCODE2>
<USDA-NAME> MILK: FILLED, FLUID, W/BLEND OF HYDR VEGETABLE OILS </USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 0.800 </ASH>
<CHOCDF> 4.740 </CHOCDF>
<ENERC> 264.280 USDA </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 3.330 JONES </PROCNT>
<FAT> 3.460 </FAT>
<WATER> 87.670 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> P24 N01 M001 K03 J133 H161 F14 B1201 E106 A148 C235
</FDA-FFV-8807>
<USDA-NOB> 1091 </USDA-NOB>
<EUROCODE2> 1.3 </EUROCODE2>
<USDA-NAME> MILK: COW, DRY, SKIM, NON-FAT SOLIDS, REGULAR,
WO/ADDED VIT A </USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 7.930 </ASH>
<CHOCDF> 51.980 </CHOCDF>
<ENERC> 1516.368 USDA </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 36.160 JONES </PROCNT>
<FAT> 0.770 </FAT>
<WATER> 3.160 </WATER>
</COMP> </FOOD>

Example: The Oracle commercial data base management system

Oracle is a fairly typical example of a data base management system. It represents technology which is the subject of US (ANSI) and international (ISO) standards and which has already become quite common and, in some fields, dominant. It uses the operations of the SQL language, supplemented by a specialized report generator. The commands below show the report formatting and definition for Oracle using SQL [26], and are similar to those that would be required for any similar system.

The Oracle Commands

select '<FOOF> <CLASSIF> <FDA-FFV-807> '||FFV||' <FDA-FFV-8807>'
'<USDA-NDB> '||nbd||' </USDA-NDB>'
<'EUROCODE2> '||eurocode2||' </EUROCODE2>'
'<USDA-NAME> '||usda name|| '</USDA-NAME> </CLASSIF>'
'<COMP> <ASH> '||ash||' </ASH>
'<CHOCDF>' ||chocdf|| '<CHOCDF>'
'<ENERC>' ||enerc||' USDA </ENERC> <FIBC> '||fibc||' </FIBC>'
'<PROCNT> ||procnt||' JONES </PROCNT> <WATER> '||water
'</WATER> </COMP> </FOOD>' from extract_foods;

In this command, the names "ffv", "eurocode2", "usda name" and so forth are names of fields (which Oracle calls "columns"). The "||" characters are concatenation symbols, which permit the text and values to be printed consecutively.

The Oracle Output

As in the Janus example above, the following output has been reformatted to fit better on a page. The SQL statement in the Oracle example above produces a single "line" per food record. Oracle provides report formatting capability [21] which could force the indention shown below. As in the Janus example, formatting is not required for the interchange format and, in the Oracle example, would make the syntax example many pages longer and considerably more complex and obscure. The report formatting in Oracle is also not part of the standard SQL language [38, 54].

<FOOD>
<CLASSIF>
<FDA-FFV-8807> E134 A185 C245 B1201 P24 N01 M001 K03 J001 H101 F14
</FDA-FFV-8807>
<USDA-NOB> 1014 </USDA-NOB>
<EUROCODE2> 1.54 </EUROCODE2>
<USDA-NAME> CHEESE: NATURAL, COTTAGE, UNCREAMED, DRY, LARGE OR
SMALL CURD </USDA NAME>
</CLASSIF>
<COMP>
<ASH> 0.690 </ASH>
<CHOCDF> 1.85 </CHOCDF>
<ENERC> 354.027 </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 17.270 </PROCNT>
<WATER> 79.770 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> C245 H247 A281 J001 F14 B1201 K03 M001 E125 H107 P24 N01
</FDA-FFV-8807>
<USDA-NOB> 1035 </USDA-NOB>
<EUROCODE2> 1.51 </EUROCODE2>
<USDA-NAME> CHEESE: NATURAL, PROVOLONE </USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 4.710 </ASH>
<CHOCDF> 2.14 </CHOCDF>
<ENERC> 1471.019 </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 25.580 </PROCNT>
<WATER> 40.950 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> C245 A111 F14 H107 J001 K03 M001 N01 P24 B1201 E001
</FDA-FFV-8807>
<USDA-NOB> 1046 <
/USDA-NOB>
<EUROCODE2> 1.56 </EUROCODE2>
<USDA-NAME> CHEESE FOOD: PASTEURIZED PROCESSED, AMERICAN, W/O Dl NA PHOS
</USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 5.350 </ASH>
<CHOCDF> 7.29 </CHOCDF>
<ENERC> 1373.437 </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 19.610 </PROCNT>
<WATER> 43.150 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> J135 M001 P24 N01 A148 B1201 C113 E123 F18 H001 K03
</FDA-FFV-8807>
<USDA-NOB> 1049 </USOA-NDB>
<EUROCODE2> 1.01 </EUROCODE2>
<USDA-NAME> CREAM: FLUID, HALF & HALF, CREAM AND MILK </USDA-NAME>
</CLASSIF>
<COMP>
<ASH> 0.670 </ASH>
<CHOCDF> 4.30 </CHOCDF>
<ENERC> 545.578 </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 2.960 </PROCNT>
<WATER> 80.570 </WATER>
</COMP>
</FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> M001 N01 P24 J135 F18 H221 C235 B1201 A182 K03 E123 H208
</FDA-FFV-8807>
<USDA-NOB> 1059 </USDA-NOB>
<EUROCODE2> 1.03 </EUROCODE2>
<USDA-NAME> MILK: FILLED, FLUID, W/BLEND OF HYDR VEGETABLE OILS </USDA-NAME>

</CLASSIF>
<COMP>
<ASH> 0.800 </ASH>
<CHOCDF> 4.74 </CHOCDF>
<ENERC> 264.280 </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 3.330 </PROCNT>
<WATER> 87.670 </WATER>
</COMP> </FOOD>

<FOOD>
<CLASSIF>
<FDA-FFV-8807> P24 N01 M001 K03 J133 H161 F14 B1201 E106 A148 C235
</FDA-FFV-8807>
<USDA-NOB> 1091 </USDA-NOB>
<EUROCoDE2> 1.3 </EUROCODE2>
<USDA-NAME> MILK: COW, DRY, SKIM, NON-FAT SOLIDS, REGULAR, WO/ADDED VIT A
</USDA NAME>
</CLASSIF>
<COMP>
<ASH> 7.930 </ASH>
<CHOCDF> 51.98 </CHOCDF>
<ENERC> 1516.368 </ENERC>
<FIBC> 0.000 </FIBC>
<PROCNT> 36.160 </PROCNT>
<WATER> 3.160 </WATER>
</COMP>
</ FOOD>


Contents - Previous - Next