This is the old United Nations University website. Visit the new site at http://unu.edu


Contents - Previous - Next


9. Conversion of data from interchange format


INTRODUCTION

An interchange format file consists of many data items, generally separated by tags but occasionally by whitespace. This stream of data is usually broken up for convenience into lines, but in essence a line break is just more whitespace. In many cases the file will contain data superfluous to one's interests: either foods in addition to those desired or data about unsought food components will be included. The first part of the discussion of extracting data from an interchange file will discuss simple cases of manually finding (with an editor program) certain selected data. The discussion will then progress to more lengthy extractions that might require special programs. The discussion may provide details about the building of such programs, but will discuss what such a program would have to accomplish.

SOME SIMPLE EXAMPLES

Consider a sample task: Find the calcium and iron content of bananas. (This assumes there is a <food> element which has as one of its <BVNAME> elements the content "banana".) The element to be searched for will include a subordinate element <bvname> banana </bvname>. So, by hand or with an editor, one must first find <bvname> banana </bvname> . The banana <food> element probably looks like

<food>
...
<classif>
<ifri> ... </ifri>
<bvname> banana </bvname>
</classif>
<fddflt> <meas/> ... </meas/> </fddflt>
<comp> ...
<CA> 5.7 </CA>
. . .
<FE> 63 </FE>
. . .
</comp>
<drvd-comp> ... </drvd-comp>
</food>

If the file is positioned at the <food> start-tag preceding <bvname> banana </bvname>, all preceding material can be erased: it will be irrelevant to bananas. Also, all material following the <food> end-tag ( </food> ) will also be irrelevant and can be erased. Now one can search for <CA> and <FE> without fear of getting a value for the wrong food.

Next, consider this problem: Find all of the names of all of the foods described in an interchange file. First, search for the first <bvname> start-tag and erase it and everything before. Next, find the matching end-tag, </bvname>. Insert a line break and mark the position after the end of the line and before the end-tag, and search for the next <bvname>. Erase everything from the marked position to and including the second start-tag. (Note how this has left the first <bvname> by itself on a line, and the next <bvname> begins the next line.) Now repeat everything from finding the end-tag to erasing up to the next start-tag, over and over until at some point there is no start-tag to be found. At this point, there are no more <bvname>s. Simply erase everything from the last-marked end-tag to the end of the interchange file. What remains is a list of local names, one per line.

How might a program be written to extract local food names automatically? Here is a sample program, written in BASIC.

100 if eof#1 goto 200
input #1, dataline$
ndx = index (dataline$, ''<bvname>")
if ndx = 0 then goto 100
name$ = sub (dataline$, ndx+12)
ndx = index (name$, "</bvname>")
if ndx> 0 goto 150
120 input #1, dataline$
name$ = name$ + " " + dateline$
if index (dataline$, "</bvname>") = 0 goto 120
ndx = index (name$, "</bvname>'')
150 write #2, sub (name$, 1, ndx-1)
goto 100
200 end

In this program, input lines are effectively erased by being read but not copied into the output fee. When <bvname> is found, the remainder of the line is copied into "dame$". Subsequent lines are tacked onto "name$" until the end-tag </bvname> is found. Then that part of "name$" prior to the end-tag is written out, and the program returns to skipping lines, looking for the next <bvname>.

The program and the editor algorithm respond differently if a <bvname> is split across two or more lines: the editor algorithm as given above does not include making the name fit entirely on one line while the program does.

COMPLICATIONS

The preceding examples each had a very simplifying aspect. The first was only involved with one <food>; the second, with only one subsidiary element of each <food>. To put it another way, the first looked at several subsidiary elements for each <food> and selected only certain <food>s; the second looked at more than one <food>, but only a single subsidiary element type. In addition, in theory, the <bvname> element could have a different interpretation if it occurred other than immediately subsidiary to <classif>, so that the procedures above would identify some things as food names which were not. Such uses of this particular element are unlikely in practice.

Scanning for Several or Many Components

If more than one subsidiary element is of interest, it is usually necessary to determine the boundaries of each <food> as it is being considered so that the various subsidiary elements are associated with one another and not with the subsidiary elements of another <food> . This suggests that the boundaries of a <food> element must be determined before its content is searched for the subsidiary elements-or at least, when the search is being made linearly top-to-bottom, left-to-right, that (l) the <food> start-tag is found first, and (2) as each line is searched for the start-tags of each subsidiary element, the </food> end-tag is also searched for. Note that care must be taken to cover the possibilities that a subsidiary element may occupy more than one line, that another subsidiary element may begin on the same line on which another ends, and that a subsidiary element of the following <food> could possibly occur later on the same line as the current <food>'s end-tag.

Such a parallel search (for several start-tags and an end-tag) requires care in implementing. For example, if the BASIC program of the second problem above were being modified, each line must be checked for every tag of interest; the one to be acted upon must be the one that occurs first. After it is processed, the remainder of the line must be checked for the other tags. (Incidentally, the program as given above made the highly likely but not guaranteed assumption that no two <bvname> elements will fall on the same line.) If it can be guaranteed that every <food> has all of the subsidiary elements, then the end-tag need not be searched for in parallel; indeed if it can be guaranteed (perhaps by a prior sort) that all of the subsidiary elements are not only present but in a prescribed order, then the search for start-tags can be made serially, looking for one only after the preceding one has been found and processed. If a truly parallel search is needed, a text-processing programming language/system which succinctly implements complicated text searches should be considered. Examples of such systems include Digital Equipment Corporation's VAX-TPUTM, the XEDIT system found on IBM's VM/SP system, most versions of the "emacs" editor, and the SNOBOL and ICON languages.

If many, most, or all of the <comp> or <drvd-comp> subsidiary elements are of interest, a variation on this system might be considered. Specifically, when one has found the <comp> start-tag, the next tag should be read no matter what it is. The tag is either a subsidiary start-tag or (after a few repetitions) is the <comp> or <drvd-comp> end-tag. If it is a start-tag, its generic identifier is checked against the list of those elements of interest. If it is one, the data of interest is extracted; otherwise everything up to and including this element's end-tag is erased or ignored. Then the next tag is either a subsidiary start-tag or the <comp> or <drvd-comp> end-tag. The cycle repeats until the end-tag is encountered.

The data selected during this pass can be accumulated in an array, either placed in the proper cell as it is encountered or marked with an identifier indicating which component or derived component it is for and retained for sorting when the <food> is completely read in; in either case, the entire collection of data for that <food> can be written out in the proper order at this time.

Alternatively, the data can be written out as soon as they are encountered, provided that data fields are marked with an identifier that identifies both the food being processed and the component or derived component involved. These output records can then be sorted later at leisure. This is a particularly helpful mechanism for use on very small computers.

Selecting Certain Foods

The problem in scanning a <food> element to determine whether it is to be processed or ignored is that one cannot determine from the start-tag of the element the identity of the food involved. Instead, the food is identified by one or more of the subsidiary data elements. It is absolutely necessary that any potentially useful data that appears in the interchange file before the food identification data be retained in the processing computer until it is determined whether or not this food is of interest. Then that data can be written out or ignored as appropriate.

Given a computer with memory big enough to store all the data of interest for any one <food> as discussed above, it might be easier (though perhaps slower) to read in all of the data of interest and all of the data needed to decide if the food is of interest for each <food> as it is encountered, and then make the decision whether to ignore or erase, or to write out all of that <food>'s data.

OUTPUT FORMATS

Formats for Direct Use by People

If the data values of interest are to be perused only once or twice by a user and then abandoned, the easiest output format available is probably appropriate. Two approaches are prime candidates, with their relative convenience depending upon the strategy for selecting the data of interest from the interchange file.

If the data values are selected using a text processor or editor, it is probably easiest to delete the unnecessary data and leave the remainder of the interchange file in its original format, adding line breaks where necessary to get lines of reasonable length (e.g., at most 65 or 80 characters).

If the data values are selected by copying the desired data to an output file, they may easily be printed in a food-versus-component array: Simply use a format that prints each value in a fixed-width field, with all of the values for a single food printed on one line (presumably preceded by the name of the food). Such output formatting is available in virtually all common computer languages. For example, the FORTRAN format

(X,A20,12(X,A4))

would handle food names of as many as twenty characters, followed by up to twelve component values of up to four characters each, all printed on an 80-character line.

(The first space character is the FORTRAN carriage control character.)

Other, fancier, formats might be used for inclusion in special reports. These might include column titles, a two-page-wide array, or a non-array-oriented format. In particular, if many items of data about each component are being retained, a "paragraph" of data and descriptive material might be printed for each food. Such a "paragraph" might include various names for the food and statistical information about each primary component datum, along with identifying labels.

Formats for Computer Input

Many computerized food component data bases are maintained as a food-versus-component array of primary component data values, and can be printed out using a format similar to the human-oriented array-based format described just above. A variation on the theme, if the programming language supports variable-length fields for output of data, is illustrated by the following BASIC subroutine:

100 FOR I=1 TO 99
WRITE #1, DATA$(I);",";
NEXT I
WRITE #1, DATA$(100)
RETURN

In this example, assume that writing out the variable-length string DATA$(I) will be done with the minimum number of characters-no leading blanks or zeroes-and that (as is usual in most BASICs) a semicolon following a WRITE datum suppresses trailing spaces, tabs, or new-lines (record-breaks) that might otherwise be automatically placed after the item. The result is a single record consisting of 100 data items written out in compact form and separated by commas.

Alternatively, the target data base might store the data for a food in several records; for example, in the USDA standard reference data base [35], the data for one food is stored as follows: first, a food name record, with a food-identifying numeral, a type-of-record numeral (000), and a name for the food. Next, several records-one for each reported component with a component-identifying numeral (between 001 and 998), a primary value, and possible secondary values and statistical information. Finally, a record signifying the end of data for this one food, which contains only a type-of-record numeral ("999"). Within each record, the data fields are of constant length; each record is 80 characters, with padding as needed. Such a data collection can easily be written out by a single WRITE instruction to create the name record, a loop of WRITE instructions to emit the associated component records, and a final WRITE instruction for the end-of-food record.

Special Formats for Data Base Management Systems

Many data base management systems (DBMSs) are able to accept a large collection of data at once if the data are provided in some version of one of the array formats described above.

Alternatively, many DBMSs will accept data included in SQL commands [54]. The conversion program would emit the data, extracted from the interchange file, embedded in SQL commands which would direct the creation and initialization of new DBMS records. For example, in an appropriately defined DBMS table, the following SQL commands might add the banana data used in previous examples to a table named FOOD_TABLE:

INSERT INTO FOOD_TABLE (LOCAL_NAME, CA, FE)
VALUES ('banana', 5.7, 63)

Such a command might be written by a BASIC subroutine such as

100 WRITE #2, "INSERT INTO FOOD TABLE (LOCAL_NAME, CA, FE)"
WRITE #2, "VALUES ("'+LOCALNAME$+"', "+CA$+", "+FE$+")"
RETURN

where LOCALNAME$, CA$, and FE$ have presumably been given the values "Banana", "5.7", and "63" by another subroutine.


Contents - Previous - Next