Contents - Previous - Next


This is the old United Nations University website. Visit the new site at http://unu.edu


Session 2b: The technological experience: information resources and networks


Databases and data banks
Communication networks
The electronic library
Discussion
Panel discussion 1: Achievements and limitations in international cooperation as seen by the developing countries


Chairperson: Carlos Correa

Databases and data banks


Abstract
1. Introduction
2. Some figures and definitions
3. Typology of world databases and data banks
4. Cooperation among database producers
5. Database production
6. Use of databases
7. Bibliometry applied to STI or scientometry
8. Hypertext
9. Multimedia
10. Economic problems
11. Ownership, legislation, and copyright problems
12. Conclusion
Bibliography


Nathalie Dusoulier

Abstract

After briefly tracing the growth and coverage of the database and data bank field as well as the tradition of cooperation that characterizes their creation and operation, the steps in database production are analysed and the various means of access to databases are discussed. The application of bibliometric methods to the analysis and evaluation of databases is described. The potentials for Hypertext and Multimedia are outlined. Economic aspects of the field are investigated and problems of ownership and copyright are raised.

1. Introduction

Without information, research and industry would decline. To know how to obtain information in the minimum time is an indisputable factor in competitiveness. Databases and data banks (DB), accessible on-line from a microcomputer or a videotext station (Minitel in France), can provide this decisive factor. Nowadays there are thousands of millions of references stored in thousands of databases; this represents an enormous wealth of information, really a supermarket of information, that is still largely underused.

Although the first databases had appeared at the beginning of the century, on-line databases only saw light of day in the 1960s. These early databases were generally in-house, or otherwise not openly available. They included the large limited-access on-line database systems built under contract by Lockheed (e.g. the Nasa-Recon system) and by SDC (e.g. the National Library of Medicine's AIM-TWX system). It is fair to say that no sizeable public access systems were available until about 1972. Growth thereafter was rapid.

2. Some figures and definitions

The terms database and data bank are often used interchangeably and in somewhat different senses on different sides of the Atlantic. I will try in this document to standardize these terms, a task becoming more and more difficult with the development of mixed products, and will use the acronym DB as the general term. In the library sense, an "on-line bibliographic database" is generally understood to mean a collection of records held on-line in rapid-access computer store. This concept has become significantly diversified nowadays and DB differ not only in the content of each entity described, but more and more in the media used.

Bibliographic databases can be differentiated by their different contents: simple bibliographic citations, citations accompanied by index terms, or complete references containing citation, abstract, and index terms. The abstracts can contain or be accompanied by factual or numeric data, evaluated or not. Full-text DB are now rapidly expanding. DB also differ by the media in which they are available (diskettes, magnetic tapes, etc.); we can see also a dramatic development of CDROM and a diversification of access methods (ASCII or videotext). Nomenclature and description become so complex that increasingly sophisticated DB catalogues have become best sellers.

As for the number of databases available, the latest figures are 5,200 issued by 2,200 producers and available on nearly 800 hosts. If we compare these figures to those of 1980, the growth has been spectacular. In 1979/ 1980, the Cuadra Directory of Online Databases listed 400 DB, 220 producers, and 59 hosts, while today it lists 6,414 DB. We must also note the recent appearance of a new type of database that mixes textual or numeric data with chemical structure diagrams, photographs, weather maps, trademarks, logos, or illustrations. These graphics DB require special equipment for their use.

The Information Market Observatory (IMO) of the European Community recorded in 1989 1,048 different DB produced in the Community and 2,214 produced in the United States. This figure included only DB accessible in ASCII up to the end of 1989, and thus excludes DB that are only produced in videotext. According to the IMO, the United States produces nearly twice as many ASCII DB as Europe, but the European Community has a higher growth rate: 101 new DB, representing 11 per cent growth, in Europe and 151 (7 per cent) in the United States. In Europe, Great Britain dominates with 34 per cent of the production.

It is estimated that there are some 25,000 videotext services within the EEC. Half of them are located in France, which has the largest installed base of videotext terminals (about 6 million).

The CD-ROM market is also growing very quickly the number of titles published doubles each year. It is expected that the number of titles (about 750 in 1989) will increase to more than 6,000 worldwide in 1992.

As for the typology of DB, production of ASCII DB is dominated by bibliographic data in Europe (62 per cent of new DB) and by full text and factual DB in the United States (76 per cent of new DB). There is no significant difference between the United States and Europe as regards the target markets: the sector with the most products is that of services (especially banking, finance, and insurance). In the industrial sector, chemistry is one of the most important subjects.

The total turnover of the DB industry in 1989 rose to 35 billion francs (US $6.5 billion), of which 10 per cent was from videotext DB. For Europe, it is close to 14 billion francs (nearly US$3 billion). In 1993, these figures should reach 72 billion francs (+ US$12 billion) and 28 billion francs (US$6 billion). The United States dominates the market with 56 per cent. English is confirmed as the pre-eminent language for documentation, three-quarters of the DB being accessible in it. Japan imports more electronic information than it exports; this is one of the few areas where its trade balance is negative.

3. Typology of world databases and data banks

This typological study, which is only indicative, has been carried out by analysing information in the Cuadra Directory, mentioned above. The percentages have been calculated from the 6,414 bases in the Cuadra Directory, except for the percentages for the main fields covered, which were obtained by extrapolation from 877 references selected randomly.

The percentages given can only indicate the distribution of the total. They are in fact false because the same base can be cited several times, under different headings. Thus, from the 877 references selected for fields covered, we obtained a total of 1,296 by adding the number of bases by field. In other words, a base is listed an average of 1.5 times. However, in this case the distribution of bases is expressed as a percentage of the number of bases studied (877) and not as a function of the number of bases obtained (1,296). This is why, although a significant number of fields have not been listed, having few references (less than 1 per cent, e.g. urban planning, biotechnology, contracts and awards, etc.), the total exceeds 100 per cent. During the search, 275 different fields were encountered. The principal fields covered have thus had to be grouped together.

Principal fields covered
Finance, economics 29.0%
Business, industry, trade 25.0%
Law, justice 9.0%
Energy, environment 10.8%
Government, defence 4.8 %
Chemistry 3.7 %
Medicine, toxicology, nutrition, agriculture 5.6 %
Science and technology 3.4 %
Social sciences, humanities, administration, human resources management 5.6 %
Research, innovation 2.6 %
Computer, engineering, communications, telecommunications 7.3 %
Miscellaneous 16.7 %
Principal producer countries
United States 62.4 %
England 7.6 %
Canada 6.4 %
France 5.8 %
Germany 4.7 %
Australia 2.3 %
Spain 2.0 %
Italy 1.9%
Japan 1.4 %
Sweden 1 .3 %
Netherlands 1.2 %
Principal distributing countries
United States 66.8 %
England 13.4 %
Canada 7.7 %
Germany 6.0 %
France 6.0 %
Italy 3.3 %
Australia 2.2 %
Spain 2.0 %
Japan 1.5 %
Sweden 1.4%
Netherlands 0.9%
Principal types of databases
Reference databases  
-bibliographic 18.8 %
-referral 16.4 %
Source databases  
-numeric 17.5 %
-textual numeric 6.7 %
-full text 28.2%
-images 4.2%
Principal languages used
English 72.0%
French 9.3%
German 5.2%
Spanish 3.5%
Italian 2.3%
Swedish 1.4%
Dutch 1.1 %

4. Cooperation among database producers

We have to go far back in time if we want to retrace the whole history of cooperation between information services. Indeed, it was in 1896 that the Royal Society organized the first Conference for the Joint Production of the International Catalog of Scientific Literature, a complete abstracting and indexing service that was to last 25 years. In 1948-1949, Unesco organized several conferences on abstracting and indexing in biology and medicine and on scientific document analysis. These conferences led to the creation of the ICSU-AB, which was the forerunner of today's ICSTI. The Conference on Scientific Information organized in 1948 by the Royal Society and the International Conference on Scientific Information held in Washington, D.C., in 1958 are two other examples of this will to cope with the new communication needs that scientific information organizations had to face in the science and technology context of the time.

These issues were formalized for the first time on an international basis when the proposals of a three-year (1968-1970) Unesco-ICSU feasibility study were considered at the UNISIST intergovernmental conference convened by Unesco (1971) to prepare the ground for a worldwide scientific information system. In 1971, the keywords, so to say, were enhancement of scientific information as a basic resource and promotion of international cooperation. It was strongly felt that this resource should become more accessible and easier to use, so that as a global wealth, it could best contribute to the scientific, educational, social, cultural, and economic development of all countries of the world.

During this period, a number of approaches to international cooperation were discussed. Ron Smith, from BIOSIS, presented in one of the Miles Conrad Memorial lectures a review of the various panoramas of the situation.

The first approach grew out of some of the work of the International Council of Scientific Unions Abstracting Board (ICSU-AB) and was largely based on bilateral agreements that followed discussions by all member services in a particular discipline. This type of international cooperation is still pursued. An example of this concerns an interaction that developed in an experimental way among three abstracting journals in the area of physics: the English-language journal Physics Abstracts (PA), the French-language journal Bulletin Signalétique (BS), and the German-language journal Physikalische Berichte (PB). ICSU-AB conducted a survey of the journal literature of physics as covered by these three abstracting services. A comparison identified that there were three levels of productivity in the journals that were scanned by the various services. Quite arbitrarily, distinctions were drawn between journals of so-called high productivity (producing more than 100 titles per year), medium productivity (11-99 titles per year), and low productivity (10 or fewer titles per year). In the high-productivity group, there were approximately 80 periodicals common to the three abstracting journals; they were thought of as no more than a useful list and provided perhaps yet another definition of core journals.

In the area of low productivity, the situation was much more interesting. PA and PB scanned some 1,600 periodicals, and each of their lists contained more than 350 periodicals that fell into the low-productivity category. The acquisition of and selection from these periodicals may be expensive and difficult to justify in terms of return, but may nevertheless be essential if the coverage is to be satisfactory.

This experiment, although very interesting, was not continued for very long and the cooperation between the services was followed up by the creation of a common classification system in physics.

The second approach to international cooperation was undertaken by the ICSU/Unesco (International Council of Scientific Unions/United Nations Educational, Scientific, and Cultural Organization) Committee (UNISIST), in the framework of assessing the feasibility of the previously mentioned World Science Information System. The study was based on the concept that this World Science Information System would be a flexible network based on the voluntary cooperation of existing and future information services. It was headed by a central committee chaired by Professor Harrison Brown and was supported by a number of working groups and an advisory panel that comprised representatives of some of the large existing operating systems. One working group was concerned with questions of standardization of bibliographic descriptions that could serve for classification, indexing, and abstracting; a second working group was concerned with the identification of research problems that had to be studied to achieve an efficient worldwide system; a third group studied the problems of natural and machine languages, especially from the point of view of transferability and mechanized processing; and a fourth group worked on the problems of developing countries and their contribution and access to a worldwide system.

Certain of these actions were pursued later in the framework of Unesco's UNISIST programme and continue under its General Information Programme, but at the overall level the spirit of cooperation no longer exists.

The third approach to international cooperation, which is not attributable to any one organization but which has been fairly widely discussed in Europe, was that of establishing one information service in each of the disciplines into which the existing services would all feed; it was first assumed that the information services established would be in English and that services in other languages would agree to feed material into the system. It was then proposed that instead of having the same information processed a number of times in different languages, it would be processed only once, in one language, and subsequently made available by translation, if necessary. Some of the thinking of this nature was based on the concept that if cooperation of the kind referred to in the ICSU-AB activities were to develop, we might end up with a number of different services dealing with identical material, each in its own language. There may be much merit in the proposal of a one-language system, but there are obvious difficulties too, and not only the cultural or political ones. The current cooperation between Medlars centres is still based on this philosophy.

The fourth approach - the mission-oriented concept - is well known and a few international systems of this kind are operating. One is INIS, of the International Atomic Energy Agency, another is AGRIS, of the FAO; still others are less well known but probably equally effective.

Nowadays, the problems of international cooperation are viewed in a rather different way, since networks or other technologies that allow efficient exchanges of information have rendered obsolete the exchange methods envisaged 20 years ago. Collections of databases are created above all at the level of the hosts, who provide the necessary interfaces with the users and eliminate duplication where necessary.

Database producers get together at national levels to discuss technical problems, the market, and competition not only among themselves but also with other contenders entering the field that the database producers had for a long time considered their own. The most important organization of this type is the NFAIS (National Federation of Abstracting and Information Services) in the United States, which also contains some foreign members. In France, the GFFIL represents the principal producers. Others exist in several countries.

At the international level, the International Council for Scientific and Technical Information (ICSTI) offers great potential. The ICSTI is an international, not-for-profit organization whose purpose is to increase accessibility to and awareness of scientific and technical information. Established in 1952 as the ICSU-AB (the International Council of Scientific Unions Abstracting Board), it has evolved over the past four decades from an organization that was initially concerned with the development of abstracting and indexing services to a wide-ranging forum devoted to the improvement of scientific and technical information transfer.

Scientific and technical information, and specialized information at large, by adding value to the use of science and technology, provide searchers, administrators, and firms with the information necessary for their professional activities. Technological advances have brought many changes to the storage, processing, delivery, and use of information. At the same time, user needs have also evolved and led to the creation of more sophisticated and specialized information products.

The role of the ICSTI is to enable its members to exchange views and ideas on all the aspects of this evolution, to make progress in their comprehension, and to contribute to the development of appropriate tools better able to meet the information requirements of the world community of scientists and technologists.

In the database field, cooperation, whatever its objective or the level at which it is established, is now irreversible.

5. Database production

Database production has evolved dramatically in the last 30 years. From the totally manual production in the 1960s of references on paper worksheets, of which sections were cut off and sorted manually (slips the size of a postage stamp for authors and slips a little bigger for index terms), to the quasi-automatic production of data distributed on-line, several generations of systems have come and gone.

I will not dwell on the past and will try in this chapter to look at the stages of modern database production, although the processes used are still very diverse. I will aim at a description of the traditional functions and the way in which they are most often handled.

To produce a database, several preliminary operations must be carefully carried out. First is the definition of the subject or subjects covered as a function of the target market or the users to be served. Once the subject has been defined, the scope and depth of coverage are also fundamental points. Determining the number and types of sources (document types, document languages) will also have an influence on the database content, on the costs of the acquisition of these sources and of the data production, and on the number of staff required to process them.

The format definition for each entry is also an important part of the preliminary work: databases can contain simple bibliographic references, indexed or not, complete references with abstracts and index entries, full text, or any combination of these.

The choice of format has a major impact on the budget and the type of staff required, but also on user satisfaction (costs, processing times, etc.). Any derivative products must also be considered here. Once all these options have been decided, database production can begin.

5.1 Bibliographic Description

Bibliographic description is the first operation in the processing of each document. Very precise rules exist for compiling bibliographic references. National and international standards allow producers to process this in formation in consistent ways, thus allowing exchange. The elements treated are the title and its variants, author names, their affiliations, the titles of the publications from which the references have been taken, their type, editors' names, publication language, etc.

The role of standards organizations such as the ISO and its Technical Committee 46 on documentation, ANSI in the United States, the BSI in the United Kingdom, and AFNOR in France has been crucial. However, because of the slowness of international procedures, other organizations, including the IFLA and Unesco, have taken the initiative, and quasi-standards such as the ISBDS, ISBDM, the UNISIST Reference Manual, the CCF, and the ISDS, approved or not by the ISO, have played and still play a major role in these fields.

5.2 Conceptual Processing of Documents

5.2.1 Document Abstracts

The addition of abstracts to bibliographic references enormously enhances a database. For many years, these abstracts have been written by information scientists according to rules learnt in schools of librarianship or information science. Several types can be distinguished: analytic or indicative, objective, or oriented towards a particular type of user. The relative qualities of one or the other have been the subject of many discussions, even polemics. Nowadays most texts are accompanied by author abstracts, and the use of these is becoming more and more common. However, there are two types of use: author abstracts used unmodified and in full, or author abstracts edited and/ or shortened. In the case where the database producer decides to create his own abstracts according to the editorial policy of the DB, this work is often subcontracted to outside workers.

Instead of abstracts, some databases, called full-text databases, enter the complete text of the document, whether textual or numeric. The data-capture methods may be different, the current trend being to digitize the text and then, depending on subsequent processing, to store it either in ASCII or image form.

5.2.2 Indexing

Indexing is essential to enable retrieval of the stored documents. The operation varies greatly from one database to another. The quality of retrieval depends on the specificity of the indexing. Current systems vary from simple assignment of a limited number of unstructured terms to multi-level hierarchical indexing from sophisticated controlled vocabularies. Indexing consistency is an important factor for DB quality. Indexing is an expensive operation because it requires a large number of qualified staff. This is why most organizations are looking towards indexing methods, aided or not by expert systems, that will allow the introduction of completely automatic indexing.

5.2.3 Indexing Systems

Before discussing indexing systems, we must define the terms. Indexing is an operation that consists of describing and characterizing a document with the aid of representations of concepts contained in the document. In other words the concepts are converted into documentary language after extracting them from the document by analysis.

Indexing must allow effective searching for information in document sources. It leads to the recording of concepts contained in documents in an organized and easily accessible form, i.e. to the production of documentary search tools (catalogues, indexes, files, etc.).

Indexing is a documentary concept that is still of current interest. However, today a new qualifier more and more frequently accompanies it: "automatic" indexing, or more exactly "computer-assisted" indexing. In fact, the large and ever-growing volume of documents to process in order to make information available to the user as fast as possible makes it necessary to look for ways to speed up the processing.

Other factors encourage automation:

- the existence, in the indexer's work, of repetitive tasks without intellectual added value;
- the search for indexing quality and homogeneity for better access to the information (database reliability);
- the very significant economic realities: "We are arriving at a situation where the costs of human indexing exceed the costs of computer-assisted indexing."

Although indexing is considered an art that requires many qualities on the part of the indexer, the modelling of the thought process, the production of knowledge-based tools (dictionaries, thesauri, knowledge bases, etc.), and the development of computing techniques combine to give efficient help to the indexer. This help is available at different stages of the indexing process, either at the time of concept identification and selection in the document (analysis) or in their conversion into documentary language (coding). The intellectual processes used during indexing cannot be perfectly and completely automated. This is why we must instead speak of "computer-assisted indexing." Research in this field takes several forms. The simplest form is autoposting or generation of co-occurring terms. But it is now more complete and more advanced. Indexing systems that have been implemented fall into several types:

- statistical model: calculation of appearance frequencies of significant terms in a document;
- probabilistic model: co-occurrences method with creation of semantic net works between associated terms (Leximappe or Passat systems);
- procedural model: with a thesaurus and procedural rules allowing conversion of textual terms into thesaurus descriptors (machine-aided indexing or Medindex System);
- linguistic model: with morphological and syntactic analyses (Aleth, Spirit, or Darwin systems).

It will perhaps be a combination of these approaches and an exploitation of their complementary features that will lead to the automatic representation of document content. It is important to remember that indexing is an operation that is not only relevant at the time of document entry but also at the time of searching.

The possibility of user assistance represents a significant challenge for both producers and distributors of information.

5.2.4 Documentary Languages

Documentary languages are indexing tools that allow the transcription, in a concise and standardized form, of the concepts contained in the documents to be analysed. They provide a bridge between the natural language in the documents forming the documentary resource and that of the users' questions.

Indexing tools fall into two main types:

- languages with a hierarchical structure, called classificatory (classification, etc. )
- languages with a combinatory structure (lexicons, thesauri, etc.)

Classification is historically the first type of indexing tool to have been used in documentary systems. These classificatory languages are based on the prior coordination of ideas to express a concept and on the interlocking of classes of concepts. They go from the general to the particular, each class including the previous one, and the whole can be represented as a hierarchical tree. Classifications in general use employ codes or indices (numeric, alphabetic, or alphanumeric) to represent concepts. Examples are the Dewey Decimal classification, the Universal Decimal classification, or the Library of Congress classification.

Classifications have the advantage of offering a general logical framework and the possibility of enlarging or restricting the subject at will by using the hierarchy of classes and subclasses. However, they are awkward to update, complex ideas can only be expressed with difficulty, and their rigid structure requires a formalized approach towards the concept searched for.

The problems posed by classes interlocking with others have led to the development of new types of indexing tools: combinatory languages. The thesaurus is a typical example. This is an organized authority list of descriptors and non-descriptors obeying appropriate terminological rules and inter-linked by semantic relationships (hierarchic, associative, or equivalent). This is a vocabulary that is standardized in form (rules for expressing terms: singular, plural, prepositions) and controlled with respect to the sense of terms (synonymous or quasi-synonymous terms represented by a single term). It is organized because it establishes between the terms the hierarchic and associative relationships that constitute the semantic environment. In this it differs from a simple alphabetic list of terms, which is a lexicon.

A thesaurus is very flexible to use because the indexing is carried out by a simple combination of key words. The hierarchic relationships allow indexing at different levels of specificity or generality and the associative relationships allow navigation from one term towards other ideas, thus also leading to a better understanding of the vocabulary. It must be emphasized that updating is simple and allows the evolution of science to be followed step by step, by adding without difficulty new ideas in the form of candidate descriptors.

Documentary searching with combinatory languages is undertaken with the Boolean operators "AND," "OR," and "NOT" and is remarkably flexible, based on intersection, addition, or exclusion of ideas. It can be made more efficient, however, by the addition of syntax to the language. This can be in the form of role indicators, or links indicating the function of a term in the indexing or its relation with another term.

Indexing languages, by the fact that they are a representation of fields of knowledge, have a privileged place in the current development of information systems. Far from being obsolete, the thesaurus retains all its value in the creation of knowledge bases, by the management of the transmitted message, the choice of terms, the richness of their relationships, and their categorization. In addition, linguistics and mathematics today provide indexing languages with firm bases for their description and formalization.

5.3 Data Formats and Capture

Processing formats of databases were the subject of much work and discussion in the 1960s and 1970s; the format most often used is that recommended by the ISO-2709 standard. This offers a general structure, a framework designed specially for communication between systems. Although it was not designed as an internal format for systems, it has strongly influenced them. It provides a structure for data exchange formats such as Unimarc and its derivatives, the Unesco common communication format (CCF), or other formats less generally accepted but used. Currently there is a move towards the replacement of this type of format by one more suitable for electronic publishing, such as the SGML format (ISO 8879-1986 standard).

Character by character data capture from worksheets has been the approach used by most systems for many years and it continues to predominate. The variety of equipment used for this data entry is nearly unlimited. There is a wide choice among microcomputers (PC type), either alone or networked, minicomputers with workstations, or terminals connected to a mainframe. The choice is governed by the size of the database, the flexibility required, and the level of sophistication of the indexing techniques. Capture on formatted screens brings significant advantages. Data capture systems increasingly include checking programmes, spelling correction, or several other facilities to improve quality and productivity. There are many difficulties, however, in dealing with different alphabets or chemical or mathematical formulas, which require special treatment.

This area of database production seems to be the one in which the most significant changes are appearing and will appear in the future. Firstly, the increasingly widespread use by primary journal publishers of electronic pub fishing should allow automatic capture of data in machine-readable form. Digitization technologies (scanning) are also increasingly used at least for capture of abstracts, which are then processed by an OCR system. More sophisticated expert systems should allow quasi-automatic capture of other elements. The results of these experiments will depend substantially on the standardization of journals.


Contents - Previous - Next