Contents - Previous - Next


This is the old United Nations University website. Visit the new site at http://unu.edu


Session 3: New technologies and media for information retrieval and transfer


The potential offered by "extended retrieval"
Information retrieval: Theory, experiment, and operational systems
Computerized front-ends in retrieval systems
Multimedia technology: A design challenge
Discussion


Chairperson: Martha Stone

The potential offered by "extended retrieval"


Abstract
1. Introduction
2. Four information retrieval "architectures"
3. Illustrations of extended retrieval
4. Some technical issues
5. Conclusion
References


Michael K. Buckland

Abstract

The traditional form of information retrieval is composed of a single resource file and a single retrieval mechanism. In the environment created by the new information technology, many resources and many computers are linked by networks. This environment requires an extension of information retrieval techniques to include retrieval from multiple files and the use of multiple retrieval mechanisms. Some benefits and technical consequences of "extended retrieval" are reviewed.

1. Introduction

The traditional form of an information retrieval system is composed of two parts: a resource file and a retrieval mechanism. A bibliographic retrieval system or an on-line library catalogue, for example, is composed of a file of bibliographic records and a retrieval mechanism designed to perform the most commonly desired searches on that file of records, such as a search by author, title, or subject. It is, in effect, a unitary system, a single system composed of one resource file and of one retrieval mechanism.

The new information technology is leading to a new computing environment. The cost-effectiveness of computer hardware is increasing, the cost of electronic storage is decreasing, and connectivity through telecommunications is becoming pervasive and less expensive. In the meanwhile, labour costs and building costs continue to rise. These changing conditions are resulting in a new environment in which:

- workstations are becoming widely available;
- very large sets of data can be stored economically;
- many thousands of computers are interconnected over local, national, and international networks; and
- the standards and protocols necessary for effective cooperation are being developed and adopted.

In this situation, we find a rapidly growing number of databases, an increasing use of databases, and a trend for individuals to use a number of heterogeneous databases. The result is increased complexity for the searcher and a greater need for expertise to identify what resources exist and how to use them cost effectively. (For a convenient general introduction, see Lynch and Preston [7].)

This changed information technology has created a new information retrieval environment in which the potential for information retrieval now extends far beyond the traditional form of a unitary retrieval system composed of one file and one retrieval mechanism. I use the term "extended retrieval" to denote this more general form of information retrieval. In this paper, I describe what I mean by extended retrieval and provide examples. Some technical consequences of the extension of information retrieval from a traditional, unitary form to an extended network environment will be noted.

2. Four information retrieval "architectures"

The generalization of information storage and retrieval beyond the traditional, unitary case of one file and one retrieval system to the more general model of multiple files and multiple retrieval systems can be expressed as four combinations:

1. Traditional, unitary retrieval systems with one file and one retrieval mechanism. An on-line library catalogue would be an example. A search on MELVYL, the on-line catalogue of the nine campuses of the University of California, for example, retrieves 266 records for books using the combination of subject keywords "science" and "Japan."

2. Multi-stage retrieval from a single file. An example of multi-stage retrieval would be when the results of a search by one retrieval system on a file are subjected to additional retrieval operations by a second retrieval mechanism as "post-processing." For example, at Berkeley, an experimental system known as OASIS can be used to refine the results of MELVYL searches [5]. For example, if the results of the previous example, the 266 MELVYL catalogue records for books on "science" and "Japan," are downloaded into OASIS, additional processing can identify numerous subsets defined by date, by language, and by the libraries where copies are held (see table 1).

3. A retrieval system that searches multiple files would be one in which a single retrieval mechanism can search and derive records from two or more files simultaneously. An example is the "Onesearch" feature of the DIALOG retrieval service.

4. Retrieval using multiple files and multiple retrieval mechanisms. The more general case is when multiple files and multiple retrieval mechanisms are used. This is the logical consequence of the development of the new information technology environment: It is a networked environment in which many different files of resources and many different retrieval systems exist and are, in principle, widely accessible over the network.

Note that we are assuming that these systems are heterogeneous. We are not talking about the relatively simple case of distributed database systems designed for compatible, distributed use. We do not and cannot assume that software, hardware, and data structures are standardized. We are concerned with retrieving resources that are related in their meaning rather than in their form, so the problems are those of information retrieval rather than data retrieval.

These four cases are summarized in table 2.

Table 1 Analysis of 266 records by language, date, and campus. Search request: "Science" and "Japan"

Location

At Berkeley

At UCLA

At other campuses

Language: In Japanese In English Other In Japanese In English Other In Japanese In English Other Total
1991 3 1 0 0 0 0 1 0 0 5
1987-1990 17 9 1 8 2 0 6 11 0 54
1984-1986 3 8 0 10 4 0 7 8 0 40
1978-1983 3 10 2 7 6 1 10 7 0 46
1972-1977 0 8 1 15 5 0 7 5 0 41
1963-1971 0 7 0 14 7 0 6 4 0 38
1962 1 6 0 19 10 0 3 3 0 42
Total 27 49 4 73 34 1 40 38 0 266

Table 2 Unitary and extended retrieval

Retrieval mechanisms

Number of files

  Single file Two or more files
One a. Unitary retrieval e.g. on-line library catalogue b. Mutliple file searching, e.g. DIALOG Onesearch
Two or more b. Single file, multiple processing, e.g. postprocessing d. Fully extended retrieval

3. Illustrations of extended retrieval

To illustrate some of the potential of extended retrieval, let us consider two kinds of data: bibliographic data and scientific data.

3.1 Bibliographic Data

A record in a library catalogue will include author, title, location (call number), and a subject heading (e.g. from the Library of Congress Subject Headings list). A record in a bibliography representing the same document will likewise include author and title, but may also include an abstract and a subject heading probably from a different list, such as Medical Subject Headings. A citation index would include, again, the author and title and the references from the document. These overlapping contents are shown in figure 1. What we have for the same document is three quite different bibliographic descriptions by different publishers, in different formats, and ordinarily searched on different retrieval systems. These three records contain:

Figure 1 Related bibliographic files

- information that is the same, though possibly expressed differently and not necessarily recognizable as being the same, and
- some information not provided by the others, e.g. the catalogue has the location of a copy; the bibliography has an abstract; and the citation index shows references to and from the document.

The relationship between the records retrieved from the different databases is that they all represent the same document. But this is only one of many possible relationships. Figure 1 also shows two further relationships:

- a book review index may include a record for a different but related document, a book review; and
- outside of the bibliography and library catalogue may be some object that the book is about.

In this way, one's knowledge can be significantly increased by extending one's search to two or more heterogeneous databases. However, although the various bibliographies may refer to a single document, there is no assurance that they will do so in a consistent way. The form and contents of records in bibliographies (like references at the end of papers) vary considerably. This is not normally a problem for human beings, who can recognize what is meant, but it is a serious problem for recognition by a computer.

Differences in subject description can be substantial and significant. Consider, for example, a searcher interested in coastal pollution. A search on "coastal pollution" in the Library of Congress Subject Headings in the University of California MELVYL on-line catalogue yielded nothing either as a phrase ("exact subject") or as a pair of subject terms ("subject keyword search"). Nor does either form of search yield anything in the MELVYL file (1988 to date) of the MEDLINE bibliography. Nevertheless, material on coastal pollution does exist in both, and some of it can be found by searching for documents that contain the words "coastal" and "pollution" in their title. Analysis of these records shows that the subject headings actually assigned to these documents include:

LCSH (MELVYL Catalogue) MeSH (MELVYL MEDLINE)
Marine pollution Seawater
Coastal zone management Water pollution
Water-Pollution Bacteria
Petroleum industry and trade Water microbiology
Waste disposal in oceans Water pollutants

Not only is the plausible phrase "coastal pollution" not used in either set of subject headings, even as a cross-reference, but there is remarkably little overlap in the terms that are used.

3.2 Science Data

Consider the range of different data that could be relevant and available for studying a geographical area such as Kyoto Prefecture or the Sacramento delta:

Topographic: latitude, longitude, altitude
Political map
Satellite image
Land-use map
Gazetteer: place names
Weather: temperature, precipitation, humidity, wind
Textual documents
Census and socio-economic statistics
Photographs, etc.
Handling the retrieval of such diverse kinds of data from quite different sources is a major challenge.

4. Some technical issues

We use the phrase "extended retrieval" to refer to the extension of information retrieval to include search and retrieval in multiple files and/or using multiple retrieval mechanisms. In the new environment, a number of interesting problems arise and need to be resolved:

4.1 What Resources Exist?

The fact that many electronic resources exist in many places does not mean it is easy to identify or find them. Files stored on computers are just that: files that are stored. There is, as yet, little or no tradition of cataloguing computer files, so that they can be identified and found, as there is for cataloguing library books and museum objects. The task of developing "directories to the Internet" is not likely to be simple or inexpensive, but it is now receiving increasing attention. The question of identifying which resources one might wish to search is a bibliographic problem, although the describing of electronic resources is still undeveloped. However, there is also a question of which retrieval system to choose for any given search if there is a choice. For example, the catalogue records of the library of the Berkeley School of Library and Information Studies can be searched using four different retrieval systems. Different information retrieval systems have different retrieval capabilities. For a specialized search, one may need to select the retrieval mechanism as much as the resource to be searched. This implies a knowledge and understanding of the differing characteristics of different retrieval mechanisms available, with which resources they can be used, and how to use them, singly or in combination. This knowledge is inadequately developed, though Belkin and Croft [1] provide a useful review.

4.2 Search and Retrieve Protocols

It is now possible to access databases at remote sites as well as databases at one's local computer centre. This ordinarily requires establishing a telecommunications connection, a personal account and password, and the use of an unfamiliar command language (as in figure 2a). This is inconvenient and requires expertise. A significant new development is the creation of national (e.g. US NISO Z39:50) and international standards (ISO 10162/10163) for computer-to-computer "search and retrieve" standards. The adoption of these standards will enable one to delegate to one's local retrieval system the extension of a search to some other, different retrieval system. The Search and Retrieve standards translate searches from one retrieval command language to another [3, 4]. This development started among librarians to enable convenient access to each other's catalogues, but it has wider application. The effect is shown in figure 2b.

Figure 2 "Search and retrieve" (Z39:50) protocol (a) A user can connect with various on-line catalogues and must know how to use each. (b) With the "Search and retrieve" (Z39:50) protocol, the user need only know how to use the local catalogue and to instruct it to extend a search to other systems.

Because different retrieval systems have different capabilities, one could do more than simply extend a search to another database. For example, suppose that the on-line catalogue at library A does not support searching for individual words in titles, but that the on-line catalogue of library B does. A title keyword search desired at A could, instead, be performed on the online catalogue at B. The records of any books found at B could then be transferred back to A and, with the benefit of full descriptions, the catalogue at A could be searched to see if they are also held at A. Any such books also found to be held at A would provide the effect of a title word search - admittedly probably incomplete - even though the on-line catalogue at A did not support title word searching. The point is that specialized retrieval capabilities available on a remote machine but not available on a local machine could, within limits, be used to enhance local searching.

The idea of a "knowledge robot" or "knowbot" that could be sent off into the networks searching for and retrieving information on any specified topic has aroused interest. The essence of a "knowbot" is the idea of a conditional search command. A searcher at A might send a search command in the following form: Search in Resource B for data with attribute X. If found, retrieve it and transport it to A; if not found, extend the search to Resource C, and so on.

4.3 Questions of Relatedness

Extended retrieval among heterogeneous resources raises difficult questions of relatedness. If a name found in one database is similar to a name in another database, are they variant forms of the name of the same person? If the names look the same, could they still refer to different people? The same problem arises with records in bibliographic databases that may or may not represent the same document [2]. More generally, in extended retrieval in heterogeneous databases, one is concerned with the retrieval of related material but the nature of the relationship may be difficult to define or to determine.

4.4 Anatomy of the Retrieval Process

Retrieval in a unitary retrieval system is easily viewed as an event rather than a process. When considering information retrieval in a networked environment, one might think in terms of local "client" search machines and remote resource "servers," which implies a distinction between a search and retrieve mechanism and a resource file. This may exist when a retrieval system is built to retrieve from two or more files. But information retrieval is, in practice, a complex process including several different components. The question arises concerning how the retrieval process could be optimally divided into different stages on different machines. In fact, when one analyses the individual elements of the retrieval process, considerable complexity and choice emerges. For example, with bibliographic searches:

- different bibliographies represent different, more or less overlapping, populations of documents;
- different bibliographies will have more or less different descriptions, even for the same document;
- the access points ("indexes") that can be searched vary between systems;
- there can be more or less cross-referencing between different index terms ("see," "see also," and other kinds of syndetic relationships); and
- different retrieval systems support different types of searching (matching, comparing), even in the same bibliographic files. Some allow searches for keywords and/or composite Boolean search requests and others do not.

So, correspondingly, one can immediately identify five different classes of reasons for extending searches to two or more on-line bibliographies and/or on-line catalogues. Depending on the circumstances, different options might be chosen:

1. Because different bibliographies represent different populations of documents, one may want to extend a search to another bibliography or catalogue because what was desired was not found in what had already been searched and it would be desirable to extend the search to a new population of documents.

2. Because different bibliographies contain different descriptions for the same document, one may want to extend a search to another bibliography or catalogue for a document that has already been found because differing bibliographic descriptions can be used to accumulate additional information. As noted above, a book might be present in a library catalogue, in a subject bibliography, and in a citation index. All three will have more or less differing bibliographic records for the same document: The catalogue will have a standard catalogue record and a note of the location of each copy; the medical bibliography may contribute a different, detailed subject description and an abstract; and a citation index might contribute a list of other works cited in it and another list of works that cite it. Combining these descriptions could improve bibliographic access substantially.

3. For a more complete search, one may want to extend a search to two or more other systems in order to use additional access points. Citations have to be searched in a citation index. The ability to search on other features, such as searching by words within a title or within an abstract, or by the language or date of the document, varies significantly between systems.

4. Because of the complexity and vagueness of language, one may prefer using the system that has the best network of cross-references, the best "vocabulary control," to guide the searcher from the searcher's terms (e.g. "coastal pollution") to the system's terms.

5. It may be worthwhile to extend a search to another system because it has special searching abilities, such as identifying pairs of words that occur close to each other, or to extend the search by downloading results into a personal computer for more detailed analysis ("postprocessing").

There are other possibilities. For example, the extent to which texts are subdivided into fields can affect retrieval performance [6].

We can observe that information retrieval theory remains significantly incomplete, even for unitary retrieval systems, until the effects of changes in any one or more of these variables on retrieval outcomes are properly understood.

5. Conclusion

Automating a catalogue, placing a bibliography on-line, or providing access to any other electronic resource on-line is a substantial technological development. But to think only of an individual on-line catalogue, of an individual on-line bibliography, or of any other individual resource - even being aware that there are several different on-line catalogues, numerous individual bibliographies available on-line, and many other resources on-line is to think in terms of the card catalogues, published paper bibliographies, and the unitary retrieval systems of the past. Instead of thinking of individual retrieval systems, we should now base our thinking on the awareness that there are large and growing populations of electronic resources and of retrieval mechanisms, increasingly connected by telecommunications networks, and containing data sets that can, in principle, be linked, combined, and rearranged. What could happen if, instead of thinking of information retrieval in traditional, unitary terms, as when using a bibliography on-line, we were to follow the logic of electronic technology one step further and think instead of a collectivity of bibliographies on-line? We need to think in terms of using a whole electronic reference library, even multiple libraries, on-line.

Extended retrieval provides much wider opportunities, but it also increases the difficulties in selecting what is needed from so large and complex a universe.

References

1. Belkin, N.J., and W.B. Croft (1987). "Retrieval Techniques." Annual Review of Information Science and Technology 22: 109-145.

2. Buckland, M.K., A. Hindle, and P.M. Walker (1975). "Methodological Problems in Assessing the Overlap between Bibliographic Files and Library Holdings." Information Processing and Management 11: 89-105.

3. Buckland, M.K., and A. Lynch (1987). "The Linked Systems Protocol and the
Future of Bibliographic Networks and Systems." Information Technology and Libraries 6: 83-88.

4. Buckland, M.K., and C.A. Lynch (1988). "National and International Implications of the Linked Systems Protocol for Online Bibliographic Systems." Cataloging and Classification Quarterly 8: 15-33.

5. Buckland, M.K., B.A. Norgard, and C. Plaunt (1992). "Design for an Adaptive Library Catalog." In: Networks, Telecommunications, and the Networked Revolution: Proceedings of the ASIS 1992 Mid-Year Meeting May 27-30, 1992. Silver Springs, Md.: American Society for Information Science, pp. 165-171.

6. Lynch, A. (1992) "Online Searching on the Internet: The Challenge of Information Semantics for Networked Information." In: Proceedings of the National Online Meeting, New York, May 5-7, 1992. Medford, NJ: Learned Information. Forthcoming.

7. Lynch, A., and C.M. Preston (1990). "Internet Access to Information Resources." Annual Review of Information Science and Technology 25: 263-312.

8. Stonebraker, M., and J. Dozier (1991). Large Capacity Object Servers to Support Global Change Research. SEQUOIA 2000 Report 91/1. Berkeley, Calif.: University of California, Electronic Research Laboratory.

Information retrieval: Theory, experiment, and operational systems


Abstract
1. Scientific communication and information retrieval
2. Anomalous states of knowledge
3. Relevance
4. Early experiments in IR
5. Language
6. Boolean logic, search strategy, and intermediaries
7. Associative methods
8. Probabilistic models
9. Information-seeking behaviour
10. Intelligence
References


Stephen E. Robertson

Abstract

The paper examines the process of scientific communication resulting from users' expressed needs for information and in particular the formal mechanisms for the storage and retrieval of information in response to queries or requests. Formal indexing or coding schemes, Boolean systems, facet analysis, and associative methods, as well as probabilistic models, are reviewed and information-seeking behaviour is discussed.

1. Scientific communication and information retrieval

It is a commonplace that science depends on communication. Science is a social activity; scientists' ideas, models, and results have to be scrutinized by their peers, analysed and tested with the possibility of validation or refutation as well as the construction of further science.

In order to investigate the role of information retrieval and the effect of developing technologies on scientific communication, we may start from a simplified view of the process of scientific communication (see chart p. 145).

Such a diagram is deceptive both in its simplicity and in its circularity. As a publishing scientist, I am clearly not communicating with myself! My potential audience will be not only other scientists (who may indeed feed new publications into the process), but also other users of scientific information, e.g. those who apply the knowledge. The diagram also suggests a system with just one communication channel, and furthermore seems to imply a system that always works! Neither is in general the case.

The process of scientific communication

We in the information world tend to work with systems (fragments or subsystems of this larger process) that are supposed to contribute to the whole by providing certain linking mechanisms. By and large we work with relatively formal mechanisms (publication, libraries, databases, etc.); we like to think that they are vital to the whole. However, we also know that scientists rely extensively on less formal mechanisms (personal contacts, meetings, etc.). Furthermore, from the scientist's-eye view, there are many sources/channels of information that may be selected or rejected at different times, for all sorts of reasons. One of our concerns in developing a science of information must be the scientist's perception of the information environment, and the selection and use made of sources, channels, and modes of both obtaining information and communicating ideas.

The concern in the present paper is mainly with formal mechanisms for the storage and retrieval of information, in response to queries or requests. But the wider aspects of the communication process will be kept in view. I start with a perspective on the situation that gives rise to the request; at the end, I will return to some features of human information-seeking behaviour.

2. Anomalous states of knowledge

We must first of all ask the question, Why does a user (scientist) approach an information retrieval system? The simple answer must be because of a need or wish or imagined need for information. However, the user's perception of this information need deserves some exploration.

In Taylor's classic paper, "Question-Negotiation and Information Seeking in Libraries" [14], four stages are identified:

(1) the visceral need (i.e. the user's gut feeling of a need for information);
(2) the verbalized need (the user's first attempt to put the information requirement into words);
(3) the formalized need (the user's expression of the requirement in terms acceptable to the system);
(4) the compromised need (the user's revised expression of the requirement after negotiation with the system).

The last two stages relate to the user's interaction with the system, which is discussed later.

Belkin has further analysed the origins of the visceral need. A user has a state of knowledge of the world, an internal knowledge structure of great complexity. The perception of an information need arises from a perceived problem with some part of this knowledge structure (which may not be a simple gap but some internal inconsistency, conflict with evidence, or whatever). Belkin has called this the "anomalous state of knowledge," or ASK [3].

The ASK hypothesis potentially has strong consequences for the design of information retrieval systems. Most systems in effect demand that the users specify the piece of information that they require, and aim to provide the items that fit the specification. The ASK hypothesis suggests instead a problem-solving approach, where the system cooperates with the user in an attempt to solve the perceived problem (or resolve the anomaly).

Although some of the ideas discussed below predate the ASK hypothesis and involve a rather more traditional approach to IR system design, the ASK idea will inform my discussion throughout. Something like the problem-solving approach will recur in later sections.

3. Relevance

One of the most important concepts for an understanding of our present ideas of information retrieval is that of relevance. Relevance is an extremely difficult idea to pin down and has been the subject of much work over a large number of years [12]. It originates from well before the ASK hypothesis, and one way to think of it might be in terms of correctness. An item might be regarded as relevant to a request if it is in some sense a "correct" response to that request. There are indeed IR theorists who take a modern version of that view: an item is relevant to a request if the request can be inferred from the item, rather in the fashion of theorem proving, but with an appropriate logic [15].

However, probably the dominant view of relevance (and one that is rather more in sympathy with the ASK hypothesis) is a much more subjective one. An extreme version of the subjective approach would be to say simply that an item is relevant to a user's information need or ASK if the user says it is, or in other words if the user would like the system to retrieve this item. More commonly, in our experiments we rely where we can on end-users making relevance judgements in relation to their perceived needs, according to some descriptive scale that we devise; we also do some laboratory experiments with expert judges judging relevance to stated requests.

Even without agreement as to what exactly relevance is, the idea of relevance is of central importance to theory and experiment in IR, and it is becoming important to IR practice as well. From the experimental point of view, where the concept originated, we need it in order to evaluate different approaches and methods in IR. From there it has fed into theory; many recent theories in IR depend upon it. Finally, some methods of IR ask the user to make relevance judgements on-line and use that information internally as one kind of clue to help formulate a new search.

The major assumption is that users can make relevance judgements or recognize relevant items, even if not necessarily with absolute confidence. It is hard to imagine an information-seeking activity where the user was in principle unable to assess whether an item is appropriate or not, though of course users may suspend judgement or change their minds in particular cases, perhaps until or after they have more information about the item or have read some other item.

4. Early experiments in IR

The idea of the experimental evaluation of IR systems is central to both theory and operational system development. Perhaps surprisingly, this idea is only about 35 years old. (Admittedly, 35 years is a long time in the history of computer-based IR systems; but some kinds of IR system, such as library classification schemes, predate computers by at least two-and-a-half millennia!)

We are not so much concerned here with whether the system works in a technical or physical sense as with something that might now be described as its cognitive functioning. In other words, the question as to whether the system will succeed in locating items with specific characteristics (words, codes) is not generally at issue. The question is, Do those items with the specific characteristics actually serve the information need or resolve the ASK? This will depend, in general, on the ways the system offers of specifying characteristics. In the earliest experiments, the question took the form: Does the system retrieve the "correct" answer in response to a query? Setting aside the problem of ASKs and relevance, the implied model of the IR system was what might be described as an input-output model - feed in the query, get out the answer. It was a model that fit well with the early computer-based systems. (Actually, they would very likely be human-assisted; the searcher would send the query off to a library or information centre, where an expert would formulate it in system terms, run the search, and return the results.) In retrospect, however, it seems like a temporary aberration. Both older systems (card catalogues, printed indexes, etc.) and newer ones (highly interactive on-line systems) exhibit characteristics that do not fit too well with the input-output model, particularly if the searcher is the end-user, that is the person needing the information.

The early experiments told us a little about the design of IR systems, but they also focused attention in certain areas, and it may be argued that their lasting influence lies in this focusing process. One area is the one already mentioned, that of relevance: the necessity to devise an operational definition of a "correct" answer was a major stimulus to the reconsideration of the notion of relevance. A second kind of focus was on the particular aspects of IR system design that seemed important. A number of such aspects that had been endlessly debated in the 1940s and 1950s now seemed to be of relatively minor importance; by contrast, some aspects that had received little consideration now became central. One of the later outcomes of this process, as we shall see, has been the concern with highly interactive systems.


Contents - Previous - Next