This is the old United Nations University website. Visit the new site at http://unu.edu
J.-P. Habicht, John B. Mason, and H. Tabatabai
Evaluators often face restrictive conditions; for example, they may be called on to evaluate the effects of a programme some years after the programme has begun. Often there are negligible baseline data and no control groups. Much of this and the next chapter are therefore written from the perspective of the unfortunate reality of having to evaluate an ongoing programme in the midst of implementation. It is hoped, however, that evaluations may be better planned so as to relax many of the constraints often encountered.
Present practice is to collect data on programme participants, and possibly non-participants, then to use statistical manipulations to investigate associations between programme delivery and outcome variables. These methods tend to be expensive and may be difficult to apply in developing countries as routine procedures; moreover, consideration of the questions that should be addressed before applying such methods reveals that they often turn out to be unnecessary. We, therefore, propose procedures that can be widely applicable to a range of programmes in operation.
When a nutrition intervention programme is proposed, the following sequence of questions (adapted from ref. 1.) need to be addressed to establish whether the intervention can in principle affect the performance, health, and survival of individuals - that is, whether the intervention is in fact justified.
Before expanding further on proposed procedures for the evaluation of ongoing programmes, it is worthwile to introduce some fundamental terminology in this context. Specifically, the literature differentiates "(basic) research," which is done to ascertain basic scientific facts independently of their applications to programmes, from "evaluative research," which is done to assign a probability statement of causality to the relationships of an intervention in a community context and the observed nutritional and health impacts. in order to determine the viability and replicability of a given programme design and plan of operation, and from "operational programme evaluation," which ascertains whether an ongoing programme is attaining its objectives (see; e.g.; 2; 3). Operational programme evaluation is the subject of this and the next chapter.
The evaluations of ongoing programmes considered here do not attempt to answer questions concerning the scientific basis for interventions (e.g. questions 1 to 4 above). These are best tackled through experimental research. However, unless these have been addressed in designing the intervention, the evaluation may be pointless. Similarly, the logic for performing evaluative research to determine whether a pilot intervention or demonstration project can successfully apply scientific principles in a field setting must be borne in mind before a large-scale project is undertaken and an operational programme considered (e.g., questions 5 to 8 are examples of areas of evaluative research).
In practice there is no clear distinction between research and evaluation in the methods used. In fact, many of the researcher's tools can be used to answer questions relevant to evaluation. There is, however, a major difference in the kind of questions being addressed, and in the appropriate combination of methods to address each question. It is a grievous mistake to call research "more scientific" than evaluation. Good science is fitting the appropriate methods to seek an answer; in this sense, a good evaluation is as scientific as good research is.
Unfortunately, most works on operational programme evaluation in nutrition have been misdirected in emphasizing the researcher's concern to substantiate the probability of causality while neglecting the other important questions that managers, administrators and funders ask. This misdirected research methodology often neglects the fact that operational programme evaluation can be divided into (a) "summative evaluation," which examines the outcome of a programme, and (b) "formative evaluation," which monitors procedures and activities so as to improve the programme design and the delivery of services. We refer to the former as outcome or impact evaluation and to the latter as process evaluation. The document by Sahn and Pestronk (3) provides a careful review of the theoretical literature in health and human services about programme evaluation, and presents abstracts of many evaluations of nutrition intervention field programmes - clearly differentiating process from impact evaluation and evaluative research from operational programme evaluation. Other evaluative research studies of nutritional programmes are reviewed by Habicht and Butz (4).
This chapter discusses a number of basic concepts needed for designing evaluations appropriate to the decisions to be taken based on their findings. These concepts are intended to provide the basis for the practical stages in evaluation given in chapter 2.
Useful evaluations entail activities that apparently are not necessary to manage the logistics of a programme, and hence incur additional costs. Riecken has said, " ... my experience with evaluation is that there are few bargains, and usually you get no more than you pay for," and "When an evaluation is cheap and quick, it is often also not very good" (5). Without entering into this debate, we can nevertheless learn something from the actual levels of expenditure on evaluation. On the one hand, the World Bank review of monitoring and evaluation of projects in East Africa in 1979 for example, gives a figure of US $12.8 million for monitoring and evaluation of 28 projects averaging US $460,000 each, or somewhere around 0.5 to 5 per cent of total project costs (6). The US Government estimates that 1 per cent should be used to evaluate programs in health. On the other hand, Kielmann et al. consider that "... it should be neither uncommon nor unreasonable to budget 20 per cent to 40 per cent of total project costs for analysis of project-generated data and project evaluation" (7). Evidently, the scale of expenditure of the project itself has some influence on these calculations. Similarly, the purpose of the evaluation of the project is important - while relatively high expenditures may be essential for pilot or experimental projects, it seems unlikely that more than perhaps 5 per cent of project costs would be made available for routine evaluations of large-scale service projects. Such expenditures would often be sufficient to allow useful evaluation, if it is recognized that causality may not be established.
Ultimately, determining the percentage of programme expenditures spent on evaluation is a subjective judgement, as it depends upon which costs are assigned to the delivery of programme services and which to the distinct effort labelled evaluation. In fact, the best evaluation may appear to cost nothing, as it would be an integral component of programme design and implementation.
The question "How do we tell if a programme has an effect?" is incomplete without knowing why one needs to know. Common reasons are: - to decide whether to continue the existing programme or not, - to redesign the programme if necessary, or - to decide whether to do similar programmes elsewhere.
Those involved in the programme often have different expectations about the purpose and results of an evaluation. It is important that the decisions which will be made using evaluation findings be clearly understood and agreed on. The evaluators must then tailor not only the design of the evaluation to the purpose of the evaluation but also the presentation of the results to that purpose. For instance, results presented as if the purpose were to decide on the continuation or termination of a programme are inappropriate if the purpose of the evaluation is to improve the programme. Evaluation cannot be seen in isolation from who asks the question. It is not so much that the principles and practice of evaluation of ongoing programmes are unsatisfactory but that the whole decision-making process in nutrition and food aid programmes needs improvement.
The purposes and issues addressed in evaluation depend on who is asking the questions. The following sequence of basic issues to be addressed are of particular interest to different audiences:
- Is the intervention performing as expected? (Programme managers, administrators, and funders)
- Is the intervention worth continuing? (Administrators and funders)
- Should it be extended? (Administrators and funders)
- Is it causally linked to improved nutrition? (Researchers, scientists, and others concerned with basic mechanisms of cause and effect)
This sequence begins with considering whether the programme is performing adequately, and can progress to seeking to ascribe causality between the intervention and the outcome. The sequence approximates the changing concerns of project management and administrators and the researchers concern with causality; however, causality, if it can be shown, is also important to all aspects of management, programme design, and policy. However, causality is difficult and expensive to establish, and the more that certainty on causality can be dispensed with, the easier the evaluation becomes. Project management can often, in fact, get by with the knowledge that the beneficiaries are improving, even if they cannot be sure this is due to the programme.
Part of the information needed to address questions such as those given above can be obtained by evaluating project design and from process data. Moreover, these data can be used to screen out those projects that are unlikely to have any important effect on outcome, and thus are not worth evaluating further. This procedure is set out in subsequent sections. Other decisions required in establishing purposes of an evaluation centre on the degree of certainty required in linking outcome to programme delivery, and these need to be explained in more detail at this stage.
Different purposes of evaluation demand varying degrees of plausibility or certainty for the conclusions reached from the evaluation. The purposes, in the order of increasing need for higher levels of certainty (elaborating from the sequence of questions just given) are:
The methodological and data requirements of responding to the differing needs of these purposes for certainty and plausibility entail, in order of increasing expense and difficulty:
It would be useful to consider how these two lists could be matched. Each item in the first list is taken up individually. (This discussion is summarized in table 1.1. (see TABLE 1.1. Appropriate Data Collection and Analysis for Different Decisions))
The confidence with which the conclusions in each of the above cases is reached can be considerably improved with strong theory relying on good scientific evidence from elsewhere.
Process evaluation demands formulating implementation and performance objectives against which the programme can be evaluated. For the manager's questions (e.g. is the programme performing as expected?) and usually for the funder-administrator's questions (e.g. is the programme worth continuing, extending, etc.?), the comparison is between the procedures and activities of the programme and some preset standards, generally set out in the programme work-plan. The first prerequisite is, therefore, that the essential activities be stated in objectively measurable units. This is possible even for such an amorphous exercise as curative primary health care (1). Actual performance relative to these standards is ascertained through process evaluation.
A requirement for outcome evaluation is to establish objectives prior to assessment. These must be explicitly formulated as an acceptable difference from a standard, or as a minimum improvement from some baseline. These quantitative standards of achievement should correspond to the implicit objectives of the programme and should be understood and agreed on by those who must use the results of the evaluation. Experience shows that the exercise of making stated and implicit objectives more explicit will often reveal hidden objectives, some of which are even contradictory. This is why a consensus about the programme objectives is one of the necessary first steps to an evaluation.
Almost inevitably, programme objectives change as a programme evolves. However, changing definition of objectives during evaluation of single projects should be avoided because it is rare that the design of the evaluation can deal with new objectives. For example, a recent review of supplementary feeding programmes discussed whether the more important effects of these programmes were in terms of income distribution since the supposed objectives of improving child nutrition were seldom reached (8). However, no comparison was made with quite different programmes that might be more efficient in changing income distribution. While this may be a reasonable question in general, changing objectives for an individual programme requires more fundamental decisions.
Once the underlying outcome is identified conceptually. the next step is to identify the measurable variables related to the outcome of concern. The major portion of this book discusses that step relating desired outcome (e.g., improved nutrition) to a measured variable (e.g.. anthropometry). Subsequent chapters develop the relationship between the conceptual outcome and the measurements more fully.
Finally, the statistical test used to judge the reality of a measured difference (either between treatment and control groups, or between treatment and a standard) results in a statement that most of the time (which is usually specified as 95 to 99 per cent of the time) such a measured difference will be found if the true difference is not smaller than some quantity. In designing an evaluation one must further state how one is willing to miss identifying a true difference of more than a specified magnitude. This statement refers to power analyses (see, for example, [9]). These steps of specifying procedural and impact objectives, translating those objectives into measurable variables, specifying the minimum or maximum acceptable difference in that variable, and doing the power analysis are prerequisites for any quantified evaluation.
The sad fact is that the research giving scientific justification to a programme is often so lacking that these steps are impossible. Experiments in the precise setting of the proposed programme may not always be needed (or possible). However, there needs to be a marshalling of the evidence from previous evaluations, experiments, and scientific knowledge, to serve as a basis for designing a relevant evaluation. Unfortunately, this is all too seldom done.
Investigating causality involves exploring whether there is a link between programme activities and outcome. The logical sequence of questions in evaluating whether a measured outcome is plausibly caused by programme activities may be summarized as follows:
- Is the outcome adequate? (Of concern to programme managers, administrators, and funders.)
- Is there a statistical association between intervention and outcome? (Of concern to administrators and funders and to researchers.)
- Is the outcome due to the intervention? (Of concern to administrators and funders and to researchers.)
If there is no statistical association or if the outcome is inadequate:
- Were the statistical methods used correct?
- Was the intervention relevant and adequate?
- Were the measurements of outcome valid and reliable?
- Were the recipients likely to benefit?
- Was the sample size adequate?
- Was there negative confounding?
If there is statistical association and the outcome is adequate:
- Is the association likely to be causal? Discard confounding. (See table 1.2 on internal validity.)
- What direction was the causality?
- What mechanisms linked the intervention to the outcome?
- Can the findings be extrapolated to the population as a whole (i.e. do they have external validity)?
- What was the cost of the effect and of the marginal effect?
TABLE 1.2. Major Causes of Confounding (Threats to Internal Validity)
Cause of confounding |
Example |
1. Selection |
|
When the assignment of subjects to treatment and comparison groups is not random, the groups may differ systematically in some characteristic(s) associated with the outcome variable Selection bias is therefore likely to be present. Self-selection is a common source of this type of bias. | Mothers who choose to participate in a programme to reduce the incidence of low birthweight may tend to be more educated, richer, and more motivated than those who do not. These factors influence the outcome and compete with the programme as an explanation for an observed reduction in the incidence of low birthweight. |
2. Maturation |
|
Human subjects mature over time, and this process may cause changes in the outcome variable irrespective of programme effects. | The nutritional status of 6 to 24-month-old children is often worse than that of older preschoolers. If the average age of participants in a nutrition programme increases from, say, 18 to 36 months, observed improvement may be due to maturation and would have occurred without the programme. |
3. History |
|
When a programme is in effect, many other events may intervene and influence the outcome variable When these historical events have different impacts on the treatment and comparison groups, they confound the programme effects. | A supplementary feeding programme is introduced in one of two otherwise equivalent areas. If food prices rise at different rates in the two areas, observed differences in nutritional status may not be attributed solely to the feeding programme. The differential price rises may have influenced the outcome. |
4. Instrumentation |
|
A threat may arise from changes in
the way measurements are made or in what is measured or from measurement errors due, for example, to instrument decay |
The height for age of preschool children is often compared across age groups. Since for infants from birth to two years old it is usually length that is measured, rather than height as for children older than two, comparisons between the two groups may be biased. |
5. Regression artifact |
|
If subjects are chosen on the basis of exhibiting an extreme value on some variable (e.g. wasting), there may be improvement over time without any intervention. This tendency is called regression toward the mean. The solution is either to observe the effect on the whole population or to make comparisons within the selected extreme group (e.g. the malnourished) | In a nutrition programme instituted for the malnourished, improvement may be shown in that some of the participants are no longer malnourished at the end of the programme; but part of this improvement may not be due to the programme, since some subjects would have improved anyway. |
6. Experimental mortality |
|
Some subjects may drop out of a programme during the course of its implementation if these subjects have different characteristics than those who remain, any before/after effect shown may be confounded by differences in the populations at the beginning and end of the programme | A food-for-work programme may not lead to an improvement in the nutritional status of a community even if it has in fact been effective. This could happen if enough of the participants who improve leave the community in search of jobs elsewhere. The observed change here underestimates the impact of the programme. |
The basic question is whether there is a statistical association between the putative cause (the intervention) and the outcome or effect (e.g. improving nutritional status). Seeking associations is useful only when one needs some level of certainty that the programme is causally related to the outcome. Showing an association requires comparison of measured outcome in at least two groups that receive different intensities of programme intervention. This may mean comparing two groups such as control and treatment, showing correlations between different levels of programme delivery and outcome, estimating regression coefficients between programme activities and outcome, statistically controlling for other influences, and so on. Controlling for influences on the outcome other than the programme which mimic a programme effect is called "controlling for confounding influences."
The questions to be asked about a programme and its evaluation are very different depending on the results of the statistical tests of association. Although these tests are performed after data collection. they must be foreseen to ensure that the right data are collected. Elaborating on the summary above, then, the first question after data collection and analysis are completed will be: was there a statistical association between the programme intervention and the outcome?
If no association between the intervention and outcome is found, the next questions are as follows:
- measurement imprecision?
- undependability due to random influences on outcome? These errors due to unreliability can be reduced by:
- exclusion by design,
- stratification before treatment and exclusion by analysis (e.g. matching),
- measurement and exclusion by analyses (e.g. multivariable analysis).
If an association between the intervention and outcome is found, the next questions are as follows:
- fishing (multiple significant tests),
- wrong assumptions about the appropriateness of chosen statistical tests for the data?
This list of questions is reducing in its logic. Some of the questions (i.e., 3 and 4) are so difficult to answer perfectly in the context of a realistic programme evaluation that one must examine carefully whether they are really necessary or even useful for the evaluation. The perfect solution to confounding is to design an intervention in such a way that all influences on outcome other than that of the programme are randomly distributed among comparison groups. This then permits one to state that a probability of association is a probability of causation.
Unfortunately, the random distribution of all confounding influences on outcome other than that of the programme requires designs that are impossible in evaluation. Evaluation must therefore deal with confounding influences in a different fashion to establish the validity of the results, and thus be able to make a plausible statement about the programme being the cause of the improvement.
In the following sections, some of these concepts are elaborated to provide background for the procedures discussed in chapter 2.