Contents - Previous - Next


This is the old United Nations University website. Visit the new site at http://unu.edu


1. Basic concepts for the design of evaluation during programme implementation


J.-P. Habicht, John B. Mason, and H. Tabatabai


Introduction
Costs of evaluation
Purposes of evaluation
Setting programme objectives as a basis for evaluation
Investigating causality
Confounding variables and evaluation design
Levels of analysis
Definitions of population groups involved
Effect/cost
Appropriate indicators for different objectives
Note on sample size
References
Bibliography


Introduction


Evaluators often face restrictive conditions; for example, they may be called on to evaluate the effects of a programme some years after the programme has begun. Often there are negligible baseline data and no control groups. Much of this and the next chapter are therefore written from the perspective of the unfortunate reality of having to evaluate an ongoing programme in the midst of implementation. It is hoped, however, that evaluations may be better planned so as to relax many of the constraints often encountered.

Present practice is to collect data on programme participants, and possibly non-participants, then to use statistical manipulations to investigate associations between programme delivery and outcome variables. These methods tend to be expensive and may be difficult to apply in developing countries as routine procedures; moreover, consideration of the questions that should be addressed before applying such methods reveals that they often turn out to be unnecessary. We, therefore, propose procedures that can be widely applicable to a range of programmes in operation.

When a nutrition intervention programme is proposed, the following sequence of questions (adapted from ref. 1.) need to be addressed to establish whether the intervention can in principle affect the performance, health, and survival of individuals - that is, whether the intervention is in fact justified.

  1. Is a deficit of food or specific nutrients causing disease, decreased performance, or untimely death in individuals?
  2. How detrimental is this deficit to individual performance, health, or survival? In other words, what is the dose-response relationship?
  3. Is it possible to decrease or eliminate the deficit (or its effects) in individuals?
  4. How prevalent is the deficit (and its effects) in the population? Is the problem increasing or decreasing?
  5. What proportion of ill health, decreased performance, and untimely death in a population may be ascribed to this deficit now and in the future?
  6. Is it possible to decrease or eliminate the deficit (or its effects) in the population?
  7. What are the expected benefits, costs, and side-effects of the proposed intervention on the deficit (or its effects) in a large population, given the results of intervention trial studies in small populations? How long would the intervention need to run?
  8. What are the actual benefits, costs, and side-effects of the intervention undertaken in the large population on the deficit (or its effects)? Is the actual benefit worth the actual cost and the actual side-effects?

Before expanding further on proposed procedures for the evaluation of ongoing programmes, it is worthwile to introduce some fundamental terminology in this context. Specifically, the literature differentiates "(basic) research," which is done to ascertain basic scientific facts independently of their applications to programmes, from "evaluative research," which is done to assign a probability statement of causality to the relationships of an intervention in a community context and the observed nutritional and health impacts. in order to determine the viability and replicability of a given programme design and plan of operation, and from "operational programme evaluation," which ascertains whether an ongoing programme is attaining its objectives (see; e.g.; 2; 3). Operational programme evaluation is the subject of this and the next chapter.

The evaluations of ongoing programmes considered here do not attempt to answer questions concerning the scientific basis for interventions (e.g. questions 1 to 4 above). These are best tackled through experimental research. However, unless these have been addressed in designing the intervention, the evaluation may be pointless. Similarly, the logic for performing evaluative research to determine whether a pilot intervention or demonstration project can successfully apply scientific principles in a field setting must be borne in mind before a large-scale project is undertaken and an operational programme considered (e.g., questions 5 to 8 are examples of areas of evaluative research).

In practice there is no clear distinction between research and evaluation in the methods used. In fact, many of the researcher's tools can be used to answer questions relevant to evaluation. There is, however, a major difference in the kind of questions being addressed, and in the appropriate combination of methods to address each question. It is a grievous mistake to call research "more scientific" than evaluation. Good science is fitting the appropriate methods to seek an answer; in this sense, a good evaluation is as scientific as good research is.

Unfortunately, most works on operational programme evaluation in nutrition have been misdirected in emphasizing the researcher's concern to substantiate the probability of causality while neglecting the other important questions that managers, administrators and funders ask. This misdirected research methodology often neglects the fact that operational programme evaluation can be divided into (a) "summative evaluation," which examines the outcome of a programme, and (b) "formative evaluation," which monitors procedures and activities so as to improve the programme design and the delivery of services. We refer to the former as outcome or impact evaluation and to the latter as process evaluation. The document by Sahn and Pestronk (3) provides a careful review of the theoretical literature in health and human services about programme evaluation, and presents abstracts of many evaluations of nutrition intervention field programmes - clearly differentiating process from impact evaluation and evaluative research from operational programme evaluation. Other evaluative research studies of nutritional programmes are reviewed by Habicht and Butz (4).

This chapter discusses a number of basic concepts needed for designing evaluations appropriate to the decisions to be taken based on their findings. These concepts are intended to provide the basis for the practical stages in evaluation given in chapter 2.


Costs of evaluation


Useful evaluations entail activities that apparently are not necessary to manage the logistics of a programme, and hence incur additional costs. Riecken has said, " ... my experience with evaluation is that there are few bargains, and usually you get no more than you pay for," and "When an evaluation is cheap and quick, it is often also not very good" (5). Without entering into this debate, we can nevertheless learn something from the actual levels of expenditure on evaluation. On the one hand, the World Bank review of monitoring and evaluation of projects in East Africa in 1979 for example, gives a figure of US $12.8 million for monitoring and evaluation of 28 projects averaging US $460,000 each, or somewhere around 0.5 to 5 per cent of total project costs (6). The US Government estimates that 1 per cent should be used to evaluate programs in health. On the other hand, Kielmann et al. consider that "... it should be neither uncommon nor unreasonable to budget 20 per cent to 40 per cent of total project costs for analysis of project-generated data and project evaluation" (7). Evidently, the scale of expenditure of the project itself has some influence on these calculations. Similarly, the purpose of the evaluation of the project is important - while relatively high expenditures may be essential for pilot or experimental projects, it seems unlikely that more than perhaps 5 per cent of project costs would be made available for routine evaluations of large-scale service projects. Such expenditures would often be sufficient to allow useful evaluation, if it is recognized that causality may not be established.

Ultimately, determining the percentage of programme expenditures spent on evaluation is a subjective judgement, as it depends upon which costs are assigned to the delivery of programme services and which to the distinct effort labelled evaluation. In fact, the best evaluation may appear to cost nothing, as it would be an integral component of programme design and implementation.


Purposes of evaluation


The question "How do we tell if a programme has an effect?" is incomplete without knowing why one needs to know. Common reasons are: - to decide whether to continue the existing programme or not, - to redesign the programme if necessary, or - to decide whether to do similar programmes elsewhere.

Those involved in the programme often have different expectations about the purpose and results of an evaluation. It is important that the decisions which will be made using evaluation findings be clearly understood and agreed on. The evaluators must then tailor not only the design of the evaluation to the purpose of the evaluation but also the presentation of the results to that purpose. For instance, results presented as if the purpose were to decide on the continuation or termination of a programme are inappropriate if the purpose of the evaluation is to improve the programme. Evaluation cannot be seen in isolation from who asks the question. It is not so much that the principles and practice of evaluation of ongoing programmes are unsatisfactory but that the whole decision-making process in nutrition and food aid programmes needs improvement.

The purposes and issues addressed in evaluation depend on who is asking the questions. The following sequence of basic issues to be addressed are of particular interest to different audiences:

- Is the intervention performing as expected? (Programme managers, administrators, and funders)
- Is the intervention worth continuing? (Administrators and funders)
- Should it be extended? (Administrators and funders)
- Is it causally linked to improved nutrition? (Researchers, scientists, and others concerned with basic mechanisms of cause and effect)

This sequence begins with considering whether the programme is performing adequately, and can progress to seeking to ascribe causality between the intervention and the outcome. The sequence approximates the changing concerns of project management and administrators and the researchers concern with causality; however, causality, if it can be shown, is also important to all aspects of management, programme design, and policy. However, causality is difficult and expensive to establish, and the more that certainty on causality can be dispensed with, the easier the evaluation becomes. Project management can often, in fact, get by with the knowledge that the beneficiaries are improving, even if they cannot be sure this is due to the programme.

Part of the information needed to address questions such as those given above can be obtained by evaluating project design and from process data. Moreover, these data can be used to screen out those projects that are unlikely to have any important effect on outcome, and thus are not worth evaluating further. This procedure is set out in subsequent sections. Other decisions required in establishing purposes of an evaluation centre on the degree of certainty required in linking outcome to programme delivery, and these need to be explained in more detail at this stage.

Different purposes of evaluation demand varying degrees of plausibility or certainty for the conclusions reached from the evaluation. The purposes, in the order of increasing need for higher levels of certainty (elaborating from the sequence of questions just given) are:

  1. improvement in programme management,
  2. continuation of funding,
  3. replication of the programme in similar conditions,
  4. replication of the programme in dissimilar conditions,
  5. finding basic research results about cause-effect relationships.

The methodological and data requirements of responding to the differing needs of these purposes for certainty and plausibility entail, in order of increasing expense and difficulty:

  1. collection of data on process and outcome for participants only (programme data),
  2. collection of data through ad hoc surveys,
  3. advanced statistical analysis,
  4. control group(s) of some kind,
  5. collection of before-after data,
  6. highly-standardized measurements,
  7. randomized intervention,
  8. double-blind research designs (blind intervention and blind assessment).

It would be useful to consider how these two lists could be matched. Each item in the first list is taken up individually. (This discussion is summarized in table 1.1. (see TABLE 1.1. Appropriate Data Collection and Analysis for Different Decisions))

  1. Evaluation for programme management seeks to determine whether programme services are being delivered as planned to the intended target groups and whether the (gross) outcome is acceptable. The objective of this "adequacy evaluation" is to reveal the possibilities for improvement in programme management. It does this by relying on programme data relating to process and participants (method a). It may, on occasion, require survey data as well (method b).
  2. The decision whether to continue funding of a particular programme often requires adding advanced statistical analysis (method c) to the requirements for adequacy evaluation (method a) and, possibly, survey work (method b).
  3. Replication of the programme in similar conditions usually requires data on some form of control groups (method d), and/or surveys (b), in addition to the requirements for purpose 2.
  4. Replication in dissimilar conditions entails, at the minimum, both control groups (d) and survey data (b), as well as advanced analysis (c). Sometimes it may be preferable to use a quasi-experimental design (see section on Design below), e.g. to add before-after data (method e), to standardize measurements carefully (method f), and/or to use a randomized design (method g).
  5. Basic research involves most of these requirements, and sometimes it may even be possible to employ the ideal research design, the double-bind randomized trial (method h).

The confidence with which the conclusions in each of the above cases is reached can be considerably improved with strong theory relying on good scientific evidence from elsewhere.


Setting programme objectives as a basis for evaluation


Process evaluation demands formulating implementation and performance objectives against which the programme can be evaluated. For the manager's questions (e.g. is the programme performing as expected?) and usually for the funder-administrator's questions (e.g. is the programme worth continuing, extending, etc.?), the comparison is between the procedures and activities of the programme and some preset standards, generally set out in the programme work-plan. The first prerequisite is, therefore, that the essential activities be stated in objectively measurable units. This is possible even for such an amorphous exercise as curative primary health care (1). Actual performance relative to these standards is ascertained through process evaluation.

A requirement for outcome evaluation is to establish objectives prior to assessment. These must be explicitly formulated as an acceptable difference from a standard, or as a minimum improvement from some baseline. These quantitative standards of achievement should correspond to the implicit objectives of the programme and should be understood and agreed on by those who must use the results of the evaluation. Experience shows that the exercise of making stated and implicit objectives more explicit will often reveal hidden objectives, some of which are even contradictory. This is why a consensus about the programme objectives is one of the necessary first steps to an evaluation.

Almost inevitably, programme objectives change as a programme evolves. However, changing definition of objectives during evaluation of single projects should be avoided because it is rare that the design of the evaluation can deal with new objectives. For example, a recent review of supplementary feeding programmes discussed whether the more important effects of these programmes were in terms of income distribution since the supposed objectives of improving child nutrition were seldom reached (8). However, no comparison was made with quite different programmes that might be more efficient in changing income distribution. While this may be a reasonable question in general, changing objectives for an individual programme requires more fundamental decisions.

Once the underlying outcome is identified conceptually. the next step is to identify the measurable variables related to the outcome of concern. The major portion of this book discusses that step relating desired outcome (e.g., improved nutrition) to a measured variable (e.g.. anthropometry). Subsequent chapters develop the relationship between the conceptual outcome and the measurements more fully.

Finally, the statistical test used to judge the reality of a measured difference (either between treatment and control groups, or between treatment and a standard) results in a statement that most of the time (which is usually specified as 95 to 99 per cent of the time) such a measured difference will be found if the true difference is not smaller than some quantity. In designing an evaluation one must further state how one is willing to miss identifying a true difference of more than a specified magnitude. This statement refers to power analyses (see, for example, [9]). These steps of specifying procedural and impact objectives, translating those objectives into measurable variables, specifying the minimum or maximum acceptable difference in that variable, and doing the power analysis are prerequisites for any quantified evaluation.

The sad fact is that the research giving scientific justification to a programme is often so lacking that these steps are impossible. Experiments in the precise setting of the proposed programme may not always be needed (or possible). However, there needs to be a marshalling of the evidence from previous evaluations, experiments, and scientific knowledge, to serve as a basis for designing a relevant evaluation. Unfortunately, this is all too seldom done.


Investigating causality


Investigating causality involves exploring whether there is a link between programme activities and outcome. The logical sequence of questions in evaluating whether a measured outcome is plausibly caused by programme activities may be summarized as follows:

- Is the outcome adequate? (Of concern to programme managers, administrators, and funders.)
- Is there a statistical association between intervention and outcome? (Of concern to administrators and funders and to researchers.)
- Is the outcome due to the intervention? (Of concern to administrators and funders and to researchers.)

If there is no statistical association or if the outcome is inadequate:

- Were the statistical methods used correct?
- Was the intervention relevant and adequate?
- Were the measurements of outcome valid and reliable?
- Were the recipients likely to benefit?
- Was the sample size adequate?
- Was there negative confounding?

If there is statistical association and the outcome is adequate:

- Is the association likely to be causal? Discard confounding. (See table 1.2 on internal validity.)
- What direction was the causality?
- What mechanisms linked the intervention to the outcome?
- Can the findings be extrapolated to the population as a whole (i.e. do they have external validity)?
- What was the cost of the effect and of the marginal effect?

TABLE 1.2. Major Causes of Confounding (Threats to Internal Validity)

Cause of confounding

Example

1. Selection

When the assignment of subjects to treatment and comparison groups is not random, the groups may differ systematically in some characteristic(s) associated with the outcome variable Selection bias is therefore likely to be present. Self-selection is a common source of this type of bias. Mothers who choose to participate in a programme to reduce the incidence of low birthweight may tend to be more educated, richer, and more motivated than those who do not. These factors influence the outcome and compete with the programme as an explanation for an observed reduction in the incidence of low birthweight.

2. Maturation

Human subjects mature over time, and this process may cause changes in the outcome variable irrespective of programme effects. The nutritional status of 6 to 24-month-old children is often worse than that of older preschoolers. If the average age of participants in a nutrition programme increases from, say, 18 to 36 months, observed improvement may be due to maturation and would have occurred without the programme.

3. History

When a programme is in effect, many other events may intervene and influence the outcome variable When these historical events have different impacts on the treatment and comparison groups, they confound the programme effects. A supplementary feeding programme is introduced in one of two otherwise equivalent areas. If food prices rise at different rates in the two areas, observed differences in nutritional status may not be attributed solely to the feeding programme. The differential price rises may have influenced the outcome.

4. Instrumentation

A threat may arise from changes in the way measurements are made or in what is

measured or from measurement errors due, for example, to instrument decay

The height for age of preschool children is often compared across age groups. Since for infants from birth to two years old it is usually length that is measured, rather than height as for children older than two, comparisons between the two groups may be biased.

5. Regression artifact

If subjects are chosen on the basis of exhibiting an extreme value on some variable (e.g. wasting), there may be improvement over time without any intervention. This tendency is called regression toward the mean. The solution is either to observe the effect on the whole population or to make comparisons within the selected extreme group (e.g. the malnourished) In a nutrition programme instituted for the malnourished, improvement may be shown in that some of the participants are no longer malnourished at the end of the programme; but part of this improvement may not be due to the programme, since some subjects would have improved anyway.

6. Experimental mortality

Some subjects may drop out of a programme during the course of its implementation if these subjects have different characteristics than those who remain, any before/after effect shown may be confounded by differences in the populations at the beginning and end of the programme A food-for-work programme may not lead to an improvement in the nutritional status of a community even if it has in fact been effective. This could happen if enough of the participants who improve leave the community in search of jobs elsewhere. The observed change here underestimates the impact of the programme.

The basic question is whether there is a statistical association between the putative cause (the intervention) and the outcome or effect (e.g. improving nutritional status). Seeking associations is useful only when one needs some level of certainty that the programme is causally related to the outcome. Showing an association requires comparison of measured outcome in at least two groups that receive different intensities of programme intervention. This may mean comparing two groups such as control and treatment, showing correlations between different levels of programme delivery and outcome, estimating regression coefficients between programme activities and outcome, statistically controlling for other influences, and so on. Controlling for influences on the outcome other than the programme which mimic a programme effect is called "controlling for confounding influences."

The questions to be asked about a programme and its evaluation are very different depending on the results of the statistical tests of association. Although these tests are performed after data collection. they must be foreseen to ensure that the right data are collected. Elaborating on the summary above, then, the first question after data collection and analysis are completed will be: was there a statistical association between the programme intervention and the outcome?

If no association between the intervention and outcome is found, the next questions are as follows:

  1. Are the probability statements themselves correct in stating that there is not a high probability of association between outcome and intervention? Were the right statistical tests used, given the objective of the programme?
  2. Was the intervention relevant? Was it adequately applied? Check programme design and process evaluation.
  3. Were the measures used of improved nutritional outcome relevant to the programme objectives? There is no use using indicators that have not been shown to be responsive to the intervention (see [4] and table 1.6 below).
  4. Were the recipients likely to show a benefit? This question is a different way of posing questions 2 and 3 but should be asked explicitly.
  5. Given "yes" to questions 1 to 4, was the sample size adequate to reveal a meaningful minimum association? In whom (needy, targeted, participant population, whole population)? This can be checked as follows:
  1. What size of effect is expected under these conditions?
  2. What size are errors of unreliability due to:

- measurement imprecision?
- undependability due to random influences on outcome? These errors due to unreliability can be
reduced by:
- exclusion by design,
- stratification before treatment and exclusion by analysis (e.g. matching),
- measurement and exclusion by analyses (e.g. multivariable analysis).

  1. Recalculate necessary sample size given a and b and desired probability of type I and II errors.
  2. Was there a countervailing influence that cancelled out the association: e.g. treatment contamination, negative confounding, unintended treatment of controls?

If an association between the intervention and outcome is found, the next questions are as follows:

  1. Are the statistical probability statements themselves incorrect due to:

- fishing (multiple significant tests),
- wrong assumptions about the appropriateness of chosen statistical tests for the data?

  1. What are the boundaries of the magnitude (not the statistical significance) of the association ?
  2. Was the association causal? Answering this requires discarding confounding variables inherent in comparison groups, treatment and outcome measurements.
  3. Once causality is established, what was direction of causality?
  4. What are the mechanisms of the causal relationship?
  5. What is the cost of the effect? What are the marginal costs?
  6. Can one extrapolate these findings to the future or to other situations?

This list of questions is reducing in its logic. Some of the questions (i.e., 3 and 4) are so difficult to answer perfectly in the context of a realistic programme evaluation that one must examine carefully whether they are really necessary or even useful for the evaluation. The perfect solution to confounding is to design an intervention in such a way that all influences on outcome other than that of the programme are randomly distributed among comparison groups. This then permits one to state that a probability of association is a probability of causation.

Unfortunately, the random distribution of all confounding influences on outcome other than that of the programme requires designs that are impossible in evaluation. Evaluation must therefore deal with confounding influences in a different fashion to establish the validity of the results, and thus be able to make a plausible statement about the programme being the cause of the improvement.

In the following sections, some of these concepts are elaborated to provide background for the procedures discussed in chapter 2.


Contents - Previous - Next