This is the old United Nations University website. Visit the new site at http://unu.edu
In general, data processing can be understood as the treatment given to the data after collection. In small evaluations, this treatment is usually manual. In the case of large-scale efforts, the bulk of the data handling requires access to computer facilities although some parts of the data processing may be performed manually. In this context, a description of manual and computerized techniques will be reviewed, with special reference to large-scale studies.
Data Input
Recent advances in computer sciences provide a choice of alternatives for data input. These range from the use of the traditional punched card to direct access with automated systems using mark sense devices or direct on-line input from a measuring apparatus.
When the survey or research comprises a small number of cases and each case is evaluated in terms of many characteristics, an interactive data input procedure might be recommended, especially when the original data is generated in a place where facilities for data input (terminals) are available or can be easily installed. The interactive data input procedures provide the opportunity to test for completeness, inconsistencies and errors at the data sources. This often permits the implementation of proper procedures of data recovery. Unfortunately, this type of data input will undoubtedly have limited application in field evaluations.
When interactive data input procedures are not applicable, some type of key-to-tape data input systems must be implemented. In such systems, the speed of data recording can be high. However, immediate checks for completeness or inconsistency controls are not possible. since the processing of data unavoidably must take place with a delay. Under these conditions, delayed checks for errors and completeness or inconsistency controls are possible. although the recovery of data in most instances is practically impossible.
Data Quality Control
The control of data quality is a most important aspect of any research process. Once the data have been collected and coded. the control of its quality generally proceeds in two stages: the first relates to completeness and the second to the internal consistency among the various items that comprise the data set.
The preliminary controls for completeness of the data are usually, but not necessarily, performed after the coding of the data is complete. The purpose of this exercise is to control for the inclusion of every required item in each observation vector, both in terms of identification and actual data items (variables).
As indicated earlier, the identification portion of the observation vector generally includes several items or information bits that describe different individual characteristics. These descriptive items, considered in parallel with the evaluation or survey design, provide the reference criteria for the preliminary control of the completeness of the set of observation units. Thus, if there exist three items in the identification portion of the observation vector, identified as a primary unit code (PU). a secondary unit code (SU) and the individual number within the SU (IN), then the identification for each observation vector would be a composite set of characters made as PU-SU-IN. For example, if there are 25 PU and the number of the IN differ in each SU among the PU, then a complete inventory of the possible codes for each PU can be constructed. As an illustration of this hypothetical case, a list of the acceptable codes for a preliminary identification control of completeness for the set of observations is presented in table 12.2.
TABLE 12.2. List of Codes (inventory of codes) for PU (Primary units), SU (secondary units), and Number of IN (individuals)
PU |
SU |
INS |
01 |
1 |
10 |
01 |
2 |
37 |
01 |
3 |
17 |
01 |
4 |
28 |
01 |
5 |
15 |
01 |
6 |
06 |
02 |
1 |
13 |
02 |
2 |
21 |
02 |
3 |
20 |
02 |
4 |
11 |
02 |
5 |
19 |
03 |
1 |
14 |
. |
. |
. |
24 |
1 |
08 |
24 |
2 |
29 |
24 |
3 |
21 |
25 |
1 |
13 |
25 |
2 |
17 |
25 |
3 |
18 |
25 |
4 |
12 |
25 |
5 |
21 |
25 |
6 |
30 |
In this example, the detailed identification codes for observations included in the first PU would be as follows:
SU1 | 01101, 01102, 01103, ··· 01110 |
SU2 | 01201, 01202, 01203, ··· 01237 |
*** | *** |
SU6 | 01601, 01602, 01603, ···, 01606 |
In the same manner, the admissible identification codes can be listed in detail for each of the 25 PU.
The actual control for admissible identification codes can be carried out in different ways. One procedure is the simple visual checking of the existence of each recorded entry for each identification item required, without controlling explicitly for the validity of such entries. Another possibility is a complete and detailed visual checking of the list of valid identification codes, such as that presented in table 12.2., for every individual information vector. In this latter alternative, practical feasibility diminishes as the size of the observation set increases. However, the control of validity and completeness of the identification portion of each information vector can be performed accurately and efficiently on large observation sets through complex but automated systems using computer facilities.
The preliminary quality control procedures also relate to the checking of completeness for the remaining portion of the observation vector, which constitutes the actual data portion (variables) of each vector. In this case, special care is required to identify logical omissions of data which may be the valid result of logical associations among variables. For example, when one observation data item identifies a male subject, the observation vector for this subject cannot, and must not, include data items that refer to the number of pregnancies of the subject.
In the preliminary procedures for the quality control of observation of data items, it is often possible to include obvious control items that do not require much effort in the checking process. For example, if the questionnaire is applicable only to adults for example those 18 years of age or older, it is possible, while checking for completeness of the age information bits, to identify subjects under 18 years of age.
Incidentally, when the preliminary procedures for the quality control are applied soon after collection and coding, it often may be possible to recover missing bits of data by going back to the field. This possibility should always be considered and the rules governing such procedures explicity addressed in the SOP.
After satisfactorily completing this preliminary stage of data quality control checks, the first stage edited information vectors are ready for entry into appropriate devices for further processing. This is done prior to implementing computing procedures as required under the plan of analyses defined in the SOP.
Obviously, the number and magnitude of errors can be most efficiently reduced by improving, in the planning and testing stages, the procedures for collecting and ennumerating data, rather than by increasing the number of a posterior revisions and internal consistency checks. Independently of the adequacy of collection and ennumeration procedures, however, consistency controls are always essential. They will substantially contribute to the "cleanliness" of the data. This type of quality control ranges from simple to fairly complicated checks, designed for detecting at different levels contradictions contained in the data.
The processing required in the control of consistency generally relates to two types of variables: continuous variables (interval scaled) such as age, weight, height, temperature and blood values; and discrete variables (nominally or ordinally scaled) such as sex, race, marital status and birth order.
The actions to be taken when an error is detected through any checking procedure are as follows:
Although there are many possibilities for consistency controls, the procedures to be applied generally relate to the check of admissible ranges, and the examination of arithmetic, logical or special relations among variables.
The check and control of admissible ranges applies to both continuous and discrete variables, since in the latter case the numeric codes assigned to the various classes can be examined in terms of the admissible numerical values of the codes that have been defined for a given variable. Range controls are usually applied to the basic data collected. They may also be applied to indices, ratios or any other forms of data derived from the original observations. The inclusion of derived data in the control of ranges often provides opportunities to detect inconsistencies that may not be apparent in the original data.
Since different variables within a case are often related, arithmetic relations among pairs of variables also can be used in the process of internal consistency controls. Consider, for example, a pair of variables, A and B. In the consistency control procedures, it is possible to specifically check conditions such as A greater than B; that is, the numeric value of A is always greater than the value of B, except in a situation of a "not applicable" answer for either of A or B. An example of this may be the number of persons in the family (variable A) and the number of children under five years of age in the same family (variable B). Similarly, the condition A is greater than or equal to B, that is, A is always equal or greater than B can be defined and checked. For example, note the case where variable A is the number of children in the family and variable B is the number of children under five years of age in the same family. A simple numeric equality relation A=B, generally would represent duplicity in the data, but sometimes it may be an appropriate criteria for consistency checks as is the case, for example, when data records are reshuffled and variable names are changed.
The types of consistency controls described above also can be applied to derived data. For example, if a new derived variable is the sum of components, A, B, and C, then, the derived variable D may be independently checked against the actual sum of the components (A+B+C). The arithmetic relationship controls are most often used when checking continuous variables (interval scaled), although it is possible to make limited use of such controls in the case of some discrete variables (ordinally scaled).
The control of logical relations are based on the dependency of one variable on another. For example, the variable C can take on different specific values or a range of values, depending on the values for the variables A and/or B. More specifically, if C is the weight for a child, A is the sex of that child, and B his age, then, if a criterion of range can be defined for weight depending upon sex (A) and age (B), then the weight (C) of an 18 month old male child should be a value within the admissible range corresponding to his sex (male) and age (18 months).
The control of logical relations is applicable to many conditions, but always must be based on criteria specifically defined by the researcher. The logical criteria required are generally presented in the form of "if-then" or "if and only if" statements. A general outline of some common logical control criteria is presented in table 12.3. (see TABLE 12.3. General Outline and Examples of Different Kinds of Logical Checks).
TABLE 12.3. General Outline and Examples of Different Kinds of Logical Checks One-way Checks Two-way Checks Prototype: IF A = x then B = y Prototype: A = x if and only if B = y
1. Between variable pairs |
|
Examples: | Examples: |
a. If A = 1 then B = 1, 3, 5, 7 | e. A = 1 if end only if B = 5 |
b. If A = 1, 3 then B = 4, 6 | f. A = 1.3,6 if and only if B = 2, 5, 6 |
c. If A= 1 - 10 then B = 3- 18 | g. A= 1 - 10 if and only if B = 3- 18 |
d. If A = 1 - 4, 8 - 12 then B = 2 - 12 | h. A = 1 - 4, 8 - 12 if and only if |
B = 5 - 7, 15 - 30 | |
2. Among several variables |
|
Examples: | Examples: |
i. If A = 1 and B = 2 then C = 5 10, 11 | I. A = 1 and B = 2 if and only if |
C = 5, 1 0, 1 1 | |
j. If A = 1 - 5 and B = 10 - 20 then | m. A = 1 - 5 and B = 10 - 20 if and |
C = 2 - 8 | only if C = 2 - 8 |
k. If A = 1 - 5, B = 1 - 5 and C = 1 then | |
D = 5- 10 |
The basic difference between one-way and two-way controls relates to the uniqueness of correspondence. For example, when checking the criteria "if A = 1 then B = 1, 3, 5, 7" (example 1.a in table 12.3.), a finding A = 1 implies that 1 or 3 or 5 or 7 are acceptable answers to B, but it does not imply that if B = 1 or 3 or 5 or 7, A is necessarily equal to 1. In the case of a two way control the "if-then" statement becomes an "if and only if" statement, as in example 1.e in table 12.3. In this case when checking the criteria A = 1 if and only if B = 5, just as the finding A = 1 implies B = 5 conversely the finding B = 5 implies A = 1.
The description of preliminary control of data, laid emphasis on careful procedures for verifying the completeness of the identification items. In addition to completeness, it is also essential to check for inconsistencies in identification. In this connection, special procedures such as look-up systems using binary search techniques for possible identifiers and self-checking identification number systems (for example, modulus 10 and modulus 11 techniques) can be used effectively for checking inconsistencies in the identification portion of the information vector. As can be expected, however, error identification by the control checks described above is not exhaustive. Special situations may arise for any of the variables of interest. In the data processing required for detecting errors, these cases may be handled by including in the data editing programme one or more appropriate subroutines to check special relations or conditions pertaining to a specific set of data items or observation vectors.
Special cases require individual attention; and in this connection, general techniques such as sorting and searching are useful. The choice of specific techniques for sorting or searching in a particular situation will depend on the way the main set of data items relate to each other in the observation vector, and this in turn will determine the type of controls to be applied to the individual data items in the observation vector. For example, when checking a combination of codes or a coded data item, the binary search may be the procedure of choice for looking up the acceptability of the recorded set or the coded data items, since generally there is no continuous sequence in the structure of such codes. However, when a continuous data structure is used, a direct searching technique may be the method of choice. Another type of checking which may be useful is the "route of answer check." In this instance an answer to a specific question for a subset of the data vector is not applicable. A tree for describing the "route of answers" within the allowed answers to the questions under consideration permits the construction of an ordered path based on the relations among the items contained in the data vector.
Data Bank
The quality control of data will produce clean files for each type of data collected. A properly identified and cross-related set of such files is called a Data Bank.
The master data file will be created from the data bank by merging individual files using proper identification keys: study, data type, form identification, family, individual, date and examiner, for example. It is important to stress the need of complete and full documentation of the structure of the master data file, since this provides the keys and needed criteria for manipulating the information it contains When a properly and exhaustively documented master file is ready, the stage of data analysis can begin.
With computer system facilities having capability for Data Base Management, the Data Bank constitutes the original source of data for structuring a useful Data Base (5) for subsequent processing. This feature is particularly useful for executing the statistical analyses required in the testing of specific hypotheses.
It is also important to point out that the data bank stage is not fixed. It is a very dynamic situation requiring continuous action and attention for as long as the interactive processes of data analyses and interpretation continue.
Data Analysis
The analysis of data relates both to the type of data and the hypotheses posed by the investigator. Most of the time, the first stage in the analysis of continuous variables consists of a scan of the data set. By scanning, one can define a set of basic descriptive statistics that will permit a first approximation to the pattern of behaviour of each variable included in the evaluation. This type of analysis, however, also provides information that can be used in assessing the relative effectiveness and success of the data cleaning and consistency controls already executed. Different levels of scans can be used to secure adequate preliminary descriptions of the study variables. In particular, in the case of discrete variables, frequency tables with single or multiple cross-classification criteria may provide a good description of these variables.
Once the quality of the data collected has been documented and the general descriptions for the study variables have been obtained, the investigator may proceed with the statistical testing of the specific hypotheses. Simple comparison between two classes may be performed using student-t tests. Analysis of variance techniques (6, 7) may be used when testing hypotheses that involve more than two classes, provided proper attention is given to satisfying the basic assumptions underlaying the use of these procedures (8). Trends and associations among variables may be examined by multiple regression and correlation analyses (9, 10, 11). The classification and identification of groups of observations may be performed using clustering techniques and discriminate analysis (12, 13, 14), while confounded inter-relationships among large sets of variables may be examined using factor analysis (15, 16). Overall relations in sets of variables, regardless of the nature of the variables within the set (continuous or discrete or mixtures), may be tested using canonical correlation analysis (17). Additionally, when interests in a set of ,several dependent variables relate to more than two classes, the analysis may be performed using multivariate analysis of variance techniques (15).
Frequently, it is not possible to satisfy the requirements and conditions inherent in the use of the parametric techniques listed above. Under such conditions, there is the option of using distribution free (non-parametric) techniques (18, 20). The ability of rejecting a null hypothesis, when in fact the alternative hypothesis is true (power of the test), is generally smaller for non-parametric than for parametric procedures. However, under a given set of circumstances, they may be the only choice. On the other hand, the power of non-parametric tests, when properly used, is satisfactory under the general conditions prevailing in most practical situations. The level of generalization possible through non-parametric testing often compensates for the apparent, usually small, reduction in the power of the test.
A partial listing of some useful analytical procedures is presented in table 12.4. Appropriate description of the method of procedure and examples of applications of these methods can be found in the statistical texts cited in presenting this subject matter (6-20).
TABLE 12.4. Common Methods Used in Statistical Analysis
I. Parametric |
|
Univariate | Multivariate |
Student-t Test | Multivariate Analysis of Variance (MANOVA) |
Analysis of Variance (ANOVA) | Multivariate Analysis of Covariance (MANCOVA) |
Analysis of Covariance (ANCOVA) | Discriminate Function Analysis |
Regression: Simple | Factor Analysis |
Regression: Multiple | Path Analysis |
Time Series Analysis | Cluster Analysis (Numerical Taxonomy) |
Correlation Analysis | Canonical Correlation Analysis |
Multidimensional Scaling Analysis | |
II. Non-parametric |
|
Binominal Test |
|
Lilliefors Test of Normality |
|
Kolmogorov-Smirnov Test |
|
Randomization Test Kruskal-Wallis Analysis of Variance |
|
Fiedman Analysis of Variance |
|
Cochran Q Test |
|
Concordance Tests |
|
Lambda Test |
|
Multicategorical Chi-square Wilcoxon Tests |
|
Fisher's Exact Probability Test |
|
McNemar Test |
|
Eta, The Correlation Ratio Test |
|
Theil's Slope Coefficient Test |
|
Spearman Rank Correlation |
On the basis of the general outline of alternatives for data analysis described above, several steps are required for implementing the appropriate analytical procedures. First, the questions to be answered must be explicitly defined, to permit design of the specific analyses required to satisfy the objectives of the evaluation. The original statement of objectives and the preliminary definition of the analytical plan contained in the SOP provide a basis for the final choice of appropriate analytical procedures for answering the questions posed. This, in turn, establishes a sequence of events that relate to programming, data processing and statistical computation. This sequence, therefore, translates into an operation schedule (pathway) that is defined taking into account the most efficient utilization of available analytical facilities (hardware, software, systems analysts, programmers and operators). In the implementation of the operational schedule, the writing, debugging, testing and documenting of computer programmes may be required in the case of very specialized data.
At present, many well tested statistical packages (software) such as SAS, SPSS, BMDP, and RUMMAGE, among others, are available for performing most of the statistical analysis mentioned in table 12.4. When these package are used, the programming chores are minimal, and relate primarily to variable specifications, procedure definition and output selection. In addition, the use of these extensively tested programmes constitutes good insurance against common programming errors. In some cases, interphasing of standard statistical packages is possible and this increases both the capability and efficiency of available software for widespread application of statistical analysis techniques.
The general guidelines described by the investigator and data processing personnel must translate into sequence of events that, as indicated by Helms (21), can be summarized in flow chart form as illustrated in figure 12.3 outlined as follows:
1. State the questions to be answered and the general analyses to be performed in the SOP for the evaluation. Write down the scientific objectives of the analysis.
2. Plan the sequence of the steps required in programming, data processing and statistical computations. Draw up an "operational plan."
3. Schedule the performance of each step, including personnel assignment and definition of deadlines. Draw up an "operational schedule."
4. Begin work on the problem.
5. Write, debug, test, and document an "inclusion" computer subprogramme to assess results on the basis of specific criteria for including or excluding a case from the analysis.
6. Develop specifications (control cards) which define the variables to be used in the analysis. These specifications constitute the input required to operated the Master
Update Programmes, which will copy the desired variables onto a "raw analysis file" while performing an update run.
7. Incorporate the "inclusion subprogramme" (from step 5) into the Master Update Programme. This subprogramme "tells" the Master Update Programme which cases should be copied into the "raw analysis file" ("inclusions") and which should not ("exclusions ").
8. Execute an update run of the Master Update Programme to produce the "raw analysis file" (steps 6 and 7 are preparatory: this step actually produces the file).
9. Check the raw analysis file for correct format, correct variables, and correct cases (inclusions/exclusions). If not correct, determine the cause of errors, correct the problem, and return to step 6 or 7, as indicated.
10. Duplicate the raw analysis file and store the copy in a secure place as backup.
11. Design, write, test, debug, and document all "transformation programmes" required to perform data transformations and produce a "transformed analysis file." This step may include programmes for linking data from two or more raw analysis files.
12. Set up and execute the transformation programmes (step 11 ) and produce the transformed analysis file. Check the file; if errors are found, determine their origin, make the required corrections (this could involve any of steps 5-11). If no errors are found, proceed to step 13.
13. Make a backup copy of the transformed analysis file and store in a secure place.
14. Perform computations for preliminary statistical analyses, using the "latest" analysis file. Typical calculations include statistics usually called "descriptive statistics": histograms, percentiles, means, medians, standard deviations, skewness, and other moments, cross-tabulations, scatter diagrams, correlations, regressions, etc.
15. Examine the output from step 14 for "outliers" and other indications of erroneous values. Trace such "outliers" to the original data and determine which can be identified as errors and which are correct.
16. Write a summary of the subject-matter results of the preliminary analysis.
17. Re-examine the scientific objectives document and the operational plan (steps 1,2). If changes are made, return to step 1. Some steps may not need to be repeated; this will be indicated in the new operation plan.
18. Design, write, debug, test, and document statistical computation programmes required for the statistical analyses.
NOTE: This step may be a long, involved process, not just another step in the procedure. Whenever this step is required, other personnel are usually assigned to it and the work proceeds concurrently with steps 4-16.
19. Perform the statistical computations required for the desired analyses.
20. Analyse the output created in step 19 and write the preliminary conclusions.
NOTE: Typically, a number of different analyses will be required in addition to the preliminary analyses performed in steps 14-16. One analysis or set of computations frequently generates ideas for performing other analyses, which is all a part of the art of statistical analysis.
21. Determine whether additional calculations are needed. If so:
This decision usually involves personnel outside the coordinating center: project officier, participating physicians, etc.
22. When no further calculations are needed, write up the results for distribution or publication.
As has been suggested previously by Guzman (22). and as stated repeatedly in this chapter, the processing and analysis of data must be a continuous undertaking. Procedures should be carefully defined in the study standard operating protocol (SOP). which defines activities from the day the field operations start and concludes only when all reports have been completed. It is not easy to describe in detail the handling of data without reference to a specific study, the corresponding research design and the particular set of circumstances. Accordingly, in this chapter we have presented the sequence of events and described in general terms some of the basic procedures that, through experience, have been found to be essential components of a successful data management system. With proper adjustment, the illustrative examples might serve as a guideline for effective data recording and processing procedures in a specific study. In a recent book, Cosley and Luny (23) describe additional examples of procedures and present a more extensive treatment of this subject.