App 4.1: Modelling enrichment data
App 4.2: The effect of auto-correlated errors on linear regression
Contributor: Mike Franklin
This Appendix expands on some of the points raised in Chapter 5 and may be helpful in providing a different perspective on the issues.
In order to model the enrichment data it is usual to assume that the observed values arose from some 'true' value plus an error component, that we can model the true component and that the error component is random with mean value zero. Differences arise between the assumptions made relating to the model and to the errors. The target, however, in all the methods is the same, namely to provide an estimate of:
log d1 - log d2
where d1 = d (t1)
The two-point method is a useful starting point for it can be justified in (at least) two ways:
a) make no assumptions about model or errors, simply take observations at t1 and t2 and use these to provide the estimates of the true values for d (t1) and d (t2)
b) plot the two values logd (t1), logd (t2) and assume the function is a line passing through these points with no error.
Clearly the strength of the method is that it does not require assumptions about a model. Its weaknesses are its failure to use known information about the model and its failure to recognise the presence of errors.
In general the decline of the log enrichment curve with multi-point data is not perfectly linear but in some special cases it may be so. For example, if all the Lifson and McClintock assumptions hold perfectly then the decline is linear, and in practice there are very many cases where the decline is seen to be nearly linear even though all the assumptions are unlikely to have held throughout the period. In these cases it may be desirable to assume a linear model. If the enrichment curve is replaced by a straight line then the slope of this line is equal to the relative flow rate or rate constant k and the value logd (t1)- logd (t2) is replaced by k(t1-t2), the observed values d (t1), logd (t2) are replaced by the values predicted from the straight line. Indeed whenever we try to model the enrichment curve one of the objectives is to obtain estimates for logd (t1) and logd (t2). The quality of these estimates is clearly dependent on the quality of the model.
When the log enrichment curve
departs from linearity the use of the linear regression slope to estimate the rate
constant, k, is a fairly robust procedure if the data points are symmetrically spread
about t. On the other hand, under the same conditions, the use of the intercept to derive
the pool size may lead to substantial errors. The problem then arises as to how to use the
observations and the curve drawn through these observations to estimate the initial
enrichment. The answer will normally be best obtained through knowledge of the methodology
used and intelligent study of the data. One solution, considered appropriate for data
produced under conditions similar to those used at Cambridge, is presented in Section 5.7.
Essentially it involves adjusting a reliable observation taken near time zero. This type
of trade-off between early observations and the value predicted by the curve is likely to
be a widely applicable procedure.
In calculating the flux by either the two-point or multi-point method, it is common to assume that the errors are independent (ie the errors on one day are unrelated to those on the next day). This is rarely the case for a sudden large water intake say, may influence the measured concentrations for several successive days. Plotting residuals from many data sets shows a tendency for residuals on neighbouring days to be correlated. This type of correlation is known as 'auto-correlation'.
Auto-correlation is important when data points are close together but less so when they are distant. Thus the multi-point method is more affected than the two-point method, and the effect of auto-correlation is to reduce the advantage of the multi-point method. To illustrate the effect of this auto-correlation we present, in Table App 4.1, the relative variances that would be obtained from a study lasting twelve days (ie 13 assessments) if measurements were obtained daily or at 2, 3, 6 day intervals or at the beginning and end only. The variances are given for various degrees of auto-correlation. (The level of correlation will be affected by the nature of the investigation and the experimental procedure).
Perhaps the best illustration of the effect of serious autocorrelation is seen for p = 0.75 where using 13-points is of little advantage over using two and probably inferior to using 3. The reason for this apparent reversal is that standard linear regression procedures are inefficient when there is a high degree of autocorrelation. Because autocorrelation between errors may arise naturally or may arise from attempting to fit an incorrect model, we recommend that if there are ten or more equally-spaced data points the autocorrelation coefficient for the residuals is obtained. A high absolute value may help to identify wrongly specified models.
Table App 4.1. The relative variance of the estimated slope (k) and the mean value (y) using standard linear regression when the data exhibit selected levels of auto-correlation between successive days
A. Variance (k) |
|||||
Correlation |
No
of sampling days |
||||
13 |
7 |
5 |
3 |
2 |
|
p = 0.00 |
.0055 |
.0089 |
.0111 |
.0139 |
.0139 |
p = 0.25 |
.0080 |
.0096 |
.0112 |
.0139 |
.0139 |
p = 0.50 |
.0116 |
.0116 |
.0121 |
.0139 |
.0139 |
p = 0.75 |
.0150 |
.0140 |
.0135 |
.0134 |
.0134 |
p = 1.00 |
.0000 |
.0000 |
.0000 |
.0000 |
.0000 |
B. Variance (y) |
|||||
Correlation |
No
of sampling days |
||||
13 |
7 |
5 |
3 |
2 |
|
p = 0.00 |
.077 |
.143 |
.200 |
.333 |
.500 |
p = 0.25 |
.123 |
.159 |
.205 |
.333 |
.500 |
p = 0.50 |
.207 |
.220 |
.244 |
.340 |
.500 |
p = 0.75 |
.400 |
.392 |
.392 |
.419 |
.516 |
p = 1.00 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
Notes on Table App 4.1
The columns denote the number of (equally spaced) sampling occasions with the first occasion on day 0 and the last on day 12.
The values were generated on the assumption that the errors obey a first order auto-regressive model.
The variance of the intercept a at t=0 is V(a) = V(y) + 36V(k), and the covariance (a, k) = 6V(k). These can be used to calculate the relative variances of the estimated pool size and flux rates for selected values of a and b.
Observe that at very high levels of
autocorrelation (towards p=1) the slopes are precisely estimated but the intercepts are
not. Also note for p = 0.75 the multi-point method may be inferior to the two-point method
(because the wrong model is being fitted).