INTRODUCTION
Any surveillance system is concerned with the quality of the data collected, including the degree of ascertainment of affected individuals [Reference Nanan and White1]. A conventional surveillance system is notification, possibly containing false-positive cases and often incomplete for true-positive cases, as described for Legionnaires' disease (LD) [Reference Infuso, Hubert and Etienne2, Reference Nardone3].
LD is a serious, possibly fatal, pneumonia caused by Legionella spp., occurring in sporadic cases and outbreaks [Reference Den Boer4, Reference Lettinga5]. Under the present legislation regarding infectious diseases in The Netherlands, LD is placed in category B. This group of infectious diseases has to be notified within 24 h to the Municipal Public Health Service by the diagnosing physician. The Municipal Public Health Service forwards this information to the Register of Notifiable Infectious Diseases at the Office of the Health Care Inspectorate where national data are aggregated for analysis, monitoring, public health intervention or policy making. Since 1999 an average of 230 LD patients were notified in The Netherlands annually. The average national annual incidence rate was 1·4 LD patients per 100 000 inhabitants, almost three times higher than the average annual incidence rate in the United States and the United Kingdom [Reference Ricketts and Joseph6, Reference Jajosky7]. However, the incidence rate based on notifications varies considerably per province [Reference Den Boer, Friesema and Hooi8]. Under-diagnosis and under-notification are likely. This can obscure the true burden of LD, hamper the detection of clusters of LD patients and hinder good investigations into the possible source of legionella infections. The Dutch Health Council estimated an annual number of 800 LD patients. This number is based on the annual number of cases of pneumonia in The Netherlands (110 000) of whom 15% require hospital admission (16 000) of which 5% is caused by Legionella spp. (800) [9].
Record-linkage is important for assessing the quality and completeness of infectious disease registers, i.e. comparing patient data across multiple registers [Reference Migliori10]. Completeness of notification can be assessed by comparison with case-ascertainment, i.e. the total number of patients observed in at least one register, or the estimated total number of patients obtained by capture–recapture analysis. The total number of individuals present in one or more registrations does not necessarily reflect a reliable approximation of the true number of cases. The purpose of capture–recapture analysis is to assess the number of cases that are not registered. In an article published in 1972, Stephen Fienberg demonstrated how this number of unobserved cases could be estimated, using log-linear analysis [Reference Fienberg11]. For capture–recapture analysis, according to Fienberg, the availability of data from at least three different, possibly incomplete, partially overlapping and preferably, but not necessarily, independent sources is needed [Reference Bishop, Fienberg and Holland12–Reference Hook and Regal16]. The data can be put in a 2×2×2 contingency table, indicating the absence or presence of a case in each of the registers. This table has one empty cell, corresponding to the number of cases never registered. Based on certain assumptions, which will be discussed later, capture–recapture analysis aims at obtaining an estimate of the unregistered number of patients in the empty cell from the available data in the other cells. This estimate can be found under the best fitting and most parsimonious log-linear model, as explained later. Finally, the total number of individuals is the number of registered cases plus the estimated number of non-registered patients. Capture–recapture methods have been used to estimate the total number of patients with LD and other infectious diseases [Reference Infuso, Hubert and Etienne2, Reference Nardone3, Reference Van Hest, Smit and Verhave17].
The validity of capture–recapture analysis depends on possible violation of the underlying assumptions and one focus is to establish which method is most appropriate for specific datasets [Reference Chao15]. Usually, log-linear modelling of data from at least three linked registers is the preferred capture–recapture method because it can reduce bias due to inter-dependencies between two registers [13, Reference Van Hest, Smit and Verhave17] Stratified capture–recapture analysis according to categorical covariates associated with the probability of capture in a register can further reduce bias [Reference Fienberg11, Reference Bishop, Fienberg and Holland12, 14, Reference Hook and Regal16]. An alternative is to include these covariates, e.g. demographic, diagnostic or prognostic variables, in a log-linear covariate capture–recapture model but these models have rarely been used to estimate human disease incidence [Reference Tilling and Sterne18, Reference Tilling, Sterne and Wolfe19].
This study aims to estimate incidence and completeness of notification of LD in The Netherlands in 2000 and 2001 using record-linkage of three data sources and capture–recapture analysis.
MATERIALS AND METHODS
Data sources and patient identifiers
Three LD data sources were used:
(1) Notification. Patients notified by their physician to the Health Care Inspectorate. A uniform questionnaire collected additional information from local Public Health Services processing the notifications.
(2) Laboratory. Patients with a specified positive laboratory test result reported by the clinical microbiologists in a survey among all clinical microbiology laboratories after obtaining permission for this survey from the Dutch Society for Microbiology and supported by the Inspector-General for Infectious Diseases of the Health Care Inspectorate. Positive laboratory test results were classified as either confirmed (culture, urine antigen test or a fourfold rise in antibody titre [⩾128 IU] against Legionella spp. in paired acute and convalescent serum samples) or probable [PCR, a high titre (⩾256 IU) against Legionella spp. in one serum sample or direct fluorescent antibody staining], according to the European Working Group for Legionella Infections (EWGLI) definitions. Patients with LD only known to the Hospital register were classified as cases with unknown laboratory verification.
(3) Hospital. Hospitalized patients recorded in the National Morbidity Registration by Prismant, covering all hospitals in The Netherlands with:
(a) an International Code for Diseases (ICD-9 code) for all forms of pneumonia (ICD-9 codes 480.0–487.0) for individuals known to Notification and/or Laboratory.
(b) An ICD-9 code 482.8 for individuals only known to Hospital.
ICD-9 has no specific code for LD and, as reported from other countries, in The Netherlands ICD-9 code 482.8 (pneumonia due to other specified bacteria) is used for LD patients [Reference Slobbe20]. Hospital records coded as 482.8 can therefore include false-positive cases, mainly patients with Escherichia coli pneumonia, a rare nosocomial disease, predominantly occurring among intensive-care patients. Data on the annual number of E. coli pneumonia patients in The Netherlands are not available. Based upon an estimated annual number of 60 000 intensive-care admissions and an estimated E. coli pneumonia incidence of 1/1000 intensive-care admissions (derived from a random survey among intensive-care consultants in The Netherlands), the estimated annual number of E. coli pneumonia patients is 60. This number is used to correct the number of patients only known to Hospital. Because proxy code 482.8 is used for cross-validation and collection of additional information, uniform questionnaires requested all chest physicians to report hospitalized LD patients in 2000 and 2001.
For all patients in each register it was attempted to collect date of birth, postal code or town of residence, sex and date of notification (and first day of illness), first laboratory sample or hospital admission as personal identifiers to be used in all record-linkage procedures. Duplicate entries in each register were deleted.
Case-definition and study period
LD patients are defined as all ascertained (notified, laboratory-reported or hospitalized) and un-ascertained LD patients. Notified LD patients with a first day of illness in 2000 and 2001 were included in the study. For inclusion of patients known to Laboratory and/or Hospital the laboratory sample date, hospital admission date or first known of both dates were used as proxy for first day of illness. Through examining the registers 1 month before and after the study period, all registers were corrected for late notification or laboratory results, as described previously [Reference Van Hest, Smit and Verhave17].
Record-linkage and stratification
Record-linkage was performed manually using the patient identifiers, proximity of dates and geographical information found in the three registers. In case of doubt consensus was sought between two investigators. Because of expected geographical differences in incidence of LD, after record-linkage, on the basis of the provinces of The Netherlands, ascertained LD patients were stratified into four regions: North (1 671 534 inhabitants), East (4 467 527 inhabitants), West (5 955 299 inhabitants) and South (3 892 715 inhabitants) (Fig.). Correction for the estimated number of E. coli pneumonia patients in the different regions was proportional to the regional division of the total number of patients only ascertained in Hospital.
Coverage rates and capture–recapture analysis
The ascertained register-specific coverage rate is defined as the number of LD patients in each register divided by the case-ascertainment, expressed as percentage. The total number of un-ascertained LD patients was estimated on the basis of the distribution of the ascertained cases over the three registers. For internal validity analysis we used two-source capture–recapture analysis, as explained elsewhere [Reference Hook and Regal21]. Briefly, by two-source capture–recapture analysis the estimated total number of cases, N est, equals the number of cases on register A, N A, times the number of cases on register B, N B, divided by the overlap of the two registers, N both (N est=N A×N B/N both, also known as the Petersen estimator equation). Approximately unbiased estimates of N est are expected when the registers are large. To correct for bias caused by small registers Chapman proposed the Nearly Unbiased Estimator, which can be expressed as N est=[(N A+1)×(N B+1)/(N both+1)] – 1 [13, Reference Chapman22, Reference Wittes23].
The independence of registers and other assumptions underlying capture–recapture analysis were described previously [Reference Van Hest, Smit and Verhave17]. Specific interdependencies between the three registers, causing bias in two-source capture–recapture estimates, are probable. Using SPSS statistical software (version 13.0; SPSS Inc., Chicago, IL, USA), conventional total and stratified three-source log-linear capture–recapture analysis was employed taking possible interdependencies and heterogeneity into account, as previously described [Reference Van Hest, Smit and Verhave17]. Alternatively to capture–recapture analysis stratified by region, a log-linear covariate capture–recapture model with one covariate, region, was specified [Reference Tilling and Sterne18, Reference Tilling, Sterne and Wolfe19, Reference Hope, Hickman and Tilling24]. Other covariates considered will be discussed later. The best-fitting models were identified using the likelihood ratio test (G 2). The null hypothesis in the likelihood-ratio goodness-of-fit test is that the specified model holds and the alternative is that it does not hold. If the null hypothesis does not need to be rejected (e.g. P>0·05) this means that there is no evidence that the specified model is in disagreement with the data. The lower the value of G 2 the better is the fit of the model. In the log-linear estimation procedure model selection follows model fitting, i.e. to identify the models that are clearly wrong and select from a number of acceptable models the most appropriate. For model selection we used Akaike's Information Criterion (AIC) which can be expressed as AIC=G 2 – 2 degrees of freedom (d.f.) [Reference Sakamoto, Ishiguru and Kitigawa25]. The first term, G 2, is a measure of how well the model fits the data and the second term, 2 d.f., is a penalty for the addition of parameters (and hence model complexity). A second information criterion used was the Bayesian Information Criterion (BIC) which can be expressed as BIC=G 2 – (ln N obs) (d.f.), where N obs is the total number of observed individuals [Reference Agresti26]. Relative to the AIC, the BIC penalizes complex models more heavily. In general, in the log-linear capture–recapture estimation procedure the least complex, i.e. the least saturated (in other words the most parsimonious) model, whose fit appears adequate, is preferred [13]. Since the G 2 of the saturated model is zero and has no degrees of freedom left, the AIC and BIC are also zero and models with a negative AIC and BIC are preferred although this does not necessarily mean that the estimate is correct. The estimated register-specific coverage rate is defined as the number of LD patients in each register divided by the estimated total number of LD patients, expressed as percentage.
RESULTS
Notification system
In the notification register from the Health Care Inspectorate 358 LD patients were recorded. An additional 15 patients were reported through the questionnaires from local public health services processing the notifications, giving a total of 373 notified LD patients.
Laboratory survey
Questionnaires were received from 36 out of the 48 laboratories (response rate 75%). Based on population estimates the cooperating laboratories served 81·2% of the Dutch population. A total of 261 patients with a positive test for Legionella spp. were reported. Of these patients 186 (71·3%) were notified. Additional information on laboratory diagnosis was available for another 127 patients through Public Health Service or chest physician questionnaires, bringing the total number of patients with known laboratory results to 388.
Hospital records
From 385 chest physicians in The Netherlands 179 replies were received (response rate 46%), the majority indicating that the requested information could not be retrieved or no LD patients were admitted. Chest physicians reported 44 LD patients, all of them also known to Notification and/or Laboratory.
Out of 448 LD patients in Notification and/or Laboratory, 331 (73·9%) could be linked to the National Morbidity Registration pneumonia records. Of the linked LD patients 79 (23·9%) were classified as either ‘pneumonia not specified’ (ICD-9 code 486, 63 cases), ‘pneumonia due to other specified organism’ (ICD-9 code 483, nine cases) or ‘pneumococcal pneumonia’ (ICD-9 code 481, seven cases). The remaining 252 linked patients (76·1%) had ICD-9 code 482.8, the assigned code for LD. Another 452 patients, unknown to Notification and/or Laboratory, were identified in Hospital with ICD-9 code 482.8. This number was adjusted to 332 LD patients after deduction of an estimated number of 120 E. coli pneumonia patients in the two years studied, also recorded under ICD-9 code 482.8.
Epidemiological results
Table 1 shows the epidemiological characteristics of 447 LD patients in Notification and/or Laboratory (one patient had insufficient data). The mean age was 54 years (s.d.= 14 years). The recorded case-fatality rate was 5·6%. The mean duration between onset of disease and microbiological diagnosis was 12 days (median 6 days). The mean duration of hospital admission was 19 days (median 13 days).
* From 447 patients sufficient data was available for analysis; sometimes one or two variables are missing.
† Rest of Europe: Austria, Croatia, Cyprus, England, Hungary, Ireland, Luxembourg, Moldavia, Poland, Slovakia, Switzerland, Czech Republic, Yugoslavia; Americas: Netherlands Antilles, Brazil, Canada, Dominican Republic, Mexico, Peru, USA, Venezuela; Asia: China, Indonesia, Japan, Kazakhstan, Malaysia; Africa: Morocco and Tunis.
‡ Confirmed laboratory diagnosis: positive culture, positive urine antigen test or a fourfold rise in antibody titre against Legionella spp. in paired acute and convalescent serum samples, ⩾128 IU. Probable laboratory diagnosis: positive PCR, a high titre in one serum sample against Legionella spp., ⩾256 IU, or direct fluorescent antibody staining of the organism.
Table 2 shows the number and proportion per region of the different laboratory tests for Legionella spp. There are differences between the four Dutch regions in laboratory diagnostic approach. In region North no culture results were reported. In region West a low proportion of fourfold rise in antibody titre and PCR results were reported and more patients had unknown test results, probably the result of non-participation of some larger laboratories. In region South a high proportion of a fourfold rise in antibody titre and PCR results were reported, probably the result of a major reference laboratory in that region.
* For 441 patients information of region was known.
† Direct fluorescent antibody staining.
Case-ascertainment
Table 3 shows the distribution of the 780 ascertained LD patients over the three registrations after record-linkage, in total and stratified by region. The ascertained register-specific coverage rate of Notification, Laboratory and Hospital was 47·8% (373/780), 33·5% (261/780) and 85·0% (663/780) respectively. The ascertained under-notification was 52·2%. Table 4 shows the number of notified and ascertained LD patients, the average annual incidence rate by notification and by case-ascertainment and the proportion of the ascertained patients notified, in total and stratified per region. The average national annual incidence rate by notification was 1·15/100 000 and by case-ascertainment 2·42/100 000. The regional annual incidence rates differ, with a 100% difference between the highest and lowest regional incidence rate based on notification, reducing to 50% difference after record-linkage. Based upon the notification data the low incidence rate in region North partly results from under-notification but the notified and ascertained incidence rates in region South were higher than in the rest of The Netherlands (P<0·0001).
* NOT, Notification register (373 patients).
† LAB, Laboratory register (261 patients).
‡ HOSP, Hospital admission register. The proportional correction for the Escherichia coli pneumonia patients in regions North, East, West and South is 13, 22, 49 and 36 patients respectively (663 patients).
§ For six LD patients the place of residence unknown.
* The information on region was missing for four LD patients.
Capture–recapture analysis
Internal validity analysis by two-source capture–recapture analysis on Notification and Hospital and on Laboratory and Hospital both estimate 865 LD patients through Chapman's Nearly Unbiased Estimator. The considerable lower capture–recapture estimate obtained with Notification and Laboratory (523 LD patients) indicates a larger positive association between this pair than between the other pairs, resulting in an estimate more biased downwards.
The best-fitting three-source log-linear capture–recapture model was the saturated model, i.e. the model including all two-variable associations and assuming absent three-way interaction, which yielded an estimate of 1253 LD patients [95% confidence interval (CI) 1019–1715]. Estimated under-notification was 70·2%. To acknowledge the geographical differences capture–recapture analysis stratified by region was performed. For all regions apart from region East a more parsimonious model, containing only one two-way interaction (between Notification and Laboratory), was selected as best-fitting model, with totals of 78, 327 and 277 LD patients and incidence rates of 2·33, 2·75 and 3·56/100 000 inhabitants for regions North, West and South respectively. For region East a saturated model was selected that estimated an unexpectedly high number of 650 LD patients with a wide 95% CI of 283–2382 patients.
As an alternative to the stratified capture–recapture analysis we specified a log-linear covariate (region) capture–recapture model. The covariate model that served as a starting point contained, apart from the main effects for region and the three registers, the Region-Notification, Region-Laboratory, Region-Hospital, Notification-Laboratory, Notification-Hospital, Laboratory-Hospital two-variable terms. In this model we allow for regional differences in the number of cases in the three registers, but not for interaction with other effects per stratum, as the association between the registers is assumed equal across regions. This model fits the data reasonably well (G 2=22·1, d.f.=9, P=0·009) and estimates 932 LD patients with a narrower 95% CI of 851–1106, reducing statistical uncertainty. Inspection of the misfit for individual cells showed a large adjusted residual for LD patients only known to Laboratory in region East. After including a separate parameter for this single cell we obtain a good fitting model (G 2=5·7, d.f.=8, P=0·686). The estimated number of LD patients is 886 (95% CI 827–1022), similar to the two internal validity estimates with least assumed interdependence.
The estimated register-specific coverage rate of Notification, Laboratory and Hospital was 42·1% (373/886), 29·5% (261/886) and 74·9% (663/886) respectively. The estimated under-notification was 57·9%. The estimated average annual incidence rate of LD was 2·77/100 000.
A sensitivity analysis, assuming double or half the number of false-positive cases due to E. coli pneumonia only known to Hospital, estimated the number of LD patients to range between 727 (95% CI 689–813) and 966 (95% CI 896–1126).
DISCUSSION
After record-linkage and log-linear covariate capture–recapture analysis of three registers of LD in 2000 and 2001 in The Netherlands we found a notified, ascertained and estimated annual incidence rate of 1·15, 2·42 and 2·77 cases/100 000 inhabitants respectively. Ascertained and estimated under-notification was 52·2% and 57·9% respectively. This indicates the need for more consistent notification, e.g. through treatment of LD by a limited group of clinicians, familiar with notification. The southern part of The Netherlands had a higher notified, ascertained and estimated incidence rate of LD.
Legionella pneumonia might be responsible for 0–14% of all nosocomial pneumonias and for 2–16% of all community-acquired pneumonias [Reference Kool27]. In The Netherlands legionella pneumonia is reportedly responsible for 7% of all nosocomial pneumonias and 2–8% of all community-acquired pneumonias in hospitalized patients [Reference Bohte, Van Furth and Van Den Broek28–Reference Van der Eerden30]. Under-notification of LD is estimated at 67% in France, 90% in England and 95% in the United States [Reference Nardone3, Reference Joseph31–Reference Marston33]. At 57·9% we estimated a lower under-notification in The Netherlands, possibly influenced by increased awareness after a major outbreak or increased use of the urine antigen test (although this use is proportionally still low compared to the average EWGLI data for Europe) [Reference Den Boer4, Reference Joseph31]. Among patients in the laboratory survey with positive legionella results under-notification was 28·7%, much lower than reported in France [Reference Infuso, Hubert and Etienne2]. Parallel to mandatory notification by clinicians, many Dutch laboratories report positive results voluntarily to the public health services, which reduces under-notification of LD and other infectious diseases. The ascertained and estimated register-specific coverage rates for the laboratories would be higher with a better response. Record-linkage improved completeness of information in the linked dataset but, unlike laboratories, clinicians are not a useful source of additional information.
Several assumptions must be met for valid results of three-source log-linear capture–recapture models and limitations of capture–recapture analysis are described by others [13, Reference Hook and Regal16, Reference Desenclos and Hubert34–Reference Tilling39]. Violation of the closed population assumption is assumed limited for LD as opportunities for notification, laboratory verification or hospitalization are largely determined within a short period of time, but could result in overestimation of the number of patients. Due to lack of a unique patient identification number used in all registrations and incomplete information on personal identifiers in some records, imperfect record-linkage cannot be excluded but balanced misclassification can still result in unbiased numbers in each category. Limitations of capture–recapture studies due to lack of a uniform and unambiguous case-definition and variable specificity of registers are described elsewhere [Reference Papoz, Balkau and Lellouch36, Reference Borgdorff, Glynn and Vynnycky40]. The notification criteria in The Netherlands requires a clinical diagnosis of pneumonia and a confirmed or probable laboratory diagnosis. However, for 187 (50·1%) notified patients and 463 (69·8%) hospitalized patients no laboratory-verification was found, although part of these patients could be microbiologically diagnosed in a non-participating laboratory or abroad or, due to imperfect record-linkage, could not be linked to Laboratory. Likewise Laboratory may contain cases without pneumonia and cases diagnosed on a single high antibody titre, a test with a low positive predictive value [Reference Nardone3, Reference Braun29]. The 79 linked patients in Hospital with another pneumonia ICD-9 code than 482.8 are probably miscoded but some could be false-positive cases. Violation of the perfect positive value of the hospital episode registers is always a reason for concern in capture–recapture studies on infectious diseases and should be addressed critically, even when specific disease codes are used, e.g. for tuberculosis in ICD-9 [Reference Tocque41–Reference De Greeff44]. We have corrected for imperfect positive predictive value for Hospital. Possible bias as a result of correction for other hospitalized patients with ICD-9 code 482.8 is reflected in the confidence intervals of the sensitivity analysis. Conventional log-linear capture–recapture analysis for The Netherlands and region East selected the saturated model, with an unexpectedly high estimate in region East. When saturated capture–recapture models are selected by any criterion investigators should be particularly cautious about the associated outcomes [Reference Hook and Regal16, Reference De Greeff44–Reference Cormack, Chang and Smith46]. We selected the three-source covariate capture–recapture model with equal two-way interactions across the regions as the best-fitting model. Internal validity analysis and analyses stratified by region indicate dependence between Notification and Laboratory as the dominant interaction. Positive three-way interaction across sources, causing underestimation of the number of LD patients, cannot be incorporated in the selected model but is arguably limited. Regional heterogeneity in the probability of being captured in the different registers was expected and observed [Reference Nardone3, Reference Den Boer, Friesema and Hooi8]. Covariate capture–recapture models have been used only rarely to estimate disease incidence but appear to reduce bias due to heterogeneity and result in plausible estimates of the total number of cases, e.g. in simulations [Reference Tilling and Sterne18, Reference Tilling, Sterne and Wolfe19]. Inclusion of other covariates than region in the model, such as age or method of laboratory diagnosis, could have further reduced bias. In France, apart from region, method of diagnosis was identified as a variable with heterogeneity of capture [Reference Nardone3]. However, proportional correction for E. coli pneumonia patients in Hospital, as performed for the regional stratification, was not feasible. Bias due to exclusion of these and unobserved possibly relevant covariates from the model can not be excluded.
Different characteristics of diseases, the patients and their registers can introduce various degrees of register interdependence and population heterogeneity into capture–recapture analysis, influencing model preference. This study shows that in The Netherlands for LD there is considerable interdependence between Notification and Laboratory and confirms geographical heterogeneity. Log-linear covariate capture–recapture analysis with region as covariate appears to reduce bias in the estimated number of LD patients. To our knowledge this is the first covariate capture–recapture study performed for infectious disease surveillance. Further research is needed into the causes of the geographical differences of LD incidence rates.
ACKNOWLEDGEMENTS
We thank Dr Carol Joseph for reviewing an earlier version of the manuscript. Permission for this study was obtained from the Medical Ethics Committee of the Erasmus MC, University Medical Centre Rotterdam, Rotterdam, The Netherlands, and the data protection committees of the Legionnaires' disease registrations.
DECLARATION OF INTEREST
None.