Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun;27 Suppl 1(Suppl 1):S67-75.
doi: 10.1007/s11606-012-2031-7.

Chapter 9: options for summarizing medical test performance in the absence of a "gold standard"

Affiliations

Chapter 9: options for summarizing medical test performance in the absence of a "gold standard"

Thomas A Trikalinos et al. J Gen Intern Med. 2012 Jun.

Abstract

The classical paradigm for evaluating test performance compares the results of an index test with a reference test. When the reference test does not mirror the "truth" adequately well (e.g. is an "imperfect" reference standard), the typical ("naïve") estimates of sensitivity and specificity are biased. One has at least four options when performing a systematic review of test performance when the reference standard is "imperfect": (a) to forgo the classical paradigm and assess the index test's ability to predict patient relevant outcomes instead of test accuracy (i.e., treat the index test as a predictive instrument); (b) to assess whether the results of the two tests (index and reference) agree or disagree (i.e., treat them as two alternative measurement methods); (c) to calculate "naïve" estimates of the index test's sensitivity and specificity from each study included in the review and discuss in which direction they are biased; (d) mathematically adjust the "naïve" estimates of sensitivity and specificity of the index test to account for the imperfect reference standard. We discuss these options and illustrate some of them through examples.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Correspondence of test results and true proportions in the 2 × 2 table. The cells in 2 × 2 the table, α, β, γ, δ are the true population proportions corresponding to combinations of test results. The diagram depicts how these proportions break down according to the (unknown) true status of the condition of interest. For example, the proportion when both the index test and the reference standard are positive is α = α1 + α2 (i.e., the sum of the proportion of positive index and reference test results when the condition is present (α1) and absent (α2)), and similarly for the other groups. A white colored box and the subscript 1 is used when the reference standard result matches the true status of the condition of interest; a grey colored box and the subscript 2 is used when it does not.
Figure 2.
Figure 2.
“Naïve” estimates versus true values for the performance of the index test with an imperfect reference standard. Seindex and Spindex: sensitivity and specificity of the index test, respectively; Seref and Spref: sensitivity and specificity of the reference test, respectively; p: disease prevalence. If the results of the index and reference tests are independent conditional on disease status, the “naïve” estimates for the performance of the index test are underestimates. The thin reference lines are the true sensitivity (solid) and specificity (dashed) of the index test. Note that the “naïve” estimate for the sensitivity and specificity of the index test approach the true values as the sensitivity and specificity of the reference test approaches 100%. In the left plot the “naïve” estimate of sensitivity does not reach 70% (the true value) when the sensitivity of the reference test, Seref, is 100%, because the specificity of the reference test is not perfect (Spref = 90%). Similarly, on the plot on the right, the specificity of the index test does not reach the true value of 80% when the specificity of the reference test, Spref, is 100%, because the sensitivity of the reference test is not perfect (Seref = 80%). The “naïve” estimates would be the same as the true values only if both the sensitivity and the specificity of the reference test are 100%.
Figure 3.
Figure 3.
“Naïve” estimates of the ability of portable monitors versus laboratory-based polysomnography to detect AHI > 15 events/hour. These data are on a subset of studies from the systematic review used in the illustration (Studies that used manual scoring or combined manual and automated scoring for a type III portable monitor). “Naive” sensitivity/specificity pairs from the same study (obtained with different cut-offs for the portable monitor) are connected with lines. Studies lying on the left lightly shaded area have a positive likelihood ratio of 10 or more. Studies lying on the top lightly shaded area have a negative likelihood ratio of 0.1 or less. Studies lying on the intersection of the grey areas (darker grey polygon) have both a positive likelihood ratio more than 10 and a negative likelihood ratio less than 0.1.
Figure 4.
Figure 4.
Illustrative example of a difference versus average analysis of measurements with facility-based polysomnography and portable monitors. Digitized data from an actual study where portable monitors (Pro-Tech PTAF2 and Compumedics P2) were compared with facility-based polysomnography (PSG). The dashed line at zero difference is the line of perfect agreement. The mean bias stands for the average systematic difference between the two measurements. The 95 % limits of agreement are the boundaries within which 95 % of the differences lie. If these are very wide and encompass clinically important differences, one may concur that the agreement between the measurements is suboptimal. Note that the spread of the differences increases for higher measurement values. This indicates that the mean bias and 95 % limits of agreement do not describe adequately the differences between the two measurements; differences are smaller for smaller values and larger for larger AHI values. In this example mean bias = -11 events/hour (95 % limits of agreement: –38, 17), with statistically significant dependence of difference on average (Bradley-Blackwood F test, p < 0.01).
Figure 5.
Figure 5.
Schematic representation of the mean bias and limits of agreement across several studies. Schematic representation of the agreement between portable monitors and facility-based polysomnography as conveyed by difference versus average analyses across seven studies (the study of Fig. 4 is not included). The study author and the make of the monitor are depicted in the upper and lower part of the graph, respectively. The difference versus average analyses from each study are represented by three horizontal lines: a thicker middle line (denoting the mean bias); and two thinner lines, which represent the 95 % limits of agreement and are symmetrically positioned above and below the mean bias line. The figure facilitates comparisons of the mean bias and the 95 % limits of agreement across the studies by means of colored horizontal zones. The middle light-gray-colored zone shows the range of the mean bias in the seven studies, which is from +6 events per hour of sleep in the study by Dingi et al. (Embletta monitor) to -8 events per hour of sleep in the study by Whittle et al. (Edentrace monitor). The uppermost and lowermost shaded areas show the corresponding range of the upper 95 % limits of agreement (upper shaded zone) and the lower 95 % limits of agreement (lower shaded zone) in the seven studies.

Similar articles

Cited by

References

    1. Bossuyt PM. Interpreting diagnostic test accuracy studies. Semin Hematol. 2008;45(3):189–195. doi: 10.1053/j.seminhematol.2008.04.001. - DOI - PubMed
    1. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Radiology. 2003;226(1):24–28. - PubMed
    1. Rutjes AW, Reitsma JB, Coomarasamy A, Khan KS, Bossuyt PM. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess 2007; 11(50):iii, ix-51. - PubMed
    1. Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med. 2004;140(3):189–202. - PubMed
    1. Trikalinos TA, Balion CM, Coleman CI, et al. Chapter 8: Meta-analysis of test performance when there is a "Gold Standard." J Gen Internal Med. 2012; doi: 10.1007/s11606-012-2029-1 - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources