Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun;27 Suppl 1(Suppl 1):S56-66.
doi: 10.1007/s11606-012-2029-1.

Chapter 8: meta-analysis of test performance when there is a "gold standard"

Affiliations

Chapter 8: meta-analysis of test performance when there is a "gold standard"

Thomas A Trikalinos et al. J Gen Intern Med. 2012 Jun.

Abstract

Synthesizing information on test performance metrics such as sensitivity, specificity, predictive values and likelihood ratios is often an important part of a systematic review of a medical test. Because many metrics of test performance are of interest, the meta-analysis of medical tests is more complex than the meta-analysis of interventions or associations. Sometimes, a helpful way to summarize medical test studies is to provide a "summary point", a summary sensitivity and a summary specificity. Other times, when the sensitivity or specificity estimates vary widely or when the test threshold varies, it is more helpful to synthesize data using a "summary line" that describes how the average sensitivity changes with the average specificity. Choosing the most helpful summary is subjective, and in some cases both summaries provide meaningful and complementary information. Because sensitivity and specificity are not independent across studies, the meta-analysis of medical tests is fundamentaly a multivariate problem, and should be addressed with multivariate methods. More complex analyses are needed if studies report results at multiple thresholds for positive tests. At the same time, quantitative analyses are used to explore and explain any observed dissimilarity (heterogeneity) in the results of the examined studies. This can be performed in the context of proper (multivariate) meta-regressions.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Typical data on the performance of a medical test (D-dimers for venous thromboembolism). Eleven studies on ELISA-based D-dimer assays for the diagnosis of venous thromboembolism. The top panel (a) depicts studies as markers, labeled by author names and thresholds for a positive test (in ng/mL). Studies listed on the left lightly shaded area have a positive likelihood ratio of at least 10. Studies listed on the top lightly shaded area have a negative likelihood ratio of at most 0.1. Studies listed at the intersection of the gray areas (darker gray polygon) have both a positive likelihood ratio of at least 10 and a negative likelihood ratio of 0.1 or less. The second panel (b) shows ‘paired’ forest plots in ascending order of sensitivity (left) along with with the corresponding specificity (right). Note how sensitivity increases with decreasing specificity, which could be explained by a “threshold effect”. The third panel (c) shows the respective negative and positive likelihood ratios.
Figure 2.
Figure 2.
Obtaining summary (overall) metrics for medical test performance. PLR/NLR = positive (negative) likelihood ratio; PPV/NPV = positive (negative) predictive value; Prev = prevalence; Se = Sensitivity; Sp = specificity. The herein recommended approach is to perform a meta-analysis for sensitivity and specificity across the K studies, and then use the summary sensitivity and specificity (Se+ and Sp+; a row of two boxes after the horizontal black line) to back-calculate “overall” values for the other metrics (second row of boxes after the horizontal black line). In most cases it is not meaningful to synthesize prevalences (see text).
Figure 3.
Figure 3.
Graphical presentation of studies reporting data at multiple thresholds. Ability of early total serum bilirubin measurements to identify postdischarge total serum bilirubin above the 95th hour-specific percentile. Sensitivity and 100 percent minus specificity pairs from the same study (obtained with different cut-offs for the early total serum bilirubin measurement) are connected with lines. These lines are reconstructed based on the reported cut-offs, and are not perfect representations of the actual ROC curves in each study (they show only a few thresholds that could be extracted from the study). Studies listed on the left lightly shaded area have a positive likelihood ratio of at least 10. Studies listed on the top lightly shaded area have a negative likelihood ratio of at most 0.1. Studies listed at the intersection of the gray areas (darker gray polygon) have both a positive likelihood ratio of at least 10 and a negative likelihood ratio of 0.1 or less.
Figure 4.
Figure 4.
HSROC for the ELISA-based D-dimer tests. (a) Hierarchical summary receiver-operator curve (HSROC) of the studies plotted in Fig. 1a. (b) Calculated negative predictive value for the ELISA-based D-dimer test if the sensitivity and specificity are fixed at 80 % and 97 %, respectively, and prevalence of venous thromboembolism varies from 5 to 50 %.
Figure 5.
Figure 5.
Sensitivity 1–specificity plot for studies of serial CK-MB measurements. The left panel shows the sensitivity and specificity of 14 studies according to the timing of the last serial CK-MB measurement for diagnosis of acute cardiac ischemia. The numbers next to each study point are the actual length of the time interval from symptom onset to last serial CK-MB measurement. Filled circles: at most 3 hours; “x” marks: longer than 3 hours. The right panel plots the summary points and the 95 % confidence regions for the aforementioned subgroups of studies (at most 3 hours: filled circles; longer than 3 hours— “x”s). Estimates are based on a bivariate meta-regression using the time interval as a predictor. The predictor has distinct effects for sensitivity and specificity. This is the same analysis as in Table 2.

Similar articles

Cited by

References

    1. Lau J, Ioannidis JP, Schmid CH. Summing up evidence: one answer is not always enough. Lancet. 1998;351(9096):123–127. doi: 10.1016/S0140-6736(97)08468-7. - DOI - PubMed
    1. Tatsioni A, Zarin DA, Aronson N, Samson DJ, Flamm CR, Schmid C, et al. Challenges in systematic reviews of diagnostic technologies. Ann Intern Med. 2005;142(12 Pt 2):1048–1055. - PubMed
    1. Lijmer JG, Leeflang M, Bossuyt PM. Proposals for a phased evaluation of medical tests. Med Decis Making. 2009;29(5):E13–E21. doi: 10.1177/0272989X09336144. - DOI - PubMed
    1. Lord SJ, Irwig L, Bossuyt PM. Using the principles of randomized controlled trial design to guide test evaluation. Med Decis Making. 2009;29(5):E1–E12. doi: 10.1177/0272989X09340584. - DOI - PubMed
    1. Trikalinos TA, Kulasingam S, Lawrence WF. Chapter 10: Deciding Whether to Complement a Systematic Review of Medical Tests with Decision Modeling. J Gen Intern Med 2012. doi:10.1007/s11606-012-2019-3. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources