Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep-Oct;19(5):758-64.
doi: 10.1136/amiajnl-2012-000862. Epub 2012 Apr 17.

Grid Binary LOgistic REgression (GLORE): building shared models without sharing data

Affiliations

Grid Binary LOgistic REgression (GLORE): building shared models without sharing data

Yuan Wu et al. J Am Med Inform Assoc. 2012 Sep-Oct.

Abstract

Objective: The classification of complex or rare patterns in clinical and genomic data requires the availability of a large, labeled patient set. While methods that operate on large, centralized data sources have been extensively used, little attention has been paid to understanding whether models such as binary logistic regression (LR) can be developed in a distributed manner, allowing researchers to share models without necessarily sharing patient data.

Material and methods: Instead of bringing data to a central repository for computation, we bring computation to the data. The Grid Binary LOgistic REgression (GLORE) model integrates decomposable partial elements or non-privacy sensitive prediction values to obtain model coefficients, the variance-covariance matrix, the goodness-of-fit test statistic, and the area under the receiver operating characteristic (ROC) curve.

Results: We conducted experiments on both simulated and clinically relevant data, and compared the computational costs of GLORE with those of a traditional LR model estimated using the combined data. We showed that our results are the same as those of LR to a 10(-15) precision. In addition, GLORE is computationally efficient.

Limitation: In GLORE, the calculation of coefficient gradients must be synchronized at different sites, which involves some effort to ensure the integrity of communication. Ensuring that the predictors have the same format and meaning across the data sets is necessary.

Conclusion: The results suggest that GLORE performs as well as LR and allows data to remain protected at their original sites.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None.

Figures

Figure 1
Figure 1
Illustration of a logistic regression model using one-dimensional data (for example, X=body temperature. π(X,β) is a sigmoid function relating temperature and the probability that a particular record contained the word ‘fever’. Dots on the upper and lower horizontal lines correspond to positive (‘fever’) and negative (absence of ‘fever’) observations, respectively. Beta is the estimated parameter.
Figure 2
Figure 2
Pipeline for the Grid Binary LOgistic REgression (GLORE) model. Data sets hosted in different institutions (ie, A, B, and C) are processed locally through the same virtual engine (ie, GLORE code) to compute non-sensitive intermediary results, which are exchanged and combined to obtain the final global model parameters at the central site. A similar distributed process is executed for evaluation of the model.
Figure 3
Figure 3
Calculating the area under the curve (AUC) using GLORE. (A) Exchange numbers of one-labeled and zero-labeled records between site A and site B. (B) Compute rank sums for records in A. 1: Calculate the rank of each probability in A among zero-labeled records in B. 2: Calculate the rank of each one-labeled probability in A among zero-labeled records in A. 3: Find the one-labeled records from step 1 (ie, bounded in red boxes). 4: Combine ranks for one-labeled records from procedures 2 and 3 to get the rank sums for A. (c) Compute rank sums for records in site B. 5: Calculate the rank of each probability in B among zero-labeled records in A. 6: Calculate the rank of each one-labeled probability in B among zero-labeled records in B. 7: Find the one-labeled records from step 5 (ie, bounded in red boxes). 8: Combine ranks for one-labeled records from procedures 6 and 7 to get rank sums for A.
Figure 4
Figure 4
The convergence paths of the two-site GLORE estimations for intercept, X1, and X2. The estimation difference between GLORE and classic LR is smaller than 10−15 for all iterations, as shown in table 1.

Similar articles

Cited by

References

    1. Ohno-Machado L, Bafna V, Boxwala AA, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc 2012;19:196–201 - PMC - PubMed
    1. Willison DJ. Use of data from the electronic health record for health research: current governance challenges and potential approaches. In: Johnston S, Ranford J, eds. OPC Guidance Documents, Annual Reports to Parliament. Ottawa, Ont: Office of the Privacy Commissioner of Canada, 2009:1–32
    1. Murphy SN, Gainer V, Mendis M, et al. Strategies for maintaining patient privacy in i2b2. J Am Med Inform Assoc 2011;18(Suppl 1):103–8 - PMC - PubMed
    1. Vinterbo SA, Sarwate AD, Boxwala A. Protecting count queries in study design. J Am Med Inform Assoc 2012;19:750–7 - PMC - PubMed
    1. Denekamp Y, Boxwala AA, Kuperman G, et al. A meta-data model for knowledge in decision support systems. AMIA Annu Symp Proc 2003:826. - PMC - PubMed

Publication types