Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015;16 Suppl 18(Suppl 18):S6.
doi: 10.1186/1471-2105-16-S18-S6. Epub 2015 Dec 9.

Privacy-preserving Search for Chemical Compound Databases

Free PMC article

Privacy-preserving Search for Chemical Compound Databases

Kana Shimizu et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources.

Results: In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation.

Conclusion: We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.

Figures

Figure 1
Figure 1
Schematic view showing a large difference in tolerance against the regression attack between two cases: (a) The server's reply is the distance between the attacker's query and the server's data, (b) The server's reply is the binary sign that shows whether or not the distance between the attacker's query and the server's data is larger than the given threshold. The red point represents the server's data and × represents the attacker's query. Prior to the query, the search spaces (white areas) in (a-1) and (b-1) are equal. After the first query has been sent, the search space in (a-2) is limited to the circle whose radius is the distance between the attacker's query and the server's data. On the other hand in (b-2), only the small area of the dashed circle whose radius is the given threshold (gray area) is excluded from the search space. By sending the second query, the attacker knows that one of the two intersections of the two circles in (a-3) is equal to the server's data, while the search space is large in (b-3). Finally, the server's data is detected by sending the third query in (a-4), however in (b-4), the search space is still large, even though the third query is within the given threshold.
Figure 2
Figure 2
Schematic view of protection of (a) user privacy and (b) database privacy while keeping user privacy. For user privacy, the user's query and the search result which includes the query information must be invisible to the database side during the search task. For database privacy, the server minimizes output information for preventing regression attacks (b-1), and also detects and rejects illegal queries that might cause unexpected information leakage (b-2). These server's tasks must be carried out with the encrypted queries in order to keep user privacy.
Figure 3
Figure 3
Upper bounds of the probabilities that the user has at least one hit query out of making 1, 10, ..., 106 queries. Note that the hit query becomes the critical hint for revealing database information. Each line shows the results with one of the four different thresholds.
Figure 4
Figure 4
The comparison of the experimental success ratios of the user's guess based on the server's return as well as the prior distribution of true value when the user sends many queries (δ = 0, ..., 0.2), and success probability based only on a guess from the prior distribution (ideal value). TI¯1,1,0.8(k=831)is assumed and results are calculated for three different numbers of dummies (n = 831 × 10, 831 × 50, 831 × 102) when the user sends L = 1, 10, ..., 105 queries and three different distributions: wChEMBL−177159 and wChEMBL−265935 are actual distributions of TI¯1,1,0.8 on ChEMBL obtained by querying two randomly selected fingerprints from ChEMBL, wrand is obtained by randomly selecting a value from 1, ..., k for m = 5 × 831 times and dividing each observed frequency by m.

Similar articles

See all similar articles

Cited by 1 article

References

    1. Subbaraman N. Flawed arithmetic on drug development costs. Nature Biotechnology. 2011;29(5):381–381. - PubMed
    1. Miller Ma. Chemical database techniques in drug discovery. Nature Reviews Drug Discovery. 2002;1(3):220–7. - PubMed
    1. Schooler J. Unpublished results hide the decline effect. Nature. 2011;470:437. - PubMed
    1. Ostrovsky R, Skeith WE. , IIIA survey of single-database private information retrieval: techniques and applications. Proceedings of the 10th International Conference on Practice and Theory in Public-key Cryptography PKC'07. 2007. pp. 393–411.
    1. Goethals B, Laur S, Lipmaa H, Mielik¨ainen T. On private scalar product computation for privacy-preserving data mining. Proceedings of the 7th Annual International Conference on Information Security and Cryptology ICISC 2004. 2004. pp. 104–120.

Publication types

LinkOut - more resources

Feedback