A Scalable Pseudonymization Tool for Rapid Deployment in Large Biomedical Research Networks: Development and Evaluation Study

Hammam Abu Attieh; Diogo Telmo Neves; Mariana Guedes; Massimo Mirandola; Chiara Dellacasa; Elisa Rossi; Fabian Prasser

doi:10.2196/49646

A Scalable Pseudonymization Tool for Rapid Deployment in Large Biomedical Research Networks: Development and Evaluation Study

JMIR Med Inform. 2024 Apr 23:12:e49646. doi: 10.2196/49646.

Authors

Hammam Abu Attieh¹, Diogo Telmo Neves¹, Mariana Guedes^{2

3

4}, Massimo Mirandola⁵, Chiara Dellacasa⁶, Elisa Rossi⁶, Fabian Prasser¹

Affiliations

¹ Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
² Infection and Antimicrobial Resistance Control and Prevention Unit, Centro Hospitalar Universitário São João, Porto, Portugal.
³ Infectious Diseases and Microbiology Division, Hospital Universitario Virgen Macarena, Sevilla, Spain.
⁴ Department of Medicine, University of Sevilla/Instituto de Biomedicina de Sevilla (IBiS)/Consejo Superior de Investigaciones Científicas (CSIC), Sevilla, Spain.
⁵ Infectious Diseases Division, Diagnostic and Public Health Department, University of Verona, Verona, Italy.
⁶ High Performance Computing (HPC) Department, CINECA - Consorzio Interuniversitario, Bologna, Italy.

PMID: 38654577
PMCID: PMC11063579
DOI: 10.2196/49646

Abstract

Background: The SARS-CoV-2 pandemic has demonstrated once again that rapid collaborative research is essential for the future of biomedicine. Large research networks are needed to collect, share, and reuse data and biosamples to generate collaborative evidence. However, setting up such networks is often complex and time-consuming, as common tools and policies are needed to ensure interoperability and the required flows of data and samples, especially for handling personal data and the associated data protection issues. In biomedical research, pseudonymization detaches directly identifying details from biomedical data and biosamples and connects them using secure identifiers, the so-called pseudonyms. This protects privacy by design but allows the necessary linkage and reidentification.

Objective: Although pseudonymization is used in almost every biomedical study, there are currently no pseudonymization tools that can be rapidly deployed across many institutions. Moreover, using centralized services is often not possible, for example, when data are reused and consent for this type of data processing is lacking. We present the ORCHESTRA Pseudonymization Tool (OPT), developed under the umbrella of the ORCHESTRA consortium, which faced exactly these challenges when it came to rapidly establishing a large-scale research network in the context of the rapid pandemic response in Europe.

Methods: To overcome challenges caused by the heterogeneity of IT infrastructures across institutions, the OPT was developed based on programmable runtime environments available at practically every institution: office suites. The software is highly configurable and provides many features, from subject and biosample registration to record linkage and the printing of machine-readable codes for labeling biosample tubes. Special care has been taken to ensure that the algorithms implemented are efficient so that the OPT can be used to pseudonymize large data sets, which we demonstrate through a comprehensive evaluation.

Results: The OPT is available for Microsoft Office and LibreOffice, so it can be deployed on Windows, Linux, and MacOS. It provides multiuser support and is configurable to meet the needs of different types of research projects. Within the ORCHESTRA research network, the OPT has been successfully deployed at 13 institutions in 11 countries in Europe and beyond. As of June 2023, the software manages data about more than 30,000 subjects and 15,000 biosamples. Over 10,000 labels have been printed. The results of our experimental evaluation show that the OPT offers practical response times for all major functionalities, pseudonymizing 100,000 subjects in 10 seconds using Microsoft Excel and in 54 seconds using LibreOffice.

Conclusions: Innovative solutions are needed to make the process of establishing large research networks more efficient. The OPT, which leverages the runtime environment of common office suites, can be used to rapidly deploy pseudonymization and biosample management capabilities across research networks. The tool is highly configurable and available as open-source software.

Keywords: biomedical research; data protection; data sharing; privacy; pseudonymization; research network.