Objective: This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.
Materials and methods: Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms-namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms-to identify the one that could generate the best prediction model.
Results: The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.
Conclusion: By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.
Thieme. All rights reserved.