Identifying and extracting patient smoking status information from clinical narrative texts in Spanish

Rosa L Figueroa; Diego A Soto; Esteban J Pino

doi:10.1109/EMBC.2014.6944182

Identifying and extracting patient smoking status information from clinical narrative texts in Spanish

Annu Int Conf IEEE Eng Med Biol Soc. 2014:2014:2710-3. doi: 10.1109/EMBC.2014.6944182.

Authors

Rosa L Figueroa, Diego A Soto, Esteban J Pino

PMID: 25570550
DOI: 10.1109/EMBC.2014.6944182

Abstract

In this work we present a system to identify and extract patient's smoking status from clinical narrative text in Spanish. The clinical narrative text was processed using natural language processing techniques, and annotated by four people with a biomedical background. The dataset used for classification had 2,465 documents, each one annotated with one of the four smoking status categories. We used two feature representations: single word token and bigrams. The classification problem was divided in two levels. First recognizing between smoker (S) and non-smoker (NS); second recognizing between current smoker (CS) and past smoker (PS). For each feature representation and classification level, we used two classifiers: Support Vector Machines (SVM) and Bayesian Networks (BN). We split our dataset as follows: a training set containing 66% of the available documents that was used to build classifiers and a test set containing the remaining 34% of the documents that was used to test and evaluate the model. Our results show that SVM together with the bigram representation performed better in both classification levels. For S vs NS classification level performance measures were: ACC=85%, Precision=85%, and Recall=90%. For CS vs PS classification level performance measures were: ACC=87%, Precision=91%, and Recall=94%.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Bayes Theorem
Chile
Databases, Factual*
Electronic Health Records / classification*
Humans
Narration
Natural Language Processing*
Smoking*
Support Vector Machine