Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Nov 6:1239-40.

Extracting coded information from large databases

Affiliations
  • PMID: 18998869

Extracting coded information from large databases

Patricia B Cerrito et al. AMIA Annu Symp Proc. .

Abstract

It is the purpose of this workshop to examine the necessary preprocessing of data to analyze information in nationally available databases, including the National Inpatient Sample and the SEER-Medicare database. Healthcare researchers cannot extract meaningful information without first processing the data into a format that allows for statistical analysis. For example, a quick examination of Medline via PubMed using the keywords, "National Inpatient Sample" returns 458 records. A search of "MEPS" returns 1363 records; "ambulatory care survey" returns 9803. Societies such as the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) and the American Pharmacists Association (APhA) have numerous presentations that utilize these databases as well. None of these papers discuss the necessary extraction techniques. In addition, publications that concentrate on the preprocessing required to work with these databases are virtually nonexistent. These databases can have over 100 variables, and millions of patient records. Traditional statistical methods cannot work with such complexity, and typically, the dataset is reduced to a handful of variables, and a filter to reduce the dataset to a much smaller, more restrictive set of patients. Moreover, the primary patient outcome studied is cost, where the different patient claims can be combined to a total cost of treatment. One of the most difficult problems is how to handle nominal data. In these databases, nominal data can have thousands of possible levels, too many to use in a regression model. There has to be a way to compress these values. There are many different coding schemes used to record patient conditions, including DRG codes, ICD9 codes, CPT codes, and HCPCS codes. Simply because of the complexity, there needs to be information provided on how the variables and categorical levels are reduced and extracted. In addition, information is often in different datasets, requiring files to be merged based upon a patient identifier. Topics include compressing filtering and merging of datafiles, transformation of variables to satisfy model assumptions, and partitioning data to validate results. It will also focus on the required merging of different datasets (for example, the MEPS) by patient identifier so that the information extracted is accurate. The focus will be on how datafiles with thousands, if not millions of patient entries must be developed to extract meaningful information.

PubMed Disclaimer

Similar articles

MeSH terms