Extracting coded information from large databases
- PMID: 18998869
Extracting coded information from large databases
Abstract
It is the purpose of this workshop to examine the necessary preprocessing of data to analyze information in nationally available databases, including the National Inpatient Sample and the SEER-Medicare database. Healthcare researchers cannot extract meaningful information without first processing the data into a format that allows for statistical analysis. For example, a quick examination of Medline via PubMed using the keywords, "National Inpatient Sample" returns 458 records. A search of "MEPS" returns 1363 records; "ambulatory care survey" returns 9803. Societies such as the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) and the American Pharmacists Association (APhA) have numerous presentations that utilize these databases as well. None of these papers discuss the necessary extraction techniques. In addition, publications that concentrate on the preprocessing required to work with these databases are virtually nonexistent. These databases can have over 100 variables, and millions of patient records. Traditional statistical methods cannot work with such complexity, and typically, the dataset is reduced to a handful of variables, and a filter to reduce the dataset to a much smaller, more restrictive set of patients. Moreover, the primary patient outcome studied is cost, where the different patient claims can be combined to a total cost of treatment. One of the most difficult problems is how to handle nominal data. In these databases, nominal data can have thousands of possible levels, too many to use in a regression model. There has to be a way to compress these values. There are many different coding schemes used to record patient conditions, including DRG codes, ICD9 codes, CPT codes, and HCPCS codes. Simply because of the complexity, there needs to be information provided on how the variables and categorical levels are reduced and extracted. In addition, information is often in different datasets, requiring files to be merged based upon a patient identifier. Topics include compressing filtering and merging of datafiles, transformation of variables to satisfy model assumptions, and partitioning data to validate results. It will also focus on the required merging of different datasets (for example, the MEPS) by patient identifier so that the information extracted is accurate. The focus will be on how datafiles with thousands, if not millions of patient entries must be developed to extract meaningful information.
Similar articles
-
A method for cohort selection of cardiovascular disease records from an electronic health record system.Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30. Int J Med Inform. 2017. PMID: 28495342
-
The BioPrompt-box: an ontology-based clustering tool for searching in biological databases.BMC Bioinformatics. 2007 Mar 8;8 Suppl 1(Suppl 1):S8. doi: 10.1186/1471-2105-8-S1-S8. BMC Bioinformatics. 2007. PMID: 17430575 Free PMC article.
-
[Routine data from general practitioner's software systems - Export, analysis and preparation for research].Gesundheitswesen. 2010 Jun;72(6):323-31. doi: 10.1055/s-0030-1249689. Epub 2010 May 20. Gesundheitswesen. 2010. PMID: 20491004 German.
-
Biodiversity informatics: the challenge of linking data and the role of shared identifiers.Brief Bioinform. 2008 Sep;9(5):345-54. doi: 10.1093/bib/bbn022. Epub 2008 Apr 29. Brief Bioinform. 2008. PMID: 18445641 Review.
-
Big data in medical science--a biostatistical view.Dtsch Arztebl Int. 2015 Feb 27;112(9):137-42. doi: 10.3238/arztebl.2015.0137. Dtsch Arztebl Int. 2015. PMID: 25797506 Free PMC article. Review.