Background: Metadata is data that describes other data or resources. It has a defined number of named elements that convey meaning. Medical data are complex to process. For example, in the Primary Care Data Quality (PCDQ) renal programme, we need to collect over 300 variables because there are so many possible causes of renal disease. These variables are not just single columns of data--all are extracted as code plus date, while others are code-date-value. Metadata has the potential to improve the reliability of processing large datasets.
Objective: To define unique and unambiguous metadata headings for clinical data and derived variables.
Method: We defined the look-up tables we would use as a controlled vocabulary to name the core clinical concepts within the metadata. We added six other elements to describe data: (1) the study or audit name; (2) the query used to extract the data; (3) the data collection number; (4) the type of data, including specifying the units; (5) the repeat number (if the variable was extracted more than once); and (6) a processing suffix that defines how the data have been processed.
Results: The metadata system has enabled the development of a query library and an analysis syntax library that make data processing and analysis more efficient. Its stability means greater effort can be put into more complex data processing, and some semiautomation of processes. However, the system has had implementation problems. It has been particularly hard to stop clinicians using multiple synonyms for the same variable.
Conclusions: The PCDQ metadata system provides an auditable method of data processing. It is a method that should improve the reliability, validity and efficiency of processing routinely collected clinical data. This paper sets out to demystify our data processing method and makes the PCDQ metadata system available to clinicians and data processors who might wish to adopt it.