We investigate options for grouping templates for the purpose of template identification and extraction from electronic medical records. We sampled a corpus of 1000 documents originating from Veterans Health Administration (VA) electronic medical record. We grouped documents through hashing and binning tokens (Hashed) as well as by the top 5% of tokens identified as important through the term frequency inverse document frequency metric (TF-IDF). We then compared the approaches on the number of groups with 3 or more and the resulting longest common subsequences (LCSs) common to all documents in the group. We found that the Hashed method had a higher success rate for finding LCSs, and longer LCSs than the TF-IDF method, however the TF-IDF approach found more groups than the Hashed and subsequently more long sequences, however the average length of LCSs were lower. In conclusion, each algorithm appears to have areas where it appears to be superior.
Keywords: Boilerplates; Natural Language Processing; templates.