Sharing models and tools for processing German clinical texts

Stud Health Technol Inform. 2015:210:734-8.


The automatic processing of non-English clinical documents is massively hampered by the lack of publicly available medical language resources for training, testing and evaluating NLP components. We suggest sharing statistical models derived from access-protected clinical documents as a reasonable substitute and provide solutions for sentence splitting, tokenization and POS tagging of German clinical texts. These three components were trained on the confidential FRAMED corpus, a non-sharable collection of various German-language clinical document types. The models derived therefrom outperform alternative components from OPENNLP and the Stanford POS tagger, also trained on FRAMED.

Publication types

  • Comparative Study
  • Evaluation Study

MeSH terms

  • Germany
  • Machine Learning
  • Models, Theoretical*
  • Natural Language Processing*
  • Pattern Recognition, Automated / methods*
  • Periodicals as Topic*
  • Software*
  • Terminology as Topic
  • Translating*
  • Vocabulary, Controlled