Spell checker for consumer language (CSpell)

J Am Med Inform Assoc. 2019 Mar 1;26(3):211-218. doi: 10.1093/jamia/ocy171.


Objective: Automated understanding of consumer health inquiries might be hindered by misspellings. To detect and correct various types of spelling errors in consumer health questions, we developed a distributable spell-checking tool, CSpell, that handles nonword errors, real-word errors, word boundary infractions, punctuation errors, and combinations of the above.

Methods: We developed a novel approach of using dual embedding within Word2vec for context-dependent corrections. This technique was used in combination with dictionary-based corrections in a 2-stage ranking system. We also developed various splitters and handlers to correct word boundary infractions. All correction approaches are integrated to handle errors in consumer health questions.

Results: Our approach achieves an F1 score of 80.93% and 69.17% for spelling error detection and correction, respectively.

Discussion: The dual-embedding model shows a significant improvement (9.13%) in F1 score compared with the general practice of using cosine similarity with word vectors in Word2vec for context ranking. Our 2-stage ranking system shows a 4.94% improvement in F1 score compared with the best 1-stage ranking system.

Conclusion: CSpell improves over the state of the art and provides near real-time automatic misspelling detection and correction in consumer health questions. The software and the CSpell test set are available at https://umlslex.nlm.nih.gov/cSpell.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Algorithms*
  • Consumer Health Informatics
  • Consumer Health Information*
  • Humans
  • Information Seeking Behavior*
  • Language*
  • Natural Language Processing*