Individual Gestalt Is Unreliable for the Evaluation of Quality in Medical Education Blogs: A METRIQ Study

Ann Emerg Med. 2017 Sep;70(3):394-401. doi: 10.1016/j.annemergmed.2016.12.025. Epub 2017 Mar 2.


Study objective: Open educational resources such as blogs are increasingly used for medical education. Gestalt is generally the evaluation method used for these resources; however, little information has been published on it. We aim to evaluate the reliability of gestalt in the assessment of emergency medicine blogs.

Methods: We identified 60 English-language emergency medicine Web sites that posted clinically oriented blogs between January 1, 2016, and February 24, 2016. Ten Web sites were selected with a random-number generator. Medical students, emergency medicine residents, and emergency medicine attending physicians evaluated the 2 most recent clinical blog posts from each site for quality, using a 7-point Likert scale. The mean gestalt scores of each blog post were compared between groups with Pearson's correlations. Single and average measure intraclass correlation coefficients were calculated within groups. A generalizability study evaluated variance within gestalt and a decision study calculated the number of raters required to reliably (>0.8) estimate quality.

Results: One hundred twenty-one medical students, 88 residents, and 100 attending physicians (93.6% of enrolled participants) evaluated all 20 blog posts. Single-measure intraclass correlation coefficients within groups were fair to poor (0.36 to 0.40). Average-measure intraclass correlation coefficients were more reliable (0.811 to 0.840). Mean gestalt ratings by attending physicians correlated strongly with those by medical students (r=0.92) and residents (r=0.99). The generalizability coefficient was 0.91 for the complete data set. The decision study found that 42 gestalt ratings were required to reliably evaluate quality (>0.8).

Conclusion: The mean gestalt quality ratings of blog posts between medical students, residents, and attending physicians correlate strongly, but individual ratings are unreliable. With sufficient raters, mean gestalt ratings provide a community standard for assessment.

MeSH terms

  • Adult
  • Blogging / standards*
  • Blogging / trends
  • Clinical Competence
  • Education, Medical / methods
  • Education, Medical / standards*
  • Educational Measurement / methods*
  • Emergency Medicine / education*
  • Female
  • Gestalt Theory*
  • Humans
  • Internship and Residency
  • Male
  • Reproducibility of Results
  • Social Media / statistics & numerical data
  • Students, Medical