Development and validation of a computerized South Asian Names and Group Recognition Algorithm (SANGRA) for use in British health-related studies

J Public Health Med. 2001 Dec;23(4):278-85. doi: 10.1093/pubmed/23.4.278.


Background: Studies on ethnic variations in health have played an important role in aetiological and health services research. Most routine datasets, however, do not include information on ethnicity. South Asians, one of the largest minority ethnic groups in Britain, have distinctive names that also allow differentiation of the main sub-groups with their important differences in health-related exposures and disease risks.

Methods: A computerized name recognition algorithm (SANGRA) was developed incorporating directories of South Asian first names and surnames together with their religious and linguistic origin. SANGRA was validated using health-related data with self-ascribed information on ethnicity.

Results: SANGRA was successful in recognizing South Asian origin in reference datasets, with sensitivity of 89-96 per cent, specificity of 94-98 per cent, positive predictive value (PPV) of 80-89 per cent and negative predictive value (NPV) of 98-99 per cent. Religious origin was correctly assigned in the majority of cases: sensitivity, specificity and PPV were 94 per cent, 91 per cent and 90 per cent for Hindus; 90 per cent, 99 per cent and 98 per cent for Muslims; and 76 per cent, 99 per cent and 94 per cent for Sikhs. SANGRA correctly identified 76 per cent Gujerati and 70 per cent Punjabi names, although only 62 per cent of Gujerati names were sufficiently distinct to be allocated to the Gujerati-only category and only 53 per cent Punjabi names were allocated to the Punjabi-only category. However, specificity and PPV were high for both languages (respectively 97 per cent and 93 per cent for Gujerati, and 99 per cent and 97 per cent for Punjabi).

Conclusions: SANGRA provides a practical and valid method of ascertaining South Asian origin by name and, to a lesser degree of accuracy, of differentiating between the main religious and linguistic subgroups living in Britain. This algorithm will be useful in health-related studies where information on self-ascribed ethnicity is not available or is of a limited nature.

Publication types

  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Algorithms*
  • Asia, Southeastern / ethnology
  • Database Management Systems*
  • Directories as Topic
  • Ethnicity / classification*
  • Ethnicity / statistics & numerical data
  • Health Status*
  • Humans
  • Language
  • Names*
  • Patient Admission
  • Patient Identification Systems
  • Religion
  • Software
  • United Kingdom / epidemiology