Facilitating research on racial and ethnic disparities and inequities in transportation: Application and evaluation of the Bayesian Improved Surname Geocoding (BISG) algorithm

Traffic Inj Prev. 2021;22(sup1):S32-S37. doi: 10.1080/15389588.2021.1955109. Epub 2021 Aug 17.


Objective: Racial and ethnic disparities and/or inequities have been documented in traffic safety research. However, race/ethnicity data are often not captured in population-level traffic safety databases, limiting the field's ability to comprehensively study racial/ethnic differences in transportation outcomes, as well as our ability to mitigate them. To overcome this limitation, we explored the utility of estimating race and ethnicity for drivers in the New Jersey Safety and Health Outcomes (NJ-SHO) data warehouse using the Bayesian Improved Surname Geocoding (BISG) algorithm. In addition, we summarize important recommendations established to guide researchers developing and implementing racial and ethnic disparity research.

Methods: We applied BISG to estimate population-level race/ethnicity for New Jersey drivers in 2017 and evaluated the concordance between reported values available in integrated administrative sources (e.g., hospital records) and BISG probability distributions using an area under the receiver operator curve (AUC) within each race/ethnicity category. Overall AUC was calculated by weighting each AUC value by the population count in each reported category. In an exemplar analysis using 2017 crash data, we conducted an analysis of average monthly police-reported crash rates in 2017 by race/ethnicity using the NJ-SHO and BISG sets of race/ethnicity values to compare their outputs.

Results: We found excellent or outstanding concordance (AUC ≥0.86) between reported race/ethnicity and BISG probabilities for White, Hispanic, Black, and Asian/Pacific Islander drivers. We found poor concordance for American Indian/Alaskan Native drivers (AUC= 0.65), and concordance was no better than random assignment for Multiracial drivers (AUC = 0.52). Among White, Hispanic, Asian/Pacific Islander, and American Indian/Alaskan native drivers, monthly crash rates calculated using both NJ-SHO reported race/ethnicity values and BISG probabilities were similar. Monthly crash rates differed by 11% for Black drivers, and by more than 200% for Multiracial drivers.

Conclusion: Findings of excellent or outstanding concordance between and mostly similar crash rates derived from reported race/ethnicity and BISG probabilities for White, Hispanic, Black, and Asian/Pacific Islander drivers (98.9% of all drivers in this sample) demonstrate the potential utility of BISG in enabling research on transportation disparities and inequities. Concordance between race/ethnicity values were not acceptable for American Indian/Alaskan Native and Multiracial drivers, which is similar to previous applications and evaluations of BISG. Future work is needed to determine the extent to which BISG may be applied to traffic safety contexts.

Keywords: Traffic crashes; data integration; data warehousing; epidemiology; minority health; public health informatics.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Accidents, Traffic
  • Algorithms
  • Bayes Theorem
  • Ethnicity*
  • Geographic Mapping*
  • Humans
  • United States