We compare mixed effects logistic regression models for binary response data with two nested levels of clustering. The comparison of these models occurs in the context of developmental toxicity data sets, for which multiple types of outcomes (first level) are measured on each rat pup (second level) nested within a litter (third level). Because the nested nature of such data is occasionally accommodated by ignoring one level of clustering, we consider three models: (i) a three-level model adjusting for clustering due to both pup and litter (M1); (ii) a two-level model adjusting for just pup (M2); and (iii) another two-level model adjusting for just litter (M3). The three types of effects of interest are: (i) differences among malformation types (first-level effects); (ii) differences among groups of pups (for example, sex of pup, second-level effects); and (iii) differences among groups of litters (for example, dose, third-level effects). Simulations and data analyses suggest that the M3 model leads to more bias than the M1 or M2 models for all three types of effects. In terms of coverage of confidence intervals for fixed effects log odds ratio parameters, the M1 model achieves nominal coverage, whereas the M2 model reduces coverage for the third-level effects and the M3 model obtains poor coverage for both first- and second-level effects. These reductions in coverage for certain model-parameter combinations worsen as baseline risk increases. The data analyses support these simulation-based conclusions to some extent.