Significant improvements in automated image analysis have been achieved in recent years and tools are now increasingly being used in computer-assisted syndromology. However, the ability to recognize a syndromic facial gestalt might depend on the syndrome and may also be confounded by severity of phenotype, size of available training sets, ethnicity, age, and sex. Therefore, benchmarking and comparing the performance of deep-learned classification processes is inherently difficult. For a systematic analysis of these influencing factors we chose the lysosomal storage diseases mucolipidosis as well as mucopolysaccharidosis type I and II that are known for their wide and overlapping phenotypic spectra. For a dysmorphic comparison we used Smith-Lemli-Opitz syndrome as another inborn error of metabolism and Nicolaides-Baraitser syndrome as another disorder that is also characterized by coarse facies. A classifier that was trained on these five cohorts, comprising 289 patients in total, achieved a mean accuracy of 62%. We also developed a simulation framework to analyze the effect of potential confounders, such as cohort size, age, sex, or ethnic background on the distinguishability of phenotypes. We found that the true positive rate increases for all analyzed disorders for growing cohorts (n = [10...40]) while ethnicity and sex have no significant influence. The dynamics of the accuracies strongly suggest that the maximum distinguishability is a phenotype-specific value, which has not been reached yet for any of the studied disorders. This should also be a motivation to further intensify data sharing efforts, as computer-assisted syndrome classification can still be improved by enlarging the available training sets.