Land use regression (LUR) models have become popular to explain the spatial variation of air pollution concentrations. Independent evaluation is important. We developed LUR models for nitrogen dioxide (NO(2)) using measurements conducted at 144 sampling sites in The Netherlands. Sites were randomly divided into training data sets with a size of 24, 36, 48, 72, 96, 108, and 120 sites. LUR models were evaluated using (1) internal "leave-one-out-cross-validation (LOOCV)" within the training data sets and (2) external "hold-out" validation (HV) against independent test data sets. In addition, we calculated Mean Square Error based validation R(2)s. The mean adjusted model and LOOCV R(2) slightly decreased from 0.87 to 0.82 and 0.83 to 0.79, respectively, with an increasing number of training sites. In contrast, the mean HV R(2) was lowest (0.60) with the smallest training sets and increased to 0.74 with the largest training sets. Predicted concentrations were more accurate in sites with out of range values for prediction variables after changing these values to the minimum or maximum of the range observed in the corresponding training data set. LUR models for NO(2) perform less well, when evaluated against independent measurements, when they are based on relatively small training sets. In our specific application, models based on as few as 24 training sites, however, achieved acceptable hold out validation R(2)s of, on average, 0.60.