There is currently a great deal of interest in creating computational tools for predicting the pharmacological properties of drug development candidates, ranging from physicochemical properties such as pK(a) and solubility to more complex biological properties such as oral bioavailability and toxicity. The limiting factor in many cases is a shortage of good data from which to construct training sets. In other cases, large amounts of data are available, but they use surrogate end-points or are comprised of compounds very different from those usually encountered in drug discovery and development. In such cases large training sets and global models are not necessarily better than local models based on smaller data sets. Such considerations make it as important to examine the available data carefully so as to avoid over-interpretation of the models obtained as it is to minimise errors in prediction per se. The kinds of complications likely to be encountered for in vitro hepatotoxicity modelling are discussed in general terms and illustrated in particular by SIMCA analysis of data obtained from assays of cultured hepatocytes for a large, structurally diverse data set and a smaller, much more focussed one.