One major problem with nonrigid image registration techniques is their high computational cost. Because of this, these methods have found limited application to clinical situations where fast execution is required, e.g., intraoperative imaging. This paper presents a parallel implementation of a nonrigid image registration algorithm. It takes advantage of shared-memory multiprocessor computer architectures using multithreaded programming by partitioning of data and partitioning of tasks, depending on the computational subproblem. For three different biomedical applications (intraoperative brain deformation, contrast-enhanced MR mammography, intersubject brain registration), the scaling behavior of the algorithm is quantitatively analyzed. The method is demonstrated to perform the computation of intra-operative brain deformation in less than a minute using 64 CPUs on a 128-CPU shared-memory supercomputer (SGI Origin 3800). It is shown that its serial component is no more than 2% of the total computation time, allowing a speedup of at least a factor of 50. In most cases, the theoretical limit of the speedup is substantially higher (up to 132-fold in the application examples presented in this paper). The parallel implementation of our algorithm is, therefore, capable of solving nonrigid registration problems with short execution time requirements and may be considered an important step in the application of such techniques to clinically important problems such as the computation of brain deformation during cranial image-guided surgery.