This report describes a systematic effort to test all functions of a large 3-D radiation therapy planning program, including graphics and user interaction. Previous studies in quality assurance for radiation therapy programs do not adequately address the problem of programming errors. They compare dose estimates calculated by planning programs to actual doses measured in phantoms, so they cannot distinguish programming errors from measurement errors or physical unsoundness of the beam model. Moreover, they fail to exercise graphics and user interaction functions. This report describes a different methodology: test cases are derived from the program specification, results are calculated by an independent technique, and compared to program output. Derivation of test cases is described in detail. Effectiveness of testing is assessed by reporting the number of errors revealed by testing and comparing to the number of errors discovered during routine use in five successive program versions. The size of the test set is related to the total program size, and the effort devoted to deriving and performing tests is compared to the total program development effort. We conclude that systematic testing can reveal errors that are not found by informal testing, routine program use, or comparison with measurements. However, additional errors remain that are only discovered during use. This study suggests that a typical large planning system may include more than 100 errors when it is released for clinical use. Methods for increasing testing effectiveness are recommended.