When measuring treatment effect in chronic low back pain with multi-item outcome instruments, it is necessary, both for clinical decision-making and research purposes, to understand the clinical importance of the outcome scores. The aims of the present study were three-fold. Firstly, it aimed to estimate the minimal clinically important difference of three multi-item outcome instruments (the Oswestry Disability Index, the General Function Score and the Zung Depression Scale) and of the visual analogue scale (VAS) of back pain. Secondly, it aimed to estimate the error of measurement of these instruments; and its third aim was to describe the clinical meaning of score change. The study population consisted of 289 patients treated surgically or non-surgically in a randomised controlled trial. The minimal clinically important difference was estimated with patient global assessment as the external criterion. It was compared with the standard error of measurement of the instruments. The individual items of the instruments were compared for score changes related to improvement and deterioration. The standard error of measurement of the Oswestry Disability Index, the General Function Score and the Zung Depression Scale was 4, 6 and 3 units, respectively. The 95% tolerance interval was 10, 16 and 8 units, respectively. The minimal clinically important difference was 10, 12 and 8-9 units, respectively, thus not significantly exceeding the tolerance interval. The minimal clinically important difference of VAS back pain was 18-19 units, well exceeding the 95% tolerance interval, which was 15 units. Improvement after treatment for chronic low back pain tends to occur to a greater extent in sleep disturbance, ability to do usual things and psychological irritability, but to a lesser extent in the ability to sit, stand and lift. We conclude that the VAS of back pain is responsive enough to detect the minimal clinically important difference, whereas the smallest acceptable score changes of the Oswestry Disability Index, the General Function Score and the Zung Depression Scale may require an increase to exceed the 95% tolerance interval when used for clinical decision making and for power calculation. Despite improvement after treatment, the ability to sit, stand and lift, remain notable problems.