High reliability and validity of clinical rating schemes is crucial for their use as outcome measurements of treatment of hip and knee osteoarthritis. In this paper, we review the empirical evidence on the reliability and validity of commonly used clinical scores. Clinical scores and related reliability and validity studies were identified by systematic literature search. Scores were classified according to the type and joint. Reliability and validity studies were characterized according to design, population, number and qualification of observers, number of measurements, time interval between repeat measurements and results. Reliability and validity studies were reported for only 6 and 15 of the 45 identified clinical scores, respectively. Although comparisons are difficult due to differences in study design, relatively high reliability was reported for most measurements of pain, stiffness, and physical function, while results are less conclusive for clinical signs. Most validity studies focused on the correlation between various scores. Correlation was generally found to be high for overall numerical ratings, but scores often differed with respect to the interpretation of these ratings. Validity has been more comprehensively studied for Lequesne's scores, WOMAC, and ILAS, and these scores have shown satisfactory responsiveness to different treatment effects. Overall, knowledge on reliability and validity of clinical scores of hip and knee osteoarthritis is limited, underlining the need for further properly designed and conducted studies.