Human T-lymphotropic virus type 1 (HTLV-1) causes leukaemia or chronic inflammatory disease in approximately 5% of infected hosts. The level of proviral expression of HTLV-1 differs significantly among infected people, even at the same proviral load (proportion of infected mononuclear cells in the circulation). A high level of expression of the HTLV-1 provirus is associated with a high proviral load and a high risk of the inflammatory disease of the central nervous system known as HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP). But the factors that control the rate of HTLV-1 proviral expression remain unknown. Here we show that proviral integration sites of HTLV-1 in vivo are not randomly distributed within the human genome but are associated with transcriptionally active regions. Comparison of proviral integration sites between individuals with high and low levels of proviral expression, and between provirus-expressing and provirus non-expressing cells from within an individual, demonstrated that frequent integration into transcription units was associated with an increased rate of proviral expression. An increased frequency of integration sites in transcription units in individuals with high proviral expression was also associated with the inflammatory disease HAM/TSP. By comparing the distribution of integration sites in human lymphocytes infected in short-term cell culture with those from persistent infection in vivo, we infer the action of two selective forces that shape the distribution of integration sites in vivo: positive selection for cells containing proviral integration sites in transcriptionally active regions of the genome, and negative selection against cells with proviral integration sites within transcription units.