Cities are home to an increasing majority of the world's population. Currently, it is difficult to track social, economic, environmental and health outcomes in cities with high spatial and temporal resolution, needed to evaluate policies regarding urban inequalities. We applied a deep learning approach to street images for measuring spatial distributions of income, education, unemployment, housing, living environment, health and crime. Our model predicts different outcomes directly from raw images without extracting intermediate user-defined features. To evaluate the performance of the approach, we first trained neural networks on a subset of images from London using ground truth data at high spatial resolution from official statistics. We then compared how trained networks separated the best-off from worst-off deciles for different outcomes in images not used in training. The best performance was achieved for quality of the living environment and mean income. Allocation was least successful for crime and self-reported health (but not objectively measured health). We also evaluated how networks trained in London predict outcomes three other major cities in the UK: Birmingham, Manchester, and Leeds. The transferability analysis showed that networks trained in London, fine-tuned with only 1% of images in other cities, achieved performances similar to ones from trained on data from target cities themselves. Our findings demonstrate that street imagery has the potential complement traditional survey-based and administrative data sources for high-resolution urban surveillance to measure inequalities and monitor the impacts of policies that aim to address them.