Abductive reasoning seeks the likeliest possible explanation for partial observations. Although being frequently employed in human daily reasoning, abduction is rarely explored in computer vision literature. In this article, we propose a new task, Visual Abductive Reasoning (VAR), that underpins the machine intelligence study of abductive reasoning in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the observed premise. We create the first large-scale VAR dataset, which contains a total of 9K examples. We further devise a transformer-based VAR model - REASONERv2 - for knowledge-driven, causal-and-cascaded reasoning. REASONERv2 first adopts a contextualized directional position embedding strategy in the encoder, to capture the causal-related temporal structure of the observations, and yield discriminative representations for the premises and hypotheses. Then, REASONERv2 extracts condensed causal knowledge from external knowledge bases, for reasoning beyond observation. Finally, REASONERv2 cascades multiple decoders so as to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that REASONERv2 surpasses many famous video-language models, while still being far behind human performance. Code and dataset are available at: https://github.com/leonnnop/VAR.