Detecting objects in cluttered scenes and estimating articulated human body parts from 2D images are two challenging problems in computer vision. The difficulty is particularly pronounced in activities involving human-object interactions (e.g., playing tennis), where the relevant objects tend to be small or only partially visible and the human body parts are often self-occluded. We observe, however, that objects and human poses can serve as mutual context to each other-recognizing one facilitates the recognition of the other. In this paper, we propose a mutual context model to jointly model objects and human poses in human-object interaction activities. In our approach, object detection provides a strong prior for better human pose estimation, while human pose estimation improves the accuracy of detecting the objects that interact with the human. On a six-class sports data set and a 24-class people interacting with musical instruments data set, we show that our mutual context model outperforms state of the art in detecting very difficult objects and estimating human poses, as well as classifying human-object interaction activities.