In this paper we introduce a method for multi-class, monocular 3D object detection from a single RGB image, which exploits a novel disentangling transformation and a novel, self-supervised confidence estimation method for predicted 3D bounding boxes. The proposed disentangling transformation isolates the contribution made by different groups of parameters to a given loss, without changing its nature. This brings two advantages: i) it simplifies the training dynamics in the presence of losses with complex interactions of parameters; and ii) it allows us to avoid the issue of balancing independent regression terms. We further apply this disentangling transformation to another novel, signed Intersection-over-Union criterion-driven loss for improving 2D detection results. We also critically review the AP metric used in KITTI3D and resolve a flaw which affected and biased all previously published results on monocular 3D detection. Our improved metric is now used as official KITTI3D metric. We provide extensive experimental evaluations and ablation studies on the KITTI3D and nuScenes datasets, setting new state-of-the-art results. We provide additional results on all the classes of KITTI3D as well as nuScenes datasets to further validate the robustness of our method, demonstrating its ability to generalize for different types of objects.