Learning metric volume estimation of fruits and vegetables from short monocular video sequences

Jan Steinbrener; Vesna Dimitrievska; Federico Pittino; Frans Starmans; Roland Waldner; Jürgen Holzbauer; Thomas Arnold

doi:10.1016/j.heliyon.2023.e14722

Learning metric volume estimation of fruits and vegetables from short monocular video sequences

Heliyon. 2023 Mar 21;9(4):e14722. doi: 10.1016/j.heliyon.2023.e14722. eCollection 2023 Apr.

Authors

Jan Steinbrener¹, Vesna Dimitrievska², Federico Pittino², Frans Starmans³, Roland Waldner³, Jürgen Holzbauer³, Thomas Arnold²

Affiliations

¹ Control of Networked Systems Group, University of Klagenfurt, Universitaetsstr. 65- 67, Klagenfurt, 9020, Carinthia, Austria.
² Silicon Austria Labs GmbH, Europastraße 12, Villach, 9524, Carinthia, Austria.
³ Philips Domestic Appliances Austria GmbH, Koningsbergerstr. 11, Klagenfurt, 9020, Carinthia, Austria.

Abstract

We present a novel approach for extracting metric volume information of fruits and vegetables from short monocular video sequences and associated inertial data recorded with a hand-held smartphone. Estimated segmentation masks from a pre-trained object detector are fused with the predicted change in relative pose obtained from the inertial data to predict the class and volume of the objects of interest. Our approach works with simple RGB video frames and inertial data which are readily available from modern smartphones. It does not require reference objects of known size in the video frames. Using a balanced validation dataset, we achieve a classification accuracy of 95% and a mean absolute percentage error for the volume prediction of 16% on untrained objects, which is comparable to state-of-the-art results requiring more elaborated data recording setups. A very accurate estimation of the model uncertainty is achieved through ensembling and the use of Gaussian negative log-likelihood loss. The dataset used in our experiments including ground-truth volume information is available at https://sst.aau.at/cns/datasets.

Keywords: Deep learning; Food datasets; Fusion; Image recognition; Sensor; Volume estimation.