Monocular Depth Estimation using a Multi-grid Attention-based Model

Sangam Man Buddhacharya; Rabin Adhikari; Nischal Maharjan; Sanjeeb Prasad Panday

doi:10.36548/jiip.2022.3.001

Monocular Depth Estimation using a Multi-grid Attention-based Model

Open Access

https://doi.org/10.36548/jiip.2022.3.001

Vol. 4, No. 3 (2022)

Published: 12 August, 2022

Pages: 127-146

Sangam Man Buddhacharya , Sangam Man Buddhacharya

Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal

Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal
Rabin Adhikari , Rabin Adhikari

Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal

Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal
Nischal Maharjan , Nischal Maharjan

Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal

Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal
Sanjeeb Prasad Panday Sanjeeb Prasad Panday

Associate Professor, Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal

Associate Professor, Department of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus, Tribhuvan University, Lalitpur, Nepal

view PDF

How to Cite

Buddhacharya, Sangam Man, Rabin Adhikari, Nischal Maharjan, and Sanjeeb Prasad Panday. 2022. “Monocular Depth Estimation Using a Multi-Grid Attention-Based Model”. Journal of Innovative Image Processing 4 (3): 127-46. https://doi.org/10.36548/jiip.2022.3.001.

Keywords

Convolutional Neural Network (CNN)

depth estimation

dilation rate

multigrid

attention mechanism

depth map

Abstract

With the increased use of depth information in computer vision, monocular depth estimation has been an emerging ﬁeld of study. It is a challenging task where many deep convolutional neural network-based methods have been used for depth prediction. The problem with most of these approaches is that they use a repeated combination of max-pooling and striding in an encoder, which reduces spatial resolution. In addition, these approaches use information from all the channels directly from the encoder, which is prone to noise. Addressing these issues, we present a multigrid attention-based densenet-161 model. It consists of a multigrid densenet-161 encoder that increases the spatial resolution and an attention-based decoder to select the important information from low-level features. We achieved absolute relative error (Absrel) of 0.109 and 0.0724 on NYU v2 and KITTI, dataset respectively. Our proposed method exceeded most evaluation measures with fewer parameters compared to the state-of-the-art on standard benchmark datasets. We produce a dense depth map from a single RGB image which can be used to create a dense point cloud. The anticipated depth map is accurate and smooth, which can be used in several applications.

References

João Paulo Silva do Monte Lima et al. “Depth-assisted rectification for real-time object detection and pose estimation”. In: Machine Vision and Applications 27.2 (2016), pp. 193– 219.
Caner Hazirbas et al. “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture”. In: Asian conference on computer vision. Springer. 2016, pp. 213–228.
Francesc Moreno-Noguer, Peter N Belhumeur, and Shree K Nayar. “Active refocusing of images and videos”. In: ACM Transactions On Graphics (TOG) 26.3 (2007), 67–es.
Rene Ranftl et al. “Dense monocular depth estimation in complex dynamic scenes”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 4058–4066.
Ashutosh Saxena, Min Sun, and Andrew Y Ng. “Make3d: Learning 3d scene structure from a single still image”. In: IEEE transactions on pattern analysis and machine intelligence 31.5 (2008), pp. 824–840.
David Eigen, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network”. In: Advances in neural information processing systems. 2014, pp. 2366–2374.
David Eigen and Rob Fergus. “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 2650–2658.
Fayao Liu et al. “Learning depth from single monocular images using deep convolutional neural fields”. In: IEEE transactions on pattern analysis and machine intelligence 38.10 (2015), pp. 2024–2039.
Peng Wang et al. “Towards unified depth and semantic prediction from a single image”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 2800–2809.
Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. “Semi-supervised deep learning for monocular depth map prediction”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). July 2017, pp. 6647–6655.
Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for largescale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.
Gao Huang et al. “Densely connected convolutional networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4700–4708.
Iro Laina et al. “Deeper depth prediction with fully convolutional residual networks”. In: 2016 Fourth international conference on 3D vision (3DV). IEEE. 2016, pp. 239–248.
Ravi Garg et al. “Unsupervised cnn for single view depth estimation: Geometry to the rescue”. In: European conference on computer vision. Springer. 2016, pp. 740–756.
Wei Yin et al. “Enforcing geometric constraints of virtual normal for depth prediction”. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 5684– 5693.
Jin Han Lee et al. “From big to small: Multi-scale local planar guidance for monocular depth estimation”. In: arXiv preprint arXiv:1907.10326 (2019).
Peter Hedman and Johannes Kopf. “Instant 3D photography”. In: ACM Transactions on Graphics (TOG) 37.4 (2018), pp. 1–12.
Shunsuke Saito et al. “3D hair synthesis using volumetric variational autoencoders”. In: ACM Transactions on Graphics (TOG) 37.6 (2018), pp. 1–12.
Lijun Wang et al. “DeepLens: shallow depth of field from a single image”. In: arXiv preprint arXiv:1810.08100 (2018).
Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. “Learning depth from single monocular images”. In: Advances in neural information processing systems. 2006, pp. 1161–1168.
Xiaolong Wang, David Fouhey, and Abhinav Gupta. “Designing deep networks for surface normal estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 539–547.
Bo Li et al. “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1119–1127.
Alexander G Schwing and Raquel Urtasun. “Fully connected deep structured networks”. In: arXiv preprint arXiv:1503.02351 (2015).
Patrick Knobelreiter et al. “End-to-end training of hybrid CNN-CRF models for stereo”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 2339–2348.
Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. “Estimating depth from monocular images as classification using deep fully convolutional residual networks”. In: IEEE Transactions on Circuits and Systems for Video Technology 28.11 (2017), pp. 3174–3182.
Huan Fu et al. “Deep ordinal regression network for monocular depth estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 2002–2011.
Yukang Gan et al. “Monocular depth estimation with affinity, vertical pooling, and label enhancement”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 224–239.
Weifeng Chen et al. “Single-Image Depth Perception in the Wild”. In: Advances in Neural Information Processing Systems. Ed. by D. Lee et al. Vol. 29. Curran Associates, Inc., 2016, pp. 730–738.
Liang-Chieh Chen et al. “Rethinking atrous convolution for semantic image segmentation”. In: arXiv preprint arXiv:1706.05587 (2017).
Panqu Wang et al. “Understanding convolution for semantic segmentation”. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE. 2018, pp. 1451– 1460.
Ruibo Li et al. “Deep attention-based classification network for robust depth prediction”. In: Asian Conference on Computer Vision. Springer. 2018, pp. 663–678.
Augustus Odena, Vincent Dumoulin, and Chris Olah. “Deconvolution and checkerboard artifacts”. In: Distill 1.10 (2016), e3.
Jie Hu, Li Shen, and Gang Sun. “Squeeze-and-excitation networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 7132–7141.
Andreas Geiger et al. “Vision meets Robotics: The KITTI Dataset”. In: The International Journal of Robotics Research 32.11 (2013), pp. 1231–1237.
Nathan Silberman et al. “Indoor segmentation and support inference from rgbd images”. In: European conference on computer vision. Springer. 2012, pp. 746–760.
Martín Abadi et al. “Tensorflow: A system for large-scale machine learning”. In: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 2016, pp. 265–283.
Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).
Jia Deng et al. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE. 2009, pp. 248–255.
Ayan Chakrabarti, Jingyu Shao, and Greg Shakhnarovich. “Depth from a single image by harmonizing overcomplete local network predictions”. In: Advances in Neural Information Processing Systems. 2016, pp. 2658–2666.
Jun Li, Reinhard Klein, and Angela Yao. “A two-streamed network for estimating fine-scaled depth maps from single RGB images”. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 3372–3380.
Dan Xu et al. “Multi-scale continuous CSRFs as sequential deep networks for monocular depth estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 5354–5362.
Jae-Han Lee et al. “Single-image depth estimation based on Fourier domain analysis”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 330–339.
Xiaojuan Qi et al. “Geonet: Geometric neural network for joint depth and surface normal estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 283–291.
Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. “Unsupervised monocular depth estimation with left-right consistency”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 270–279.

Monocular Depth Estimation using a Multi-grid Attention-based Model

How to Cite

Download Citation

Keywords

Abstract

References