Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

international conference on learning representations, 2015.

Cited by: 2488|Bibtex|Views46|Links
EI
Keywords:
high level visiondeep convolutional neuralDeep Convolutional Neural Networksvision taskimage segmentationMore(4+)
Weibo:
Our experimental results show that the proposed method significantly advances the state-of-art in the challenging PASCAL VOC 2012 semantic image segmentation task

Abstract:

Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segm...More

Code:

Data:

0
Introduction
  • Deep Convolutional Neural Networks (DCNNs) had been the method of choice for document recognition since LeCun et al (1998), but have only recently become the mainstream of high-level vision research.
  • A common theme in these works is that DCNNs trained in an end-to-end manner deliver strikingly better results than systems relying on carefully engineered representations, such as SIFT or HOG features
  • This success can be partially attributed to the built-in invariance of DCNNs to local image transformations, which underpins their ability to learn hierarchical abstractions of data (Zeiler & Fergus, 2014).
  • This allows efficient dense computation of DCNN responses in a scheme substantially simpler than earlier solutions to this problem (Giusti et al, 2013; Sermanet et al, 2013)
Highlights
  • Deep Convolutional Neural Networks (DCNNs) had been the method of choice for document recognition since LeCun et al (1998), but have only recently become the mainstream of high-level vision research
  • A common theme in these works is that Deep Convolutional Neural Networks trained in an end-to-end manner deliver strikingly better results than systems relying on carefully engineered representations, such as SIFT or HOG features
  • That model was shown in Krahenbuhl & Koltun (2011) to largely improve the performance of a boosting-based pixel-level classifier, and in our work we demonstrate that it leads to state-of-the-art results when coupled with a Deep Convolutional Neural Networks-based pixel-level classifier
  • We provide visual comparisons between DeepLab and DeepLab-Conditional Random Field in Fig. 7
  • Our experimental results show that the proposed method significantly advances the state-of-art in the challenging PASCAL VOC 2012 semantic image segmentation task
  • Our work lies in the intersection of convolutional neural networks and probabilistic graphical models
Methods
  • DeepLab DeepLab-CRF DeepLab-MSc DeepLab-MSc-CRF

    DeepLab-7x7 DeepLab-CRF-7x7 DeepLab-LargeFOV DeepLab-CRF-LargeFOV DeepLab-MSc-LargeFOV DeepLab-MSc-CRF-LargeFOV (a) mean IOU (%)

    MSRA-CFM FCN-8s TTI-Zoomout-16

    DeepLab-CRF DeepLab-MSc-CRF DeepLab-CRF-7x7 DeepLab-CRF-LargeFOV DeepLab-MSc-CRF-LargeFOV (b)

    Dataset The authors test the DeepLab model on the PASCAL VOC 2012 segmentation benchmark (Everingham et al, 2014), consisting of 20 foreground object classes and one background class.
  • DeepLab-7x7 DeepLab-CRF-7x7 DeepLab-LargeFOV DeepLab-CRF-LargeFOV DeepLab-MSc-LargeFOV DeepLab-MSc-CRF-LargeFOV (a) mean IOU (%).
  • The original dataset contains 1, 464, 1, 449, and 1, 456 images for training, validation, and testing, respectively.
  • The dataset is augmented by the extra annotations provided by Hariharan et al (2011), resulting in 10, 582 training images.
  • The performance is measured in terms of pixel intersectionover-union (IOU) averaged across the 21 classes.
  • The authors use momentum of 0.9 and a weight decay of 0.0005
Results
  • Evaluation on Validation set

    The authors conduct the majority of the evaluations on the PASCAL ‘val’ set, training the model on the augmented PASCAL ‘train’ set.
  • As shown in Tab. 1 (a), incorporating the fully connected CRF to the model yields a substantial performance boost, about 4% improvement over DeepLab. The authors note that the work of Krahenbuhl & Koltun (2011) improved the 27.6% result of TextonBoost (Shotton et al, 2009) to 29.1%, which makes the improvement the authors report here all the more impressive.
  • As shown in Tab. 1 (a), adding the multi-scale features to our
Conclusion
  • The authors' work combines ideas from deep convolutional neural networks and fully-connected conditional random fields, yielding a novel method able to produce semantically accurate predictions and detailed segmentation maps, while being computationally efficient.
  • The authors' experimental results show that the proposed method significantly advances the state-of-art in the challenging PASCAL VOC 2012 semantic image segmentation task.
  • The authors' work lies in the intersection of convolutional neural networks and probabilistic graphical models.
  • The authors plan to further investigate the interplay of these two powerful classes of methods and explore their synergistic potential for solving challenging computer vision tasks
Summary
  • Introduction:

    Deep Convolutional Neural Networks (DCNNs) had been the method of choice for document recognition since LeCun et al (1998), but have only recently become the mainstream of high-level vision research.
  • A common theme in these works is that DCNNs trained in an end-to-end manner deliver strikingly better results than systems relying on carefully engineered representations, such as SIFT or HOG features
  • This success can be partially attributed to the built-in invariance of DCNNs to local image transformations, which underpins their ability to learn hierarchical abstractions of data (Zeiler & Fergus, 2014).
  • This allows efficient dense computation of DCNN responses in a scheme substantially simpler than earlier solutions to this problem (Giusti et al, 2013; Sermanet et al, 2013)
  • Methods:

    DeepLab DeepLab-CRF DeepLab-MSc DeepLab-MSc-CRF

    DeepLab-7x7 DeepLab-CRF-7x7 DeepLab-LargeFOV DeepLab-CRF-LargeFOV DeepLab-MSc-LargeFOV DeepLab-MSc-CRF-LargeFOV (a) mean IOU (%)

    MSRA-CFM FCN-8s TTI-Zoomout-16

    DeepLab-CRF DeepLab-MSc-CRF DeepLab-CRF-7x7 DeepLab-CRF-LargeFOV DeepLab-MSc-CRF-LargeFOV (b)

    Dataset The authors test the DeepLab model on the PASCAL VOC 2012 segmentation benchmark (Everingham et al, 2014), consisting of 20 foreground object classes and one background class.
  • DeepLab-7x7 DeepLab-CRF-7x7 DeepLab-LargeFOV DeepLab-CRF-LargeFOV DeepLab-MSc-LargeFOV DeepLab-MSc-CRF-LargeFOV (a) mean IOU (%).
  • The original dataset contains 1, 464, 1, 449, and 1, 456 images for training, validation, and testing, respectively.
  • The dataset is augmented by the extra annotations provided by Hariharan et al (2011), resulting in 10, 582 training images.
  • The performance is measured in terms of pixel intersectionover-union (IOU) averaged across the 21 classes.
  • The authors use momentum of 0.9 and a weight decay of 0.0005
  • Results:

    Evaluation on Validation set

    The authors conduct the majority of the evaluations on the PASCAL ‘val’ set, training the model on the augmented PASCAL ‘train’ set.
  • As shown in Tab. 1 (a), incorporating the fully connected CRF to the model yields a substantial performance boost, about 4% improvement over DeepLab. The authors note that the work of Krahenbuhl & Koltun (2011) improved the 27.6% result of TextonBoost (Shotton et al, 2009) to 29.1%, which makes the improvement the authors report here all the more impressive.
  • As shown in Tab. 1 (a), adding the multi-scale features to our
  • Conclusion:

    The authors' work combines ideas from deep convolutional neural networks and fully-connected conditional random fields, yielding a novel method able to produce semantically accurate predictions and detailed segmentation maps, while being computationally efficient.
  • The authors' experimental results show that the proposed method significantly advances the state-of-art in the challenging PASCAL VOC 2012 semantic image segmentation task.
  • The authors' work lies in the intersection of convolutional neural networks and probabilistic graphical models.
  • The authors plan to further investigate the interplay of these two powerful classes of methods and explore their synergistic potential for solving challenging computer vision tasks
Tables
  • Table1: a) Performance of our proposed models on the PASCAL VOC 2012 ‘val’ set (with training in the augmented ‘train’ set). The best performance is achieved by exploiting both multi-scale features and large field-of-view. (b) Performance of our proposed models (with training in the augmented ‘trainval’ set) compared to other state-of-art methods on the PASCAL VOC 2012 ‘test’ set
  • Table2: Effect of Field-Of-View. We show the performance (after CRF) and training speed on the PASCAL VOC 2012 ‘val’ set as the function of (1) the kernel size of first fully connected layer, (2) the input stride value employed in the atrous algorithm
  • Table3: Labeling IOU (%) on the PASCAL VOC 2012 test set, using the trainval set for training
Download tables as Excel
Related work
  • Our system works directly on the pixel representation, similarly to Long et al (2014). This is in contrast to the two-stage approaches that are now most common in semantic segmentation with DCNNs: such techniques typically use a cascade of bottom-up image segmentation and DCNN-based region classification, which makes the system commit to potential errors of the front-end segmentation system. For instance, the bounding box proposals and masked regions delivered by (Arbelaez et al, 2014; Uijlings et al, 2013) are used in Girshick et al (2014) and (Hariharan et al, 2014b) as inputs to a DCNN to introduce shape information into the classification process. Similarly, the authors of Mostajabi et al (2014) rely on a superpixel representation. A celebrated non-DCNN precursor to these works is the second order pooling method of (Carreira et al, 2012) which also assigns labels to the regions proposals delivered by (Carreira & Sminchisescu, 2012). Understanding the perils of committing to a single segmentation, the authors of Cogswell et al (2014) build on (Yadollahpour et al, 2013) to explore a diverse set of CRF-based segmentation proposals, computed also by (Carreira & Sminchisescu, 2012). These segmentation proposals are then re-ranked according to a DCNN trained in particular for this reranking task. Even though this approach explicitly tries to handle the temperamental nature of a front-end segmentation algorithm, there is still no explicit exploitation of the DCNN scores in the CRF-based segmentation algorithm: the DCNN is only applied post-hoc, while it would make sense to directly try to use its results during segmentation.
Funding
  • This work was partly supported by ARO 62250-CS, NIH Grant 5R01EY022247-03, EU Project RECONFIG FP7-ICT-600825 and EU Project MOBOT FP7-ICT-2011-600796
  • We also gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research
Reference
  • Adams, A., Baek, J., and Davis, M. A. Fast high-dimensional filtering using the permutohedral lattice. In Computer Graphics Forum, 2010.
    Google ScholarLocate open access versionFindings
  • Arbelaez, P., Pont-Tuset, J., Barron, J. T., Marques, F., and Malik, J. Multiscale combinatorial grouping. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Bell, S., Upchurch, P., Snavely, N., and Bala, K. Material recognition in the wild with the materials in context database. arXiv:1412.0623, 2014.
    Findings
  • Carreira, J. and Sminchisescu, C. Cpmc: Automatic object segmentation using constrained parametric min-cuts. PAMI, 2012.
    Google ScholarLocate open access versionFindings
  • Carreira, J., Caseiro, R., Batista, J., and Sminchisescu, C. Semantic segmentation with second-order pooling. In ECCV, 2012.
    Google ScholarLocate open access versionFindings
  • Chen, L.-C., Papandreou, G., and Yuille, A. Learning a dictionary of shape epitomes with applications to image labeling. In ICCV, 2013.
    Google ScholarLocate open access versionFindings
  • Chen, L.-C., Schwing, A., Yuille, A., and Urtasun, R. Learning deep structured models. arXiv:1407.2538, 2014.
    Findings
  • Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016.
    Findings
  • Chen, X. and Yuille, A. L. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Cogswell, M., Lin, X., Purushwalkam, S., and Batra, D. Combining the best of graphical models and convnets for semantic segmentation. arXiv:1412.4313, 2014.
    Findings
  • Dai, J., He, K., and Sun, J. Convolutional feature masking for joint object and stuff segmentation. arXiv:1412.1283, 2014.
    Findings
  • Delong, A., Osokin, A., Isack, H. N., and Boykov, Y. Fast approximate energy minimization with label costs. IJCV, 2012.
    Google ScholarLocate open access versionFindings
  • Eigen, D. and Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. arXiv:1411.4734, 2014.
    Findings
  • Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J., and Zisserma, A. The pascal visual object classes challenge a retrospective. IJCV, 2014.
    Google ScholarLocate open access versionFindings
  • Farabet, C., Couprie, C., Najman, L., and LeCun, Y. Learning hierarchical features for scene labeling. PAMI, 2013.
    Google ScholarLocate open access versionFindings
  • Geiger, D. and Girosi, F. Parallel and deterministic algorithms from mrfs: Surface reconstruction. PAMI, 13(5):401–412, 1991.
    Google ScholarLocate open access versionFindings
  • Geiger, D. and Yuille, A. A common framework for image segmentation. IJCV, 6(3):227–243, 1991.
    Google ScholarLocate open access versionFindings
  • Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Giusti, A., Ciresan, D., Masci, J., Gambardella, L., and Schmidhuber, J. Fast image scanning with deep max-pooling convolutional neural networks. In ICIP, 2013.
    Google ScholarFindings
  • Gonfaus, J. M., Boix, X., Van de Weijer, J., Bagdanov, A. D., Serrat, J., and Gonzalez, J. Harmony potentials for joint classification and segmentation. In CVPR, 2010.
    Google ScholarLocate open access versionFindings
  • Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., and Malik, J. Semantic contours from inverse detectors. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • Hariharan, B., Arbelaez, P., Girshick, R., and Malik, J. Hypercolumns for object segmentation and fine-grained localization. arXiv:1411.5752, 2014a.
    Findings
  • Hariharan, B., Arbelaez, P., Girshick, R., and Malik, J. Simultaneous detection and segmentation. In ECCV, 2014b.
    Google ScholarLocate open access versionFindings
  • He, X., Zemel, R. S., and Carreira-Perpindn, M. Multiscale conditional random fields for image labeling. In CVPR, 2004.
    Google ScholarLocate open access versionFindings
  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
    Findings
  • Kohli, P., Ladicky, L., and Torr, P. H. Robust higher order potentials for enforcing label consistency. IJCV, 2009.
    Google ScholarLocate open access versionFindings
  • Kokkinos, I., Deriche, R., Faugeras, O., and Maragos, P. Computational analysis and learning for a biologically motivated model of boundary detection. Neurocomputing, 71(10):1798–1812, 2008.
    Google ScholarLocate open access versionFindings
  • Krahenbuhl, P. and Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • Krahenbuhl, P. and Koltun, V. Parameter learning and convergent inference for dense random fields. In ICML, 2013.
    Google ScholarFindings
  • Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • Ladicky, L., Russell, C., Kohli, P., and Torr, P. H. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009.
    Google ScholarLocate open access versionFindings
  • LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proc. IEEE, 1998.
    Google ScholarLocate open access versionFindings
  • Lempitsky, V., Vedaldi, A., and Zisserman, A. Pylon model for semantic segmentation. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. arXiv:1411.4038, 2014.
    Findings
  • Lucchi, A., Li, Y., Boix, X., Smith, K., and Fua, P. Are spatial and global constraints really necessary for segmentation? In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • Mallat, S. A Wavelet Tour of Signal Processing. Acad. Press, 2 edition, 1999.
    Google ScholarFindings
  • Mostajabi, M., Yadollahpour, P., and Shakhnarovich, G. Feedforward semantic segmentation with zoom-out features. arXiv:1412.0774, 2014.
    Findings
  • Papandreou, G., Kokkinos, I., and Savalle, P.-A. Untangling local and global deformations in deep convolutional networks for image classification and sliding window detection. arXiv:1412.0296, 2014.
    Findings
  • Papandreou, G., Chen, L.-C., Murphy, K., and Yuille, A. L. Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. arXiv:1502.02734, 2015.
    Findings
  • Rother, C., Kolmogorov, V., and Blake, A. Grabcut: Interactive foreground extraction using iterated graph cuts. In SIGGRAPH, 2004.
    Google ScholarLocate open access versionFindings
  • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013.
    Findings
  • Shotton, J., Winn, J., Rother, C., and Criminisi, A. Textonboost for image understanding: Multiclass object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009.
    Google ScholarLocate open access versionFindings
  • Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
    Findings
  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. arXiv:1409.4842, 2014.
    Findings
  • Published as a conference paper at ICLR 2015 Tompson, J., Jain, A., LeCun, Y., and Bregler, C. Joint Training of a Convolutional Network and a
    Google ScholarLocate open access versionFindings
  • Graphical Model for Human Pose Estimation. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., and Yuille, A. Towards unified depth and semantic prediction from a single image. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Yadollahpour, P., Batra, D., and Shakhnarovich, G. Discriminative re-ranking of diverse segmentations. In CVPR, 2013.
    Google ScholarLocate open access versionFindings
  • Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Zhang, N., Donahue, J., Girshick, R., and Darrell, T. Part-based r-cnns for fine-grained category detection. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Conditional random fields as recurrent neural networks. arXiv:1502.03240, 2015.
    Findings
Your rating :
0

 

Tags
Comments