Matrix capsules with EM routing

Nicholas Frosst
Nicholas Frosst

international conference on learning representations, 2018.

Cited by: 273|Bibtex|Views108|Links
EI
Keywords:
pose matrixcapsule layertransformation matrixwhite box adversarial attackviewpoint invariantMore(4+)
Wei bo:
Building on the work of Sabour et al, we have proposed a new type of capsule system in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix to represent the pose of that entity

Abstract:

A capsule is a group of neurons whose outputs represent different properties of the same entity. Each layer in a capsule network contains many capsules [a group of capsules forms a capsule layer and can be used in place of a traditional layer in a neural net]. We describe a version of capsules in which each capsule has a logistic unit to ...More

Code:

Data:

0
Introduction
  • Convolutional neural nets are based on the simple fact that a vision system needs to use the same knowledge at all locations in the image.
  • Capsules use high-dimensional coincidence filtering: a familiar object can be detected by looking for agreement between votes for its pose matrix.
  • These votes come from parts that have already been detected.
  • The pose matrices of the parts and the whole will change in a coordinated way so that any agreement between votes from different parts will persist
Highlights
  • Convolutional neural nets are based on the simple fact that a vision system needs to use the same knowledge at all locations in the image
  • A part produces a vote by multiplying its own pose matrix by a learned transformation matrix that represents the viewpoint invariant relationship between the part and the whole
  • Building on the work of Sabour et al (2017), we have proposed a new type of capsule system in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix to represent the pose of that entity
  • We introduced a new iterative routing procedure between capsule layers, based on the EM algorithm, which allows the output of each lower-level capsule to be routed to a capsule in the layer above in such a way that active capsules receive a cluster of similar pose votes
  • We have shown it to be significantly more robust to white box adversarial attacks than a baseline CNN
Methods
  • The smallNORB dataset (LeCun et al (2004)) has gray-level stereo images of 5 classes of toys: airplanes, cars, trucks, humans and animals.
  • 5 physical instances of a class are selected for the training data and the other 5 for the test data.
  • Every individual toy is pictured at 18 different azimuths (0-340), 9 elevations and 6 lighting conditions, so the training and test sets each contain 24,300 stereo pairs of 96x96 images.
  • The authors selected smallNORB as a benchmark for developing the capsules system because it is carefully designed to be a pure shape recognition task that is not confounded by context and color, but it is much closer to natural images than MNIST.
Results
  • The authors tested the model on NORB which is a jittered version of smallNORB with added background and the authors achieved a 2.6% error rate which is on par with the state-of-the-art of 2.7% (Ciresan et al (2012)).
Conclusion
  • Building on the work of Sabour et al (2017), the authors have proposed a new type of capsule system in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix to represent the pose of that entity.
  • The authors introduced a new iterative routing procedure between capsule layers, based on the EM algorithm, which allows the output of each lower-level capsule to be routed to a capsule in the layer above in such a way that active capsules receive a cluster of similar pose votes
  • This new system achieves significantly better accuracy on the smallNORB data set than the state-of-the-art CNN, reducing the number of errors by 45%.
  • That the capsules model works well on NORB, the authors plan to implement an efficient version to test much larger models on much larger data-sets such as ImageNet
Summary
  • Introduction:

    Convolutional neural nets are based on the simple fact that a vision system needs to use the same knowledge at all locations in the image.
  • Capsules use high-dimensional coincidence filtering: a familiar object can be detected by looking for agreement between votes for its pose matrix.
  • These votes come from parts that have already been detected.
  • The pose matrices of the parts and the whole will change in a coordinated way so that any agreement between votes from different parts will persist
  • Methods:

    The smallNORB dataset (LeCun et al (2004)) has gray-level stereo images of 5 classes of toys: airplanes, cars, trucks, humans and animals.
  • 5 physical instances of a class are selected for the training data and the other 5 for the test data.
  • Every individual toy is pictured at 18 different azimuths (0-340), 9 elevations and 6 lighting conditions, so the training and test sets each contain 24,300 stereo pairs of 96x96 images.
  • The authors selected smallNORB as a benchmark for developing the capsules system because it is carefully designed to be a pure shape recognition task that is not confounded by context and color, but it is much closer to natural images than MNIST.
  • Results:

    The authors tested the model on NORB which is a jittered version of smallNORB with added background and the authors achieved a 2.6% error rate which is on par with the state-of-the-art of 2.7% (Ciresan et al (2012)).
  • Conclusion:

    Building on the work of Sabour et al (2017), the authors have proposed a new type of capsule system in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix to represent the pose of that entity.
  • The authors introduced a new iterative routing procedure between capsule layers, based on the EM algorithm, which allows the output of each lower-level capsule to be routed to a capsule in the layer above in such a way that active capsules receive a cluster of similar pose votes
  • This new system achieves significantly better accuracy on the smallNORB data set than the state-of-the-art CNN, reducing the number of errors by 45%.
  • That the capsules model works well on NORB, the authors plan to implement an efficient version to test much larger models on much larger data-sets such as ImageNet
Tables
  • Table1: The effect of varying different components of our capsules architecture on smallNORB
  • Table2: A comparison of the smallNORB test error rate of the baseline CNN and the capsules model on novel viewpoints when both models are matched on error rate for familiar viewpoints
Download tables as Excel
Related work
  • Among the multiple recent attempts at improving the ability of neural networks to deal with viewpoint variations, there are two main streams. One stream attempts to achieve viewpoint invariance and the other aims for viewpoint equivariance. The work presented by Jaderberg et al (2015)), Spatial Transformer Networks, seeks viewpoint invariance by changing the sampling of CNNs according to a selection of affine transformations. De Brabandere et al (2016) extends spatial transformer networks where the filters are adapted during inference depending on the input. They generate different filters for each locality in the feature map rather than applying the same transformation to all filters. Their approach is a step toward input covariance detection from traditional pattern matching frameworks like standard CNNs (LeCun et al (1990)). Dai et al (2017) improves upon spatial transformer networks by generalizing the sampling method of filters. Our work differs substantially in that a unit is not activated based on the matching score with a filter (either fixed or dynamically changing during inference). In our case, a capsule is activated only if the transformed poses coming from the layer below match each other. This is a more effective way to capture covariance and leads to models with many fewer parameters that generalize better. The success of CNNs has motivated many researchers to extend the translational equivariance built in to CNNs to include rotational equivariance (Cohen & Welling (2016), Dieleman et al (2016), Oyallon & Mallat (2015)). The recent approach in Harmonic Networks (Worrall et al (2017)) achieves rotation equivariant feature maps by using circular harmonic filters and returning both the maximal response and orientation using complex numbers. This shares the basic representational idea of capsules: By assuming that there is only one instance of the entity at a location, we can use several different numbers to represent its properties. They use a fixed number of streams of rotation orders. By enforcing the equality of the sum of rotation orders along any path, they achieve patch-wise rotation equivariance. This approach is more parameter-efficient than data augmentation approaches, duplicating feature maps, or duplicating filters (Fasel & Gatica-Perez (2006), Laptev et al (2016)). Our approach encodes general viewpoint equivariance rather than only affine 2D rotations. Symmetry networks (Gens & Domingos (2014)) use iterative Lucas-Kanade optimization to find poses that are supported by the most low-level features. Their key weakness is that the iterative algorithm always starts at the same pose, rather than the mean of the bottom-up votes.
Reference
  • Wieland Brendel and Matthias Bethge. Comment on” biologically inspired protection of deep networks from adversarial attacks”. arXiv preprint arXiv:1704.01547, 2017.
    Findings
  • Dan Ciresan, Ueli Meier, and Jurgen Schmidhuber. Multi-column deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3642–3649. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Dan C Ciresan, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jurgen Schmidhuber. Highperformance neural networks for visual object classification. arXiv preprint arXiv:1102.0183, 2011.
    Findings
  • Taco Cohen and Max Welling. Group equivariant convolutional networks. In International Conference on Machine Learning, pp. 2990–2999, 2016.
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. arXiv preprint arXiv:1703.06211, 2017.
    Findings
  • Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
    Google ScholarLocate open access versionFindings
  • Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1889–1898. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045590.
    Locate open access versionFindings
  • Beat Fasel and Daniel Gatica-Perez. Rotation-invariant neoperceptron. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 3, pp. 336–339. IEEE, 2006.
    Google ScholarLocate open access versionFindings
  • Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in neural information processing systems, pp. 2537–2545, 2014.
    Google ScholarLocate open access versionFindings
  • Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
    Findings
  • Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning, pp. 1462–1471, 2015.
    Google ScholarLocate open access versionFindings
  • Yann Guermeur and Emmanuel Monfrini. A quadratic loss multi-class svm for which a radius– margin bound applies. Informatica, 22(1):73–96, 2011.
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, pp. 2017–2025, 2015.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
    Findings
  • Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, and Marc Pollefeys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 289–297, 2016.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404, 1990.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pp. II–104. IEEE, 2004.
    Google ScholarLocate open access versionFindings
  • Karel Lenc and Andrea Vedaldi. Learning covariant feature detectors. In Computer Vision–ECCV 2016 Workshops, pp. 100–117.
    Google ScholarLocate open access versionFindings
  • Edouard Oyallon and Stephane Mallat. Deep roto-translation scattering for object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2865– 2873, 2015.
    Google ScholarLocate open access versionFindings
  • Published as a conference paper at ICLR 2018 Sara Sabour, Nicholas Fross, and Geoffrey E Hinton. Dynamic routing between capsules. In Neural
    Google ScholarLocate open access versionFindings
  • Information Processing Systems (NIPS), 2017. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), 2017. Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments