# Variational Lossy Autoencoder

ICLR, Volume abs/1611.02731, 2017.

EI

Keywords:

generative modelingobserved datumlearning representationdensity estimationglobal representationMore(15+)

Weibo:

Abstract:

Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a sim...More

Code:

Data:

Introduction

- A key goal of representation learning is to identify and disentangle the underlying causal factors of the data, so that it becomes easier to understand the data, to classify it, or to perform other tasks (Bengio et al, 2013).
- The objective that the authors optimize is often completely disconnected from the goal of learning a good representation: An autoregressive model of the data may achieve the same log-likelihood as a variational autoencoder (VAE) (Kingma & Welling, 2013), but the structure learned by the two models is completely different: the latter typically has a clear hierarchy of latent variables, while the autoregressive model has no stochastic latent variables at all
- For this reason, autoregressive models have far not been popular for the purpose of learning representations, even though they are extremely powerful as generative models.
- The model the authors propose performs well as a density estimator, as evidenced by state-of-the-art log-likelihood results on MNIST, OMNIGLOT and Caltech-101, and has a structure that is uniquely suited for learning interesting global representations of data

Highlights

- A key goal of representation learning is to identify and disentangle the underlying causal factors of the data, so that it becomes easier to understand the data, to classify it, or to perform other tasks (Bengio et al, 2013)
- The objective that we optimize is often completely disconnected from the goal of learning a good representation: An autoregressive model of the data may achieve the same log-likelihood as a variational autoencoder (VAE) (Kingma & Welling, 2013), but the structure learned by the two models is completely different: the latter typically has a clear hierarchy of latent variables, while the autoregressive model has no stochastic latent variables at all
- Even though the information preference property of variational autoencoder might suggest that one should always use the full autoregressive models to achieve a better code length/log-likelihood, especially when slow data generation is not a concern, we argue that this information preference property can be exploited to turn the variational autoencoder into a powerful representation learning method that gives us fine-grained control over the kind of information that gets included in the learned representation
- We propose to parametrize the prior distribution p(z; θ) with an autoregressive model and show that a type of autoregressive latent code can in theory reduce inefficiency in Bits-Back coding
- We present a complementary perspective on when/how should the latent code be used by appealing to a Bits-Back interpretation of variational autoencoder
- VLAE has the appealing properties of controllable representation learning and improved density estimation performance but these properties come at a cost: compared with variational autoencoder models that have simple prior and decoder, VLAE is slower at generation due to the sequential nature of autoregressive model

Methods

- The authors evaluate VLAE on 2D images and leave extensions to other forms of data to future work.
- The authors use binary image datasets that are commonly used for density estimation tasks: MNIST (LeCun et al, 1998) (both statically binarized 1 and dynamically binarized version (Burda et al, 2015a)), OMNIGLOT (Lake et al, 2013; Burda et al, 2015a) and Caltech-101 Silhouettes (Marlin et al, 2010).

Results

**Method Results with tractable likelihood models**

Uniform distribution [1] Multivariate Gaussian [1] NICE [2] Deep GMMs [3] Real NVP [4] PixelCNN [1] Gated PixelCNN [5] PixelRNN [1] PixelCNN++ [6] Results with variationally trained latent-variable models: Deep Diffusion [7] Convolutional DRAW [8] ResNet VAE with IAF [9] ResNet VLAE DenseNet VLAE

The authors investigate learning lossy codes on CIFAR10 images.- To illustrate how does the receptive field size of PixelCNN decoder influence properties of learned latent codes, the authors show visualizations of similar VLAE models with receptive fields of different sizes.
- From (a)-(c) in Figure 3, the authors can see that larger receptive fields progressively make autoregressive decoders capture more structural information.
- In (a), a smaller receptive field tends to preserve rather detailed shape information in the lossy code whereas the latent code only retains rough shape in (c) with a larger receptive field.
- To demonstrate how the authors can encode color information in the lossy code, the authors can choose to make

Conclusion

- The authors analyze the condition under which the latent code in VAE should be used, i.e. when does VAE autoencode, and use this observation to design a VAE model that’s a lossy compressor of observed data.
- Moving forward, the authors believe it’s exciting to extend this principle of learning lossy codes to other forms of data, in particular those that have a temporal aspect like audio and video.
- Another promising direction is to design representations that contain only information for downstream tasks and utilize those representations to improve semi-supervised learning

Summary

- A key goal of representation learning is to identify and disentangle the underlying causal factors of the data, so that it becomes easier to understand the data, to classify it, or to perform other tasks (Bengio et al, 2013).
- The model we propose performs well as a density estimator, as evidenced by state-of-the-art log-likelihood results on MNIST, OMNIGLOT and Caltech-101, and has a structure that is uniquely suited for learning interesting global representations of data.
- Once we understand the inefficiency of the Bits-Back Coding mechanism, it’s simple to realize why sometimes the latent code z is not used: if the p(x|z) could model pdata(x) without using information from z, it will not use z, in which case the true posterior p(z|x) is the prior p(z) and it’s usually easy to set q(z|x) to be p(z) to avoid incurring an extra cost DKL(q(z|x)||p(z|x)).
- If we are interested in learning a global representation for 2D images that doesn’t encode information about detailed texture, we can construct a specific factorization of the autoregressive distribution such that it has a small local receptive field as decoding distribution, e.g., plocal(x|z) = than x
- The conditional of an autoregressive distribution might depend on a heavily down-sampled receptive field so that it can only model long-range pattern whereas local high-frequency statistics need to be encoded into the latent code.
- An identical VAE model with factorized decoding distribution will uses on average 37.3 bits in latent code, and this indicates that VLAE can learn a lossier compression than a VAE with regular factorized conditional distribution.
- We evaluate whether using autoregressive decoding distribution can improve performance and we show in Table 1 that a VLAE model, with AF prior and PixelCNN conditional, is able to outperform a VAE with just AF prior and achieves new state-of-the-art results on statically binarized MNIST.
- We hypothesize that the separation of different types of information, the modeling global structure in latent code and local statistics in PixelCNN, likely has some form of good inductive biases for 2D images.
- Kingma et al (2016), Kaae Sønderby et al (2016), Gregor et al (2016) and Salimans (2016) explored VAE architecture with an explicitly deep autoregressive prior for continuous latent variables, but the autoregressive data likelihood is intractable in those architectures and needs to inferred variationally.
- VLAE has the appealing properties of controllable representation learning and improved density estimation performance but these properties come at a cost: compared with VAE models that have simple prior and decoder, VLAE is slower at generation due to the sequential nature of autoregressive model.
- Another promising direction is to design representations that contain only information for downstream tasks and utilize those representations to improve semi-supervised learning

- Table1: Statically Binarized MNIST. Ablation on Dynamically binarized MNIST
- Table2: Dynamically binarized MNIST
- Table3: OMNIGLOT. [1] (<a class="ref-link" id="cBurda_et+al_2015_a" href="#rBurda_et+al_2015_a">Burda et al, 2015a</a>), [2] (<a class="ref-link" id="cBurda_et+al_2015_b" href="#rBurda_et+al_2015_b">Burda et al, 2015b</a>), [3] (<a class="ref-link" id="cGregor_et+al_2015_a" href="#rGregor_et+al_2015_a">Gregor et al, 2015</a>), [4] (<a class="ref-link" id="cGregor_et+al_2016_a" href="#rGregor_et+al_2016_a">Gregor et al, 2016</a>),
- Table4: Caltech-101 Silhouettes. [1] (<a class="ref-link" id="cBornschein_2014_a" href="#rBornschein_2014_a">Bornschein & Bengio, 2014</a>), [2] (<a class="ref-link" id="cCho_et+al_2011_a" href="#rCho_et+al_2011_a">Cho et al, 2011</a>), [3] (<a class="ref-link" id="cDu_et+al_2015_a" href="#rDu_et+al_2015_a">Du et al, 2015</a>), [4] (<a class="ref-link" id="cRolfe_2016_a" href="#rRolfe_2016_a">Rolfe, 2016</a>), [5] (<a class="ref-link" id="cGoessling_2015_a" href="#rGoessling_2015_a">Goessling & Amit, 2015</a>),
- Table5: CIFAR10. Likelihood for VLAE is approximated with 512 importance samples. [1] (<a class="ref-link" id="cvan_den_Oord_et+al_2016_a" href="#rvan_den_Oord_et+al_2016_a">van den Oord et al, 2016a</a>), [2] (<a class="ref-link" id="cDinh_et+al_2014_a" href="#rDinh_et+al_2014_a">Dinh et al, 2014</a>), [3] (van den Oord & Schrauwen, 2014), [4] (<a class="ref-link" id="cDinh_et+al_2016_a" href="#rDinh_et+al_2016_a">Dinh et al, 2016</a>), [5] (van den Oord et al, 2016b), [6] (Salimans et al, 2017), [7] (Sohl-Dickstein et al, 2015), [8] (<a class="ref-link" id="cGregor_et+al_2016_a" href="#rGregor_et+al_2016_a">Gregor et al, 2016</a>), [9] (<a class="ref-link" id="cKingma_et+al_2016_a" href="#rKingma_et+al_2016_a">Kingma et al, 2016</a>)

Related work

- We investigate a fusion between variational autoencoders with continuous latent variables (Kingma & Welling, 2013; Rezende et al, 2014) and neural autoregressive models. For autoregression, we specifically apply a novel type of architecture where autoregression is realised through a carefully constructed deep convolutional network, introduced in the PixelCNN model for images (van den Oord et al, 2016a,b). These family of convolutional autoregressive models was further explored, and extended, for audio in WaveNet (Oord et al, 2016), video in Video Pixel Networks (Kalchbrenner et al, 2016b) and language in ByteNet (Kalchbrenner et al, 2016a). The combination of latent variables with expressive decoder was previously explored using recurrent networks mainly in the context of language modeling (Chung et al, 2015; Bowman et al, 2015; Serban et al, 2016; Fraccaro et al, 2016; Xu & Sun, 2016). Bowman et al (2015) has also proposed to weaken an otherwise too expressive decoder by dropout to force some information into latent codes. Concurrent with our work, PixelVAE (Gulrajani et al, 2016) also explored using conditional PixelCNN as a VAE’s decoder and has obtained impressive density modeling results through the use of multiple levels of stochastic units. Using autoregressive model on latent code was explored in the context of discrete latent variables in DARN (Gregor et al, 2013). Kingma et al (2016), Kaae Sønderby et al (2016), Gregor et al (2016) and Salimans (2016) explored VAE architecture with an explicitly deep autoregressive prior for continuous latent variables, but the autoregressive data likelihood is intractable in those architectures and needs to inferred variationally. In contrast, we use multiple steps of autoregressive flows that has exact likelihood and analyze the effect of using expressive latent code. Optimization challenges for using (all levels of) continuous latent code were discussed before and practical solutions were proposed (Bowman et al, 2015; Kaae Sønderby et al, 2016; Kingma et al, 2016). In this paper, we present a complementary perspective on when/how should the latent code be used by appealing to a Bits-Back interpretation of VAE. Learning a lossy compressor with latent variable model has been investigated with ConvDRAW (Gregor et al, 2016). It learns a hierarchy of latent variables and just using high-level latent variables will result in a lossy compression that performs similarly to JPEG. Our model similarly learns a lossy compressor but it uses an autoregressive model to explicitly control what kind of information should be lost in compression.

Funding

- We evaluate whether using autoregressive decoding distribution can improve performance and we show in Table 1 that a VLAE model, with AF prior and PixelCNN conditional, is able to outperform a VAE with just AF prior and achieves new state-of-the-art results on statically binarized MNIST

Study subjects and analysis

samples: 4096

Experimental setup and hyperparameters are detailed in the appendix. Reported marginal NLL is estimated using Importance Sampling with 4096 samples. We designed experiments to answer the following questions:

importance samples: 512

Table 5: CIFAR10. Likelihood for VLAE is approximated with 512 importance samples. [1] (van den Oord et al, 2016a), [2] (Dinh et al, 2014), [3] (van den Oord & Schrauwen, 2014), [4] (Dinh et al, 2016), [5] (van den Oord et al, 2016b), [6] (Salimans et al, 2017), [7] (Sohl-Dickstein et al, 2015), [8] (Gregor et al, 2016), [9] (Kingma et al, 2016). Method Results with tractable likelihood models: Uniform distribution

Reference

- Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Jorg Bornschein and Yoshua Bengio. Reweighted wake-sleep. arXiv preprint arXiv:1406.2751, 2014.
- Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
- Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015a.
- Yuri Burda, Roger B Grosse, and Ruslan Salakhutdinov. Accurate and conservative estimates of mrf log-likelihood using reverse annealing. In AISTATS, 2015b.
- KyungHyun Cho, Tapani Raiko, and Alexander T Ihler. Enhanced gradient and adaptive learning rate for training restricted boltzmann machines. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 105–112, 2011.
- Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
- Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
- Chao Du, Jun Zhu, and Bo Zhang. Learning deep generative models with doubly stochastic mcmc. arXiv preprint arXiv:1506.04557, 2015.
- Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581, 2014.
- Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. arXiv preprint arXiv:1605.07571, 2016.
- Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. arXiv preprint arXiv:1502.03509, 2015.
- Marc Goessling and Yali Amit. Sparse autoregressive networks. arXiv preprint arXiv:1511.04776, 2015.
- Karol Gregor, Andriy Mnih, and Daan Wierstra. Deep AutoRegressive Networks. arXiv preprint arXiv:1310.8499, 2013.
- Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
- Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. arXiv preprint arXiv:1604.08772, 2016.
- Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016.
- Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM, 1993.
- Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and Helmholtz free energy. Advances in neural information processing systems, pp. 3–3, 1994.
- Antti Honkela and Harri Valpola. Variational learning and bits-back coding: an informationtheoretic view to bayesian learning. IEEE Transactions on Neural Networks, 15(4):800–810, 2004.
- Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
- Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282, 2016.
- Nal Kalchbrenner, Lasse Espheholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. eural machine translation in linear time. arXiv preprint arXiv:1610.00527, 2016a.
- Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016b.
- Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
- Diederik P Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.
- Brenden M Lake, Ruslan R Salakhutdinov, and Josh Tenenbaum. One-shot learning by inverting a compositional causal process. In Advances in neural information processing systems, pp. 2526– 2534, 2013.
- Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Benjamin M Marlin, Kevin Swersky, Bo Chen, and Nando de Freitas. Inductive principles for restricted boltzmann machine learning. In AISTATS, pp. 509–516, 2010.
- Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
- Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1278–1286, 2014.
- Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016. Tim Salimans. A structured variational auto-encoder for learning deep hierarchies of sparse features.
- arXiv preprint arXiv:1602.08734, 2016. Tim Salimans, Diederip P. Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. arXiv preprint arXiv:1410.6460, 2014. Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069, 2016. Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015. Dustin Tran, Rajesh Ranganath, and David M Blei. Variational gaussian process. arXiv preprint arXiv:1511.06499, 2015.

Tags

Comments