Multinomial Variational Autoencoders can recover Principal Components
Covariance estimation on high dimensional data is a central challenge across multiple scientific disciplines. Sparse high-dimensional count data frequently encountered in biological applications such as DNA sequencing and proteomics are often well modeled using multinomial logistic-normal models. In many cases these datasets are also compositional, presented item-wise as fractions of a normalized total, necessitated by measurement and instrument constraints. Yet three key challenge prove limiting in covariance estimation with these models: (1) the computational complexity of inverting high-dimensional covariance matrices, (2) non-exchangability introduced from the summation constraint on multinomial parameters, (3) the irreducibility of the component multinomial logistic-normal distribution that necessitates the use of parameter augmentation, or similar techniques, during inference. We show that a variational autoencoder augmented with a fast Isometric Log-ratio (ILR) transform can address these issues and accurately estimate principal components from multinomially logistic-normal distributed data. This model can be optimized on GPUs and modified to handle mini-batching, with the ability to scale across thousands of dimensions and thousands of samples.
PDF Abstract