Learning DNA folding patterns with Recurrent Neural Networks

25 Sep 2019 · Michal Rozenwald, Aleksandra Galitsyna, Ekaterina Khrameeva, Grigory Sapunov, Mikhail S. Gelfand ·

The recent expansion of machine learning applications to molecular biology proved to have a significant contribution to our understanding of biological systems, and genome functioning in particular. Technological advances enabled the collection of large epigenetic datasets, including information about various DNA binding factors (ChIP-Seq) and DNA spatial structure (Hi-C). Several studies have confirmed the correlation between DNA binding factors and Topologically Associating Domains (TADs) in DNA structure. However, the information about physical proximity represented by genomic coordinate was not yet used for the improvement of the prediction models. In this research, we focus on Machine Learning methods for prediction of folding patterns of DNA in a classical model organism Drosophila melanogaster. The paper considers linear models with four types of regularization, Gradient Boosting and Recurrent Neural Networks for the prediction of chromatin folding patterns from epigenetic marks. The bidirectional LSTM RNN model outperformed all the models and gained the best prediction scores. This demonstrates the utilization of complex models and the importance of memory of sequential DNA states for the chromatin folding. We identify informative epigenetic features that lead to the further conclusion of their biological significance.

PDF Abstract