SEQUENCE-LEVEL FEATURES: HOW GRU AND LSTM CELLS CAPTURE N-GRAMS
Modern recurrent neural networks (RNN) such as Gated Recurrent Units (GRU) and Long Short-term Memory (LSTM) have demonstrated impressive results on tasks involving sequential data in practice. Despite continuous efforts on interpreting their behaviors, the exact mechanism underlying their successes in capturing sequence-level information have not been thoroughly understood. In this work, we present a study on understanding the essential features captured by GRU/LSTM cells by mathematically expanding and unrolling the hidden states. Based on the closed-form approximations of the hidden states, we argue that the effectiveness of the cells may be attributed to a type of sequence-level representations brought in by the gating mechanism, which enables the cells to encode sequence-level features along with token-level features. Specifically, we show that under certain mild assumptions, the essential components of the cells would consist of such sequence-level features similar to those of N-grams. Based on such a finding, we also found that replacing the standard cells with approximate hidden state representations does not necessarily degrade performance on the sentiment analysis and language modeling tasks, indicating such features may play a significant role for GRU/LSTM cells. We hope that our work can give inspiration on proposing new neural architectures for capturing contextual information within sequences.
PDF Abstract