no code implementations • 12 Mar 2024 • Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet Oymak
}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token.
no code implementations • 21 Feb 2024 • M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak
Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation.
2 code implementations • 17 Jan 2023 • Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak
We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i. i. d.