no code implementations • 13 Mar 2024 • Heejune Sheen, Siyu Chen, Tianhao Wang, Harrison H. Zhou
Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices.
no code implementations • 29 Feb 2024 • Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang
In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model.
no code implementations • 24 Nov 2020 • Heejune Sheen, Xiaonan Zhu, Yao Xie
We estimate the general influence functions for spatio-temporal Hawkes processes using a tensor recovery approach by formulating the location dependent influence function that captures the influence of historical events as a tensor kernel.