no code implementations • 23 Mar 2023 • Kai Liu, Hailiang Xiong, Gangqiang Yang, Zhengfeng Du, Yewen Cao, Danyal Shah
On the other hand, we need to reduce the dimension of each subspace to keep the size of the overall feature space unchanged when we increase the number of heads, which will significantly weaken the ability to represent the feature of each subspace.