Abstract
Variants dropout methods have been designed for the fully-connected layer,convolutional layer and recurrent layer in neural networks, and shown to beeffective to avoid overfitting. As an appealing alternative to recurrent andconvolutional layers, the fully-connected self-attention layer surprisinglylacks a specific dropout method. This paper explores the possibility ofregularizing the attention weights in Transformers to prevent differentcontextualized feature vectors from co-adaption. Experiments on a wide range oftasks show that DropAttention can improve performance and reduce overfitting.
Quick Read (beta)
loading the full paper ...