Introducing Evolutionary Sparsity in the Transformer Model Architecture
Recent research has shown that dense (fully connected) layers in Artificial Neural Networks (ANN) are superfluous. They showed that all dense layers in ANNs can be replaced with sparse ones before training using their sparse evolutionary training (SET) procedure, reducing quadratically the number of parameters, with no decrease in accuracy. This research applied the SET algorithm to the latest innovation in NLP: The Transformer Model Architecture. As
We showed that our adapted sparse Transformer outperforms original Transformers, while having two orders of magnitude fewer parameters. Moreover, the sparse Transformer manages to obtain a higher starting accuracy
that both original variants never seemed to catch up with.