last entries
random

sparse transformers

These are transformer models that exploit the fact that attention matrices are often nearly low-rank (that is, attention can be modeled well as depending on only a few dimensions).