These are transformer models that exploit the fact that attention matrices are often nearly low-rank (that is, attention can be modeled well as depending on only a few dimensions). 27.07.2023 17:54 aior