Pay more attention: Recap of the last week

justynasty@lemmy.kya.moe · edit-2 1 year ago

Pay more attention: Recap of the last week

justynasty@lemmy.kya.moe · 1 year ago

Self-attention struggles with long sequences, due to its quadratic dependency on the sequence length. One query would attend to all keys and values, leading to computational inefficiencies. Sparse attention alleviates this issue by restricting the query’s access to a subset of keys and values.

I think this reduces the training time (improve on quadratic time complexity), but the attention is still not spread out unevenly in an infinitely long text - the question that future models need to answer.

Pay more attention: Recap of the last week

Pay more attention: Recap of the last week

🕳️ Attention Sinks in LLMs for endless fluency