Website Source: blog / flex_attention

Summary

Pending synthesis from local website source.

Original source title: Flex Attention

Extracted Preview

Foreword: The title is a clickbait. I don't actually know how to scale attention to serve billion users, as I feel it is a really complicated problem with a lot of moving parts and optimizations to keep in mind, but in this blog I'm going to explore one of the approaches which I find really interesting. I got the idea to write this blog after watching Horace He's [talk](https://youtu.be/139UPjoq7Kw?si=8hoc2s7FZ7SRj4B1) with Jane Street. I hope I was able to do it justice. I've also linked resources which I referred to while piecing together this blog. Get a cup of coffee, sit in a nice place, and enjoy this blog.

How To Scale Attention To A Billion Users?

Why isn't vanilla `self_attention` not used too much in practice?

"[Attention is all you need](https://arxiv.org/pdf/1706.03762)" was a pivotal paper that marked the revolution in the AI industry. All of the breakthroughs that we see today in the AI space can be traced back to that infamous paper. The authors of that paper are really influential too, but that's a story for another blog.

The key idea introduced in the paper, in the development of transformer architecture was that of scaled dot product attention and self attention. For each input sequence, three vectors are generated dynamically, namely queries(Q), keys(K) and values(V) which allows the model to focus on different parts of the input. These three vectors make one "head" of attention. The scores are calculated as:

![image](https://github.com/user-attachments/assets/caf420cf-1925-4007-ad25-40b12cfeb91e)

Performance has always been a bottleneck for using these models in downstream applications. The dot product step in the attention score calculation is quadratic in memory requirement. Another drawback which limits their application is numerical instability. When working with large sequences, the self attention score calculation can suffer from "avalanche effect" where small perturbations in the input can magnify the error during computations.

How do we optimize the attention mechanism?

"Any optimization that is not about the bottleneck is an illusion of improvement"

</blockquote>

Integration Notes

Source section: blog
Local source: /home/yashs/Desktop/Programming/yash_blog/yash-srivastava19.github.io/blog/flex_attention.md
Raw copy: raw/website/yash-srivastava19-github-io/blog/flex_attention.md

Website Source: blog / flex_attention

Summary

Extracted Preview

How To Scale Attention To A Billion Users?

Why isn't vanilla `self_attention` not used too much in practice?

How do we optimize the attention mechanism?

Integration Notes

Links Created Or Updated

Open Questions

Website Source: blog / flex_attention

Summary

Extracted Preview

How To Scale Attention To A Billion Users?

Why isn't vanilla self_attention not used too much in practice?

How do we optimize the attention mechanism?

Integration Notes

Links Created Or Updated

Open Questions

Why isn't vanilla `self_attention` not used too much in practice?