Obsidian Source: How to Build Attention Models for Billions Users
Summary
Pending synthesis from local Obsidian source.
Original source title: How To Build Attention Models For Billions Users
Extracted Preview
Foreword: The title is a clickbait. I don't actually know how to scale attention to serve billion users, as I feel it is a really complicated problem with a lot of moving parts and optimizations to keep in mind, but in this blog I'm going to explore one of the approaches which I find really interesting. I got the idea to write this blog after watching Horace He's [talk](https://youtu.be/139UPjoq7Kw?si=8hoc2s7FZ7SRj4B1) with Jane Street. I hope I was able to do it justice. I've also linked resources which I referred to while piecing together this blog. Get a cup of coffee, sit in a nice place, and enjoy this blog.
Why isn't vanilla self_attention not used too much in practice?
"[Attention is all you need](https://arxiv.org/pdf/1706.03762)" was a pivotal paper that marked the revolution in the AI industry. All of the breakthroughs that we see today in the AI space can be traced back to that infamous paper. The authors of that paper are really influential too, but that's a story for another blog.
The key idea introduced in the paper, in the development of transformer architecture was that of scaled dot product attention and self attention. For each input sequence, three vectors are generated dynamically, namely queries($Q$), keys($K$) and values($V$) which allows the model to focus on different parts of the input. These three vectors make one "head" of attention. The scores are calculated as:
$$Attention(Q, K,V) = softmax(\frac{QK^T}{\sqrt{d_{k}}})V$$
Performance has always been a bottleneck for using these models in downstream applications. The dot product step in the attention score calculation is quadratic ($O(n^2)$ )in memory requirement. Another drawback which limits their application is numerical instability. When working with large sequences, the self attention score calculation can suffer from "avalanche effect" where small perturbations in the input can magnify the error during computations.
How do we optimize the attention mechanism?
<blockquote>
"Any optimization that is not about the bottleneck is an illusion of improvement"
</blockquote>
The core idea behind engineering is simple in theory, but is difficult in implementation. In our case, optimizing attention mechanism involves understanding the bottlenecks and building patches to improve performance. We established that memory requirements and numerical instability is one of the bottlenecks for attention, so what next should we do get performance gains?
One approach was the introduction of "fused" attention. For applications where memory is a constraint, having to compute the query and key matrix multiplication ($Q . K^T$ ) could be a bottleneck. A query, key vector of size $4096 \times 4096$ (standard in practice) and datatype bloat16 can take about $4096 \times 4096 \times 2 \approx 32MB$ of space. To avoid exhausting space and skipping the multiplication of query and key vectors, we can "fuse" the softmax computation with the second matrix multiplication. We make use of the fact(which is by no means trivial and is really clever) that in order to produce one block of the final result, we only need one block of the query vector. This implies that instead of multiplying the entire $Q.K$, we can compute one block at a time to produce one block of the output. For a block size of, say $128$, the matrix multiplication $q.K^T$ has the shape $128 \times 4096$ which takes about(for the same bfloat16 datatype) $128 \times 4096 \times 2 \approx 1MB$ of space at once. Now, to get the final result, just look over all the blocks!! How cool is that!
Integration Notes
- Source folder:
/home/yashs/Documents/Docs/Obsidian/Research-Notes - Local source:
/home/yashs/Documents/Docs/Obsidian/Research-Notes/How to Build Attention Models for Billions Users.md - Raw copy:
raw/obsidian/research-notes/How to Build Attention Models for Billions Users.md