Website Source: blog / leviathan
Summary
Pending synthesis from local website source.
Original source title: Leviathan
Extracted Preview
Note: Leviathan is my take to make a drop in replacement of Attention models with similar performance. This whole idea occured to me while I was reading from a lot of domains and doing a ton of math.
Leviathan - Let's modify and improve the Transformer model(from scratch).
Transformers have been the go-to architecture for modern deep learning stack from the moment they were introduced. They are quite powerful, and many successful architectures have been made from them which have been SOTA in their respective domain.
Tackling on changing the architecture of Transformers is a easy problem, and resources such as “[Illustrated Transformers](https://jalammar.github.io/illustrated-transformer/)” have helped a lot deeply understand the intricacies of it. It is one thing to read the paper and do the math, and another to debug like a child. Fortunately, this was a problem that I’ve been thinking a lot about in the past, the implementation was a different beast. All that was left was running some experiments and comparing it to the transformer baseline.
I’ve tried to implement this model called “Leviathan” which is a modified version of Transformers that uses correlation score(an analogy taken from signal processing). The implementation can be found [here](https://github.com/yash-srivastava19/attention-free-revolution), and here’s my reasoning on why I think it performs similar to Transformers.
Why Correlation ?
I read somewhere that self-attention(simple scaled dot product attention) can be seen as a Graph Neural Network, where each token in the input sequence is a mode, and edges denote the relationship between each token, and that attention layers is a directed acyclic graph - which makes sense as different context gives different meanings to how different tokens are connected.
If we think of tokens as signals, attention can be used to capture long range dependencies in signals, and correlation is great when there is delay in the signals. Correlation, or to be more general, Cross Correlation is a more general way to find similarity between signals. Try thinking of dot product as basically just cross correlation with zero lag.
Suppose instead of a vanilla dot product, we used cross correlation - which ultimately uses a sliding window product, and now due to the “lag” in the tokens, there are effectively more nodes in the graph, which basically allows for more ways in which signals(tokens) can be connected. Having more nodes means we can learn rich features, as there are now more ways in which a bunch of tokens interact.
Architectural Design
I wanted to have the correlation metric to be a drop in replacement to the scaled dot product attention. I tried to implement the algorithm using scipy signals module which looks something like this :
Integration Notes
- Source section:
blog - Local source:
/home/yashs/Desktop/Programming/yash_blog/yash-srivastava19.github.io/blog/leviathan.md - Raw copy:
raw/website/yash-srivastava19-github-io/blog/leviathan.md