Obsidian Source: Notes / Language Models are Unsupervised Multitask Learners
Summary
Pending synthesis from local Obsidian source.
Original source title: Language Models Are Unsupervised Multitask Learners
Extracted Preview
Pytorch Port. 124M version.
- 12 Layers Transformers + 768
embed_dim. Takes about ~1hr with cloud service. - Getting the
state_dict-> which are basically the raw tokens. - In OG AIAYN, the PE are initialized with sine and cosine, in GPT2, they are trained from scratch.
- We know, GPT2 is a decoder only model. Add + Norm before MLP layers, and final
lm_headwith no bias. nn.ModuleDict-> allowsdictindexing of submodules.- Residual Stream should not be normalized, so avoid this. We want clean residual stream.
- MLP acts token wise. GPT is just map reduce? GeLU is preferred over ReLU(more smooth,
tanhapproximator is there in GPT2.) - Attention is parallel(MHA). Mask to attend to tokens before them.
- Views and other Pytorch internals. BxT + 1 and then offsets to align the logits and labels.
- nn.Module we can do
model.to(device), but we cannot do this with tensors.ten = ten.to(device)is the correct way. - In the initial output embedding layer and the
lm_headshare the weights, cause it saves a lot of space! - Clipping the loss - to prevent gradient shocks(too much variance) so clipping is nice.
Gradually increasing batch size(as a hyperparameter which basically makes or breaks the system) is something that makes sense for bigger models because they can fit in larger batches in the GPU, but for most cases, it doesn't make a lot of sense.
Takeaways
Integration Notes
- Source folder:
/home/yashs/Documents/Docs/Obsidian/Research-Notes - Local source:
/home/yashs/Documents/Docs/Obsidian/Research-Notes/Notes/Language Models are Unsupervised Multitask Learners.md - Raw copy:
raw/obsidian/research-notes/Notes/Language Models are Unsupervised Multitask Learners.md