Article

Obsidian Source: Notes / Language Models are Unsupervised Multitask Learners

Summary

Pending synthesis from local Obsidian source.

Original source title: Language Models Are Unsupervised Multitask Learners

Extracted Preview

Pytorch Port. 124M version.

  • 12 Layers Transformers + 768 embed_dim. Takes about ~1hr with cloud service.
  • Getting the state_dict -> which are basically the raw tokens.
  • In OG AIAYN, the PE are initialized with sine and cosine, in GPT2, they are trained from scratch.
  • We know, GPT2 is a decoder only model. Add + Norm before MLP layers, and final lm_head with no bias.
  • nn.ModuleDict -> allows dict indexing of submodules.
  • Residual Stream should not be normalized, so avoid this. We want clean residual stream.
  • MLP acts token wise. GPT is just map reduce? GeLU is preferred over ReLU(more smooth, tanh approximator is there in GPT2.)
  • Attention is parallel(MHA). Mask to attend to tokens before them.
  • Views and other Pytorch internals. BxT + 1 and then offsets to align the logits and labels.
  • nn.Module we can do model.to(device), but we cannot do this with tensors. ten = ten.to(device) is the correct way.
  • In the initial output embedding layer and the lm_head share the weights, cause it saves a lot of space!
  • Clipping the loss - to prevent gradient shocks(too much variance) so clipping is nice.

Gradually increasing batch size(as a hyperparameter which basically makes or breaks the system) is something that makes sense for bigger models because they can fit in larger batches in the GPU, but for most cases, it doesn't make a lot of sense.

Takeaways

Integration Notes

  • Source folder: /home/yashs/Documents/Docs/Obsidian/Research-Notes
  • Local source: /home/yashs/Documents/Docs/Obsidian/Research-Notes/Notes/Language Models are Unsupervised Multitask Learners.md
  • Raw copy: raw/obsidian/research-notes/Notes/Language Models are Unsupervised Multitask Learners.md

Links Created Or Updated

Open Questions