Obsidian Source: Notes / Scaling Laws for Neural Language Models.

Summary

Pending synthesis from local Obsidian source.

Original source title: Scaling Laws For Neural Language Models.

Extracted Preview

OpenAI + Jared Kaplan(Anthropic Founder, John Hopkins, Physics)

Find empirical relationship between cross entropy loss as a variation of model size($N$ - excluding embeddings), dataset size($D$) and amount of compute($C$)
Necessary as large models are more sample efficient, so we need budget allocation for them. The study is done over several orders of magnitude, and it seems they follow precise power law scaling.
Key Findings from the paper :

- For models with limited number of parameters trained to converge on large datasets

- $L(N) = (\frac{N_c}{N})^{\alpha_{n}}$ $\alpha_{n} \approx 0.076$ ; $N_c \approx 8.8 \times 10^{13}$ (non-embedding params)

- For large models trained with a limited dataset with early stopping

- $L(D) = (\frac{D_c}{D})^{\alpha_{n}}$ $\alpha_{n} \approx 0.095$ ; $D_c \approx 5.4 \times 10^{13}$ (tokens)

- When training with a fixed amount of compute, a sufficiently large dataset, and an optimally sized model(and a sufficiently small batch size)

- $L(C_{min}) = (\frac{C_c^{min}}{C_c})^{\alpha_{C_{min}}}$ $\alpha_{C_{min}} \approx 0.050$ ; $C_c^{min} \approx 8.1 \times 10^{8}$ (PF days)

- When training a given model for a finite number of steps(S) in the infinite data limit

- $L(N, S) = (\frac{N_c}{N})^{\alpha_{n}} + (\frac{S}{S_c})^{\alpha_{s}}$ $\alpha_{s} \approx 0.76$ ; $S_c \approx 2.1 \times 10^{3}$

- $N \propto C^{\frac{\alpha_C^{min}}{\alpha_{n}}}$ ; $B \propto C^{\frac{\alpha_C^{min}}{\alpha_{b}}}$ $S \propto C^{\frac{\alpha_C^{min}}{\alpha_{s}}}$ where, $\alpha_C^{min} = \frac{1}{\frac{1}{\alpha_s} + \frac{1}{\alpha_b} + \frac{1}{\alpha_n}}$

Integration Notes

Source folder: /home/yashs/Documents/Docs/Obsidian/Research-Notes
Local source: /home/yashs/Documents/Docs/Obsidian/Research-Notes/Notes/Scaling Laws for Neural Language Models..md
Raw copy: raw/obsidian/research-notes/Notes/Scaling Laws for Neural Language Models..md

Obsidian Source: Notes / Scaling Laws for Neural Language Models.

Summary

Extracted Preview

Integration Notes

Links Created Or Updated

Open Questions