SAE Macaronic Languages
Do language models trained only on English develop internal representations of Hindi words dropped into English sentences? This project uses Sparse Autoencoders to probe whether GPT-2's features treat "song" and "gaana" similarly in the same sentence context.
The Question
Growing up code-switching mid-sentence ("She likes to dance on bollywood *gaane*") raises a genuine question: does GPT-2 — trained entirely on English — have any internal representation that captures the semantic equivalence? The project is personal: bilingual upbringing drove the research question.
Method
Eight matched English/Hinglish sentence pairs — one word swapped, everything else identical. Run through GPT-2-small, capture residual stream and MLP activations at each layer, pass through SAELens pre-trained SAEs, compare which features activate for English vs Hindi token versions.
Findings (Honest)
Partial, inconclusive — which is itself the finding:
- Residual stream shows more feature overlap than MLP layers. Makes sense: residual stream is a broad semantic summary; MLPs are more like lookup tables.
- Some words generalize better (shaadi, khana) — likely because they appear as loanwords in GPT-2's English training data or are context-inferable (Bollywood → wedding, food).
- Overlap is never complete. The model isn't understanding Hindi — it's picking up contextual semantic signal that bleeds through.
The careful claim: not "GPT-2 understands Hindi" but "English contextual cues partially carry semantic signal across the language boundary."
Related Pages
Sources
- GitHub Repo: sae-macaronic-analysis -
wiki/sources/github/sae-macaronic-analysis.md - Website Source: blog / sae_macaronic_blog -
wiki/sources/website/blog-sae-macaronic-blog.md
Evidence
Linked source: GitHub Repo: sae-macaronic-analysis
Linked source: Website Source: blog / sae_macaronic_blog