SAE Macaronic Languages

Do language models trained only on English develop internal representations of Hindi words dropped into English sentences? This project uses Sparse Autoencoders to probe whether GPT-2's features treat "song" and "gaana" similarly in the same sentence context.

The Question

Growing up code-switching mid-sentence ("She likes to dance on bollywood *gaane*") raises a genuine question: does GPT-2 — trained entirely on English — have any internal representation that captures the semantic equivalence? The project is personal: bilingual upbringing drove the research question.

Method

Eight matched English/Hinglish sentence pairs — one word swapped, everything else identical. Run through GPT-2-small, capture residual stream and MLP activations at each layer, pass through SAELens pre-trained SAEs, compare which features activate for English vs Hindi token versions.

Findings (Honest)

Partial, inconclusive — which is itself the finding:

Residual stream shows more feature overlap than MLP layers. Makes sense: residual stream is a broad semantic summary; MLPs are more like lookup tables.
Some words generalize better (shaadi, khana) — likely because they appear as loanwords in GPT-2's English training data or are context-inferable (Bollywood → wedding, food).
Overlap is never complete. The model isn't understanding Hindi — it's picking up contextual semantic signal that bleeds through.

The careful claim: not "GPT-2 understands Hindi" but "English contextual cues partially carry semantic signal across the language boundary."

Sources

GitHub Repo: sae-macaronic-analysis - wiki/sources/github/sae-macaronic-analysis.md
Website Source: blog / sae_macaronic_blog - wiki/sources/website/blog-sae-macaronic-blog.md

Evidence

Linked source: GitHub Repo: sae-macaronic-analysis