Article

SAE Macaronic Languages

Do language models trained only on English develop internal representations of Hindi words dropped into English sentences? This project uses Sparse Autoencoders to probe whether GPT-2's features treat "song" and "gaana" similarly in the same sentence context.

The Question

Growing up code-switching mid-sentence ("She likes to dance on bollywood *gaane*") raises a genuine question: does GPT-2 — trained entirely on English — have any internal representation that captures the semantic equivalence? The project is personal: bilingual upbringing drove the research question.

Method

Eight matched English/Hinglish sentence pairs — one word swapped, everything else identical. Run through GPT-2-small, capture residual stream and MLP activations at each layer, pass through SAELens pre-trained SAEs, compare which features activate for English vs Hindi token versions.

Findings (Honest)

Partial, inconclusive — which is itself the finding:

  • Residual stream shows more feature overlap than MLP layers. Makes sense: residual stream is a broad semantic summary; MLPs are more like lookup tables.
  • Some words generalize better (shaadi, khana) — likely because they appear as loanwords in GPT-2's English training data or are context-inferable (Bollywood → wedding, food).
  • Overlap is never complete. The model isn't understanding Hindi — it's picking up contextual semantic signal that bleeds through.

The careful claim: not "GPT-2 understands Hindi" but "English contextual cues partially carry semantic signal across the language boundary."

Related Pages

Sources

Evidence

Linked source: GitHub Repo: sae-macaronic-analysis

Linked source: Website Source: blog / sae_macaronic_blog