Article

Website Source: blog / sec_bpe_blog

Summary

Pending synthesis from local website source.

Original source title: What Happens When You Run BPE on Encrypted Text?

Extracted Preview

Note: I was watching Karpathy's minBPE video for the third time, late at night. The question hit me mid-rewatch: what if the corpus you feed BPE isn't plaintext? I thought it'd be a quick experiment. It became a PyPI package. Karpathy is responsible for a lot of my late nights.

What Happens When You Run BPE on Encrypted Text?

Repo: [yash-srivastava19/sec_bpe](https://github.com/yash-srivastava19/sec_bpe)

The Origin Story

I was going through Karpathy's [minBPE video](https://www.youtube.com/watch?v=zduSFxRajkE) for probably the third time - because that's the kind of person I am - and something clicked differently this time. Not about BPE itself, but about the *corpus* being fed into it. Everyone just... takes the plaintext corpus for granted. You have text, you run BPE, you get a vocabulary. But what if the text wasn't text?

Around the same time, I was reading the [Random BPE paper](https://arxiv.org/abs/2311.01480), which argues that BPE's greedy merge order isn't the sacred, irreplaceable thing we treat it as. You can randomize the merge decisions and still get vocabularies that perform comparably. That planted a seed: if the *order* of merges doesn't matter that much, what about the *distribution* of the corpus itself? BPE is fundamentally just a frequency analysis over byte pairs. What if we messed with the byte distribution before handing it off?

Then I was reading about classical ciphers for no particular reason (as one does), and I landed on the Playfair cipher. Playfair operates on *digraphs* - letter pairs. It encrypts two letters at a time, together. And I thought: BPE also operates on pairs. It's literally called *Byte Pair* Encoding. It merges the most frequent *pairs* of bytes. What if those pairs were ciphertext pairs?

That's the whole idea. Encrypt first, then BPE.

BPE in 30 Seconds

Integration Notes

  • Source section: blog
  • Local source: /home/yashs/Desktop/Programming/yash_blog/yash-srivastava19.github.io/blog/sec_bpe_blog.md
  • Raw copy: raw/website/yash-srivastava19-github-io/blog/sec_bpe_blog.md

Links Created Or Updated

Open Questions