Article

Obsidian Source: Notes / The Tokenization Race

Summary

Pending synthesis from local Obsidian source.

Original source title: The Tokenization Race

Extracted Preview

Think of tokenization is compressing the natural language into tokens that can be understood by a LLM.

Byte Pair Encoding

  • Ability to combine both tokens that encode single characters and those that encode whole words.

Basically, we are creating a lookup table of merges and vocab. The most frequent words in the bigram are merged together and a new word is added to the vocabulary.(We are dealing with bytes, so from 0-255 we will have a dictionary, and then from 256 onwards, we will have out merges.)

Merits :

  • Does not require large computational overheads
  • Consistent and reliable.

Before moving forward, let's look at some data compression techniques to get the idea of what exactly are we dealing with.

Compression is basically the process of encoding the information using fewer bits than the original representation.

There are lossy(some information is lost) as well as lossless(no information is lost) compression techniques.

  • Most form of lossy compression(image based) are based on transform coding(DCT).
  • Basis in Information theory. Shannon source coding theorem.
  • Compression algorithms implicitly map strings into implicit feature space vectors.

Integration Notes

  • Source folder: /home/yashs/Documents/Docs/Obsidian/Research-Notes
  • Local source: /home/yashs/Documents/Docs/Obsidian/Research-Notes/Notes/The Tokenization Race.md
  • Raw copy: raw/obsidian/research-notes/Notes/The Tokenization Race.md

Links Created Or Updated

Open Questions