Byte Pair Encoding Tokenization from Scratch in cpp

Getting Vocabulary

For this task, we will make use of a sample training dataset from the OSCAR dataset. Since the original OSCAR dataset is more than 1TB+ in size, we will make use of OSCAR 10K which has 10K tokens present. The dataset will be downloaded from huggingface:

hf download stas/oscar-en-10k --repo-type dataset
# This dataset has a script inside which when run will download the dataset
# Do note that in datasets > 4.0.0 execution of remote code is not supported
# To get around this you may do pip install "datasets<4.0.0"

Once the dataset is downloaded, run the below python script to create the corpus

from datasets import load_dataset

ds = load_dataset("stas/oscar-en-10k")
with open("oscar_10k.txt", "w", encoding="utf-8") as f:
    for row in ds["train"]:
        f.write(row["text"].replace("\n", " ") + "\n")

To see how big the corpus is, you can run the du command.

❯ du -h ./oscar_10k.txt
144M    ./oscar_10k.txt # 144MB in size

The next step would be to split the words in the created corpus into subwords. We can make use of regex for this step. Looking through the original GPT 2 tokenizer implementation from huggingface, I found a very nasty looking regex expression.

self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

breaking it down all its doing is

's|'t|'re|'ve|'m|'ll|'d - splitting a word by its contractions
?[a-zA-Z]+ - optional space followed by one or more characters
?[0-9]+ - optional space followed by the digits
?[^\s\w]+ - optional space followed by neither whitespace or word characters (special characters)
\s+(?!\S) - match one or more whitespace characters (there’s a negative lookahead so asserts that non whitespace characters wont be matched)
\s+ - match any remaining chunks (and also internal spaces)

Input Preprocessing

The encoding happens at a byte level instead of the character level. Given a UTF-8 encoded input text, each byte, which is from 0 to 255, will become the initial symbol. Even the puncutation, emojiis and spaces will just be split into bytes so you dont have to worry about removing them. This is great because you always have something to start with, the byte, even if the character is unknown. Also certain characters or emojiis are have multiple bytes, so this is handled well.

"你" → [228, 189, 160]   (3 bytes in UTF-8)
"🙂" → [240, 159, 153, 130]