The Artwork of Tokenization: Breaking Down Textual content for AI | by Murilo Gustineli | Sep, 2024

Demystifying NLP: From Textual content to Embeddings

Tokenization example generated by Llama-3-8B. Each colored subword represents a distinct token.
Tokenization instance generated by Llama-3-8B. Every coloured subword represents a definite token.

In laptop science, we discuss with human languages, like English and Mandarin, as “pure” languages. In distinction, languages designed to work together with computer systems, like Meeting and LISP, are referred to as “machine” languages, following strict syntactic guidelines that depart little room for interpretation. Whereas computer systems excel at processing their very own extremely structured languages, they wrestle with the messiness of human language.

Language — particularly textual content — makes up most of our communication and data storage. For instance, the web is generally textual content. Massive language fashions like ChatGPT, Claude, and Llama are skilled on monumental quantities of textual content — primarily all of the textual content accessible on-line — utilizing subtle computational methods. Nevertheless, computer systems function on numbers, not phrases or sentences. So, how can we bridge the hole between human language and machine understanding?

That is the place Pure Language Processing (NLP) comes into play. NLP is a subject that mixes linguistics, laptop science, and synthetic intelligence to allow computer systems to know, interpret, and generate human language. Whether or not translating textual content from English to French, summarizing articles, or participating in dialog, NLP permits machines to provide significant outputs from textual inputs.

The primary important step in NLP is reworking uncooked textual content right into a format that computer systems can work with successfully. This course of is named tokenization. Tokenization entails breaking down textual content into smaller, manageable items referred to as tokens, which may be phrases, subwords, and even particular person characters. Right here’s how the method usually works:

  • Standardization: Earlier than tokenizing, the textual content is standardized to make sure consistency. This will embrace changing all letters to lowercase, eradicating punctuation, and making use of different normalization methods.
  • Tokenization: The standardized textual content is then cut up into tokens. For instance, the sentence “The fast brown fox jumps over the lazy canine” may be tokenized into phrases:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
  • Numerical illustration: Since computer systems function on numerical information, every token is transformed right into a numerical illustration. This may be so simple as assigning a singular identifier to every token or as advanced as creating multi-dimensional vectors that seize the token’s that means and context.
Illustration impressed by “Determine 11.1 From textual content to vectors” from Deep Studying with Python by François Chollet

Tokenization is extra than simply splitting textual content; it’s about getting ready language information in a method that preserves that means and context for computational fashions. Completely different tokenization strategies can considerably influence how properly a mannequin understands and processes language.

On this article, we give attention to textual content standardization and tokenization, exploring a couple of methods and implementations. We’ll lay the groundwork for changing textual content into numerical kinds that machines can course of — a vital step towards superior matters like phrase embeddings and language modeling that we’ll deal with in future articles.

Think about these two sentences:

1. “nightfall fell, i used to be gazing on the Sao Paulo skyline. Isnt city life vibrant??”

2. “Nightfall fell; I gazed on the São Paulo skyline. Isn’t city life vibrant?”

At first look, these sentences convey an analogous that means. Nevertheless, when processed by a pc, particularly throughout duties like tokenization or encoding, they’ll seem vastly completely different attributable to delicate variations:

  • Capitalization: “nightfall” vs. “Nightfall”
  • Punctuation: Comma vs. semicolon; presence of query marks
  • Contractions: “Isnt” vs. “Isn’t”
  • Spelling and Particular Characters: “Sao Paulo” vs. “São Paulo”

These variations can considerably influence how algorithms interpret the textual content. For instance, “Isnt” with out an apostrophe is probably not acknowledged because the contraction of “just isn't”, and particular characters like “ã” in “São” could also be misinterpreted or trigger encoding points.

Text standardization is a vital preprocessing step in NLP that addresses these points. By standardizing textual content, we scale back irrelevant variability and make sure that the information fed into fashions is constant. This course of is a type of function engineering the place we get rid of variations that aren’t significant for the duty at hand.

A easy methodology for textual content standardization consists of:

  • Changing to lowercase: Reduces discrepancies attributable to capitalization.
  • Eradicating punctuation: Simplifies the textual content by eliminating punctuation marks.
  • Normalizing particular characters: Converts characters like “ã” to their commonplace kinds (“a”).

Making use of these steps to our sentences, we get:

1. “nightfall fell i used to be gazing on the sao paulo skyline isnt city life vibrant”

2. “nightfall fell i gazed on the sao paulo skyline isnt city life vibrant”

Now, the sentences are extra uniform, highlighting solely the significant variations in phrase alternative (e.g., “was gazing at” vs. “gazed at”).

Whereas there are extra superior standardization methods like stemming (lowering phrases to their root kinds) and lemmatization (lowering phrases to their dictionary type), this fundamental method successfully minimizes superficial variations.

Python implementation of textual content standardization

Right here’s how one can implement fundamental textual content standardization in Python:

import re
import unicodedata

def standardize_text(textual content:str) -> str:
# Convert textual content to lowercase
textual content = textual content.decrease()
# Normalize unicode characters to ASCII
textual content = unicodedata.normalize('NFKD', textual content).encode('ascii', 'ignore').decode('utf-8')
# Take away punctuation
textual content = re.sub(r'[^ws]', '', textual content)
# Take away further whitespace
textual content = re.sub(r's+', ' ', textual content).strip()
return textual content

# Instance sentences
sentence1 = "nightfall fell, i used to be gazing on the Sao Paulo skyline. Isnt city life vibrant??"
sentence2 = "Nightfall fell; I gazed on the São Paulo skyline. Is not city life vibrant?"

# Standardize sentences
std_sentence1 = standardize_text(sentence1)
std_sentence2 = standardize_text(sentence2)
print(std_sentence1)
print(std_sentence2)

Output:

nightfall fell i used to be gazing on the sao paulo skyline isnt city life vibrant
nightfall fell i gazed on the sao paulo skyline isnt city life vibrant

By standardizing the textual content, we’ve minimized variations that would confuse a computational mannequin. The mannequin can now give attention to the variations between the sentences, such because the distinction between “was gazing at” and “gazed at”, somewhat than discrepancies like punctuation or capitalization.

After textual content standardization, the following important step in pure language processing is tokenization. Tokenization entails breaking down the standardized textual content into smaller items referred to as tokens. These tokens are the constructing blocks that fashions use to know and generate human language. Tokenization prepares the textual content for vectorization, the place every token is transformed into numerical representations that machines can course of.

We intention to transform sentences right into a type that computer systems can effectively and successfully deal with. There are three frequent strategies for tokenization:

1. Phrase-level tokenization

Splits textual content into particular person phrases based mostly on areas and punctuation. It’s essentially the most intuitive option to break down textual content.

textual content = "nightfall fell i gazed on the sao paulo skyline isnt city life vibrant"
tokens = textual content.cut up()
print(tokens)

Output:

['dusk', 'fell', 'i', 'gazed', 'at', 'the', 'sao', 'paulo', 'skyline', 'isnt', 'urban', 'life', 'vibrant']

2. Character-level tokenization

Breaks textual content into particular person characters, together with letters and typically punctuation.

textual content = "Nightfall fell"
tokens = checklist(textual content)
print(tokens)

Output:

['D', 'u', 's', 'k', ' ', 'f', 'e', 'l', 'l']

3. Subword tokenization

Splits phrases into smaller, significant subword items. This methodology balances the granularity of character-level tokenization with the semantic richness of word-level tokenization. Algorithms like Byte-Pair Encoding (BPE) and WordPiece fall below this class. For example, the BertTokenizer tokenizes “I've a brand new GPU!” as follows:

from transformers import BertTokenizer

textual content = "I've a brand new GPU!"
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize(textual content)
print(tokens)

Output:

['i', 'have', 'a', 'new', 'gp', '##u', '!']

Right here, “GPU” is cut up into “gp” and “##u”, the place “##” signifies that “u” is a continuation of the earlier subword.

Subword tokenization gives a balanced method between vocabulary dimension and semantic illustration. By decomposing uncommon phrases into frequent subwords, it maintains a manageable vocabulary dimension with out sacrificing that means. Subwords carry semantic data that aids fashions in understanding context extra successfully. This implies fashions can course of new or uncommon phrases by breaking them down into acquainted subwords, rising their potential to deal with a wider vary of language inputs.

For instance, contemplate the phrase “annoyingly” which may be uncommon in a coaching corpus. It may be decomposed into the subwords “annoying” and “ly”. Each “annoying” and “ly” seem extra ceaselessly on their very own, and their mixed meanings retain the essence of “annoyingly”. This method is particularly helpful in agglutinative languages like Turkish, the place phrases can grow to be exceedingly lengthy by stringing collectively subwords to convey advanced meanings.

Discover that the standardization step is commonly built-in into the tokenizer itself. Massive language fashions use tokens as each inputs and outputs when processing textual content. Right here’s a visible illustration of tokens generated by Llama-3–8B on Tiktokenizer:

Tiktokenizer instance utilizing Llama-3–8B. Every token is represented by a special colour.

Moreover, Hugging Face offers a superb abstract of the tokenizers information, through which I exploit a few of its examples on this article.

Let’s now discover how completely different subword tokenization algorithms work. Notice that every one of these tokenization algorithms depend on some type of coaching which is normally achieved on the corpus the corresponding mannequin shall be skilled on.

Byte-Pair Encoding is a subword tokenization methodology launched in Neural Machine Translation of Uncommon Phrases with Subword Items by Sennrich et al. in 2015. BPE begins with a base vocabulary consisting of all distinctive characters within the coaching information and iteratively merges essentially the most frequent pairs of symbols — which may be characters or sequences of characters — to type new subwords. This course of continues till the vocabulary reaches a predefined dimension, which is a hyperparameter you select earlier than coaching.

Suppose we’ve the next phrases with their frequencies:

  • “hug” (10 occurrences)
  • “pug” (5 occurrences)
  • “pun” (12 occurrences)
  • “bun” (4 occurrences)
  • “hugs” (5 occurrences)

Our preliminary base vocabulary consists of the next characters: [“h”, “u”, “g”, “p”, “n”, “b”, “s”].

We cut up the phrases into particular person characters:

  • “h” “u” “g” (hug)
  • “p” “u” “g” (pug)
  • “p” “u” “n” (pun)
  • “b” “u” “n” (bun)
  • “h” “u” “g” “s” (hugs)

Subsequent, we depend the frequency of every image pair:

  • “h u”: 15 instances (from “hug” and “hugs”)
  • “u g”: 20 instances (from “hug”, “pug”, “hugs”)
  • “p u”: 17 instances (from “pug”, “pun”)
  • “u n”: 16 instances (from “pun”, “bun”)

Essentially the most frequent pair is “u g” (20 instances), so we merge “u” and “g” to type “ug” and replace our phrases:

  • “h” “ug” (hug)
  • “p” “ug” (pug)
  • “p” “u” “n” (pun)
  • “b” “u” “n” (bun)
  • “h” “ug” “s” (hugs)

We proceed this course of, merging the following most frequent pairs, akin to “u n” into “un”, till we attain our desired vocabulary dimension.

BPE controls the vocabulary dimension by specifying the variety of merge operations. Frequent phrases stay intact, lowering the necessity for intensive memorization. And, uncommon or unseen phrases may be represented by combos of recognized subwords. It’s utilized in fashions like GPT and RoBERTa.

The Hugging Face tokenizers library offers a quick and versatile option to prepare and use tokenizers, together with BPE.

Coaching a BPE Tokenizer

Right here’s the way to prepare a BPE tokenizer on a pattern dataset:

from tokenizers import Tokenizer
from tokenizers.fashions import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize a tokenizer
tokenizer = Tokenizer(BPE())

# Set the pre-tokenizer to separate on whitespace
tokenizer.pre_tokenizer = Whitespace()

# Initialize a coach with desired vocabulary dimension
coach = BpeTrainer(vocab_size=1000, min_frequency=2, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# Information to coach on
recordsdata = ["path/to/your/dataset.txt"]

# Practice the tokenizer
tokenizer.prepare(recordsdata, coach)

# Save the tokenizer
tokenizer.save("bpe-tokenizer.json")

Utilizing the skilled BPE Tokenizer:

from tokenizers import Tokenizer

# Load the tokenizer
tokenizer = Tokenizer.from_file("bpe-tokenizer.json")

# Encode a textual content enter
encoded = tokenizer.encode("I've a brand new GPU!")
print("Tokens:", encoded.tokens)
print("IDs:", encoded.ids)

Output:

Tokens: ['I', 'have', 'a', 'new', 'GP', 'U', '!']
IDs: [12, 45, 7, 89, 342, 210, 5]

WordPiece is one other subword tokenization algorithm, launched by Schuster and Nakajima in 2012 and popularized by fashions like BERT. Just like BPE, WordPiece begins with all distinctive characters however differs in the way it selects which image pairs to merge.

Right here’s how WordPiece works:

  1. Initialization: Begin with a vocabulary of all distinctive characters.
  2. Pre-tokenization: Break up the coaching textual content into phrases.
  3. Constructing the Vocabulary: Iteratively add new symbols (subwords) to the vocabulary.
  4. Choice Criterion: As an alternative of selecting essentially the most frequent image pair, WordPiece selects the pair that maximizes the probability of the coaching information when added to the vocabulary.

Utilizing the identical phrase frequencies as earlier than, WordPiece evaluates which image pair, when merged, would most enhance the chance of the coaching information. This entails a extra probabilistic method in comparison with BPE’s frequency-based methodology.

Just like BPE, we are able to prepare a WordPiece tokenizer utilizing the tokenizers library.

Coaching a WordPiece Tokenizer

from tokenizers import Tokenizer
from tokenizers.fashions import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize a tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# Set the pre-tokenizer
tokenizer.pre_tokenizer = Whitespace()

# Initialize a coach
coach = WordPieceTrainer(vocab_size=1000, min_frequency=2, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# Practice the tokenizer
tokenizer.prepare(recordsdata, coach)

# Save the tokenizer
tokenizer.save("wordpiece-tokenizer.json")

Utilizing the skilled WordPiece tokenizer:

from tokenizers import Tokenizer

# Load the tokenizer
tokenizer = Tokenizer.from_file("wordpiece-tokenizer.json")

# Encode a textual content enter
encoded = tokenizer.encode("I've a brand new GPU!")
print("Tokens:", encoded.tokens)
print("IDs:", encoded.ids)

Output:

Tokens: ['I', 'have', 'a', 'new', 'G', '##PU', '!']
IDs: [10, 34, 5, 78, 301, 502, 8]

Tokenization is a foundational step in NLP that prepares textual content information for computational fashions. By understanding and implementing acceptable tokenization methods, we allow fashions to course of and generate human language extra successfully, setting the stage for superior matters like phrase embeddings and language modeling.

All of the code on this article can be accessible on my GitHub repo: github.com/murilogustineli/nlp-medium

Different Sources

Except in any other case famous, all photos are created by the creator.