The best way to Use the Hugging Face Tokenizers Library to Preprocess Textual content Information

The best way to Use the Hugging Face Tokenizers Library to Preprocess Textual content Information
Picture by Creator

 

In case you have studied NLP, you may need heard concerning the time period “tokenization.” It is a crucial step in textual content preprocessing, the place we rework our textual information into one thing that machines can perceive. It does so by breaking down the sentence into smaller chunks, generally known as tokens. These tokens may be phrases, subwords, and even characters, relying on the tokenization algorithm getting used. On this article, we are going to see methods to use the Hugging Face Tokenizers Library to preprocess our textual information.

 

Setting Up Hugging Face Tokenizers Library

 

To start out utilizing the Hugging Face Tokenizers library, you may want to put in it first. You are able to do this utilizing pip:

 

The Hugging Face library helps numerous tokenization algorithms, however the three essential varieties are:

  • Byte-Pair Encoding (BPE): Merges probably the most frequent pairs of characters or subwords iteratively, making a compact vocabulary. It’s utilized by fashions like GPT-2.
  • WordPiece: Just like BPE however focuses on probabilistic merges (does not select the pair that’s the most frequent however the one that can maximize the probability of the corpus as soon as merged), generally utilized by fashions like BERT.
  • SentencePiece: A extra versatile tokenizer that may deal with completely different languages and scripts, usually used with fashions like ALBERT, XLNet, or the Marian framework. It treats areas as characters fairly than phrase separators.

The Hugging Face Transformers library offers an AutoTokenizer class that may routinely choose the very best tokenizer for a given pre-trained mannequin. It is a handy method to make use of the proper tokenizer for a particular mannequin and may be imported from the transformers library. Nevertheless, for the sake of our dialogue relating to the Tokenizers library, we is not going to observe this method.

We’ll use the pre-trained BERT-base-uncased tokenizer. This tokenizer was educated on the identical information and utilizing the identical strategies because the BERT-base-uncased mannequin, which suggests it may be used to preprocess textual content information appropriate with BERT fashions:

# Import the required parts
from tokenizers import Tokenizer
from transformers import BertTokenizer

# Load the pre-trained BERT-base-uncased tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

 

Single Sentence Tokenization

 

Now, let’s encode a easy sentence utilizing this tokenizer:

# Tokenize a single sentence
encoded_input = tokenizer.encode_plus("That is pattern textual content to check tokenization.")
print(encoded_input)

 

Output:

{'input_ids': [101, 2023, 2003, 7099, 3793, 2000, 3231, 19204, 3989, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 

To make sure correctness, let’s decode the tokenized enter:

tokenizer.decode(encoded_input["input_ids"])

 

Output:

[CLS] that is pattern textual content to check tokenization. [SEP]

 

On this output, you’ll be able to see two particular tokens. [CLS] marks the beginning of the enter sequence, and [SEP] marks the top, indicating a single sequence of textual content.

 

Batch Tokenization

 

Now, let’s tokenize a corpus of textual content as a substitute of a single sentence utilizing batch_encode_plus:

corpus = [
    "Hello, how are you?",
    "I am learning how to use the Hugging Face Tokenizers library.",
    "Tokenization is a crucial step in NLP."
]
encoded_corpus = tokenizer.batch_encode_plus(corpus)
print(encoded_corpus)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

 

For higher understanding, let’s decode the batch-encoded corpus as we did incase of single sentence. This can present the unique sentences, tokenized appropriately.

tokenizer.batch_decode(encoded_corpus["input_ids"])

 

Output:

['[CLS] hiya, how are you? [SEP]',
 '[CLS] i'm studying methods to use the cuddling face tokenizers library. [SEP]',
 '[CLS] tokenization is an important step in nlp. [SEP]']

 

Padding and Truncation

 

When getting ready information for machine studying fashions, guaranteeing all enter sequences have the identical size is commonly vital. Two strategies to perform this are:

 

1. Padding

Padding works by including the particular token [PAD] on the finish of the shorter sequences to match the size of the longest sequence within the batch or max size supported by the mannequin if max_length is outlined. You are able to do this by:

encoded_corpus_padded = tokenizer.batch_encode_plus(corpus, padding=True)
print(encoded_corpus_padded)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

 

Now, you’ll be able to see that further 0s are positioned, however for higher understanding, let’s decode to see the place the tokenizer has positioned the [PAD] tokens:

tokenizer.batch_decode(encoded_corpus_padded["input_ids"], skip_special_tokens=False)

 

Output:

['[CLS] hiya, how are you? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] i'm studying methods to use the cuddling face tokenizers library. [SEP]',
 '[CLS] tokenization is an important step in nlp. [SEP] [PAD] [PAD] [PAD] [PAD]']

 

2. Truncation

Many NLP fashions have a most enter size sequence, and truncation works by chopping off the top of the longer sequence to satisfy this most size. It reduces reminiscence utilization and prevents the mannequin from being overwhelmed by very giant enter sequences.

encoded_corpus_truncated = tokenizer.batch_encode_plus(corpus, truncation=True, max_length=5)
print(encoded_corpus_truncated)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 102], [101, 1045, 2572, 4083, 102], [101, 19204, 3989, 2003, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

 

Now, it’s also possible to use the batch_decode technique, however for higher understanding, let’s print this data otherwise:

for i, sentence in enumerate(corpus):
    print(f"Unique sentence: {sentence}")
    print(f"Token IDs: {encoded_corpus_truncated['input_ids'][i]}")
    print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded_corpus_truncated['input_ids'][i])}")
    print()

 

Output:

Unique sentence: Hey, how are you?
Token IDs: [101, 7592, 1010, 2129, 102]
Tokens: ['[CLS]', 'hiya', ',', 'how', '[SEP]']

Unique sentence: I'm studying methods to use the Hugging Face Tokenizers library.
Token IDs: [101, 1045, 2572, 4083, 102]
Tokens: ['[CLS]', 'i', 'am', 'studying', '[SEP]']

Unique sentence: Tokenization is an important step in NLP.
Token IDs: [101, 19204, 3989, 2003, 102]
Tokens: ['[CLS]', 'token', '##ization', 'is', '[SEP]']

 

This text is a part of our wonderful collection on Hugging Face. If you wish to discover extra about this matter, listed below are some references that can assist you out:

 
 

Kanwal Mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with medication. She co-authored the e-book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions range and tutorial excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.

Leave a Reply