Table of Content

</aside>

Lecture 1: Embedding Models

Tokenization Techniques:

Character or Byte Based Tokenization
Word Based Tokenization
Sub-word Tokenization (Walker Walked → Walk , er, Walk, ed)

Input of Embedding Model:

Text → Tokens (Tokenizer) → Sequence of IDs (ID per Token) → Embed Each Token (Sequence of Token Embeddings)

Each Token will be exactly the same with every run

Output of Embedding Model:

Sequence of Token Embeddings → Add Order through Positional Encoding → Process through Stacked NN Modules.

The Token Embedding Value may change with every run (because of the stacked NN Modules)

Groups of Tokens in our Vocabulary:

1st group: Technical Tokens Specific for the Model (e.g. [CLS], [SEP], etc.)
2nd group: Sup-word Tokens (with ##prefix)
3rd group: Prefixes and words starting with anything except ##

Lecture 2: Role of Tokenizers

Tokenizer Encoding Techniques:

BPE (Byte Pair Encoding)
Word Piece
Unigram
Sentence Piece (Tokenize Multiple Words Together, e.g. Real Madrid Should be 1 token, not 2)

Common Embedding Models Uses:

1. Byte Pair Encoding:

Steps:

Words Split with whitespaces, then divide into characters or bytes → Tokens of Characters
Now, each token (character tokens) merges to create 2 tokens (which are the most frequent pair).
Iterate the process until we reach the specified vocabulary size

<aside> ❗

The vocabulary_size is a parameter specified when you are training your tokenizer model.

When training LLMs, its value could be thousands.

</aside>

Code:

training_data = ["walker walked a long walk"]

from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace

bpe_tokenizer = Tokenizer(BPE())
bpe_tokenizer.pre_tokenizer = Whitespace()

bpe_trainer = BpeTrainer(vocab_size=14)

# Training process
bpe_tokenizer.train_from_iterator(training_data, bpe_trainer)

<aside> ❗

The tokenizers is a library by HuggingFace that collects many tokenizing techniques and make it ready for training tokenizing models.

</aside>

Example:

Using the walker walked a long walk as the training example.

2. WordPiece:

Steps:

Differentiate the first letter from the middle one
Merge tokens that maximize the score:

$\text{score}(u, v) = \frac{\text{frequency}(uv)}{\text{frequency}(u) \times \text{frequency}(v)}$
Iterate the process until we reach the vocabulary size

Code:

from tokenizers.trainers import WordPieceTrainer

unk_token = "[UNK]" # -> For Specifying Unkown Tokens/Letters

wordpiece_model = WordPiece(unk_token=unk_token)
wordpiece_tokenizer = Tokenizer(wordpiece_model)
wordpiece_tokenizer.pre_tokenizer = Whitespace()
wordpiece_trainer = WordPieceTrainer(
    vocab_size=28,
    special_tokens=[unk_token]
)

Example:

using the walker walked a long walk as the training set

3. Unigram:

Steps:

Unigram starts with huge vocabulary that is trimmed down.

Code:

from tokenizers.trainers import UnigramTrainer
from tokenizers.models import Unigram

unigram_tokenizer = Tokenizer(Unigram())
unigram_tokenizer.pre_tokenizer = Whitespace()
unigram_trainer = UnigramTrainer(
    vocab_size=14, 
    special_tokens=[unk_token],
    unk_token=unk_token,
)

unigram_tokenizer.train_from_iterator(training_data, unigram_trainer)
unigram_tokenizer.get_vocab()

Example of one word:

<aside> ❗

It is taking high resources since there are many permutations for each word

</aside>

Lesson3: Practical Implications of the Tokenization

See the Notebook to follow the implementation.

https://github.com/baselhusam/DeepLearning.AI-Short-Courses/blob/main/Qdrant Retrieval Optimization Tokenization to Vector Quantization/Lesson_3.ipynb

Lesson4: Measuring Search Relevance

To Import the Retrieval Quality, you should measure the quality before

You can’t Improve what you don’t Measure

Building Ground Truth:

Measuring Search Relevance by Building the Ground Truth
Build Dataset Pairs of Queries & Best Matching Documents Two ways to annotate the document relevance:

Kinds of Metrics:

Relevancy Based Metrics
Ranking Related Metrics
Score Related Metrics

Relevancy Based Metrics:

1. Precision@k

$\text{precision@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{k}$

2. Recall@k

$\text{recall@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{\lvert \text{all relevant documents} \rvert}$

Ranking Related Metrics:

1. Mean Reciprocal Rank (MRR).

Mean Reciprocal Rank (MRR) evaluates the effectiveness of a system by calculating the average of the reciprocal ranks of the first relevant result across all queries

$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$

Score Related Metrics:

1. Discount Cumulative Gain (DCG)

Discounted Cumulative Gain (DCG) assigns higher importance to documents appearing earlier in the list by incorporating a logarithmic discount factor to reflect diminishing relevance with position