<aside> <img src="/icons/table_gray.svg" alt="/icons/table_gray.svg" width="40px" />

Table of Content

</aside>

Lecture 1: Embedding Models


Tokenization Techniques:

  1. Character or Byte Based Tokenization
  2. Word Based Tokenization
  3. Sub-word Tokenization (Walker WalkedWalk , er, Walk, ed)

Input of Embedding Model:

Text → Tokens (Tokenizer) → Sequence of IDs (ID per Token) → Embed Each Token (Sequence of Token Embeddings)

Output of Embedding Model:

Sequence of Token Embeddings → Add Order through Positional Encoding → Process through Stacked NN Modules.

Groups of Tokens in our Vocabulary:

  1. 1st group: Technical Tokens Specific for the Model (e.g. [CLS], [SEP], etc.)
  2. 2nd group: Sup-word Tokens (with ##prefix)
  3. 3rd group: Prefixes and words starting with anything except ##

Lecture 2: Role of Tokenizers


Tokenizer Encoding Techniques:

  1. BPE (Byte Pair Encoding)
  2. Word Piece
  3. Unigram
  4. Sentence Piece (Tokenize Multiple Words Together, e.g. Real Madrid Should be 1 token, not 2)

Common Embedding Models Uses:

image.png

1. Byte Pair Encoding:

Steps:

  1. Words Split with whitespaces, then divide into characters or bytes → Tokens of Characters
  2. Now, each token (character tokens) merges to create 2 tokens (which are the most frequent pair).
  3. Iterate the process until we reach the specified vocabulary size

<aside> ❗

The vocabulary_size is a parameter specified when you are training your tokenizer model.

When training LLMs, its value could be thousands.

</aside>

Code:

training_data = ["walker walked a long walk"]

from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace

bpe_tokenizer = Tokenizer(BPE())
bpe_tokenizer.pre_tokenizer = Whitespace()

bpe_trainer = BpeTrainer(vocab_size=14)

# Training process
bpe_tokenizer.train_from_iterator(training_data, bpe_trainer)

<aside> ❗

The tokenizers is a library by HuggingFace that collects many tokenizing techniques and make it ready for training tokenizing models.

</aside>

Example:

Using the walker walked a long walk as the training example.

image.png

2. WordPiece:

Steps:

  1. Differentiate the first letter from the middle one

  2. Merge tokens that maximize the score:

    $\text{score}(u, v) = \frac{\text{frequency}(uv)}{\text{frequency}(u) \times \text{frequency}(v)}$

  3. Iterate the process until we reach the vocabulary size

Code:

from tokenizers.trainers import WordPieceTrainer

unk_token = "[UNK]" # -> For Specifying Unkown Tokens/Letters

wordpiece_model = WordPiece(unk_token=unk_token)
wordpiece_tokenizer = Tokenizer(wordpiece_model)
wordpiece_tokenizer.pre_tokenizer = Whitespace()
wordpiece_trainer = WordPieceTrainer(
    vocab_size=28,
    special_tokens=[unk_token]
)

Example:

using the walker walked a long walk as the training set

image.png

3. Unigram:

Steps:

  1. Unigram starts with huge vocabulary that is trimmed down.

Code:

from tokenizers.trainers import UnigramTrainer
from tokenizers.models import Unigram

unigram_tokenizer = Tokenizer(Unigram())
unigram_tokenizer.pre_tokenizer = Whitespace()
unigram_trainer = UnigramTrainer(
    vocab_size=14, 
    special_tokens=[unk_token],
    unk_token=unk_token,
)

unigram_tokenizer.train_from_iterator(training_data, unigram_trainer)
unigram_tokenizer.get_vocab()

Example of one word:

image.png

<aside> ❗

It is taking high resources since there are many permutations for each word

</aside>

Lesson3: Practical Implications of the Tokenization


See the Notebook to follow the implementation.

https://github.com/baselhusam/DeepLearning.AI-Short-Courses/blob/main/Qdrant Retrieval Optimization Tokenization to Vector Quantization/Lesson_3.ipynb

Lesson4: Measuring Search Relevance


To Import the Retrieval Quality, you should measure the quality before

You can’t Improve what you don’t Measure

Building Ground Truth:

Kinds of Metrics:

  1. Relevancy Based Metrics
  2. Ranking Related Metrics
  3. Score Related Metrics

Relevancy Based Metrics:

1. Precision@k

$\text{precision@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{k}$

2. Recall@k

$\text{recall@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{\lvert \text{all relevant documents} \rvert}$

Ranking Related Metrics:

1. Mean Reciprocal Rank (MRR).

Mean Reciprocal Rank (MRR) evaluates the effectiveness of a system by calculating the average of the reciprocal ranks of the first relevant result across all queries

$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$

Score Related Metrics:

1. Discount Cumulative Gain (DCG)

Discounted Cumulative Gain (DCG) assigns higher importance to documents appearing earlier in the list by incorporating a logarithmic discount factor to reflect diminishing relevance with position