<aside> <img src="/icons/table_gray.svg" alt="/icons/table_gray.svg" width="40px" />

Table of Content

</aside>

Lecture 1: Embedding Models


Tokenization Techniques:

  1. Character or Byte Based Tokenization
  2. Word Based Tokenization
  3. Sub-word Tokenization (Walker WalkedWalk , er, Walk, ed)

Input of Embedding Model:

Text → Tokens (Tokenizer) → Sequence of IDs (ID per Token) → Embed Each Token (Sequence of Token Embeddings)

Output of Embedding Model:

Sequence of Token Embeddings → Add Order through Positional Encoding → Process through Stacked NN Modules.

Groups of Tokens in our Vocabulary:

  1. 1st group: Technical Tokens Specific for the Model (e.g. [CLS], [SEP], etc.)
  2. 2nd group: Sup-word Tokens (with ##prefix)
  3. 3rd group: Prefixes and words starting with anything except ##

Lecture 2: Role of Tokenizers


Tokenizer Encoding Techniques:

  1. BPE (Byte Pair Encoding)
  2. Word Piece
  3. Unigram
  4. Sentence Piece (Tokenize Multiple Words Together, e.g. Real Madrid Should be 1 token, not 2)

Common Embedding Models Uses:

image.png

1. Byte Pair Encoding:

Steps:

  1. Words Split with whitespaces, then divide into characters or bytes → Tokens of Characters
  2. Now, each token (character tokens) merges to create 2 tokens (which are the most frequent pair).
  3. Iterate the process until we reach the specified vocabulary size

<aside> ❗

The vocabulary_size is a parameter specified when you are training your tokenizer model.

When training LLMs, its value could be thousands.

</aside>

Code:

training_data = ["walker walked a long walk"]

from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace

bpe_tokenizer = Tokenizer(BPE())
bpe_tokenizer.pre_tokenizer = Whitespace()

bpe_trainer = BpeTrainer(vocab_size=14)

# Training process
bpe_tokenizer.train_from_iterator(training_data, bpe_trainer)

<aside> ❗

The tokenizers is a library by HuggingFace that collects many tokenizing techniques and make it ready for training tokenizing models.

</aside>

Example:

Using the walker walked a long walk as the training example.

image.png

2. WordPiece:

Steps:

  1. Differentiate the first letter from the middle one

  2. Merge tokens that maximize the score:

    $\text{score}(u, v) = \frac{\text{frequency}(uv)}{\text{frequency}(u) \times \text{frequency}(v)}$

  3. Iterate the process until we reach the vocabulary size

Code:

from tokenizers.trainers import WordPieceTrainer

unk_token = "[UNK]" # -> For Specifying Unkown Tokens/Letters

wordpiece_model = WordPiece(unk_token=unk_token)
wordpiece_tokenizer = Tokenizer(wordpiece_model)
wordpiece_tokenizer.pre_tokenizer = Whitespace()
wordpiece_trainer = WordPieceTrainer(
    vocab_size=28,
    special_tokens=[unk_token]
)

Example:

using the walker walked a long walk as the training set

image.png

3. Unigram:

Steps:

  1. Unigram starts with huge vocabulary that is trimmed down.

Code:

from tokenizers.trainers import UnigramTrainer
from tokenizers.models import Unigram

unigram_tokenizer = Tokenizer(Unigram())
unigram_tokenizer.pre_tokenizer = Whitespace()
unigram_trainer = UnigramTrainer(
    vocab_size=14, 
    special_tokens=[unk_token],
    unk_token=unk_token,
)

unigram_tokenizer.train_from_iterator(training_data, unigram_trainer)
unigram_tokenizer.get_vocab()

Example of one word:

image.png

<aside> ❗

It is taking high resources since there are many permutations for each word

</aside>

Lesson3: Practical Implications of the Tokenization


See the Notebook to follow the implementation.

https://github.com/baselhusam/DeepLearning.AI-Short-Courses/blob/main/Qdrant Retrieval Optimization Tokenization to Vector Quantization/Lesson_3.ipynb

Lesson4: Measuring Search Relevance


To Import the Retrieval Quality, you should measure the quality before

You can’t Improve what you don’t Measure

Building Ground Truth:

Kinds of Metrics:

  1. Relevancy Based Metrics
  2. Ranking Related Metrics
  3. Score Related Metrics

Relevancy Based Metrics:

1. Precision@k

$\text{precision@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{k}$

2. Recall@k

$\text{recall@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{\lvert \text{all relevant documents} \rvert}$

Ranking Related Metrics:

1. Mean Reciprocal Rank (MRR).

Mean Reciprocal Rank (MRR) evaluates the effectiveness of a system by calculating the average of the reciprocal ranks of the first relevant result across all queries

$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$

Score Related Metrics:

1. Discount Cumulative Gain (DCG)

Discounted Cumulative Gain (DCG) assigns higher importance to documents appearing earlier in the list by incorporating a logarithmic discount factor to reflect diminishing relevance with position

$\text{DCG@k} = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)}$

2. Normalized Discount Cumulative Gain (nDCG)

Normalized Discounted Cumulative Gain (nDCG) calculates the ratio of the DCG to the ideal DCG (IDCG), normalizing the score to evaluate the relevance of results against the perfect ranking

$\text{nDCG@k} = \frac{\text{DCG@k}}{\text{IDCG@k}}$

Ranx

All these metrics could be calculated through ranx python package.

ranx has 2 main components for comparing:

  1. Qrels: The Ground Truth Query with Retrievals
  2. Runs: The Retrievals from the Vector DB.

Lecture 5: Optimizing HNSW Search


HNSW (Hierarchical Navigable Small Worlds) is the most common used algorithm to approximate nearest neighbor search

It is a Multi Layer Graph of Vectors in which connections are created between the closest points.

image.png

Two Main Parameters in the HNSW:

1. the m parameter:

Define how many edges each node should have.

Increasing its value → Better Search Precision, BUT, Impact the Latency.

2. the ef parameter: