<aside> <img src="/icons/table_gray.svg" alt="/icons/table_gray.svg" width="40px" />
Table of Content
</aside>
Walker Walked
→ Walk
, er
, Walk
, ed
)Text → Tokens (Tokenizer) → Sequence of IDs (ID per Token) → Embed Each Token (Sequence of Token Embeddings)
Sequence of Token Embeddings → Add Order through Positional Encoding → Process through Stacked NN Modules.
##
prefix)##
Real Madrid
Should be 1 token, not 2)vocabulary size
<aside> ❗
The vocabulary_size
is a parameter specified when you are training your tokenizer model.
When training LLMs, its value could be thousands.
</aside>
training_data = ["walker walked a long walk"]
from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace
bpe_tokenizer = Tokenizer(BPE())
bpe_tokenizer.pre_tokenizer = Whitespace()
bpe_trainer = BpeTrainer(vocab_size=14)
# Training process
bpe_tokenizer.train_from_iterator(training_data, bpe_trainer)
<aside> ❗
The tokenizers
is a library by HuggingFace that collects many tokenizing techniques and make it ready for training tokenizing models.
</aside>
Using the walker walked a long walk
as the training example.
Differentiate the first letter from the middle one
Merge tokens that maximize the score:
$\text{score}(u, v) = \frac{\text{frequency}(uv)}{\text{frequency}(u) \times \text{frequency}(v)}$
Iterate the process until we reach the vocabulary size
from tokenizers.trainers import WordPieceTrainer
unk_token = "[UNK]" # -> For Specifying Unkown Tokens/Letters
wordpiece_model = WordPiece(unk_token=unk_token)
wordpiece_tokenizer = Tokenizer(wordpiece_model)
wordpiece_tokenizer.pre_tokenizer = Whitespace()
wordpiece_trainer = WordPieceTrainer(
vocab_size=28,
special_tokens=[unk_token]
)
using the walker walked a long walk
as the training set
from tokenizers.trainers import UnigramTrainer
from tokenizers.models import Unigram
unigram_tokenizer = Tokenizer(Unigram())
unigram_tokenizer.pre_tokenizer = Whitespace()
unigram_trainer = UnigramTrainer(
vocab_size=14,
special_tokens=[unk_token],
unk_token=unk_token,
)
unigram_tokenizer.train_from_iterator(training_data, unigram_trainer)
unigram_tokenizer.get_vocab()
<aside> ❗
It is taking high resources since there are many permutations for each word
</aside>
See the Notebook to follow the implementation.
To Import the Retrieval Quality, you should measure the quality before
You can’t Improve what you don’t Measure
$\text{precision@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{k}$
$\text{recall@k} = \frac{\lvert \text{relevant documents in the top } k \text{ results} \rvert}{\lvert \text{all relevant documents} \rvert}$
Mean Reciprocal Rank (MRR) evaluates the effectiveness of a system by calculating the average of the reciprocal ranks of the first relevant result across all queries
$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$
Discounted Cumulative Gain (DCG) assigns higher importance to documents appearing earlier in the list by incorporating a logarithmic discount factor to reflect diminishing relevance with position
$\text{DCG@k} = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)}$
Normalized Discounted Cumulative Gain (nDCG) calculates the ratio of the DCG to the ideal DCG (IDCG), normalizing the score to evaluate the relevance of results against the perfect ranking
$\text{nDCG@k} = \frac{\text{DCG@k}}{\text{IDCG@k}}$
All these metrics could be calculated through ranx
python package.
ranx
has 2 main components for comparing:
HNSW (Hierarchical Navigable Small Worlds) is the most common used algorithm to approximate nearest neighbor search
It is a Multi Layer Graph of Vectors in which connections are created between the closest points.
m
parameter:Define how many edges each node should have.
Increasing its value → Better Search Precision, BUT, Impact the Latency.
ef
parameter: