Brief Explanation on Transformer Model’s Query, Key, and Value Concept
Very loosely speaking, the Query, Key, and Value (QKV) concept in Transformer model is like, trying to identify what a puzzle is ( for example, a horse, or a building ) representing.
Let say you only have some of the puzzle pieces in you hand. You pick up one piece, which has orange color.
The other orange color puzzle pieces in your hand, probably has more weight, in terms of helping you to identify this whole puzzle is an image of a horse.
The QKV concept is the assign weights for each words in a sentence/paragraph, in the relation of a word, for every words.
Transformer Tokens Inputs vs Character Inputs
https://platform.openai.com/tokenizer
Word Tokenizer
For starters, the word tokenizer functions like a semantic surgeon, dissecting text into intuitively distinct words. It’s the go-to method for earlier NLP applications, but when we’re animating the vast neural networks of LLMs, there’s more to the story.
Pros:
- Semantic Intuition: Word tokenizers maintain the original word boundaries, preserving the full form and unadulterated meaning of the text.
- Simplicity: The straightforward nature of splitting text into words makes the data preprocessing pipeline less complex.
Cons:
- Vocabulary Size: These tokenizers produce vast vocabularies, often inflating the model size and memory footprint.
- Out-of-Vocabulary (OOV) Words: New or niche terms not present in the training data can leave the model dumbfounded, unable to process unseen words.
Sub-Word Tokenizer
In the middle ground lies the sub-word tokenizer, a hybrid maestro that parses text into frequently occurring substrings. Tools like Byte Pair Encoding (BPE) or WordPiece are star players here, elegantly balancing vocabulary size and semantic coherence.
Pros:
- Efficient Vocabulary: Sub-word tokenizers broker a peace deal between exorbitant vocabulary sizes and the need for comprehensiveness.
- OOV Handling: By breaking down words into common subunits, the model gains the dexterity to handle unfamiliar terms with grace.
Cons:
- Complex Preprocessing: The alchemy of creating sub-word units adds a layer of complexity to the data preparation process.
- Potential Ambiguity: Sub-word units can sometimes lead to ambiguous token sequences that can confuse the model without proper context.
Character Tokenizer
Finally, the character tokenizer, the molecular level of text decomposition, slices everything down to individual characters. It’s like breaking down the text into its DNA—quite meticulous indeed.
Pros:
- Minimum Vocabulary: Boasting the smallest possible vocabulary, this method ensures a lean and efficient linguistic genome for our LLM.
- Universal Application: Rare words or even different languages are no sweat for character tokenizers, as they operate at the most granular level.
Cons:
- Longer Sequences: More tokens per piece of text can lead to longer processing times and require the model to learn from longer dependencies.
- Contextual Learning: The model must work harder to learn and understand the context since the semantic signal at the character level is weak.
Artificial Language with Limited Characters
I created a mini language, and train on a miniGPT ( 2 layers, 1 head, ~10000 parameters/neurons ) and wanted to see each neurons is doing.
The language Example:
32___124143123 4 22422 111 14424411143444442 432443144442 4444 223 2 31 11 33421 111444 433241341 34 22 2434 4443234 4 2422143114442434 24432
Rules
Given a set of rules, generate a simulated language based on the rules
The sim langauge has to have causality.
Meaning: For a given 2 chars in any position, you can determine the xth position of what char should be.
One note, is that there may be multiple rules for a given 2 chars, so it will randomly select one of them.
The algorithm is as follows:
Basically reverse the string.
Start with a unknown _ char.
Randomly select a rule that satisfies the current position _ char, and nth position ( to the right as reversed string ) ‘??’ char.
Reverse the string again at the end, to have a left-causality string ( left chars determine the xth pos’s right char) string
Example rules:
rules = {
# Key: 2 chars, (next Xth pos, xth pos’s char)
’00’: (5, 2),
’01’: (1, 0),
’02’: (4, 4),
…
}
Code
# %%
import random, time
from concurrent.futures import ThreadPoolExecutor
import torch
import numpy as np
rules = {
'00': (5, 2),
'01': (1, 0),
'02': (4, 4),
'03': (3, 1),
'04': (2, 3),
'10': (5, 4),
'11': (3, 3),
'12': (1, 4),
'13': (4, 0),
'14': (2, 3),
'20': (3, 2),
'21': (2, 0),
'22': (1, 4),
'23': (4, 1),
'24': (5, 1),
'30': (2, 3),
'31': (1, 1),
'32': (4, 0),
'33': (3, 2),
'34': (5, 4),
'40': (2, 3),
'41': (3, 2),
'42': (4, 1),
'43': (5, 0),
'44': (1, 2),
}
def artificial_language(rules, max_len):
'''
Given a set of rules, generate a simulated language based on the rules
The sim langauge has to have causality.
Meaning: For a given 2 chars in any position, you can determine the xth position of what char should be.
One note, is that there may be multiple rules for a given 2 chars, so it will randomly select one of them.
The algorithm is as follows:
Basically reverse the string.
Start with a unknown _ char.
Randomly select a rule that satisfies the current position _ char, and nth position ( to the right as reversed string ) '??' char.
Reverse the string again at the end, to have a left-causality string ( left chars determine the xth pos's right char) string
Example rules:
rules = {
# Key: 2 chars, (next Xth pos, xth pos's char)
'00': (5, 2),
'01': (1, 0),
'02': (4, 4),
...
}
Example output:
32___124143123 4 22422 111 14424411143444442 432443144442 4444 223 2 31 11 33421 111444 433241341 34 22 2434 4443234 4 2422143114442434 24432
'''
random.seed(42)
np.random.seed(42)
start_time, current_time = time.time(), -61.
s, current_underscore_pos = [], []
bucket = {}
for k,v in rules.items():
bucket.setdefault(v[0], []).append((k, v[1]))
# Create a random list of choice of bucket, to be use for each _
random_choice_of_bucket = np.random.randint(min(bucket.keys()), max(bucket.keys()), size=max_len).tolist()
current_choice_of_bucket_idx = 0
# random_keys from a bucket, put outside to save time
random_keys = bucket[min(bucket.keys())]
while len(s) < max_len:
if time.time() - current_time > 60:
current_time = time.time()
print(f'len(s): {len(s)}, time: {current_time - start_time}')
if '_' not in s:
s.append('_')
current_underscore_pos.append(len(s)-1)
continue
# Find the next pos of _
pos = current_underscore_pos.pop(0)
# loop through the bucket positions, if the pos is still _
while s[pos] == '_':
forward_pos_to_key = random_choice_of_bucket[current_choice_of_bucket_idx]
current_choice_of_bucket_idx += 1
current_choice_of_bucket_idx %= len(random_choice_of_bucket)
# Get the induced positon of keys
induced_pos = pos + forward_pos_to_key
# randomly get the keys for this induced pos
random_keys = bucket[forward_pos_to_key]
if pos % 10 == 0:
random.shuffle(random_keys)
# Case 1: induced pos aleady has a key value
if len(s) > induced_pos+1 and s[induced_pos] != '_' and s[induced_pos+1] != '_':
key = f'{s[induced_pos]}{s[induced_pos+1]}'
# get the value of the key
s[pos] = rules[key][1]
# Case 2: induced pos of second char has value
elif len(s) > induced_pos+1 and s[induced_pos+1] != '_':
# apply first key matches the first char of the induced pos
for key, value in random_keys:
if key[1] == s[induced_pos+1]:
# update this pos from _ to value
s[pos] = value
# also update the 1st char of the key
s[induced_pos] = int(key[0])
# Case 3: induced pos aleady has a key value
elif len(s) >= induced_pos+1 and s[induced_pos] != '_':
# apply first key matches the first char of the induced pos
for key, value in random_keys:
if key[0] == s[induced_pos]:
# update this pos from _ to value
s[pos] = value
# if the induced pos is the last pos of the string, then add th 2nd char to end
if len(s) == induced_pos+1:
s.append(int(key[1]))
# also extend the string with the second char of the key
else:
s[induced_pos+1] = int(key[1])
#Case 4: induced pos is out of range
elif len(s) < induced_pos:
# add multipl _ to the end of the string, but leave space for the key
while(len(s) < induced_pos ):
s.append('_')
current_underscore_pos.append(len(s)-1)
# change this pos form _ to value
s[pos] = random_keys[0][1]
# also add the key to induced pos
s.append(int(random_keys[0][0][0]))
s.append(int(random_keys[0][0][1]))
# not found
if s[pos] == '_':
s[pos] = '?'
# replace all the 0 with space
s = [i if i != 0 else ' ' for i in s]
print(f'len(s): {len(s)}, time: {time.time() - start_time}')
return ''.join([str(i) for i in s[::-1]])
def artificial_language_threaded(rules, n):
with ThreadPoolExecutor(max_workers=16) as executor:
futures = [executor.submit(artificial_language, rules, n // 16) for _ in range(16)]
return "".join(f.result() for f in futures)
# %%
if __name__ == '__main__':
s = artificial_language(rules, 1000000)
torch.save(s, 'artificial_language_7.pt')
# print(s)
# %%
https://github.com/jljacoblo/jacAI/blob/master/my/artificial_language.py