Create artificial language for LLM incremental training

Brief Explanation on Transformer Model’s Query, Key, and Value Concept

Very loosely speaking, the Query, Key, and Value (QKV) concept in Transformer model is like, trying to identify what a puzzle is ( for example, a horse, or a building ) representing.

Let say you only have some of the puzzle pieces in you hand. You pick up one piece, which has orange color.

The other orange color puzzle pieces in your hand, probably has more weight, in terms of helping you to identify this whole puzzle is an image of a horse.

The QKV concept is the assign weights for each words in a sentence/paragraph, in the relation of a word, for every words.

Transformer Tokens Inputs vs Character Inputs

https://platform.openai.com/tokenizer

Word Tokenizer

For starters, the word tokenizer functions like a semantic surgeon, dissecting text into intuitively distinct words. It’s the go-to method for earlier NLP applications, but when we’re animating the vast neural networks of LLMs, there’s more to the story.

Pros:

Semantic Intuition: Word tokenizers maintain the original word boundaries, preserving the full form and unadulterated meaning of the text.
Simplicity: The straightforward nature of splitting text into words makes the data preprocessing pipeline less complex.

Cons:

Vocabulary Size: These tokenizers produce vast vocabularies, often inflating the model size and memory footprint.
Out-of-Vocabulary (OOV) Words: New or niche terms not present in the training data can leave the model dumbfounded, unable to process unseen words.

Sub-Word Tokenizer

In the middle ground lies the sub-word tokenizer, a hybrid maestro that parses text into frequently occurring substrings. Tools like Byte Pair Encoding (BPE) or WordPiece are star players here, elegantly balancing vocabulary size and semantic coherence.

Pros:

Efficient Vocabulary: Sub-word tokenizers broker a peace deal between exorbitant vocabulary sizes and the need for comprehensiveness.
OOV Handling: By breaking down words into common subunits, the model gains the dexterity to handle unfamiliar terms with grace.

Cons:

Complex Preprocessing: The alchemy of creating sub-word units adds a layer of complexity to the data preparation process.
Potential Ambiguity: Sub-word units can sometimes lead to ambiguous token sequences that can confuse the model without proper context.

Character Tokenizer

Finally, the character tokenizer, the molecular level of text decomposition, slices everything down to individual characters. It’s like breaking down the text into its DNA—quite meticulous indeed.

Pros:

Minimum Vocabulary: Boasting the smallest possible vocabulary, this method ensures a lean and efficient linguistic genome for our LLM.
Universal Application: Rare words or even different languages are no sweat for character tokenizers, as they operate at the most granular level.

Cons:

Longer Sequences: More tokens per piece of text can lead to longer processing times and require the model to learn from longer dependencies.
Contextual Learning: The model must work harder to learn and understand the context since the semantic signal at the character level is weak.

Artificial Language with Limited Characters

I created a mini language, and train on a miniGPT ( 2 layers, 1 head, ~10000 parameters/neurons ) and wanted to see each neurons is doing.

The language Example:

  32___124143123 4    22422 111 14424411143444442 432443144442 4444  223 2  31 11 33421 111444 433241341 34  22 2434  4443234 4   2422143114442434  24432

Rules

Given a set of rules, generate a simulated language based on the rules

The sim langauge has to have causality.

Meaning: For a given 2 chars in any position, you can determine the xth position of what char should be.

One note, is that there may be multiple rules for a given 2 chars, so it will randomly select one of them.

The algorithm is as follows:

Basically reverse the string.

Start with a unknown _ char.

Randomly select a rule that satisfies the current position _ char, and nth position ( to the right as reversed string ) ‘??’ char.

Reverse the string again at the end, to have a left-causality string ( left chars determine the xth pos’s right char) string

Example rules:

rules = {

# Key: 2 chars, (next Xth pos, xth pos’s char)

’00’: (5, 2),

’01’: (1, 0),

’02’: (4, 4),

…

}

Code


# %%
import random, time
from concurrent.futures import ThreadPoolExecutor
import torch
import numpy as np


rules = {
  '00': (5, 2),
  '01': (1, 0),
  '02': (4, 4),
  '03': (3, 1),
  '04': (2, 3),
  '10': (5, 4),
  '11': (3, 3),
  '12': (1, 4),
  '13': (4, 0),
  '14': (2, 3),
  '20': (3, 2),
  '21': (2, 0),
  '22': (1, 4),
  '23': (4, 1),
  '24': (5, 1),
  '30': (2, 3),
  '31': (1, 1),
  '32': (4, 0),
  '33': (3, 2),
  '34': (5, 4),
  '40': (2, 3),
  '41': (3, 2),
  '42': (4, 1),
  '43': (5, 0),
  '44': (1, 2),
}

def artificial_language(rules, max_len):
  '''
  Given a set of rules, generate a simulated language based on the rules

  The sim langauge has to have causality.

  Meaning: For a given 2 chars in any position, you can determine the xth position of what char should be.

  One note, is that there may be multiple rules for a given 2 chars, so it will randomly select one of them.


  The algorithm is as follows:
  Basically reverse the string.

  Start with a unknown _ char.

  Randomly select a rule that satisfies the current position _ char, and nth position ( to the right as reversed string ) '??' char.

  Reverse the string again at the end, to have a left-causality string ( left chars determine the xth pos's right char) string


  Example rules:
  rules = {
    # Key: 2 chars, (next Xth pos, xth pos's char)
    '00': (5, 2),  
    '01': (1, 0),
    '02': (4, 4),
    ...
  }

  Example output:
  32___124143123 4    22422 111 14424411143444442 432443144442 4444  223 2  31 11 33421 111444 433241341 34  22 2434  4443234 4   2422143114442434  24432
  '''
  random.seed(42)
  np.random.seed(42)

  start_time, current_time = time.time(), -61.

  s, current_underscore_pos = [], []

  
  bucket = {}
  for k,v in rules.items():
    bucket.setdefault(v[0], []).append((k, v[1]))

  
  # Create a random list of choice of bucket, to be use for each _
  random_choice_of_bucket = np.random.randint(min(bucket.keys()), max(bucket.keys()), size=max_len).tolist()
  current_choice_of_bucket_idx = 0

  # random_keys from a bucket, put outside to save time
  random_keys = bucket[min(bucket.keys())]

  while len(s) < max_len:
    if time.time() - current_time > 60:
      current_time = time.time()
      print(f'len(s): {len(s)}, time: {current_time - start_time}')
      
    
    if '_' not in s:
      s.append('_')
      current_underscore_pos.append(len(s)-1)
      continue

    # Find the next pos of _ 
    pos = current_underscore_pos.pop(0)


    # loop through the bucket positions, if the pos is still _
    while s[pos] == '_':

      forward_pos_to_key = random_choice_of_bucket[current_choice_of_bucket_idx]
      current_choice_of_bucket_idx += 1
      current_choice_of_bucket_idx %= len(random_choice_of_bucket)

      # Get the induced positon of keys
      induced_pos = pos + forward_pos_to_key

      # randomly get the keys for this induced pos
      random_keys = bucket[forward_pos_to_key]
      if pos % 10 == 0:
        random.shuffle(random_keys)

      # Case 1: induced pos aleady has a key value
      if len(s) > induced_pos+1 and s[induced_pos] != '_' and s[induced_pos+1] != '_':
        key = f'{s[induced_pos]}{s[induced_pos+1]}'

        # get the value of the key
        s[pos] = rules[key][1]
      
      # Case 2: induced pos of second char has value
      elif len(s) > induced_pos+1 and s[induced_pos+1] != '_':
        
        # apply first key matches the first char of the induced pos
        for key, value in random_keys:
          if key[1] == s[induced_pos+1]:
            # update this pos from _ to value
            s[pos] = value
            # also update the 1st char of the key
            s[induced_pos] = int(key[0])


      # Case 3: induced pos aleady has a key value
      elif len(s) >= induced_pos+1 and s[induced_pos] != '_':

        # apply first key matches the first char of the induced pos
        for key, value in random_keys:
          if key[0] == s[induced_pos]:
            # update this pos from _ to value
            s[pos] = value
            
            # if the induced pos is the last pos of the string, then add th 2nd char to end
            if len(s) == induced_pos+1:
              s.append(int(key[1]))
            # also extend the string with the second char of the key
            else:
              s[induced_pos+1] = int(key[1])
      
      #Case 4: induced pos is out of range
      elif len(s) < induced_pos:
        # add multipl _ to the end of the string, but leave space for the key
        while(len(s) < induced_pos ):
          s.append('_')
          current_underscore_pos.append(len(s)-1)
        
        # change this pos form _ to value
        s[pos] = random_keys[0][1]

        # also add the key to induced pos
        s.append(int(random_keys[0][0][0]))
        s.append(int(random_keys[0][0][1]))

    # not found
    if s[pos] == '_':
      s[pos] = '?'
  
  # replace all the 0 with space
  s = [i if i != 0 else ' ' for i in s]

  print(f'len(s): {len(s)}, time: {time.time() - start_time}')
  return ''.join([str(i) for i in s[::-1]])



def artificial_language_threaded(rules, n):
  with ThreadPoolExecutor(max_workers=16) as executor:
    futures = [executor.submit(artificial_language, rules, n // 16) for _ in range(16)]
  return "".join(f.result() for f in futures)



# %%
if __name__ == '__main__':
  s = artificial_language(rules, 1000000)
  torch.save(s, 'artificial_language_7.pt')
  # print(s)
# %%

https://github.com/jljacoblo/jacAI/blob/master/my/artificial_language.py