Long term memory for LLM chatbot app

The limit of LLM

There is an input size limit for all LLMs. GPT-4 has a limit of 8192 tokens for any given prompt. That’s a decent chunk of text, sure, but what if we need to remember context beyond that? It will start to forget what it is saying.

Imagine chatting with someone who forgets everything after scrolling up a bit. Frustrating, isn’t it? That’s essentially what an LLM does without some memory mechanism.

What is Embeddings in LLM?

Before we go into long-term memory, let’s talk embeddings. In the simplest terms, embedding is a way to convert tokens (words, phrases, entire documents) into vectors of real numbers. They’re like the DNA of language that models use to understand our blabbering.

These number sequences represent the syntactic and semantic essence of the text. Embeddings are what allow LLMs to process language and figure out that when we say “apple,” we might be talking about fruit or a brand, depending on the context.

What is Vector Store?

A vector store is a database for vectors, it stores the embeddings we talked about. It’s a database optimized for storing and retrieving vectors super fast.

Why does speed matter here? Because when you’re using a chatbot, speed is high priority for a response.

What is Pinecone API?

Pinecone is a Vector Database as a Service. It’s basically the vector store on steroids. It allows developers to store and work with vectors at scale, but with the ease of an API call.

Think of it as adding a large long term memory for the LLMs. It can remember parts of conversations by embedding and storing them, then quickly recalling relevant information when needed, bypassing the token limit problem!

Setup Pinecone API

First, you sign up for an account, get your API key, and install their Python client.

!pip install -q pinecone-client

Once you’ve installed the client, import it and set your API key:

import pinecone pinecone.init(api_key='your-secret-api-key')

Then, create a vector index:

index_name = 'chatbot-memory' pinecone.create_index(index_name, dimension=768) # assuming BERT embeddings

And just like that, you’ve got yourself your vector store, ready to remember all the little tidbits of your users’ chats.

Retrieving the Vectors

You stashed your embeddings nicely. But what’s the fun if we can’t get them back out and have our chatbot remember past conversations? Here’s how you summon those memories from the depths of vector space:

retrieved_embedding = pincone.load('chatbot-memory', 'user123-session456') # Use 'retrieved_embedding' to continue the convo

It can now pick up right where it left off.