Briefing Similarity Scoring Methods of Contextual Embeddings

The main objective in this post is to build a system that:

Utilize the sentence embeddings to create a knowledge graph and replicate the human's temporal memory in application.
Scope for strategies to integrate the sentence embeddings into Large language models' session caches.

Prior Knowledge

Sentence Embeddings:
- Represents the entire sentences as dense, fixed-size vectors in a high-dimensional space. The dimensions depend on the dimensions of the transformer's architecture layer.
- Converting tokens to sentence embeddings include applying a strategy, known as mean pooling , to the token vector embeddings.
- The common models used to retrieve sentence embeddings are BERT, RoBerta or dedicated models like SBERT. These models are different to text generative models in that they come without a head (or decoder state) and only encoder is loaded / used.
Similarity Scoring: measures how 'close' or 'similar' two sentence embeddings are in vector space. Higher similarity scores imply higher semantic relatedness. This post explores different methods including cosine similarity, dot product and Euclidean distance.

Dependencies

sentence-transformers # Used to load pre-trained LLMs to tokenize and encode the text sequences. In this post, all-distilroberta-v1 was used. 
numpy

Notebook Overview

A brief comparison of various similarity scoring methods for vector embeddings, including Cosine Similarity, Manhattan Distance, and more.
Initialising a base class to generate scores in real-time during streaming conversational chats in the instance where a vector database or self-organizing map is used as a generative model's temporal memory.

Last updated 3 months ago