# Introduction to Vector Similarity Search

*by*Phil Tadros

Within the earlier tutorials, we took a have a look at unstructured data, vector databases, and Milvus – the world’s hottest open-source vector database. We additionally briefly touched upon the thought of *embeddings*, high-dimensional vectors which function superior semantic representations of unstructured knowledge. One key observe to recollect – embeddings that are “shut” to 1 one other symbolize semantically comparable items of information.

On this introduction to vector search, we’ll outline what vector search is, and reply some basic questions on it. Then, we’ll construct on that information by going over a phrase embedding instance and seeing how semantically comparable items of unstructured knowledge are “close to” each other whereas dissimilar items of unstructured knowledge are “far” from each other. It will lead right into a high-level overview of *nearest neighbor search*, a computing drawback that entails discovering the closest vector(s) to a question vector primarily based on a unified *distance metric*. We’ll go over some well-known strategies for nearest neighbor search (together with my favourite – ANNOY) along with generally used *distance metrics*.

Let’s dive in.

Vector search, also called vector similarity search or nearest neighbor search, is a way utilized in knowledge retrieval and knowledge retrieval techniques to seek out gadgets or knowledge factors which can be comparable or carefully associated to a given question vector. In vector search, we symbolize knowledge factors, similar to photographs, texts, and audio, as vectors in a high-dimensional house. The aim of vector search is to effectively search and retrieve essentially the most related vectors which can be comparable or nearest to a question vector.

Sometimes, distance metrics similar to Euclidean distance or cosine similarity measure the similarity between vectors. The vector’s proximity within the vector house determines how comparable it’s. To effectively set up and search vectors, vector search algorithms use indexing buildings similar to tree-based buildings or hashing methods.

Vector search has varied functions, together with suggestion techniques, picture and video retrieval, pure language processing, anomaly detection, and query and reply chatbots. Utilizing vector search makes it doable to seek out related gadgets, patterns, or relationships inside high-dimensional knowledge, enabling extra correct and environment friendly data retrieval.

Vector search is a robust technique for analyzing and retrieving data from high-dimensional areas. It permits customers to seek out comparable or carefully associated gadgets to a given question, making it essential in varied domains. Listed below are the advantages of vector search:

**Similarity-based Retrieval**— Vector search permits for similarity-based Retrieval, enabling customers to seek out comparable or carefully associated gadgets to a given question. Similarity-based Retrieval is essential in varied domains, similar to suggestion techniques, the place customers anticipate personalised suggestions primarily based on their preferences or similarities to different customers.**Excessive-dimensional Knowledge Evaluation**— With the rising availability of high-dimensional knowledge, similar to photographs, audio, and textual knowledge, conventional search strategies grow to be much less efficient. Vector search gives a robust solution to analyze and retrieve data from high-dimensional areas, permitting for extra correct and environment friendly knowledge exploration.**Nearest-Neighbor Search**— Environment friendly nearest-neighbor search algorithms discover the closest neighbors to a given question vector. Nearest-neighbor search is helpful for important duties similar to a picture or doc similarity search, content-based Retrieval, or anomaly detection that require discovering the closest matches or comparable gadgets.**Improved Person Expertise**— By leveraging vector search, functions can present customers with extra related and personalised outcomes. Whether or not delivering related suggestions, retrieving visually comparable photographs, or discovering paperwork with comparable content material, vector search enhances the general consumer expertise by offering extra focused and significant outcomes.**Scalability**— Vector search algorithms and indexing buildings deal with large-scale datasets and high-dimensional areas effectively. They permit quick search and retrieval operations, making it possible to carry out similarity-based queries in real-time, even on huge datasets.

- Picture, video, audio similarity search
- AI drug discovery
- Semantic search engine
- DNA sequence classification
- Query answering system
- Recommender system
- Anomaly detection

Now that now we have coated the fundamentals of vector search, let’s look into the extra technical particulars by taking a look at a phrase embedding instance and end with a high-level overview of nearest neighbor search.

Let’s undergo a few phrase embedding examples. For the sake of simplicity, we’ll use `word2vec`

, an previous mannequin which makes use of a coaching methodology primarily based on *skipgrams*. BERT and different trendy transformer-based fashions will have the ability to offer you extra contextualized phrase embeddings, however we’ll keep on with `word2vec`

for simplicity. Jay Alammar gives a great tutorial on `word2vec`

, in case you’re enthusiastic about studying a bit extra.

Earlier than starting, we’ll want to put in the `gensim`

library and cargo a `word2vec`

mannequin.

```
% pip set up gensim --disable-pip-version-check
% wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
% gunzip GoogleNews-vectors-negative300.bin
```

```
Requirement already glad: gensim in /Customers/fzliu/.pyenv/lib/python3.8/site-packages (4.1.2)
Requirement already glad: smart-open>=1.8.1 in /Customers/fzliu/.pyenv/lib/python3.8/site-packages (from gensim) (5.2.1)
Requirement already glad: numpy>=1.17.0 in /Customers/fzliu/.pyenv/lib/python3.8/site-packages (from gensim) (1.19.5)
Requirement already glad: scipy>=0.18.1 in /Customers/fzliu/.pyenv/lib/python3.8/site-packages (from gensim) (1.7.3)
--2022-02-22 00:30:34-- https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.20.165
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.20.165|:443... related.
HTTP request despatched, awaiting response... 200 OK
Size: 1647046227 (1.5G) [application/x-gzip]
Saving to: GoogleNews-vectors-negative300.bin.gz
GoogleNews-vectors- 100%[===================>] 1.53G 2.66MB/s in 11m 23s
2022-02-22 00:41:57 (2.30 MB/s) - GoogleNews-vectors-negative300.bin.gz saved [1647046227/1647046227]
gunzip: GoogleNews-vectors-negative300.bin: unknown suffix -- ignored
```

Now that we have achieved all of the prep work required to generate word-to-vector embeddings, let’s load the skilled `word2vec`

mannequin.

```
>>> from gensim.fashions import KeyedVectors
>>> mannequin = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
```

### Instance 0: Marlon Brando

Let’s check out how `word2vec`

interprets the well-known actor Marlon Brando.

```
>>> print(mannequin.most_similar(constructive=['Marlon_Brando']))
```

```
[('Brando', 0.757453978061676), ('Humphrey_Bogart', 0.6143958568572998), ('actor_Marlon_Brando', 0.6016287207603455), ('Al_Pacino', 0.5675410032272339), ('Elia_Kazan', 0.5594002604484558), ('Steve_McQueen', 0.5539456605911255), ('Marilyn_Monroe', 0.5512186884880066), ('Jack_Nicholson', 0.5440199375152588), ('Shelley_Winters', 0.5432392954826355), ('Apocalypse_Now', 0.5306933522224426)]
```

Marlon Brando labored with Al Pacino in The Godfather and Elia Kazan in A Streetcar Named Want. He additionally starred in Apocalypse Now.

Vectors will be added and subtracted from one another to demo underlying semantic adjustments.

```
>>> print(mannequin.most_similar(constructive=['king', 'woman'], damaging=['man'], topn=1))
```

```
[('queen', 0.7118193507194519)]
```

Who says engineers cannot take pleasure in a little bit of dance-pop from time to time?

The phrase “apple” can consult with each the corporate in addition to the scrumptious purple fruit. On this instance, we are able to see that Word2Vec retains each meanings.

```
>>> print(mannequin.most_similar(constructive=['samsung', 'iphone'], damaging=['apple'], topn=1))
>>> print(mannequin.most_similar(constructive=['fruit'], topn=10)[9:])
```

```
[('droid_x', 0.6324754953384399)]
[('apple', 0.6410146951675415)]
```

“Droid” refers to Samsung’s first 4G LTE smartphone (“Samsung” + “iPhone” – “Apple” = “Droid”), whereas “apple” is the tenth closest phrase to “fruit”.

Now that we have seen the ability of embeddings, let’s briefly check out a number of the methods we are able to conduct nearest neighbor search. This isn’t a complete record; we’ll simply briefly go over some widespread strategies with a purpose to present a high-level overview of how vector search is carried out at scale. Word that a few of these strategies are usually not unique to one another – it is doable, for instance, to make use of quantization along side house partitioning.

(We’ll even be going over every of those strategies intimately in future tutorials, so keep tuned for extra.)

The only however most naïve nearest neighbor search algorithm is sweet previous linear search: computing the space from a question vector to all different vectors within the vector database.

For apparent causes, naïve search doesn’t work when attempting to scale our vector database to tens or a whole bunch of hundreds of thousands of vectors. However when the overall variety of components within the database is small, this could really be essentially the most environment friendly solution to carry out vector search since a separate knowledge construction for the index will not be required, whereas inserts and deletes will be applied pretty simply.

Because of the lack of house complexity in addition to fixed house overhead related to naïve search, this technique can typically outperform house partitioning even when querying throughout a reasonable variety of vectors.

Area partitioning will not be a single algorithm, however moderately a household of algorithms that each one use the identical idea.

Ok-dimensional bushes (kd-trees) are maybe essentially the most well-known on this household, and work by repeatedly bisecting the search house (splitting the vectors into “left” and “proper” buckets) in a way much like binary search bushes.

Inverted file index (IVF) can be a type of house partitioning, and works by assigning every vector to its nearest centroid – searches are then carried out by first figuring out the question vector’s closest centroid and conducting the search round there, considerably lowering the overall variety of vectors that must be searched. IVF is a reasonably common indexing technique and is usually mixed with different indexing algorithms to enhance efficiency.

Quantization is a way for lowering the overall measurement of the database by lowering the precision of the vectors.

Scalar quantization (SQ), for instance, works by multiplying high-precision floating level vectors with a scalar worth, then casting the weather of the resultant vector to their nearest integers. This not solely reduces the efficient measurement of your complete database (e.g. by an element of eight for conversion from `float64_t`

to `int8_t`

), but additionally has the constructive side-effect of dashing up vector-to-vector distance computations.

Product quantization (PQ) is one other quantization method that works much like dictionary compression. In PQ, all vectors are cut up into equally-sized subvectors, and every subvector is then changed with a centroid.

Hierarchical Navigable Small Worlds is a graph-based indexing and retrieval algorithm.

This works in a different way from product quantization: as a substitute of enhancing the searchability of the database by lowering its efficient measurement, HNSW creates a multi-layer graph from the unique knowledge. Higher layers include solely “lengthy connections” whereas decrease layers include solely “brief connections” between vectors within the database (see the following part for an outline of vector distance metrics). Particular person graph connections are created a-la skip lists.

With this structure in place, looking turns into pretty easy – we greedily traverse the uppermost graph (the one with the longest inter-vector connections) for the vector closest to our question vector. We then do the identical for the second layer, utilizing the results of the primary layer search as the start line. This continues till we full the search on the bottommost layer, the results of which turns into the closest neighbor of the question vector.

HNSW, visualized. Picture supply: https://arxiv.org/abs/1603.09320

That is in all probability my favourite ANN algorithm merely resulting from its playful and unintunitive identify. Approximate Nearest Neighbors Oh Yeah (ANNOY) is a tree-based algorithm popularized by Spotify (it’s used of their music suggestion system). Regardless of the unusual identify, the underlying idea behind ANNOY is definitely pretty easy – binary bushes.

ANNOY works by first randomly deciding on two vectors within the database and bisecting the search house alongside the hyperplane separating these two vectors. That is achieved iteratively till there are fewer than some predefined parameter `NUM_MAX_ELEMS`

per node. Because the ensuing index is basically a binary tree, this permits us to do our search on O(log n) complexity.

ANNOY, visualized. Picture supply: https://github.com/spotify/annoy

The very best vector databases are ineffective with out similarity metrics – strategies for computing the space between two vectors. Quite a few metrics exist, so we’ll focus on solely essentially the most generally used subset right here.

The commonest floating level vector similarity metrics are, in no specific order, *L1 distance*, *L2 distance*, and *cosine similarity*. The primary two values are *distance metrics* (decrease values indicate extra similarity whereas greater values indicate decrease similarity), whereas cosine similarity is a *similarity metric* (greater values indicate extra simlarity).

- $d_{l1}(mathbf{a},mathbf{b})=sum_{i=1}^{N}|mathbf{a}_i-mathbf{b}_i|$
- $d_{l2}(mathbf{a},mathbf{b})=sqrt{sum_{i=1}^{N}(mathbf{a}_i-mathbf{b}_i)^2}$
- $d_{cos}(mathbf{a},mathbf{b})=frac{mathbf{a}cdotmathbf{b}}{|mathbf{a}||mathbf{b}|}$

L1 distance can be generally known as Manhattan distance, aptly named after the truth that getting from level A to level B in Manhattan requires transferring alongside one among two perpendicular instructions. The second equation, L2 distance, is just the space between two vectors in Euclidean house. The third and closing equation is cosine distance, equal to the cosine of the angle between two vectors. Word the equation for cosine similarity works out to be the dot product between normalized variations of enter vectors **a** and **b**.

With a little bit of math, we are able to additionally present that L2 distance and cosine similarity are successfully equal in the case of similarity rating for unit norm vectors:

$d_{l2}(mathbf{a},mathbf{b})=(mathbf{a}-mathbf{b})^T(mathbf{a}-mathbf{b})$

Recall that unit norm vectors have a magnitude of 1:

$mathbf{a}^Tmathbf{a}=1$

With this, we get:

$mathbf{a}^Tmathbf{a}-2mathbf{a}^Tmathbf{b}+mathbf{b}^Tmathbf{b}$

Since now we have unit norm vectors, cosine distance works out to be the dot product between **a** and **b** (the denominator in equation 3 above works out to be 1):

$2-2mathbf{a}^Tmathbf{b}$

Basically, for unit norm vectors, L2 distance and cosine similarity are functionally equal! At all times bear in mind to normalize your embeddings.

Binary vectors, as their identify counsel, don’t have metrics primarily based in arithmetics a-la floating level vectors. Similarity metrics for binary vectors as a substitute depend on both set arithmetic, bit manipulation, or a mix of each (it is okay, I additionally hate discrete math). Listed below are the formulation for 2 generally used binary vector similarity metrics:

- $d_J(mathbf{a},mathbf{b})=1-frac{mathbf{a}cdotmathbf{b}}{|a|^2+|b|^2-mathbf{a}cdotmathbf{b}}$
- $d_J(mathbf{a},mathbf{b})=sum_{i=1}^{N}mathbf{a}_ioplusmathbf{b}_i$

The primary equation is named Tanimoto/Jaccard distance, and is basically a measure of the quantity of overlap between two binary vectors. The second equation is Hamming distance, and is a depend of the variety of vector components in a and b which differ from one another.

You can probably safely ignore these similarity metrics, for the reason that majority of functions use cosine similarity over floating level embeddings.

On this tutorial, we took a have a look at vector search / vector similarity search, together with some widespread vector search algorithms and distance metrics. Listed below are some key takeaways:

- Embedding vectors are highly effective representations, each by way of distance between the vectors and by way of vector arithmetic. By making use of a liberal amount of vector algebra to embeddings, we are able to carry out scalable semantic evaluation utilizing simply primary mathematical operators.
- There are all kinds of approximate nearest neighbor search algorithms and/or index varieties to select from. Probably the most generally one used at the moment is HNSW, however a special indexing algorithm may fit higher to your specific software, relying on the overall variety of embeddings you might have along with the size of every particular person vector.
- The 2 major distance metrics used at the moment are L2/Euclidean distance and cosine distance. These two metrics, when used on normalized embeddings, are functionally equal.

Thanks for becoming a member of us for this tutorial! Vector search is a core a part of Milvus, and it’ll proceed to be. In future tutorials, we’ll be performing some deeper dives into essentially the most generally used ANNS algorithms – HNSW and ScaNN.