# Approximate Nearest Neighbor Oh Yeah (Annoy)

*by*Phil Tadros

Welcome again to Vector Database 101.

Within the earlier tutorial, we deep-dived into Hierarchical Navigable Small Worlds (HNSW). HNSW is a graph-based indexing algorithm that at this time is without doubt one of the hottest indexing methods utilized in vector databases.

On this tutorial, we’ll change gears and speak about tree-based vector indexes. Particularly, we’ll speak about **Approximate Nearest Neighbor Oh Yeah (Annoy)**, an algorithm that makes use of a forest of timber to conduct the closest neighbor search. For these aware of random forests or gradient-boosted determination timber, Annoy can appear to be a pure extension of those algorithms, just for the closest neighbor search reasonably than machine studying. As with our HNSW tutorial, we’ll first stroll via how Annoy works from a excessive degree earlier than growing our personal easy Python implementation.

Whereas HNSW is constructed upon the related graph and skip listing, Annoy makes use of binary search timber because the core information construction. The important thing thought behind Annoy (and different tree-based indexes) is to repeatedly partition our vector area and search solely a subset of the partitions for nearest neighbors. If this feels like IVF, you are proper; the concept is identical, however the execution is barely completely different.

_{Annoy, visualized (from https://github.com/spotify/annoy).}

The easiest way to grasp Annoy is to visualise how a single tree is constructed. Nevertheless, keep in mind that high-dimensional hyperspaces are very completely different from 2D/3D Euclidean areas from an intuitive perspective, so the pictures under are just for reference.

Let’s begin with indexing. For Annoy, this can be a recursive course of the place the utmost measurement of the decision stack is the depth of the tree. Within the first iteration, two random dataset vectors, **a** and **b**, are chosen, and the total hyperspace is cut up alongside a hyperplane equidistant from each **a** and **b**. Then, vectors within the “left” half of the hyperspace get assigned to the left half of the tree, whereas vectors within the “proper” half of the subspace are given to the correct half of the tree. Observe that this may be accomplished with out really computing the hyperplane itself – for each dataset vector, we have to decide whether or not **a** (left) or **b** (proper) is nearer.

After the primary, second, and Nth iteration, respectively. Source.

The second iteration repeats this course of for each the left and proper subtrees generated by the primary iteration, leading to a tree with a depth of two and 4 leaf nodes. This course of continues for the third, fourth, and subsequent iterations till a leaf node has fewer than a pre-defined variety of parts Ok. Within the original Annoy implementation, `Ok`

is a variable worth that the person can set.

With the index absolutely constructed, we are able to now transfer on to querying. Given some question vector **q**, we are able to search by traversing the tree. A hyperplane splits every intermediate node, and we are able to decide which facet of the hyperplane the question vector falls on by computing its distance to the left and proper vectors. We repeat this course of till we attain a leaf node containing an array of, at most, `Ok`

vectors. We will then rank and return these vectors to the person.

Now we all know how Annoy works, and let’s begin with the implementation. As traditional, we’ll first create a dataset of (128 dimensional) vectors:

```
>>> import numpy as np
>>> dataset = np.random.regular(measurement=(1000, 128))
```

Let’s first outline a `Node`

class containing left and proper subtrees:

```
class Node(object):
"""Initialize with a set of vectors, then name `cut up()`.
"""
def __init__(self, ref: np.ndarray, vecs: Record[np.ndarray]):
self._ref = ref
self._vecs = vecs
self._left = None
self._right = None
@property
def ref(self) -> Elective[np.ndarray]:
"""Reference level in n-d hyperspace. Evaluates to `False` if root node.
"""
return self._ref
@property
def vecs(self) -> Record[np.ndarray]:
"""Vectors for this leaf node. Evaluates to `False` if not a leaf.
"""
return self._vecs
@property
def left(self) -> Elective[object]:
"""Left node.
"""
return self._left
@property
def proper(self) -> Elective[object]:
"""Proper node.
"""
return self._right
```

The `vecs`

variable incorporates an inventory of all vectors throughout the node. If the size of this listing is lower than some worth `Ok`

, then they’ll stay as-is; in any other case, these vectors will then get propagated to `left`

and `proper`

, with `vecs[0]`

and `vecs[1]`

remaining as the 2 randomly chosen vectors used to separate the hyperplane.

Let’s now transfer to indexing. First, Recall that each node within the tree is cut up by a hyperplane orthogonal to the road connecting two randomly chosen dataset vectors. Conveniently, we are able to decide which facet of the hyperplane a question vector lies on by computing distance. As traditional, we’ll use NumPy’s vectorized math for this:

```
def _is_query_in_left_half(q, node):
dist_l = np.linalg.norm(q - node.vecs[0])
dist_r = np.linalg.norm(q - node.vecs[1])
return dist_l < dist_r
```

Now let’s transfer to constructing the precise tree.

```
import random
def split_node(node, Ok: int, imb: float) -> bool:
if len(node._vecs) <= Ok:
return False
for n in vary(5):
left_vecs = []
right_vecs = []
left_ref = node._vecs.pop(np.random.randint(len(node._vecs)))
right_ref = node._vecs.pop(np.random.randint(len(node._vecs)))
for vec in node._vecs:
dist_l = np.linalg.norm(vec - left_ref)
dist_r = np.linalg.norm(vec - right_ref)
if dist_l < dist_r:
left_vecs.append(vec)
else:
right_vecs.append(vec)
r = len(left_vecs) / len(node._vecs)
if r < imb and r > (1 - imb):
node._left = Node(left_ref, left_vecs)
node._right = Node(right_ref, right_vecs)
return True
node._vecs.append(left_ref)
node._vecs.append(right_ref)
return False
def _build_tree(node, Ok: int, imb: float):
"""Recurses on left and proper halves to construct a tree.
"""
node.cut up(Ok=Ok, imb=imb)
if node.left:
_build_tree(node.left, Ok=Ok, imb=imb)
if node.proper:
_build_tree(node.proper, Ok=Ok, imb=imb)
def build_forest(vecs: Record[np.ndarray], N: int = 32, Ok: int = 64, imb: float = 0.95) -> Record[Node]:
"""Builds a forest of `N` timber.
"""
forest = []
for _ in vary(N):
root = Node(None, vecs)
_build_tree(root, Ok, imb)
forest.append(root)
return forest
```

This can be a denser code block, so let’s stroll via it step-by-step. First, given an already-initialized `Node`

, we randomly choose two vectors and cut up the dataset into left and proper halves. We then use the perform we outlined earlier to find out which of the 2 halves the subvectors belong to. Observe that we have added an `imb`

parameter to keep up tree steadiness – if one facet of the tree incorporates greater than 95% of all of the subvectors, we redo the cut up course of.

With node splitting in place, the `build_tree`

perform will recursively name itself on all nodes. Leaf nodes are outlined as these which comprise fewer than `Ok`

subvectors.

Nice, so we have constructed a binary tree that lets us considerably scale back the scope of our search. Now let’s implement querying as properly. Querying is fairly simple; we traverse the tree, repeatedly shifting alongside the left or proper branches till we have arrived on the one we’re excited by:

```
def _query_linear(vecs: Record[np.ndarray], q: np.ndarray, okay: int) -> Record[np.ndarray]:
return sorted(vecs, key=lambda v: np.linalg.norm(q-v))[:k]
def query_tree(root: Node, q: np.ndarray, okay: int) -> Record[np.ndarray]:
"""Queries a single tree.
"""
whereas root.left and root.proper:
dist_l = np.linalg.norm(q - node.left.ref)
dist_r = np.linalg.norm(q - node.proper.ref)
root = root.left if dist_l < dist_r else root.proper
return _query_linear(root.vecs, q, okay)
```

This chunk of code will greedily traverse the tree, returning a single nearest neighbor (`nq = 1`

). Nevertheless, recall that we’re usually excited by discovering a number of nearest neighbors. Moreover, a number of nearest neighbors can reside in different leaf nodes as properly. So how can we clear up these points?

(Sure, I do understand that the primary character’s title is spelled “Forrest” within the American classic.)

In a previous tutorial on IVF, recall that we frequently expanded our search past the Voronoi cell closest to the question vector. The reason being on account of *cell edges* – if a question vector is near a cell edge, it is very doubtless that a few of its nearest neighbors could also be in a neighboring cell. These “edges” are far more widespread in high-dimensional areas, so a large-ish worth of `nprobe`

is usually used when a excessive recall is required.

We face the identical drawback for tree-based indexes – a few of our nearest neighbors could also be exterior the closest leaf node/polygon. Annoy solves this by 1) permitting searches on each side of a cut up and a pair of) making a *forest* of timber.

Let’s first broaden on our implementation within the earlier part to look each side of a cut up:

```
def _select_nearby(node: Node, q: np.ndarray, thresh: int = 0):
"""Capabilities identically to _is_query_in_left_half, however can return each.
"""
if not node.left or not node.proper:
return ()
dist_l = np.linalg.norm(q - node.left.ref)
dist_r = np.linalg.norm(q - node.proper.ref)
if np.abs(dist_l - dist_r) < thresh:
return (node.left, node.proper)
if dist_l < dist_r:
return (node.left,)
return (node.proper,)
def _query_tree(root: Node, q: np.ndarray, okay: int) -> Record[np.ndarray]:
"""This replaces the `query_tree` perform above.
"""
pq = [root]
nns = []
whereas pq:
node = pq.pop(0)
close by = _select_nearby(node, q, thresh=0.05)
if close by:
pq.prolong(close by)
else:
nns.prolong(node.vecs)
return _query_linear(nns, q, okay)
def query_forest(forest: Record[Node], q, okay: int = 10):
nns = set()
for root in forest:
res = _query_tree(root, q, okay)
nns.replace(res)
return _query_linear(nns, q, okay)
```

Subsequent, we’ll add a perform to permit us to construct the total index as a forest of timber:

```
def build_forest(vecs: Record[np.ndarray], N: int = 32, Ok: int = 64, imb: float = 0.95) -> Record[Node]:
"""Builds a forest of `N` timber.
"""
forest = []
for _ in vary(N):
root = Node(None, vecs)
_build_tree(root, Ok, imb)
forest.append(root)
return forest
```

With the whole lot carried out, let’s now put all of it collectively, as we have accomplished for IVF, SQ, PQ, and HNSW:

```
from typing import Record, Elective
import random
import numpy as np
class Node(object):
"""Initialize with a set of vectors, then name `cut up()`.
"""
def __init__(self, ref: np.ndarray, vecs: Record[np.ndarray]):
self._ref = ref
self._vecs = vecs
self._left = None
self._right = None
@property
def ref(self) -> Elective[np.ndarray]:
"""Reference level in n-d hyperspace. Evaluates to `False` if root node.
"""
return self._ref
@property
def vecs(self) -> Record[np.ndarray]:
"""Vectors for this leaf node. Evaluates to `False` if not a leaf.
"""
return self._vecs
@property
def left(self) -> Elective[object]:
"""Left node.
"""
return self._left
@property
def proper(self) -> Elective[object]:
"""Proper node.
"""
return self._right
def cut up(self, Ok: int, imb: float) -> bool:
if len(self._vecs) <= Ok:
return False
for n in vary(5):
left_vecs = []
right_vecs = []
left_ref = self._vecs.pop(np.random.randint(len(self._vecs)))
right_ref = self._vecs.pop(np.random.randint(len(self._vecs)))
for vec in self._vecs:
dist_l = np.linalg.norm(vec - left_ref)
dist_r = np.linalg.norm(vec - right_ref)
if dist_l < dist_r:
left_vecs.append(vec)
else:
right_vecs.append(vec)
r = len(left_vecs) / len(self._vecs)
if r < imb and r > (1 - imb):
self._left = Node(left_ref, left_vecs)
self._right = Node(right_ref, right_vecs)
return True
self._vecs.append(left_ref)
self._vecs.append(right_ref)
return False
def _select_nearby(node: Node, q: np.ndarray, thresh: int = 0):
"""Capabilities identically to _is_query_in_left_half, however can return each.
"""
if not node.left or not node.proper:
return ()
dist_l = np.linalg.norm(q - node.left.ref)
dist_r = np.linalg.norm(q - node.proper.ref)
if np.abs(dist_l - dist_r) < thresh:
return (node.left, node.proper)
if dist_l < dist_r:
return (node.left,)
return (node.proper,)
def _build_tree(node, Ok: int, imb: float):
"""Recurses on left and proper halves to construct a tree.
"""
node.cut up(Ok=Ok, imb=imb)
if node.left:
_build_tree(node.left, Ok=Ok, imb=imb)
if node.proper:
_build_tree(node.proper, Ok=Ok, imb=imb)
def build_forest(vecs: Record[np.ndarray], N: int = 32, Ok: int = 64, imb: float = 0.95) -> Record[Node]:
"""Builds a forest of `N` timber.
"""
forest = []
for _ in vary(N):
root = Node(None, vecs)
_build_tree(root, Ok, imb)
forest.append(root)
return forest
def _query_linear(vecs: Record[np.ndarray], q: np.ndarray, okay: int) -> Record[np.ndarray]:
return sorted(vecs, key=lambda v: np.linalg.norm(q-v))[:k]
def _query_tree(root: Node, q: np.ndarray, okay: int) -> Record[np.ndarray]:
"""Queries a single tree.
"""
pq = [root]
nns = []
whereas pq:
node = pq.pop(0)
close by = _select_nearby(node, q, thresh=0.05)
if close by:
pq.prolong(close by)
else:
nns.prolong(node.vecs)
return _query_linear(nns, q, okay)
def query_forest(forest: Record[Node], q, okay: int = 10):
nns = set()
for root in forest:
res = _query_tree(root, q, okay)
nns.replace(res)
return _query_linear(nns, q, okay)
```

And that is it for Annoy!

On this tutorial, we did a deep dive into Annoy, a tree-based indexing technique with a playful title. There are higher languages than Python for implementing vector search information constructions on account of interpreter overhead. Nonetheless, we use as a lot numpy-based array math. We will do many optimizations to forestall copying reminiscence forwards and backwards, however I will go away them as an train for the reader.

Within the following tutorial, we’ll proceed our deep dive into indexing methods with a rundown of the Vamana algorithm – additionally recognized extra generally as *DiskANN* – a novel graph-based indexing algorithm that’s tailor-made particularly in the direction of querying straight from stable state laborious drives.

All code for this tutorial is freely accessible on my Github.

- Introduction to Unstructured Data
- What is a Vector Database?
- Comparing Vector Databases, Vector Search Libraries, and Vector Search Plugins
- Introduction to Milvus
- Milvus Quickstart
- Introduction to Vector Similarity Search
- Vector Index Basics and the Inverted File Index
- Scalar Quantization and Product Quantization
- Hierarchical Navigable Small Worlds (HNSW)
- Approximate Nearest Neighbor Oh Yeah (ANNOY)
- Choosing the Right Vector Index for Your Project
- DiskANN and the Vamana Algorithm