Constructing vector search in 200 traces of Rust
Due to the speedy progress in AI/ML adoption, Vector databases are all over the place. And whereas they permit complicated AI/ML purposes, vector search itself is conceptually not that onerous. On this submit, we are going to describe how Vector databases work and in addition construct a easy Vector Search library in beneath 200 traces of Rust code. All code might be present in this Github repo.
However first, let’s set up what vector search is.
Introduction to Vectors (aka Embeddings)
Complicated unstructured information like docs, pictures, movies, are troublesome to symbolize and question in conventional databases – particularly if the question intent is to seek out “related” objects. So how can Youtube select the very best video to play subsequent? Or Spotify customise the music queue in response to your present music?
Advances in AI in early 2010s (beginning with Word2Vec and GloVe) enabled us to construct semantic illustration of those objects wherein they’re represented as factors in cartesian area. Say one video will get mapped to the purpose [0.1, 5.1, 7.55] and one other will get mapped to the purpose [5.3, 0.1, 2.7]. The magic of those ML algorithms is that these representations are chosen such that they keep semantic info – extra related two movies, smaller the gap is between their vectors.
Be aware that these vectors (or extra technically referred to as as embeddings) do not need to be 3 dimensional – they might be and infrequently are in increased dimensional areas (say 128 dimensions or 750 dimensions). And the gap does not must be euclidean distance both – different types of distances, like dot merchandise, additionally work. Both manner, all that issues is the distances between them correspond to their similarities.
Now think about that we have now entry to such vectors for all Youtube movies. How do we discover probably the most related video to a given beginning video? Easy. Loop by way of all of the movies, compute the gap between them and select the video with the smallest distance – also called discovering the “nearest neighbours” of the question video. This can really work. Besides, as you’ll be able to guess, a linear O(N) scan might be too pricey. So we’d like a sooner sublinear strategy to discover the closest neighbours of any a question video. This is generally unimaginable – one thing has to provide.
Because it seems, in sensible conditions, we needn’t discover the nearest video – it’d even be okay to seek out nearenough movies. And that is the place approximate nearest neighbor search algorithms, also called vector search are available in. The objective is to sublinearly (ideally in logarithmic time) discover shut sufficient nearest neighbours of any level in an area. So clear up that?
How you can Discover Approximate Nearest Neighbours?
The fundamental thought behind all vector search algorithms is similar – do some preprocessing to establish factors which might be shut sufficient to one another (considerably akin to constructing an index). On the question time, use this “index” to rule out giant swath of factors. And do a linear scan throughout the small variety of factors that do not get dominated out.
Nevertheless, there are many methods of approaching this easy thought. A number of stateoftheart vector search algorithms exist like HNSW (a graph that connects closeproximity vertices and in addition maintains longdistance edges with a set entry level). There exist opensource efforts like Fb’s FAISS and several other PaaS choices for highavailability vector databases like Pinecone and Weaviate.
On this submit, we are going to construct a simplified vector search index over the given “N” factors as follows:
 Randomly take 2 arbitrary obtainable vectors A and B.
 Calculate the midpoint between these 2 vectors, referred to as C.
 Construct a hyperplane (analog of a “line” in increased dimensions) that passes by way of C and is perpendicular to the road section connecting A and B.
 Classify all of the vectors as being both “above” or “under” the hyperplane, splitting the obtainable vectors into 2 teams.
 For every of the 2 teams: if the scale of the group is increased than a configurable parameter “most node dimension”, recursively name this course of on that group to construct a subtree. Else, construct a single leaf node with all of the vector (or their distinctive ids)
We thus use this randomized course of to construct a tree the place each inner node is a hyperplane definition with the left subtree being all of the vectors “under” the hyperplane and the fitting subtree being all of the vectors “above”. The set of vectors are constantly recursively break up till leaf nodes include not more than “most node dimension” vectors. Contemplate the instance within the following determine with 5 factors:
We randomly select vectors A1=(4,2), B1=(5,7). Their midpoint is (4.5,4.5) and we construct a line by way of the midpoint perpendicular to the road (A1, B1). That line is x + 5y=27 (drawn in blue) which provides us one group of two vectors, and one group of 4. Assume the “most node dimension” is configured to 2. We don’t additional break up the primary group however select a brand new (A2, B2) from the latter to construct the pink hyperplane and so forth. Repeated splits on a big dataset break up up the hyperspace into a number of distinct areas as proven under.
Every area right here represents a leaf node and the instinct right here is that factors which might be shut sufficient are prone to find yourself in the identical leaf node. So given a question level, we are able to traverse down the tree in logarithmic time to find the leaf it belongs to and run a linear scan in opposition to all of the (small variety of) factors in that leaf. That is clearly not foolproof – it’s very attainable that factors which might be really shut sufficient get separated by a hyperplane and find yourself very far off from one another. However this drawback might be tackled by constructing not one however many unbiased bushes – that manner if two factors are shut sufficient, they’re much more prone to be in the identical leaf node in no less than some bushes. On the question time, we traverse down all of the bushes to find the related leaf nodes, take a union of all of the candidates throughout all leaves, and do a linear scan on all of them.
Okay sufficient of concept. Let’s begin writing some code and first outline some utilities for a Vector sort in Rust under, for dot product, averaging, hashing, and squared L2 distance. Due to Rust’s good sort system, we propagate the generic sort parameter N to power all vectors in an index to be the identical dimensionality at compile time.
#[derive(Eq, PartialEq, Hash)]
pub struct HashKey<const N: usize>([u32; N]);
#[derive(Copy, Clone)]
pub struct Vector<const N: usize>(pub [f32; N]);
impl<const N: usize> Vector<N> {
pub fn subtract_from(&self, vector: &Vector<N>) > Vector<N> b  a);
let coords: [f32; N] = mapped.accumulate::<Vec<_>>().try_into().unwrap();
return Vector(coords);
pub fn avg(&self, vector: &Vector<N>) > Vector<N> (a, b)
pub fn dot_product(&self, vector: &Vector<N>) > f32 a * b).sum::<f32>();
pub fn to_hashkey(&self) > HashKey<N> a.to_bits());
let information: [u32; N] = bit_iter.accumulate::<Vec<_>>().try_into().unwrap();
return HashKey::<N>(information);
pub fn sq_euc_dis(&self, vector: &Vector<N>) > f32 (a, b)
}
With these core utilities constructed out, let’s subsequent give attention to producing random hyper planes and constructing forest of nearest neighbour bushes. How ought to we symbolize factors inside a tree?
Be a part of 1000+ others on our publication mailing record to get posts like this direct to your inbox!
Please wait…
Please examine your inbox and click on the hyperlink
Please enter a sound electronic mail deal with!
We may instantly retailer Ddimensional vectors inside leaf nodes. However that considerably fragments reminiscence (main efficiency hit) for giant D and in addition creates duplicate reminiscence in a forest when a number of bushes seek advice from the identical vectors. As an alternative, we retailer vectors in a worldwide contiguous location and maintain `usize` indices (8 bytes on a 64bit system as a substitute of 4D, the place f32 takes 4 bytes) at leaf nodes. Listed here are the info varieties used to symbolize internal and leaf nodes of the tree.
enum Node<const N: usize> {
Inside(Field<InnerNode<N>>),
Leaf(Field<LeafNode<N>>),
}
struct LeafNode<const N: usize>(Vec<usize>);
struct InnerNode<const N: usize> {
hyperplane: HyperPlane<N>,
left_node: Node<N>,
right_node: Node<N>,
}
pub struct ANNIndex<const N: usize> {
bushes: Vec<Node<N>>,
ids: Vec<i32>,
values: Vec<Vector<N>>,
}
How will we really discover the fitting hyperplanes?
We pattern two distinctive indexes for vectors A and B, calculate n = A – B, and discover the midpoint of A and B (point_on_plane). Hyperplane is effectively saved with structs of coefficients (vector n) and fixed (dot product of n and point_on_plane) as n(xx0) = nx – nx0. We are able to carry out a dot product between any vector and n and subtract the fixed to position the vector “above” or “under” the hyperplane. As inner nodes in our bushes maintain hyperplane definitions and leaf nodes maintain vector IDs, we are able to typecheck our tree with ADTs:
impl<const N: usize> ANNIndex<N> {
fn build_hyperplane(
indexes: &Vec<usize>,
all_vecs: &Vec<Vector<N>>,
) > (HyperPlane<N>, Vec<usize>, Vec<usize>) {
let pattern: Vec<_> = indexes
.choose_multiple(&mut rand::thread_rng(), 2)
.accumulate();
// cartesian eq for hyperplane n * (x  x_0) = 0
// n (regular vector) is the coefs x_1 to x_n
let (a, b) = (*pattern[0], *pattern[1]);
let coefficients = all_vecs[a].subtract_from(&all_vecs[b]);
let point_on_plane = all_vecs[a].avg(&all_vecs[b]);
let fixed = coefficients.dot_product(&point_on_plane);
let hyperplane = HyperPlane::<N> {
coefficients,
fixed,
};
let (mut above, mut under) = (vec![], vec![]);
for &id in indexes.iter() {
if hyperplane.point_is_above(&all_vecs[id]) {
above.push(id)
} else {
under.push(id)
};
}
return (hyperplane, above, under);
}
}
We are able to thus outline our recursive course of to construct bushes primarily based on the indextime “most node dimension”:
impl<const N: usize> ANNIndex<N> {
fn build_a_tree(
max_size: i32,
indexes: &Vec<usize>,
all_vecs: &Vec<Vector<N>>,
) > Node<N> {
if indexes.len() <= (max_size as usize) {
return Node::Leaf(Field::new(LeafNode::<N>(indexes.clone())));
}
let (aircraft, above, under) = Self::build_hyperplane(indexes, all_vecs);
let node_above = Self::build_a_tree(max_size, &above, all_vecs);
let node_below = Self::build_a_tree(max_size, &under, all_vecs);
return Node::Inside(Field::new(InnerNode::<N> {
hyperplane: aircraft,
left_node: node_below,
right_node: node_above,
}));
}
}
Discover that constructing a hyperplane between two factors requires these two factors to be distinctive – i.e. we should deduplicate our vector set earlier than indexing because the algorithm doesn’t enable duplicates.
And thus whole index (forest of bushes) can thus be constructed with:
impl<const N: usize> ANNIndex<N> {
fn deduplicate(
vectors: &Vec<Vector<N>>,
ids: &Vec<i32>,
dedup_vectors: &mut Vec<Vector<N>>,
ids_of_dedup_vectors: &mut Vec<i32>,
) {
let mut hashes_seen = HashSet::new();
for i in 1..vectors.len() {
let hash_key = vectors[i].to_hashkey();
if !hashes_seen.incorporates(&hash_key) {
hashes_seen.insert(hash_key);
dedup_vectors.push(vectors[i]);
ids_of_dedup_vectors.push(ids[i]);
}
}
}
pub fn build_index(
num_trees: i32,
max_size: i32,
vecs: &Vec<Vector<N>>,
vec_ids: &Vec<i32>,
) > ANNIndex<N> {
let (mut unique_vecs, mut ids) = (vec![], vec![]);
Self::deduplicate(vecs, vec_ids, &mut unique_vecs, &mut ids);
// Bushes maintain an index into the [unique_vecs] record which isn't
// essentially its id, if duplicates existed
let all_indexes: Vec<usize> = (0..unique_vecs.len()).accumulate();
let bushes: Vec<_> = (0..num_trees)
.map(_ Self::build_a_tree(max_size, &all_indexes, &unique_vecs))
.accumulate();
return ANNIndex::<N> {
bushes,
ids,
values: unique_vecs,
};
}
}
Question Time
As soon as index is constructed out, how can we use it to seek for Ok approximate nearest neighbors of an enter vector on a single tree? At nonleaf nodes, we retailer hyperplanes and so we are able to begin with the tree’s root and ask: “is that this vector above or under this hyperplane?”. This may be calculated in O(D) with a dot product. Based mostly on the response, we are able to recursively search the left or proper subtree till we hit a leaf node. Bear in mind the leaf node shops at most “most node dimension” vectors that are within the approximate neighbourhood of the enter vector (as they fall in the identical partition of the hyperspace beneath all hyperplanes, see Determine 1(b)). If the variety of vector indices at this leaf node exceeds Ok, we are able to now type all these vectors by L2 distance to the enter vector and return the closest Ok!
Assuming our indexing results in a balanced tree, for dimension D, variety of vectors N, and most node dimension M << N, search takes O(Dlog(N) + DM + Mlog(M)) – this constitutes the typical worstcase log(N) hyperplane comparisons to discover a leaf node (which is the peak of the tree) the place every comparability prices a O(D) dot product, calculating the L2 metric for all candidate vectors in a leaf node in O(DM) and at last sorting them to return the highest Ok in O(Mlog(M)).
Nevertheless, what occurs if the leaf node we discovered has lower than Ok vectors? That is attainable if the utmost node dimension is simply too low or there’s a comparatively uneven hyperplane break up leaving only a few vectors in a subtree. To handle this, we are able to add some easy backtracking capabilities to our tree search. As an example, we may make an extra recursive name on the inside node to go to the opposite department if the returned variety of candidates is not adequate. That is the way it may appear to be:
impl<const N: usize> ANNIndex<N> {
fn tree_result(
question: Vector<N>,
n: i32,
tree: &Node<N>,
candidates: &mut HashSet<usize>,
) > i32 {
// take the whole lot in node, if nonetheless wanted, take from alternate subtree
match tree {
Node::Leaf(box_leaf) => {
let leaf_values = &(box_leaf.0);
let num_candidates_found = min(n as usize, leaf_values.len());
for i in 0..num_candidates_found {
candidates.insert(leaf_values[i]);
}
return num_candidates_found as i32;
}
Node::Inside(internal) => {
let above = (*internal).hyperplane.point_is_above(&question);
let (foremost, backup) = match above {
true => (&(internal.right_node), &(internal.left_node)),
false => (&(internal.left_node), &(internal.right_node)),
};
match Self::tree_result(question, n, foremost, candidates) {
ok if ok < n => {
ok + Self::tree_result(question, n  ok, backup, candidates)
}
ok => ok,
}
}
}
}
}
Be aware that we may additional optimize away recursive calls by additionally storing the entire variety of vectors within the subtree and a listing of pointers on to all leaf nodes at every inside node however aren’t doing that right here for simplicity.
Increasing this search to a forest of bushes is trivial – merely accumulate high Ok candidates from all bushes independently, type them by the gap, and return the general high Ok matches. Be aware that increased variety of bushes can have a linearly excessive reminiscence footprint and linearlyscaled searchtime however can result in higher “nearer” neighbours since we accumulate candidates throughout completely different bushes.
impl<const N: usize> ANNIndex<N> {
pub fn search_approximate(
&self,
question: Vector<N>,
top_k: i32,
) > Vec<(i32, f32)> {
let mut candidates = HashSet::new();
for tree in self.bushes.iter() {
Self::tree_result(question, top_k, tree, &mut candidates);
}
candidates
.into_iter()
.map(idx (idx, self.values[idx].sq_euc_dis(&question)))
.sorted_by(a, b a.1.partial_cmp(&b.1).unwrap())
.take(top_k as usize)
.map((idx, dis) (self.ids[idx], dis))
.accumulate()
}
}
And that offers us a easy vector search index in 200 traces of Rust!
This implementation is pretty easy for illustrative functions – the truth is it is so easy that we suspect it should be a lot worse than cuttingedge algorithms (regardless of being related within the broader strategy). Let’s do some benchmarking to verify our suspicision.
Algorithms might be evaluated for each latency and high quality. High quality is usually measured by recall – the % of precise closest neighbours (as obtained from linear scan) which might be additionally obtained by the approximate nearest neighbour search. Some instances the returned outcomes aren’t technically in high Ok however they’re so near precise high Ok that it does not matter – to quantify that, we are able to additionally have a look at common euclidean distance of the neighbours and evaluate that with common distance with brute power search.
Measuring latency is straightforward – we are able to have a look at the time it takes to execute a question (we are sometimes much less concerned about index construct latencies).
All benchmarking outcomes had been run on a singledevice CPU with a 2.3 GHz QuadCore Intel Core i5 processor utilizing the 999,994 Wikidata FastText embeddings (https://dl.fbaipublicfiles.com/fasttext/vectorsenglish/wikinews300d1M.vec.zip) in 300 dimensions. We set the “high Ok” to be 20 for all our queries.
As a degree of reference, we evaluate a FAISS HNSW Index (ef_search=16, ef_construction=40, max_node_size=15) in opposition to a small model of our Rust index (num_trees=3, max_node_size=15). We carried out exhaustive search in Rust, whereas the FAISS library has a C++ supply for HNSW. Uncooked latency numbers reinforce the advantage of approximate search:
Algorithm 
Latency 
QPS 

Exhaustive Search 
675.25ms 
1.48 
FAISS HNSW Index 
355.36μs 
2814.05 
Customized Rust Index 
112.02μs 
8926.98 
Each approximate nearest neighbor approaches are three orders of magnitude sooner, which is nice. And appears like our Rust implementation is 3x sooner on this micro benchmark in comparison with HNSW. When analyzing high quality, visually think about the primary 10 nearest neighbors returned for the immediate “river”.
Exhaustive Search 
FAISS HNSW Index 
Customized Rust Index 

Phrase 
Euclidean Distance 
Phrase 
Euclidean Distance 
Phrase 
Euclidean Distance 
river 
0 
river 
0 
river 
0 
River 
1.39122 
River 
1.39122 
creek 
1.63744 
rivers 
1.47646 
river 
1.58342 
river. 
1.73224 
river 
1.58342 
swiftflowing 
1.62413 
lake 
1.75655 
swiftflowing 
1.62413 
floodswollen 
1.63798 
sea 
1.87368 
creek 
1.63744 
river.The 
1.68156 
upriver 
1.92088 
floodswollen 
1.63798 
riverbed 
1.68510 
shore 
1.92266 
river.The 
1.68156 
unfordable 
1.69245 
brook 
2.01973 
riverbed 
1.68510 
River 
1.69512 
hill 
2.03419 
unfordable 
1.69245 
River.The 
1.69539 
pond 
2.04376 
Alternatively, think about the immediate “battle”.
Exhaustive Search 
FAISS HNSW Index 
Customized Rust Index 

Phrase 
Euclidean Distance 
Phrase 
Euclidean Distance 
Phrase 
Euclidean Distance 
battle 
0 
battle 
0 
battle 
0 
war– 
1.38416 
war– 
1.38416 
war– 
1.38416 
war–a 
1.44906 
war–a 
1.44906 
wars 
1.45859 
wars 
1.45859 
wars 
1.45859 
quasiwar 
1.59712 
war–and 
1.45907 
war–and 
1.45907 
warfooting 
1.69175 
battle.It 
1.46991 
battle.It 
1.46991 
checkmate 
1.74982
See Also

battle.In 
1.49632 
battle.In 
1.49632 
illbegotten 
1.75498 
unwinable 
1.51296 
unwinable 
1.51296 
subequent 
1.76617 
battle.And 
1.51830 
battle.And 
1.51830 
humanitary 
1.77464 
hostilities 
1.54783 
Iraw 
1.54906 
execution 
1.77992 
For your complete 999,994word corpus, we additionally visualize the distribution of the typical Euclidean distance from every phrase to its high Ok=20 approximate neighbors beneath HNSW and our customized Rust index:
The stateoftheart HNSW Index does present comparatively nearer neighbors than our instance index with a imply and median distance of 1.31576 and 1.20230 respectively (in comparison with our instance index’s 1.47138 and 1.35620). On a randomized 10,000sized subset of the corpus, HNSW demonstrates a 58.2% recall for high Ok=20, whereas our instance index yields numerous recall (from 11.465% to 23.115%) for various configurations (as mentioned earlier, increased variety of bushes present increased recall):
num_trees 
max_node_size 
Common runtime 
QPS 
Recall 
3 
5 
129.48μs 
7723 
0.11465 
3 
15 
112.02μs 
8297 
0.11175 
3 
30 
114.48μs 
8735 
0.09265 
9 
5 
16.77ms 
60 
0.22095 
9 
15 
1.54ms 
649 
0.20985 
9 
30 
370.80μs 
2697 
0.16835 
15 
5 
35.45ms 
28 
0.29825 
15 
15 
7.34ms 
136 
0.28520 
15 
30 
2.19ms 
457 
0.23115 
Why is FANN so quick?
As you’ll be able to see in numbers above, whereas FANN algorithm is not aggressive with state of artwork on high quality, it’s no less than fairly quick. Why is that the case?
Actually, as we had been constructing this, we bought carried away and began doing efficiency optimizations only for enjoyable. Listed here are a couple of optimizations that made probably the most distinction:
 Offloading doc deduplication to the indexing chilly path. Referencing vectors by index as a substitute of a float array considerably hastens search as discovering distinctive candidates throughout bushes requires hashing 8byte indices (not 300dim f32 information).
 Eagerly hashing and discovering distinctive vectors earlier than including objects to international candidate lists and passing information by mutable references throughout recursive search calls to keep away from copies throughout and inside stack frames.
 Passing N as a generic sort parameter, which permits 300dim information to be typechecked as a 300length f32 array (and never a variablelength vectortype) to enhance cache locality and cut back our reminiscence footprint (vectors have an extra stage of redirection to information on the heap).
 We additionally suspect that internal operations (e.g. dot merchandise) are getting vectorized by the Rust compiler however we did not examine.
Extra Actual World Concerns
There are a number of issues this instance skips over which might be essential in productionizing vector search:
 Parallelizing when search includes a number of bushes. As an alternative of gathering candidates sequentially, we are able to parallelize since every tree accesses distinct reminiscence – every tree can run on a separate thread the place candidates are constantly despatched to the principle course of by way of messages down a channel. The threads might be spawned at indexing time itself and warmed with dummy searches (for sections of the tree to reside in cache) to scale back overhead on search. Search will now not scale linearly within the variety of bushes.
 Giant bushes could not slot in RAM, needing environment friendly methods to learn from disk – sure subgraphs could must be on disk and algorithms designed to permit for search whereas minimizing file I/O.
 Going additional, if bushes don’t match on an occasion’s disk, we have to distribute subtrees throughout situations and have recursive search calls hearth some RPC requests if the info isn’t obtainable regionally.
 The tree includes many reminiscence redirections (pointerbased bushes aren’t L1 cache pleasant). Balanced bushes might be wellwritten with an array however our tree is just closetobalanced with randomized hyperplanes – can we use a brand new information construction for the tree?
 Options to the above points also needs to maintain when new information is listed onthefly (doubtlessly requiring resharding for giant bushes). Ought to bushes be recreated if a sure sequence of indexing results in a extremely unbalanced tree?
Conclusion
In case you bought right here, congrats! You’ve simply seen a easy vector search in ~200 traces of Rust in addition to our ramblings on productionizing vector seek for planetscale purposes. We hope you loved the learn and be happy to entry the supply code at https://github.com/fennelai/fann.
About Fennel
Fennel is a realtime function engineering platform inbuilt Rust. Try Fennel technical docs to be taught extra or mess around with it in your laptop computer by following this quickstart tutorial.