Unum · USearch 0.22.3 documentation
Smaller & Sooner Single-File
Vector Search Engine
Euclidean • Angular • Jaccard • Hamming • Haversine • Consumer-Outlined Metrics
C++11 •
Python •
JavaScript •
Java •
Rust •
C99 •
Objective-C •
Swift •
GoLang •
Wolfram
Linux • MacOS • Home windows • Docker • WebAssembly
Comparability with FAISS
FAISS is a widely known commonplace for high-performance vector search engines like google.
USearch and FAISS each make use of the identical HNSW algorithm, however they differ considerably of their design rules.
USearch is compact and broadly appropriate with out sacrificing efficiency, with a major concentrate on user-defined metrics and fewer dependencies.
FAISS |
USearch |
|
---|---|---|
Implementation |
84 Okay SLOC in |
3 Okay SLOC in |
Supported metrics |
9 fastened metrics |
Any Consumer-Outlined metrics |
Supported ID sorts |
|
|
Dependencies |
BLAS, OpenMP |
None |
Bindings |
SWIG |
Native |
Acceleration |
Realized Quantization |
Downcasting |
Base performance is similar to FAISS, and the interface should be acquainted if in case you have ever investigated Approximate Nearest Neigbors search:
$ pip set up usearch numpy
import numpy as np
from usearch.index import Index
index = Index(
ndim=3, # Outline the variety of dimensions in enter vectors
metric='cos', # Select 'l2sq', 'haversine' or different metric, default = 'ip'
dtype='f32', # Quantize to 'f16' or 'f8' if wanted, default = 'f32'
connectivity=16, # Elective: How frequent ought to the connections within the graph be
expansion_add=128, # Elective: Management the recall of indexing
expansion_search=64, # Elective: Management the standard of search
)
vector = np.array([0.2, 0.6, 0.4])
index.add(42, vector)
matches, distances, depend = index.search(vector, 10)
assert len(index) == 1
assert depend == 1
assert matches[0] == 42
assert distances[0] <= 0.001
assert np.allclose(index[42], vector)
Consumer-Outlined Features
Whereas most vector search packages focus on simply a few metrics – “Inside Product distance” and “Euclidean distance,” USearch extends this checklist to incorporate any user-defined metrics.
This flexibility permits you to customise your seek for a myriad of purposes, from computing geo-spatial coordinates with the uncommon Haversine distance to creating customized metrics for composite embeddings from a number of AI fashions.
Not like older approaches indexing high-dimensional areas, like KD-Timber and Locality Delicate Hashing, HNSW doesn’t require vectors to be similar in size.
They solely need to be comparable.
So you’ll be able to apply it in obscure purposes, like trying to find related units or fuzzy textual content matching, utilizing GZip as a distance perform.
Reminiscence Effectivity, Downcasting, and Quantization
Coaching a quantization mannequin and dimension-reduction is a standard method to speed up vector search.
These, nonetheless, are solely generally dependable, can considerably have an effect on the statistical properties of your information, and require common changes in case your distribution shifts.
As a substitute, we’ve centered on high-precision arithmetic over low-precision downcasted vectors.
The identical index, and add
and search
operations will mechanically down-cast or up-cast between f32_t
, f16_t
, f64_t
, and f8_t
representations, even when the {hardware} doesn’t natively help it.
Persevering with the subject of memory-efficiency, we offer a uint40_t
to permit assortment with over 4B+ vectors with out allocating 8 bytes for each neighbor reference within the proximity graph.
FAISS, |
USearch, |
USearch, |
USearch, |
|
---|---|---|---|---|
Batch Insert |
16 Okay/s |
73 Okay/s |
100 Okay/s |
104 Okay/s +550% |
Batch Search |
82 Okay/s |
103 Okay/s |
113 Okay/s |
134 Okay/s +63% |
Bulk Insert |
76 Okay/s |
105 Okay/s |
115 Okay/s |
202 Okay/s +165% |
Bulk Search |
118 Okay/s |
174 Okay/s |
173 Okay/s |
304 Okay/s +157% |
Recall @ 10 |
99% |
99.2% |
99.1% |
99.2% |
Dataset: 1M vectors pattern of the Deep1B dataset.
{Hardware}:c7g.metallic
AWS occasion with 64 cores and DDR5 reminiscence.
HNSW was configured with similar hyper-parameters:
connectivityM=16
,
growth @ developmentefConstruction=128
,
and growth @ searchef=64
.
Batch measurement is 256.
Each libraries have been compiled for the goal structure.
Bounce to the Performance Tuning part to learn concerning the results of these hyper-parameters.
Disk-based Indexes
With USearch, you’ll be able to serve indexes from exterior reminiscence, enabling you to optimize your server selections for indexing velocity and serving prices.
This can lead to 20x prices discount on AWS and different public clouds.
index.save("index.usearch")
loaded_copy = index.load("index.usearch")
view = Index.restore("index.usearch", view=True)
other_view = Index(ndim=..., metric=CompiledMetric(...))
other_view.view("index.usearch")
Joins
One of many massive questions lately is how will AI change the world of databases and data-management?
Most databases are nonetheless struggling to implement high-quality fuzzy search, and the one form of joins they know are deterministic.
A be part of
is totally different from trying to find each entry, because it requires a one-to-one mapping, banning collisions amongst separate search outcomes.
Precise Search |
Fuzzy Search |
Semantic Search ? |
---|---|---|
Precise Be a part of |
Fuzzy Be a part of ? |
Semantic Be a part of ?? |
Utilizing USearch one can implement sub-quadratic complexity approximate, fuzzy, and semantic joins.
This will come useful in any fuzzy-matching duties, frequent to Database Administration Software program.
males = Index(...)
girls = Index(...)
pairs: dict = males.be part of(girls, max_proposals=0, precise=False)
Performance
By now, core performance is supported throughout all bindings.
Broader performance is ported per request.
C++ |
Python |
Java |
JavaScript |
Rust |
GoLang |
Swift |
|
---|---|---|---|---|---|---|---|
add/search/take away |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
save/load/view |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
be part of |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
user-defiend metrics |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
variable-length vectors |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
4B+ capacities |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
Utility Examples
USearch + AI = Multi-Modal Semantic Search
AI has a rising variety of purposes, however one of many coolest basic concepts is to make use of it for Semantic Search.
One can take an encoder mannequin, just like the multi-modal UForm, and a web-programming framework, like UCall, and construct a text-to-image search platform in simply 20 traces of Python.
import ucall
import uform
import usearch
import numpy as np
import PIL as pil
server = ucall.Server()
mannequin = uform.get_model('unum-cloud/uform-vl-multilingual')
index = usearch.index.Index(ndim=256)
@server
def add(label: int, photograph: pil.Picture.Picture):
picture = mannequin.preprocess_image(photograph)
vector = mannequin.encode_image(picture).detach().numpy()
index.add(label, vector.flatten(), copy=True)
@server
def search(question: str) -> np.ndarray:
tokens = mannequin.preprocess_text(question)
vector = mannequin.encode_text(tokens).detach().numpy()
matches = index.search(vector.flatten(), 3)
return matches.labels
server.run()
We’ve pre-processed some generally used datasets, cleansing the photographs, producing the vectors, and pre-building the index.
USearch + RDKit = Molecular Search
Evaluating molecule graphs and trying to find related buildings is dear and sluggish.
It may be seen as a particular case of the NP-Full Subgraph Isomorphism downside.
Fortunately, domain-specific approximate strategies exists.
The one generally utilized in Chemistry, is to generate buildings from SMILES, and later hash them into binary fingerprints.
The later are searchable with bitwise similarity metrics, just like the Tanimoto coefficient.
Under is na instance utilizing the RDKit bundle.
from usearch.index import Index, MetricKind
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
molecules = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO')]
encoder = AllChem.GetRDKitFPGenerator()
fingerprints = np.vstack([encoder.GetFingerprint(x) for x in molecules])
fingerprints = np.packbits(fingerprints, axis=1)
index = Index(ndim=2048, metric=MetricKind.Tanimoto)
labels = np.arange(len(molecules))
index.add(labels, fingerprints)
matches = index.search(fingerprints, 10)
TODO
Integrations
Citations
@software program{Vardanian_USearch_2022,
doi = {10.5281/zenodo.7949416},
writer = {Vardanian, Ash},
title = {{USearch by Unum Cloud}},
url = {https://github.com/unum-cloud/usearch},
model = {0.13.0},
yr = {2022}
month = jun,
}