VectorIndex API
Given a vector, a vector index allows to search for similar other vectors. This API makes several existing vector indexes usable in a unified manner.
Base
- class small_text.vector_indexes.base.VectorIndex[source]
Abstract class for a structure that allows to index and search for vectors.
Added in version 2.0.0.
- abstract property index: INDEX_TYPE | None
Returns the underlying index implementation.
- Returns:
index – The underlying index if an index has been built, otherwise None.
- Return type:
object or None
- abstract build(vectors, ids=None)
Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.
- abstract remove(ids)
Removes the given vectors (identified by their ids) from the vector index.
- abstract search(vectors, k: int = 10, return_distance: bool = False)
For each of the given vectors, retrieve and return the k most similar vectors from the index.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.
return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.
- Returns:
ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.
distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.
Implementations
- class small_text.vector_indexes.knn.KNNIndex(algorithm='auto', radius=1.0, leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None)[source]
A vector index that relies on unsupervised learning of K-Nearest Neighbors.
See also
- Scikit-learn documentation of the underlying implementation.
Added in version 2.0.0.
- __init__(algorithm='auto', radius=1.0, leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None)
- Parameters:
algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') – Algorithm used for the nearest neighbor computation.
radius (float, default=1.0) – Ball of size radius
leaf_size (int, default=30) – Leaf size for ‘ball_tree’ and ‘kd_tree’ algorithms.
metric (str or func, default='minkowski') – Metric or metric function for the nearest neighbor distance.
p (float, default=2) – Parameter p for the Minkowski distance. The default 2 is equivalent to the euclidean distance.
metric_params (dict, default=None) – Additional params for the metric function.
n_jobs (int, default=None) – Number of jobs for nearest neighbor search.
seealso:: (..) –
- See the scikit-learn documentation for more details on the parameters.
- build(vectors, ids=None)
Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.
- build(vectors, ids=None)
Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.
- remove(ids)
Removes the given vectors (identified by their ids) from the vector index.
- remove(ids)
Removes the given vectors (identified by their ids) from the vector index.
- search(vectors, k: int = 10, return_distance: bool = False)
For each of the given vectors, retrieve and return the k most similar vectors from the index.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.
return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.
- Returns:
ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.
distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.
- search(vectors, k: int = 10, return_distance: bool = False)
For each of the given vectors, retrieve and return the k most similar vectors from the index.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.
return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.
- Returns:
ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.
distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.
- class small_text.vector_indexes.hnsw.HNSWIndex(space='l2', ef_construction=200, ef=200, m=64, random_seed=100)[source]
A vector index that relies on Hierarchical Navigable Small Worlds (HNSW).
Note
This strategy requires the optional dependency hnswlib.
See also
- GitHub repository of the underlying implementation.
- Details on setting the HNSW parameters.
https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
Added in version 2.0.0.
- __init__(space='l2', ef_construction=200, ef=200, m=64, random_seed=100)
- Parameters:
space ({'l2', 'ip', 'cosine'}, default='l2') – Type of vector space. The HNSW index uses the respective distance metric for
ef_construction (int, default=200) – Parameter that trades index accuracy versus runtime. Higher values for ef_construction increase the indexing time.
ef (int, default=200) – Parameter that trades index accuracy versus runtime. Higher values for ef increase the search time.
m (int) – Number of links between vectors during construction. Data with higher intrinsic dimensionality requires higher values of m. Higher values of m increase the memory usage.
random_seed (int) – Random seed that is used during initialization of the index.
Note
Check the hnswlib GitHub repository on details for the parameters.
- build(vectors, ids=None)
Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.
- build(vectors, ids=None)
Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.
- remove(ids)
Removes the given vectors (identified by their ids) from the vector index.
- remove(ids)
Removes the given vectors (identified by their ids) from the vector index.
- search(vectors, k: int = 10, return_distance: bool = False)
For each of the given vectors, retrieve and return the k most similar vectors from the index.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.
return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.
- Returns:
ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.
distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.
- search(vectors, k: int = 10, return_distance: bool = False)
For each of the given vectors, retrieve and return the k most similar vectors from the index.
- Parameters:
vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).
k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.
return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.
- Returns:
ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.
distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.