VectorIndex API

Given a vector, a vector index allows to search for similar other vectors. This API makes several existing vector indexes usable in a unified manner.


Base

class small_text.vector_indexes.base.VectorIndex[source]

Abstract class for a structure that allows to index and search for vectors.

Added in version 2.0.0.

abstract property index: INDEX_TYPE | None

Returns the underlying index implementation.

Returns:

index – The underlying index if an index has been built, otherwise None.

Return type:

object or None

abstract build(vectors, ids=None)

Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.

abstract remove(ids)

Removes the given vectors (identified by their ids) from the vector index.

abstract search(vectors, k: int = 10, return_distance: bool = False)

For each of the given vectors, retrieve and return the k most similar vectors from the index.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.

  • return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.

Returns:

  • ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.

  • distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.

class small_text.vector_indexes.base.VectorIndexFactory(vector_index_class, index_kwargs={})[source]
__init__(vector_index_class, index_kwargs={})

Implementations

class small_text.vector_indexes.knn.KNNIndex(algorithm='auto', radius=1.0, leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None)[source]

A vector index that relies on unsupervised learning of K-Nearest Neighbors.

Added in version 2.0.0.

__init__(algorithm='auto', radius=1.0, leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None)
Parameters:
  • algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') – Algorithm used for the nearest neighbor computation.

  • radius (float, default=1.0) – Ball of size radius

  • leaf_size (int, default=30) – Leaf size for ‘ball_tree’ and ‘kd_tree’ algorithms.

  • metric (str or func, default='minkowski') – Metric or metric function for the nearest neighbor distance.

  • p (float, default=2) – Parameter p for the Minkowski distance. The default 2 is equivalent to the euclidean distance.

  • metric_params (dict, default=None) – Additional params for the metric function.

  • n_jobs (int, default=None) – Number of jobs for nearest neighbor search.

  • seealso:: (..) –

    See the scikit-learn documentation for more details on the parameters.

    https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors

build(vectors, ids=None)

Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.

build(vectors, ids=None)

Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.

remove(ids)

Removes the given vectors (identified by their ids) from the vector index.

remove(ids)

Removes the given vectors (identified by their ids) from the vector index.

search(vectors, k: int = 10, return_distance: bool = False)

For each of the given vectors, retrieve and return the k most similar vectors from the index.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.

  • return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.

Returns:

  • ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.

  • distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.

search(vectors, k: int = 10, return_distance: bool = False)

For each of the given vectors, retrieve and return the k most similar vectors from the index.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.

  • return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.

Returns:

  • ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.

  • distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.

class small_text.vector_indexes.hnsw.HNSWIndex(space='l2', ef_construction=200, ef=200, m=64, random_seed=100)[source]

A vector index that relies on Hierarchical Navigable Small Worlds (HNSW).

Note

This strategy requires the optional dependency hnswlib.

See also

GitHub repository of the underlying implementation.

https://github.com/nmslib/hnswlib

Details on setting the HNSW parameters.

https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md

Added in version 2.0.0.

__init__(space='l2', ef_construction=200, ef=200, m=64, random_seed=100)
Parameters:
  • space ({'l2', 'ip', 'cosine'}, default='l2') – Type of vector space. The HNSW index uses the respective distance metric for

  • ef_construction (int, default=200) – Parameter that trades index accuracy versus runtime. Higher values for ef_construction increase the indexing time.

  • ef (int, default=200) – Parameter that trades index accuracy versus runtime. Higher values for ef increase the search time.

  • m (int) – Number of links between vectors during construction. Data with higher intrinsic dimensionality requires higher values of m. Higher values of m increase the memory usage.

  • random_seed (int) – Random seed that is used during initialization of the index.

Note

Check the hnswlib GitHub repository on details for the parameters.

build(vectors, ids=None)

Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.

build(vectors, ids=None)

Constructs an index from the given vectors. Each vector is identified by an id. If no ids are passed, each vector gets assigned an ascending id starting at zero.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • ids (np.ndarray[int] or None, default=None) – An array of ids where each item corresponds to the respective row in the vectors argument.

remove(ids)

Removes the given vectors (identified by their ids) from the vector index.

remove(ids)

Removes the given vectors (identified by their ids) from the vector index.

search(vectors, k: int = 10, return_distance: bool = False)

For each of the given vectors, retrieve and return the k most similar vectors from the index.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.

  • return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.

Returns:

  • ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.

  • distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.

search(vectors, k: int = 10, return_distance: bool = False)

For each of the given vectors, retrieve and return the k most similar vectors from the index.

Parameters:
  • vectors (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, num_dimensions).

  • k (int, k=10) – Specified the number of similar vectors that are returned for each input vector.

  • return_distance (bool, default=False) – Toggles if the distances should be returned in addition to the vector indices.

Returns:

  • ids (np.ndarray[int]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds k indices per row. The indices refer to the vectors on which the index has been built, i.e.

  • distances (np.ndarray[np.float32]) – A 2d matrix of vectors in the shape (num_vectors, k) which holds normalized distances between 0.0 (most similar) and 1.0 (most dissimilar). Distances are only returned if return_distance is True.