Query Strategies

Query strategies select data samples from the set of unlabeled data, i.e. they decide which samples should be labeled next.

Overview

You can use the following pre-implemented query strategies:

General

Pytorch

Interface

The query strategy interface revolves around the query() method. A query strategy can make use of any of the given positional arguments but does not need to.

class QueryStrategy(ABC):
    """Abstract base class for Query Strategies."""

    @abstractmethod
    def query(self, clf, dataset, indices_unlabeled, indices_labeled, y, n=10):
        """
        Queries instances from the unlabeled pool.

        A query selects instances from the unlabeled pool.

        Parameters
        ----------
        clf : small_text.classifiers.Classifier
            A text classifier.
        dataset : small_text.data.datasets.Dataset
            A text dataset.
        indices_unlabeled : np.ndarray[int]
            Indices (relative to `dataset`) for the unlabeled data.
        indices_labeled : np.ndarray[int]
            Indices (relative to `dataset`) for the labeled data.
        y : np.ndarray[int] or csr_matrix
            List of labels where each label maps by index position to `indices_labeled`.
        n : int
            Number of samples to query.

        Returns
        -------
        indices : numpy.ndarray
            Indices relative to `dataset` which were selected.
        """
        pass

    @staticmethod
    def _validate_query_input(indices_unlabeled, n):
        if len(indices_unlabeled) == 0:
            raise EmptyPoolException('No unlabeled indices available. Cannot query an empty pool.')

        if n > len(indices_unlabeled):
            raise PoolExhaustedException('Pool exhausted: {} available / {} requested'
                                         .format(len(indices_unlabeled), n))
  • A query strategy can use the classifier clf to make a decision.

  • The full dataset (i.e. all samples regardless of whether they are labeled or not) is given by dataset.

  • The partition into labeled and unlabeled data is handled indirectly via indices (indices_unlabeled and indices_labeled).

    Note

    All indices together must not necessarily be complete, i.e. they full set of indices {1, 2, …, len(dataset)} is not always without gaps. This allows the active learner to ignore samples which should remain part of the dataset but are not suited for active learning.

  • The argument y represent the current labels.

  • The number of samples to query can be controlled with the keyword argument n.

Helpers

Some query strategies may be formulated so that are only applicable to either single-label or multi-label data. As a safeguard against using such strategies on data which is not supported, the constraints() decorator intercepts the query(). If the given labels cannot be handled, RuntimeError is raised.

Note

For all pre-implemented query strategies, don’t equate an absence of an constraint as an indicator of capibility, since we will sparingly use this in the main library in order to not restrict the user unnecessarily. For your own projects and applications, however, this is highly recommended.

Constraints

from small_text.query_strategies import constraints, QueryStrategy

@constraints(classification_type='single-label')
class MyQueryStrategy(QueryStrategy):
    pass

Classes

Base

class small_text.query_strategies.strategies.LeastConfidence[source]

Selects instances with the least prediction confidence (regarding the most likely class) [LG94].

class small_text.query_strategies.strategies.PredictionEntropy[source]

Selects instances with the largest prediction entropy [HOL08].

class small_text.query_strategies.strategies.BreakingTies[source]

Selects instances which have a small margin between their most likely and second most likely predicted class [LUO05].

class small_text.query_strategies.strategies.EmbeddingKMeans(normalize=True)[source]

This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.

__init__(normalize=True)
Parameters

normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.bayesian.BALD(dropout_samples=10)[source]

Selects instances according to the Bayesian Active Learning by Disagreement (BALD) [HHG+11] strategy.

Requires that the predict_proba() method of the given classifier supports dropout sampling [GZ16].

New in version 1.2.0.

__init__(dropout_samples=10)
Parameters

dropout_samples (int) – For every instance in the dataset, dropout_samples-many predictions will be used to obtain uncertainty estimates.

class small_text.query_strategies.coresets.GreedyCoreset(distance_metric='euclidean', normalize=True, batch_size=100)[source]

Selects instances by constructing a greedy coreset [SS17] over document embeddings.

__init__(distance_metric='euclidean', normalize=True, batch_size=100)
Parameters
  • distance_metric ({'cosine', 'euclidean'}) –

    Distance metric to be used.

    New in version 1.2.0.

  • normalize (bool) – Embeddings will be normalized before the coreset construction if True.

  • batch_size (int) – Batch size used for computing document distances.

Note

The default distance metric before v1.2.0 used to be cosine distance.

See also

Function greedy_coreset()

Docstrings of the underlying greedy_coreset() method.

class small_text.query_strategies.coresets.LightweightCoreset(normalize=True)[source]

Selects instances by constructing a lightweight coreset [BLK18] over document embeddings.

__init__(normalize=True)
Parameters

normalize (bool) – Embeddings will be normalized before the coreset construction if True.

class small_text.query_strategies.strategies.ContrastiveActiveLearning(k=10, embed_kwargs=dict(), normalize=True, batch_size=100, pbar='tqdm')[source]

Contrastive Active Learning [MVB+21] selects instances whose k-nearest neighbours exhibit the largest mean Kullback-Leibler divergence.

__init__(k=10, embed_kwargs=dict(), normalize=True, batch_size=100, pbar='tqdm')
Parameters
  • k (int) – Number of nearest neighbours whose KL divergence is considered.

  • embed_kwargs (dict) – Embedding keyword args which are passed to clf.embed().

  • normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.

  • batch_size (int, default=100) – Batch size which is used to process the embeddings.

  • pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.strategies.DiscriminativeActiveLearning(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')[source]

Discriminative Active Learning [GS19] learns to differentiate between the labeled and unlabeled pool and selects the instances that are most likely to belong to the unlabeled pool.

__init__(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')
classifier_factorysmall_text.

Classifier factory which is used for the discriminative classifiers.

num_iterationsint

Number of iterations for the discriminiative training.

unlabeled_factorint, default=10

The ratio of “unlabeled pool” instances to “labeled pool” instances in the discriminative training.

pbar‘tqdm’ or None, default=’tqdm’

Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.multi_label.CategoryVectorInconsistencyAndRanking(batch_size=2048, prediction_threshold=0.5, epsilon=1e-8, pbar='tqdm')[source]

Uncertainty Sampling based on Category Vector Inconsistency and Ranking of Scores [RCV18] selects instances based on the inconsistency of predicted labels and per-class label rankings.

__init__(batch_size=2048, prediction_threshold=0.5, epsilon=1e-8, pbar='tqdm')
Parameters
  • batch_size (int) – Batch size in which the computations are performed. Increasing the size increases the amount of memory used.

  • prediction_threshold (float) – Confidence value above which a prediction counts as positive.

  • epsilon (float) – A small value that is added to the argument of the logarithm to avoid taking the logarithm of zero.

  • pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.strategies.SEALS(base_query_strategy, k=100, hnsw_kwargs=dict(), embed_kwargs=dict(), normalize=True)[source]

Similarity Search for Efficient Active Learning and Search of Rare Concepts (SEALS) improves the computational efficiency of active learning by presenting a reduced subset of the unlabeled pool to a base strategy [CCK+22].

This method is to be applied in conjunction with a base query strategy. SEALS selects a subset of the unlabeled pool by selecting the k nearest neighbours of the current labeled pool.

If the size of the unlabeled pool falls below the given k, this implementation will not select a subset anymore and will just delegate to the base strategy instead.

Note

This strategy requires the optional dependency hnswlib.

__init__(base_query_strategy, k=100, hnsw_kwargs=dict(), embed_kwargs=dict(), normalize=True)
base_query_strategysmall_text.query_strategy.QueryStrategy

A base query strategy which operates on the subset that is selected by SEALS.

kint, default=100

Number of nearest neighbors that will be selected.

hnsw_kwargsdict(), default=dict()

Kwargs which will be passed to the underlying hnsw index. Check the hnswlib github repository on details for the parameters space, ef_construction, ef, and M.

embed_kwargsdict, default=dict()

Kwargs that will be passed to the embed() method.

normalizebool, default=True

Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.subsampling.AnchorSubsampling(base_query_strategy, subsample_size=500, num_anchors=10, k=50, hnsw_kwargs={}, embed_kwargs={}, normalize=True, batch_size=32)[source]

This subsampling strategy is an implementation of AnchorAL [LV24].

AnchorAL performs subsampling with class-specific anchors, which aims to draw class-balanced subset and to prevent overfitting on the current decision boundary [LV24].

This method is very extensible regarding the choices of base query strategy and anchor selection, but for now the implementation covers the choices described in the original paper.

Note

This strategy requires the optional dependency hnswlib.

New in version 1.4.0.

__init__(base_query_strategy, subsample_size=500, num_anchors=10, k=50, hnsw_kwargs={}, embed_kwargs={}, normalize=True, batch_size=32)
base_query_strategysmall_text.query_strategy.QueryStrategy

A base query strategy which operates on the subset that is selected by SEALS.

subsample_sizeint, default=500

The number of subsamples to be drawn.

kint, default=50

Number of nearest neighbors that will be selected.

hnsw_kwargsdict, default=dict{}

Keyword arguments that will be passed to the underlying hnsw index. Check the hnswlib github repository on details for the parameters space, ef_construction, ef, and M.

embed_kwargsdict, default=dict{}

Keyword arguments that will be passed to the embed() method.

normalizebool, default=True

Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.strategies.RandomSampling[source]

Randomly selects instances.


Pytorch Integration

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLength(num_classes, batch_size=50, device='cuda', pbar='tqdm')[source]

Selects instances by expected gradient length [Set07].

__init__(num_classes, batch_size=50, device='cuda', pbar='tqdm')
Parameters
  • num_classes (int) – Number of classes.

  • batch_size (int, default=50) – Batch size in which the query strategy scores the instances.

  • device (str or torch.device, default=None) – Torch device on which the computation will be performed.

  • pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthMaxWord(num_classes, layer_name, batch_size=50, device='cuda')[source]

Selects instances using the EGL-word strategy [ZLW17].

The EGL-word strategy works as follows:

  1. For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.

  2. These scores are then summed up over all classes. The result is one score per instance.

Finally, the instances are selected by maximum score.

Notes

  • An embedding layer is required for this strategy.

  • This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.

__init__(num_classes, layer_name, batch_size=50, device='cuda')
Parameters
  • num_classes (int) – Number of classes.

  • layer_name (str) – Name of the embedding layer.

  • batch_size (int) – Batch size.

  • device (str or torch.device) – Torch device on which the computation will be performed.

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthLayer(num_classes, layer_name, batch_size=50)[source]

An EGL variant that is restricted to the gradients of a single layer.

This is a generalized version of the EGL-sm strategy [ZLW17], but instead of being restricted to the last layer it operates on the layer name passed to the constructor.

__init__(num_classes, layer_name, batch_size=50)
Parameters
  • num_classes (int) – Number of classes.

  • layer_name (str) – Name of the target layer.

  • batch_size (int, default=50) – Batch size in which the query strategy scores the instances.

class small_text.integrations.pytorch.query_strategies.strategies.BADGE(num_classes)[source]

Implements “Batch Active learning by Diverse Gradient Embedding” (BADGE) [AZK+20].

__init__(num_classes)
Parameters

num_classes (int) – Number of classes.

Functions

small_text.query_strategies.coresets.greedy_coreset(x, indices_unlabeled, indices_labeled, n, distance_metric='cosine', batch_size=100, normalized=False)[source]

Computes a greedy coreset [SS17] over x with size n.

Parameters
  • x (np.ndarray) – A matrix of row-wise vector representations.

  • indices_unlabeled (np.ndarray) – Indices (relative to dataset) for the unlabeled data.

  • indices_labeled (np.ndarray) – Indices (relative to dataset) for the unlabeled data.

  • n (int) – Size of the coreset (in number of instances).

  • distance_metric ({'cosine', 'euclidean'}) – Distance metric to be used.

  • batch_size (int) – Batch size.

  • normalized (bool) – If True the data x is assumed to be normalized, otherwise it will be normalized where necessary.

Returns

indices – Indices relative to x.

Return type

numpy.ndarray

References

SS17(1,2)

Ozan Sener and Silvio Savarese. 2017. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In International Conference on Learning Representations 2018 (ICLR 2018).

small_text.query_strategies.coresets.lightweight_coreset(x, x_mean, n, normalized=False, proba=None)[source]

Computes a lightweight coreset [BLK18] of x with size n.

Parameters
  • x (np.ndarray) – 2D array in which each row represents a sample.

  • x_mean (np.ndarray) – Elementwise mean over the columns of x.

  • n (int) – Coreset size.

  • normalized (bool) – If True the data x is assumed to be normalized, otherwise it will be normalized where necessary.

  • proba (np.ndarray or None) – A probability distribution over x, which makes up half of the probability mass of the sampling distribution. If proba is not None a uniform distribution is used.

Returns

indices – Indices relative to x.

Return type

numpy.ndarray