Query Strategies

Query strategies select data samples from the set of unlabeled data, i.e. they decide which samples should be labeled next.

Overview

You can use the following pre-implemented query strategies:

General

LeastConfidence
PredictionEntropy
BreakingTies
BALD
EmbeddingKMeans
GreedyCoreset
LightweightCoreset
ProbCover
ContrastiveActiveLearning
DiscriminativeActiveLearning
CategoryVectorInconsistencyAndRanking
LabelCardinalityInconsistency
AdaptiveActiveLearning
ClassBalancer
SEALS
AnchorSubsampling
RandomSampling

Interface

The query strategy interface revolves around the query() method. A query strategy can make use of any of the given positional arguments but does not need to.

class QueryStrategy(ABC):
    """Abstract base class for Query Strategies."""

    @abstractmethod
    def query(self,
              clf: Classifier,
              dataset: Dataset,
              indices_unlabeled: npt.NDArray[np.uint],
              indices_labeled: npt.NDArray[np.uint],
              y: Union[npt.NDArray[np.uint], csr_matrix],
              n: int = 10) -> np.ndarray:
        """
        Queries instances from the unlabeled pool.

        A query selects instances from the unlabeled pool.

        Parameters
        ----------
        clf : small_text.classifiers.Classifier
            A text classifier.
        dataset : small_text.data.datasets.Dataset
            A text dataset.
        indices_unlabeled : np.ndarray[uint]
            Indices (relative to `dataset`) for the unlabeled data.
        indices_labeled : np.ndarray[uint]
            Indices (relative to `dataset`) for the labeled data.
        y : np.ndarray[uint] or csr_matrix
            List of labels where each label maps by index position to `indices_labeled`.
        n : int
            Number of samples to query.

        Returns
        -------
        indices : numpy.ndarray
            Indices relative to `dataset` which were selected.
        """
        pass

    @staticmethod
    def _validate_query_input(indices_unlabeled: npt.NDArray[np.uint], n: int) -> None:

        if len(indices_unlabeled) == 0:
            raise EmptyPoolException('No unlabeled indices available. Cannot query an empty pool.')

        if n > len(indices_unlabeled):
            raise PoolExhaustedException('Pool exhausted: {} available / {} requested'
                                         .format(len(indices_unlabeled), n))

A query strategy can use the classifier clf to make a decision.
The full dataset (i.e. all samples regardless of whether they are labeled or not) is given by dataset.
The partition into labeled and unlabeled data is handled indirectly via indices (indices_unlabeled and indices_labeled).

Note

All indices together must not necessarily be complete, i.e. they full set of indices {1, 2, …, len(dataset)} is not always without gaps. This allows the active learner to ignore samples which should remain part of the dataset but are not suited for active learning.
The argument y represent the current labels.
The number of samples to query can be controlled with the keyword argument n.

Helpers

Some query strategies may be formulated so that are only applicable to either single-label or multi-label data. As a safeguard against using such strategies on data which is not supported, the constraints() decorator intercepts the query(). If the given labels cannot be handled, RuntimeError is raised.

Note

For all pre-implemented query strategies, don’t equate an absence of an constraint as an indicator of capibility, since we will sparingly use this in the main library in order to not restrict the user unnecessarily. For your own projects and applications, however, this is highly recommended.

Constraints

from small_text.query_strategies import constraints, QueryStrategy

@constraints(classification_type='single-label')
class MyQueryStrategy(QueryStrategy):
    pass

Classes

Base

class small_text.query_strategies.strategies.LeastConfidence[source]: Selects instances with the least prediction confidence (regarding the most likely class) [LG94].

class small_text.query_strategies.strategies.PredictionEntropy[source]: Selects instances with the largest prediction entropy [HOL08].

class small_text.query_strategies.strategies.BreakingTies[source]: Selects instances which have a small margin between their most likely and second most likely predicted class [LUO05].

class small_text.query_strategies.strategies.EmbeddingKMeans(normalize: bool = True)[source]

This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.

__init__(normalize: bool = True)

Parameters:: normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.bayesian.BALD(dropout_samples: int = 10)[source]

Selects instances according to the Bayesian Active Learning by Disagreement (BALD) [HHG+11] strategy.

Requires that the predict_proba() method of the given classifier supports dropout sampling [GZ16].

Added in version 1.2.0.

__init__(dropout_samples: int = 10)

Parameters:: dropout_samples (int) – For every instance in the dataset, dropout_samples-many predictions will be used to obtain uncertainty estimates.

class small_text.query_strategies.coresets.GreedyCoreset(distance_metric='euclidean', normalize=True, batch_size=100)[source]

Selects instances by constructing a greedy coreset [SS17] over document embeddings.

__init__(distance_metric='euclidean', normalize=True, batch_size=100)

Parameters:

distance_metric ({'cosine', 'euclidean'}, default='euclidean') –
Distance metric to be used.

Added in version 1.2.0.
normalize (bool, default=True) – Embeddings will be normalized before the coreset construction if True.
batch_size (int, batch_size=100) – Batch size used for computing document distances.

Note

The default distance metric before v1.2.0 used to be cosine distance.

See also

Function greedy_coreset(): Docstrings of the underlying greedy_coreset() method.

class small_text.query_strategies.coresets.LightweightCoreset(distance_metric='cosine', normalize=True, batch_size=100)[source]

Selects instances by constructing a lightweight coreset [BLK18] over document embeddings.

__init__(distance_metric='cosine', normalize=True, batch_size=100)

Parameters:

distance_metric ({'cosine', 'euclidean'}, default='cosine') – Distance metric to be used.
normalize (bool) – Embeddings will be normalized before the coreset construction if True.
batch_size (int, batch_size=100) – Batch size used for computing document distances.

class small_text.query_strategies.vector_space.ProbCover(vector_index_factory=VectorIndexFactory(HNSWIndex), k=100, ball_radius=0.1)[source]

ProbCover [YDH+22] queries instances by trying to maximize probability coverage of an embedding space. For this, each labeled instance covers an area in the embedding space with a certain radius (ball_radius). The strategy tries to maximize the covered area, by selecting instances with a high-density neighborhood.

Added in version 2.0.0.

__init__(vector_index_factory=VectorIndexFactory(HNSWIndex), k=100, ball_radius=0.1)

Parameters:

vector_index_factory (VectorIndexFactory, default=VectorIndexFactory(HNSWIndex)) – A factory that provides the vector index for nearest neighbor queries.
k (int, default=100) – Number of nearest neighbors for nearest neighbor queries.
ball_radius (float, default=0.1) – Radius of an embedding space ball that is given by a labeled instance at its center.

class small_text.query_strategies.strategies.ContrastiveActiveLearning(k=10, embed_kwargs=dict(), normalize=True, vector_index_factory=VectorIndexFactory(KNNIndex), batch_size=100, pbar='tqdm')[source]

Contrastive Active Learning [MVB+21] selects instances whose k-nearest neighbours exhibit the largest mean Kullback-Leibler divergence.

Changed in version 2.0.0.

__init__(k=10, embed_kwargs=dict(), normalize=True, vector_index_factory=VectorIndexFactory(KNNIndex), batch_size=100, pbar='tqdm')

Parameters:

k (int) – Number of nearest neighbours whose KL divergence is considered.
embed_kwargs (dict) – Embedding keyword args which are passed to clf.embed().
normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.
vector_index_factory (VectorIndexFactory, default=VectorIndexFactory(KNNIndex)) – A factory that provides the vector index for nearest neighbor queries.
batch_size (int, default=100) – Batch size which is used to process the embeddings.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.strategies.DiscriminativeActiveLearning(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')[source]

Discriminative Active Learning [GS19] learns to differentiate between the labeled and unlabeled pool and selects the instances that are most likely to belong to the unlabeled pool.

__init__(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')

classifier_factorysmall_text.classifiers.factories.ClassifierFactory: Classifier factory which is used for the discriminative classifiers.
num_iterationsint: Number of iterations for the discriminative training.
unlabeled_factorint, default=10: The ratio of “unlabeled pool” instances to “labeled pool” instances in the discriminative training.
pbar‘tqdm’ or None, default=’tqdm’: Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.multi_label.CategoryVectorInconsistencyAndRanking(*args, **kwargs)[source]

Uncertainty Sampling based on Category Vector Inconsistency and Ranking of Scores [RCV18] selects instances based on the inconsistency of predicted labels and per-class label rankings.

__init__(*args, **kwargs)

Parameters:

batch_size (int) – Batch size in which the computations are performed. Increasing the size increases the amount of memory used.
prediction_threshold (float) – Confidence value above which a prediction counts as positive.
epsilon (float) – A small value that is added to the argument of the logarithm to avoid taking the logarithm of zero.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.multi_label.LabelCardinalityInconsistency(*args, **kwargs)[source]

Queries the instances which exhibit the maximum label cardinality inconsistency [LG13].

See also

Function label_cardinality_inconsistency().: Function to compute the label cardinality inconsistency.

Added in version 2.0.0.

class small_text.query_strategies.multi_label.AdaptiveActiveLearning(*args, **kwargs)[source]

Queries the instances which exhibit the maximum inverse margin uncertainty-weighted label cardinality inconsistency [LG13].

This strategy is a combination of breaking ties and label cardinaly inconsistency. The keyword argument uncertainty_weight controls the weighting between those two.

See also

Function uncertainty_weighted_label_cardinality_inconsistency().: Function to compute uncertainty-weighted label cardinality inconsistency.

Added in version 2.0.0.

class small_text.query_strategies.class_balancing.ClassBalancer(*args, **kwargs)[source]

A query strategy that tries to draw instances so that the new class distribution of the labeled pool is moved towards a (more) balanced distribution. For this, it first partitions instances by their predicted class and then applies a base query strategy. Based on the per-class query results, the instances are sampled so that the new class distribution is more balanced.

Since the true labels are unknown, this strategy is a best effort approach and is not guaranteed to improve the distribution’s balance.

To reduce the cost of the initial predictions, which are required for the class-based partitioning, a random subsampling parameter is available.

Note

The sampling mechanism is tailored to single-label classification.

Added in version 2.0.0.

__init__(*args, **kwargs)

base_query_strategyQueryStrategy: A base query strategy which operates on the subsets partitioned by predicted class.
subsample_sizeint or None: Draws a random subsample before applying the strategy if not None.

class small_text.query_strategies.subsampling.SEALS(base_query_strategy: QueryStrategy, k: int = 100, vector_index_factory: VectorIndexFactory = VectorIndexFactory(HNSWIndex), embed_kwargs: dict = {}, normalize: bool = True)[source]

Similarity Search for Efficient Active Learning and Search of Rare Concepts (SEALS) improves the computational efficiency of active learning by presenting a reduced subset of the unlabeled pool to a base strategy [CCK+22].

This method is to be applied in conjunction with a base query strategy. SEALS selects a subset of the unlabeled pool by selecting the k nearest neighbours of the current labeled pool.

If the size of the unlabeled pool falls below the given k, this implementation will not select a subset anymore and will just delegate to the base strategy instead.

Changed in version 2.0.0.

__init__(base_query_strategy: QueryStrategy, k: int = 100, vector_index_factory: VectorIndexFactory = VectorIndexFactory(HNSWIndex), embed_kwargs: dict = {}, normalize: bool = True)

base_query_strategysmall_text.query_strategy.QueryStrategy: A base query strategy which operates on the subset that is selected by SEALS.
kint, default=100: Number of nearest neighbors that will be selected.
vector_index_factoryVectorIndexFactory, default=VectorIndexFactory(HNSWIndex): A factory that provides the vector index for nearest neighbor queries.
embed_kwargsdict, default=dict(): Kwargs that will be passed to the embed() method.
normalizebool, default=True: Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.subsampling.AnchorSubsampling(base_query_strategy: QueryStrategy, subsample_size: int = 500, vector_index_factory: VectorIndexFactory = VectorIndexFactory(HNSWIndex), num_anchors: int = 10, k: int = 50, embed_kwargs: dict = {}, normalize: bool = True, batch_size: int = 32)[source]

This subsampling strategy is an implementation of AnchorAL [LV24].

AnchorAL performs subsampling with class-specific anchors, which aims to draw class-balanced subset and to prevent overfitting on the current decision boundary [LV24].

This method is very extensible regarding the choices of base query strategy and anchor selection, but for now the implementation covers the choices described in the original paper.

Added in version 1.4.0.

Changed in version 2.0.0.

__init__(base_query_strategy: QueryStrategy, subsample_size: int = 500, vector_index_factory: VectorIndexFactory = VectorIndexFactory(HNSWIndex), num_anchors: int = 10, k: int = 50, embed_kwargs: dict = {}, normalize: bool = True, batch_size: int = 32)

base_query_strategysmall_text.query_strategy.QueryStrategy: A base query strategy which operates on the subset that is selected by SEALS.
subsample_sizeint, default=500: The number of subsamples to be drawn.
vector_index_factoryVectorIndexFactory, default=VectorIndexFactory(HNSWIndex): A factory that provides the vector index for nearest neighbor queries.
kint, default=50: Number of nearest neighbors that will be selected.
embed_kwargsdict, default=dict{}: Keyword arguments that will be passed to the embed() method.
normalizebool, default=True: Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.strategies.RandomSampling[source]: Randomly selects instances.

Pytorch Integration

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLength(*args, **kwargs)[source]

Selects instances by expected gradient length [Set07].

__init__(*args, **kwargs)

Parameters:

num_classes (int) – Number of classes.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.
device (str or torch.device, default=None) – Torch device on which the computation will be performed.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthMaxWord(*args, **kwargs)[source]

Selects instances using the EGL-word strategy [ZLW17].

The EGL-word strategy works as follows:

For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.
These scores are then summed up over all classes. The result is one score per instance.

Finally, the instances are selected by maximum score.

Notes

An embedding layer is required for this strategy.
This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.

__init__(*args, **kwargs)

Parameters:

num_classes (int) – Number of classes.
layer_name (str) – Name of the embedding layer.
batch_size (int) – Batch size.
device (str or torch.device) – Torch device on which the computation will be performed.

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthLayer(*args, **kwargs)[source]

An EGL variant that is restricted to the gradients of a single layer.

This is a generalized version of the EGL-sm strategy [ZLW17], but instead of being restricted to the last layer it operates on the layer name passed to the constructor.

__init__(*args, **kwargs)

Parameters:

num_classes (int) – Number of classes.
layer_name (str) – Name of the target layer.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.

class small_text.integrations.pytorch.query_strategies.strategies.BADGE(*args, **kwargs)[source]

Implements “Batch Active learning by Diverse Gradient Embedding” (BADGE) [AZK+20].

__init__(*args, **kwargs)

Parameters:: num_classes (int) – Number of classes.

class small_text.integrations.pytorch.query_strategies.strategies.DiscriminativeRepresentationLearning(num_iterations: int = 10, selection: str = 'stochastic', temperature: float = 0.01, unlabeled_factor: int = 10, mini_batch_size: int = 32, device='cuda', amp_args=None, embed_kwargs: dict = {}, train_kwargs: dict = {}, pbar='tqdm')[source]

Discriminative Active Learning [GS19] learns to differentiate between the labeled and unlabeled pool and selects the instances that are most likely to belong to the unlabeled pool.

This implementation uses embeddings as input representation to learn the discriminative binary problem.

Note

This is a variant of DiscriminativeActiveLearning which is not only more efficient but is also reported to perform best in the blog post linked below. The default configuration is intended to adhere to wherever possible (except for the different setting which was image classification in the original publication.)

See also

Blog post “Discriminative Active Learning”
A detailed and highly informative blog post on Discriminative Active Learning in which the original author Daniel Gissin elaborates on the method.
Original implementation

__init__(num_iterations: int = 10, selection: str = 'stochastic', temperature: float = 0.01, unlabeled_factor: int = 10, mini_batch_size: int = 32, device='cuda', amp_args=None, embed_kwargs: dict = {}, train_kwargs: dict = {}, pbar='tqdm')

Parameters:

num_iterations (int, default=10) – Number of iterations for the discriminative training.
selection ({'stochastic', 'greedy'}, default='stochastic') – Determines how the instances are selected. The option stochastic draws from the probability distribution over the current unlabeled instances that is given by the confidence estimate (predict_proba()) for the discriminative unlabeled class. The greedy selects n instances that are most likely to belong to the “unlabeled” class.
temperature (float, default=1.0) – Temperature for the stochastic sampling (i.e., only applicable if selection=’stochastic’). Lower values push the sampling distribution towards the one-hot categorical distribution, and higher values towards a uniform distribution [JGP+17].
unlabeled_factor (int, default=10) – The ratio of “unlabeled pool” instances to “labeled pool” instances in the discriminative training.
mini_batch_size (int, default=32) – Size of mini batches during training.
device (str or torch.device, default='cuda') – Torch device on which the computation will be performed.
amp_args (AMPArguments or None, default=None) –
Configures the use of Automatic Mixed Precision (AMP).

See also

AMPArguments
embed_kwargs (dict, default={}) – Keyword arguments that will be passed to the embed() method.
train_kwargs (dict, default={}) –
Keyword arguments with parameters for the training process within this method.

Possible arguments:
- num_epochs (int, default=4): Number of training epochs.
- lr (float, default=2e-5): Learning rate.
- clip_grad_norm (float, default=1): Gradients are clipped when their norm exceeds this value.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

Functions

small_text.query_strategies.coresets.greedy_coreset(x, indices_unlabeled, indices_labeled, n, distance_metric='cosine', batch_size=100, normalized=False)[source]

Computes a greedy coreset [SS17] over x with size n.

Parameters:

x (np.ndarray) – A matrix of row-wise vector representations.
indices_unlabeled (np.ndarray) – Indices (relative to dataset) for the unlabeled data.
indices_labeled (np.ndarray) – Indices (relative to dataset) for the unlabeled data.
n (int) – Size of the coreset (in number of instances).
distance_metric ({'cosine', 'euclidean'}) – Distance metric to be used.
batch_size (int) – Batch size.
normalized (bool) – If True the data x is assumed to be normalized, otherwise it will be normalized where necessary.

Returns:

indices – Indices relative to x.

Return type:

numpy.ndarray

References

[SS17] (1,2)

Ozan Sener and Silvio Savarese. 2017. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In International Conference on Learning Representations 2018 (ICLR 2018).

small_text.query_strategies.coresets.lightweight_coreset(x, x_mean, n, batch_size=100, distance_metric='cosine', normalized=False, proba=None)[source]

Computes a lightweight coreset [BLK18] of x with size n.

Parameters:

x (np.ndarray) – 2D array in which each row represents a sample.
x_mean (np.ndarray) – Elementwise mean over the columns of x.
n (int) – Coreset size.
distance_metric ({'cosine', 'euclidean'}) – Distance metric to be used.
batch_size (int) – Batch size.
normalized (bool) – If True the data x is assumed to be normalized, otherwise it will be normalized where necessary.
proba (np.ndarray or None) – A probability distribution over x, which makes up half of the probability mass of the sampling distribution. If proba is not None a uniform distribution is used.

Returns:

indices – Indices relative to x.

Return type:

numpy.ndarray