Query Strategies

Query strategies select data samples from the set of unlabeled data, i.e. they decide which samples should be labeled next.

Overview

You can use the following pre-implemented query strategies:

General

LeastConfidence
PredictionEntropy
BreakingTies
EmbeddingKMeans
GreedyCoreset
LightweightCoreset
ContrastiveActiveLearning
DiscriminativeActiveLearning
CategoryVectorInconsistencyAndRanking
SEALS
RandomSampling

Interface

The query strategy interface revolves around the query() method. A query strategy can make use of any of the given positional arguments but does not need to.

class QueryStrategy(ABC):
    """Abstract base class for Query Strategies."""

    @abstractmethod
    def query(self, clf, dataset, indices_unlabeled, indices_labeled, y, n=10):
        """
        Queries instances from the unlabeled pool.

        A query selects instances from the unlabeled pool.

        Parameters
        ----------
        clf : small_text.classifiers.Classifier
            A text classifier.
        dataset : small_text.data.datasets.Dataset
            A text dataset.
        indices_unlabeled : np.ndarray[int]
            Indices (relative to `dataset`) for the unlabeled data.
        indices_labeled : np.ndarray[int]
            Indices (relative to `dataset`) for the labeled data.
        y : np.ndarray[int] or csr_matrix
            List of labels where each label maps by index position to `indices_labeled`.
        n : int
            Number of samples to query.

        Returns
        -------
        indices : numpy.ndarray
            Indices relative to `dataset` which were selected.
        """
        pass

    @staticmethod
    def _validate_query_input(indices_unlabeled, n):
        if len(indices_unlabeled) == 0:
            raise EmptyPoolException('No unlabeled indices available. Cannot query an empty pool.')

        if n > len(indices_unlabeled):
            raise PoolExhaustedException('Pool exhausted: {} available / {} requested'
                                         .format(len(indices_unlabeled), n))

A query strategy can use the classifier clf to make a decision.
The full dataset (i.e. all samples regardless of whether they are labeled or not) is given by dataset.
The partition into labeled and unlabeled data is handled indirectly via indices (indices_unlabeled and indices_labeled).

Note

All indices together must not necessarily be complete, i.e. they full set of indices {1, 2, …, len(dataset)} is not always without gaps. This allows the active learner to ignore samples which should remain part of the dataset but are not suited for active learning.
The argument y represent the current labels.
The number of samples to query can be controlled with the keyword argument n.

Helpers

Some query strategies may be formulated so that are only applicable to either single-label or multi-label data. As a safeguard against using such strategies on data which is not supported, the constraints() decorator intercepts the query(). If the given labels cannot be handled, RuntimeError is raised.

Note

For all pre-implemented query strategies, don’t equate an absence of an constraint as an indicator of capibility, since we will sparingly use this in the main library in order to not restrict the user unnecessarily. For your own projects and applications, however, this is highly recommended.

Constraints

from small_text.query_strategies import constraints, QueryStrategy

@constraints(classification_type='single-label')
class MyQueryStrategy(QueryStrategy):
    pass

Classes

Base

class small_text.query_strategies.strategies.LeastConfidence[source]: Selects instances with the least prediction confidence (regarding the most likely class) [LG94].

class small_text.query_strategies.strategies.PredictionEntropy[source]: Selects instances with the largest prediction entropy [HOL08].

class small_text.query_strategies.strategies.BreakingTies[source]: Selects instances which have a small margin between their most likely and second most likely predicted class [LUO05].

class small_text.query_strategies.strategies.EmbeddingKMeans(normalize=True)[source]

This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.

__init__(normalize=True)

Parameters: normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.coresets.GreedyCoreset(normalize=True, batch_size=100)[source]

__init__(normalize=True, batch_size=100)

class small_text.query_strategies.coresets.LightweightCoreset(normalize=True)[source]

Selects instances using the lightweight coreset method _[BAC18].

Parameters: normalize (bool) – Embeddings are normalized if True, otherwise they are left unchanged.

__init__(normalize=True)

class small_text.query_strategies.strategies.ContrastiveActiveLearning(k=10, embed_kwargs=dict(), normalize=True, batch_size=100, pbar='tqdm')[source]

Contrastive Active Learning [MVB+21] selects instances whose k-nearest neighbours exhibit the largest mean Kullback-Leibler divergence.

__init__(k=10, embed_kwargs=dict(), normalize=True, batch_size=100, pbar='tqdm')

Parameters

k (int) – Number of nearest neighbours whose KL divergence is considered.
embed_kwargs (dict) – Embedding keyword args which are passed to clf.embed().
normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.
batch_size (int, default=100) – Batch size which is used to process the embeddings.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.strategies.DiscriminativeActiveLearning(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')[source]

Discriminative Active Learning [GS19] learns to differentiate between the labeled and unlabeled pool and selects the instances that are most likely to belong to the unlabeled pool.

__init__(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')

classifier_factorysmall_text.: Classifier factory which is used for the discriminative classifiers.
num_iterationsint: Number of iterations for the discriminiative training.
unlabeled_factorint, default=10: The ratio of “unlabeled pool” instances to “labeled pool” instances in the discriminative training.
pbar‘tqdm’ or None, default=’tqdm’: Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.multi_label.CategoryVectorInconsistencyAndRanking(batch_size=2048, prediction_threshold=0.5, epsilon=1e-08, pbar='tqdm')[source]

Uncertainty Sampling based on Category Vector Inconsistency and Ranking of Scores [RCV18] selects instances based on the inconsistency of predicted labels and per-class label rankings.

__init__(batch_size=2048, prediction_threshold=0.5, epsilon=1e-08, pbar='tqdm')

Parameters

batch_size (int) – Batch size in which the computations are performed. Increasing the size increases the amount of memory used.
prediction_threshold (float) – Confidence value above which a prediction counts as positive.
epsilon (float) – A small value that is added to the argument of the logarithm to avoid taking the logarithm of zero.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.query_strategies.strategies.SEALS(base_query_strategy, k=100, hnsw_kwargs=dict(), embed_kwargs=dict(), normalize=True)[source]

Similarity Search for Efficient Active Learning and Search of Rare Concepts (SEALS) improves the computational efficiency of active learning by presenting a reduced subset of the unlabeled pool to a base strategy [CCK+21].

This method is to be applied in conjunction with a base query strategy. SEALS selects a subset of the unlabeled pool by selecting the k nearest neighbours of the current labeled pool.

If the size of the unlabeled pool falls below the given k, this implementation will not select a subset anymore and will just delegate to the base strategy instead.

Note

This strategy requires the optional dependency hnswlib.

__init__(base_query_strategy, k=100, hnsw_kwargs=dict(), embed_kwargs=dict(), normalize=True)

base_query_strategysmall_text.query_strategy.QueryStrategy: A base query strategy which operates on the subset that is selected by SEALS.
kint, default=100: Number of nearest neighbors that will be selected.
hnsw_kwargsdict(), default=dict(): Kwargs which will be passed to the underlying hnsw index. Check the hnswlib github repository on details for the parameters space, ef_construction, ef, and M.
embed_kwargsdict, default=dict(): Kwargs that will be passed to the embed() method.
normalizebool, default=True: Embeddings will be L2 normalized if True, otherwise they remain unchanged.

class small_text.query_strategies.strategies.RandomSampling[source]: Randomly selects instances.

Pytorch Integration

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLength(num_classes, batch_size=50, device='cuda', pbar='tqdm')[source]

Selects instances by expected gradient length [Set07].

__init__(num_classes, batch_size=50, device='cuda', pbar='tqdm')

Parameters

num_classes (int) – Number of classes.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.
device (str or torch.device, default=None) – Torch device on which the computation will be performed.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthMaxWord(num_classes, layer_name, batch_size=50, device='cuda')[source]

Selects instances using the EGL-word strategy [ZLW17].

The EGL-word strategy works as follows:

For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.
These scores are then summed up over all classes. The result is one score per instance.

Finally, the instances are selected by maximum score.

Notes

An embedding layer is required for this strategy.
This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.

__init__(num_classes, layer_name, batch_size=50, device='cuda')

Parameters

num_classes (int) – Number of classes.
layer_name (str) – Name of the embedding layer.
batch_size (int) – Batch size.
device (str or torch.device) – Torch device on which the computation will be performed.

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthLayer(num_classes, layer_name, batch_size=50)[source]

An EGL variant that is restricted to the gradients of a single layer.

This is a generalized version of the EGL-sm strategy [ZLW17], but instead of being restricted to the last layer it operates on the layer name passed to the constructor.

__init__(num_classes, layer_name, batch_size=50)

Parameters

num_classes (int) – Number of classes.
layer_name (str) – Name of the target layer.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.

class small_text.integrations.pytorch.query_strategies.strategies.BADGE(num_classes)[source]

Implements “Batch Active learning by Diverse Gradient Embedding” (BADGE) [AZK+20].

__init__(num_classes)

Parameters: num_classes (int) – Number of classes.