Query Strategies

Query strategies select data samples from the set of unlabeled data.

Overview

You can use the following pre-implemented query strategies:

General

LeastConfidence
PredictionEntropy
BreakingTies
EmbeddingKMeans
RandomSampling

Pytorch

ExpectedGradientLength
ExpectedGradientLengthMaxWord
ExpectedGradientLengthLayer
BADGE

Helpers

Some query strategies may be formulated so that are only applicable to either single-label or multi-label data. As a safeguard against using such strategies on data which is not supported, the constraints() decorator intercepts the query(). If the given labels cannot be handled, RuntimeError is raised.

Note

For all pre-implemented query strategies, don’t equate an absence of an constraint as an indicator of capibility, since we will sparingly use this in the main library in order to not restrict the user unnecessarily. For your own projects and applications, however, this is highly recommended.

Constraints

from small_text.query_strategies import constraints, QueryStrategy

@constraints(classification_type='single-label')
class MyQueryStrategy(QueryStrategy):
    pass

Classes

class small_text.query_strategies.strategies.LeastConfidence[source]

Selects instances with the least prediction confidence (regarding the most likely class) [LG94].

__init__()

class small_text.query_strategies.strategies.PredictionEntropy[source]

Selects instances with the largest prediction entropy [HOL08].

__init__()

class small_text.query_strategies.strategies.BreakingTies[source]

Selects instances which have a small margin between their most likely and second most likely prediction [LUO05].

__init__()

class small_text.query_strategies.strategies.EmbeddingKMeans(normalize=True)[source]

This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.

__init__(normalize=True)

class small_text.query_strategies.strategies.RandomSampling[source]

Randomly selects instances.

__init__()

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLength(num_classes, batch_size=50, device='cuda', pbar='tqdm')[source]

Selects instances by expected gradient length [Set07].

__init__(num_classes, batch_size=50, device='cuda', pbar='tqdm')

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthMaxWord(num_classes, layer_name, batch_size=50, device='cuda')[source]

Selects instances using the EGL-word strategy [ZLW17].

The EGL-word strategy works as follows:

For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.
These scores are then summed up over all classes. The result is one score per instance.

Finally, the instances are selected by maximum score.

Notes

An embedding layer is required for this strategy.
This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.

__init__(num_classes, layer_name, batch_size=50, device='cuda')

Parameters

num_classes (int) – Number of classes.
layer_name (str) – Name of the embedding layer.
batch_size (int) – Batch size.
device (str or torch.device) – Torch device on which the computation will be performed.

class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthLayer(num_classes, layer_name, batch_size=50)[source]

An EGL variant that is restricted to the gradients of a single layer.

This is a generalized version of the EGL-sm strategy [ZLW17], but instead of being restricted to the last layer it operates on the layer name passed to the constructor.

__init__(num_classes, layer_name, batch_size=50)

Parameters

num_classes (int) – The number of classes.
layer_name (str) – The name of the target layer.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.

class small_text.integrations.pytorch.query_strategies.strategies.BADGE(num_classes)[source]

Implements “Batch Active learning by Diverse Gradient Embedding” (BADGE) [AZK+20].

__init__(num_classes)