Query Strategies
Query strategies select data samples from the set of unlabeled data, i.e. they decide which samples should be labeled next.
Overview
You can use the following pre-implemented query strategies:
General
Pytorch
Interface
The query strategy interface revolves around the query()
method.
A query strategy can make use of any of the given positional arguments but does not need to.
class QueryStrategy(ABC):
"""Abstract base class for Query Strategies."""
@abstractmethod
def query(self, clf, dataset, indices_unlabeled, indices_labeled, y, n=10):
"""
Queries instances from the unlabeled pool.
A query selects instances from the unlabeled pool.
Parameters
----------
clf : small_text.classifiers.Classifier
A text classifier.
dataset : small_text.data.datasets.Dataset
A text dataset.
indices_unlabeled : np.ndarray[int]
Indices (relative to `dataset`) for the unlabeled data.
indices_labeled : np.ndarray[int]
Indices (relative to `dataset`) for the labeled data.
y : np.ndarray[int] or csr_matrix
List of labels where each label maps by index position to `indices_labeled`.
n : int
Number of samples to query.
Returns
-------
indices : numpy.ndarray
Indices relative to `dataset` which were selected.
"""
pass
@staticmethod
def _validate_query_input(indices_unlabeled, n):
if len(indices_unlabeled) == 0:
raise EmptyPoolException('No unlabeled indices available. Cannot query an empty pool.')
if n > len(indices_unlabeled):
raise PoolExhaustedException('Pool exhausted: {} available / {} requested'
.format(len(indices_unlabeled), n))
A query strategy can use the classifier
clf
to make a decision.The full dataset (i.e. all samples regardless of whether they are labeled or not) is given by
dataset
.The partition into labeled and unlabeled data is handled indirectly via indices (
indices_unlabeled
andindices_labeled
).Note
All indices together must not necessarily be complete, i.e. they full set of indices {1, 2, …, len(dataset)} is not always without gaps. This allows the active learner to ignore samples which should remain part of the dataset but are not suited for active learning.
The argument
y
represent the current labels.The number of samples to query can be controlled with the keyword argument
n
.
Helpers
Some query strategies may be formulated so that are only applicable to either single-label or multi-label data. As a safeguard against using such strategies on data which is not supported, the constraints() decorator intercepts the query(). If the given labels cannot be handled, RuntimeError is raised.
Note
For all pre-implemented query strategies, don’t equate an absence of an constraint as an indicator of capibility, since we will sparingly use this in the main library in order to not restrict the user unnecessarily. For your own projects and applications, however, this is highly recommended.
Constraints
from small_text.query_strategies import constraints, QueryStrategy
@constraints(classification_type='single-label')
class MyQueryStrategy(QueryStrategy):
pass
Classes
Base
- class small_text.query_strategies.strategies.LeastConfidence[source]
Selects instances with the least prediction confidence (regarding the most likely class) [LG94].
- class small_text.query_strategies.strategies.PredictionEntropy[source]
Selects instances with the largest prediction entropy [HOL08].
- class small_text.query_strategies.strategies.BreakingTies[source]
Selects instances which have a small margin between their most likely and second most likely predicted class [LUO05].
- class small_text.query_strategies.strategies.EmbeddingKMeans(normalize=True)[source]
This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.
- __init__(normalize=True)
- Parameters
normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.
- class small_text.query_strategies.bayesian.BALD(dropout_samples=10)[source]
Selects instances according to the Bayesian Active Learning by Disagreement (BALD) [HHG+11] strategy.
Requires that the predict_proba() method of the given classifier supports dropout sampling [GZ16].
New in version 1.2.0.
- __init__(dropout_samples=10)
- Parameters
dropout_samples (int) – For every instance in the dataset, dropout_samples-many predictions will be used to obtain uncertainty estimates.
- class small_text.query_strategies.coresets.GreedyCoreset(distance_metric='euclidean', normalize=True, batch_size=100)[source]
Selects instances by constructing a greedy coreset [SS17] over document embeddings.
- __init__(distance_metric='euclidean', normalize=True, batch_size=100)
- Parameters
distance_metric ({'cosine', 'euclidean'}) –
Distance metric to be used.
New in version 1.2.0.
normalize (bool) – Embeddings will be normalized before the coreset construction if True.
batch_size (int) – Batch size used for computing document distances.
Note
The default distance metric before v1.2.0 used to be cosine distance.
See also
- Function
greedy_coreset()
Docstrings of the underlying
greedy_coreset()
method.
- class small_text.query_strategies.coresets.LightweightCoreset(normalize=True)[source]
Selects instances by constructing a lightweight coreset [BLK18] over document embeddings.
- __init__(normalize=True)
- Parameters
normalize (bool) – Embeddings will be normalized before the coreset construction if True.
- class small_text.query_strategies.strategies.ContrastiveActiveLearning(k=10, embed_kwargs=dict(), normalize=True, batch_size=100, pbar='tqdm')[source]
Contrastive Active Learning [MVB+21] selects instances whose k-nearest neighbours exhibit the largest mean Kullback-Leibler divergence.
- __init__(k=10, embed_kwargs=dict(), normalize=True, batch_size=100, pbar='tqdm')
- Parameters
k (int) – Number of nearest neighbours whose KL divergence is considered.
embed_kwargs (dict) – Embedding keyword args which are passed to clf.embed().
normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.
batch_size (int, default=100) – Batch size which is used to process the embeddings.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.
- class small_text.query_strategies.strategies.DiscriminativeActiveLearning(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')[source]
Discriminative Active Learning [GS19] learns to differentiate between the labeled and unlabeled pool and selects the instances that are most likely to belong to the unlabeled pool.
- __init__(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')
- classifier_factorysmall_text.
Classifier factory which is used for the discriminative classifiers.
- num_iterationsint
Number of iterations for the discriminiative training.
- unlabeled_factorint, default=10
The ratio of “unlabeled pool” instances to “labeled pool” instances in the discriminative training.
- pbar‘tqdm’ or None, default=’tqdm’
Displays a progress bar if ‘tqdm’ is passed.
- class small_text.query_strategies.multi_label.CategoryVectorInconsistencyAndRanking(batch_size=2048, prediction_threshold=0.5, epsilon=1e-8, pbar='tqdm')[source]
Uncertainty Sampling based on Category Vector Inconsistency and Ranking of Scores [RCV18] selects instances based on the inconsistency of predicted labels and per-class label rankings.
- __init__(batch_size=2048, prediction_threshold=0.5, epsilon=1e-8, pbar='tqdm')
- Parameters
batch_size (int) – Batch size in which the computations are performed. Increasing the size increases the amount of memory used.
prediction_threshold (float) – Confidence value above which a prediction counts as positive.
epsilon (float) – A small value that is added to the argument of the logarithm to avoid taking the logarithm of zero.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.
- class small_text.query_strategies.strategies.SEALS(base_query_strategy, k=100, hnsw_kwargs=dict(), embed_kwargs=dict(), normalize=True)[source]
Similarity Search for Efficient Active Learning and Search of Rare Concepts (SEALS) improves the computational efficiency of active learning by presenting a reduced subset of the unlabeled pool to a base strategy [CCK+22].
This method is to be applied in conjunction with a base query strategy. SEALS selects a subset of the unlabeled pool by selecting the k nearest neighbours of the current labeled pool.
If the size of the unlabeled pool falls below the given k, this implementation will not select a subset anymore and will just delegate to the base strategy instead.
Note
This strategy requires the optional dependency hnswlib.
- __init__(base_query_strategy, k=100, hnsw_kwargs=dict(), embed_kwargs=dict(), normalize=True)
- base_query_strategysmall_text.query_strategy.QueryStrategy
A base query strategy which operates on the subset that is selected by SEALS.
- kint, default=100
Number of nearest neighbors that will be selected.
- hnsw_kwargsdict(), default=dict()
Kwargs which will be passed to the underlying hnsw index. Check the hnswlib github repository on details for the parameters space, ef_construction, ef, and M.
- embed_kwargsdict, default=dict()
Kwargs that will be passed to the embed() method.
- normalizebool, default=True
Embeddings will be L2 normalized if True, otherwise they remain unchanged.
- class small_text.query_strategies.subsampling.AnchorSubsampling(base_query_strategy, subsample_size=500, num_anchors=10, k=50, hnsw_kwargs={}, embed_kwargs={}, normalize=True, batch_size=32)[source]
This subsampling strategy is an implementation of AnchorAL [LV24].
AnchorAL performs subsampling with class-specific anchors, which aims to draw class-balanced subset and to prevent overfitting on the current decision boundary [LV24].
This method is very extensible regarding the choices of base query strategy and anchor selection, but for now the implementation covers the choices described in the original paper.
Note
This strategy requires the optional dependency hnswlib.
New in version 1.4.0.
- __init__(base_query_strategy, subsample_size=500, num_anchors=10, k=50, hnsw_kwargs={}, embed_kwargs={}, normalize=True, batch_size=32)
- base_query_strategysmall_text.query_strategy.QueryStrategy
A base query strategy which operates on the subset that is selected by SEALS.
- subsample_sizeint, default=500
The number of subsamples to be drawn.
- kint, default=50
Number of nearest neighbors that will be selected.
- hnsw_kwargsdict, default=dict{}
Keyword arguments that will be passed to the underlying hnsw index. Check the hnswlib github repository on details for the parameters space, ef_construction, ef, and M.
- embed_kwargsdict, default=dict{}
Keyword arguments that will be passed to the embed() method.
- normalizebool, default=True
Embeddings will be L2 normalized if True, otherwise they remain unchanged.
Pytorch Integration
- class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLength(num_classes, batch_size=50, device='cuda', pbar='tqdm')[source]
Selects instances by expected gradient length [Set07].
- __init__(num_classes, batch_size=50, device='cuda', pbar='tqdm')
- Parameters
num_classes (int) – Number of classes.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.
device (str or torch.device, default=None) – Torch device on which the computation will be performed.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.
- class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthMaxWord(num_classes, layer_name, batch_size=50, device='cuda')[source]
Selects instances using the EGL-word strategy [ZLW17].
The EGL-word strategy works as follows:
For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.
These scores are then summed up over all classes. The result is one score per instance.
Finally, the instances are selected by maximum score.
Notes
An embedding layer is required for this strategy.
This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.
- __init__(num_classes, layer_name, batch_size=50, device='cuda')
- Parameters
num_classes (int) – Number of classes.
layer_name (str) – Name of the embedding layer.
batch_size (int) – Batch size.
device (str or torch.device) – Torch device on which the computation will be performed.
- class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthLayer(num_classes, layer_name, batch_size=50)[source]
An EGL variant that is restricted to the gradients of a single layer.
This is a generalized version of the EGL-sm strategy [ZLW17], but instead of being restricted to the last layer it operates on the layer name passed to the constructor.
- __init__(num_classes, layer_name, batch_size=50)
- Parameters
num_classes (int) – Number of classes.
layer_name (str) – Name of the target layer.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.
Functions
- small_text.query_strategies.coresets.greedy_coreset(x, indices_unlabeled, indices_labeled, n, distance_metric='cosine', batch_size=100, normalized=False)[source]
Computes a greedy coreset [SS17] over x with size n.
- Parameters
x (np.ndarray) – A matrix of row-wise vector representations.
indices_unlabeled (np.ndarray) – Indices (relative to dataset) for the unlabeled data.
indices_labeled (np.ndarray) – Indices (relative to dataset) for the unlabeled data.
n (int) – Size of the coreset (in number of instances).
distance_metric ({'cosine', 'euclidean'}) – Distance metric to be used.
batch_size (int) – Batch size.
normalized (bool) – If True the data x is assumed to be normalized, otherwise it will be normalized where necessary.
- Returns
indices – Indices relative to x.
- Return type
References
- small_text.query_strategies.coresets.lightweight_coreset(x, x_mean, n, normalized=False, proba=None)[source]
Computes a lightweight coreset [BLK18] of x with size n.
- Parameters
x (np.ndarray) – 2D array in which each row represents a sample.
x_mean (np.ndarray) – Elementwise mean over the columns of x.
n (int) – Coreset size.
normalized (bool) – If True the data x is assumed to be normalized, otherwise it will be normalized where necessary.
proba (np.ndarray or None) – A probability distribution over x, which makes up half of the probability mass of the sampling distribution. If proba is not None a uniform distribution is used.
- Returns
indices – Indices relative to x.
- Return type