Query Strategies
Query strategies select data samples from the set of unlabeled data, i.e. they decide which samples should be labeled next.
Overview
You can use the following pre-implemented query strategies:
General
Pytorch
Interface
The query strategy interface revolves around the query()
method.
A query strategy can make use of any of the given positional arguments but does not need to.
class QueryStrategy(ABC):
"""Abstract base class for Query Strategies."""
@abstractmethod
def query(self, clf, dataset, indices_unlabeled, indices_labeled, y, n=10):
"""
Queries instances from the unlabeled pool.
A query selects instances from the unlabeled pool.
Parameters
----------
clf : small_text.classifiers.Classifier
A text classifier.
dataset : small_text.data.datasets.Dataset
A text dataset.
indices_unlabeled : np.ndarray[int]
Indices (relative to `dataset`) for the unlabeled data.
indices_labeled : np.ndarray[int]
Indices (relative to `dataset`) for the labeled data.
y : np.ndarray[int] or csr_matrix
List of labels where each label maps by index position to `indices_labeled`.
n : int
Number of samples to query.
Returns
-------
indices : numpy.ndarray
Indices relative to `dataset` which were selected.
"""
pass
@staticmethod
def _validate_query_input(indices_unlabeled, n):
if len(indices_unlabeled) == 0:
raise EmptyPoolException('No unlabeled indices available. Cannot query an empty pool.')
if n > len(indices_unlabeled):
raise PoolExhaustedException('Pool exhausted: {} available / {} requested'
.format(len(indices_unlabeled), n))
A query strategy can use the classifier
clf
to make a decision.The full dataset (i.e. all samples regardless of whether they are labeled or not) is given by
dataset
.The partition into labeled and unlabeled data is handled indirectly via indices (
indices_unlabeled
andindices_labeled
).Note
All indices together must not necessarily be complete, i.e. they full set of indices {1, 2, …, len(dataset)} is not always without gaps. This allows the active learner to ignore samples which should remain part of the dataset but are not suited for active learning.
The argument
y
represent the current labels.The number of samples to query can be controlled with the keyword argument
n
.
Helpers
Some query strategies may be formulated so that are only applicable to either single-label or multi-label data. As a safeguard against using such strategies on data which is not supported, the constraints() decorator intercepts the query(). If the given labels cannot be handled, RuntimeError is raised.
Note
For all pre-implemented query strategies, don’t equate an absence of an constraint as an indicator of capibility, since we will sparingly use this in the main library in order to not restrict the user unnecessarily. For your own projects and applications, however, this is highly recommended.
Constraints
from small_text.query_strategies import constraints, QueryStrategy
@constraints(classification_type='single-label')
class MyQueryStrategy(QueryStrategy):
pass
Classes
Base
- class small_text.query_strategies.strategies.LeastConfidence[source]
Selects instances with the least prediction confidence (regarding the most likely class) [LG94].
- class small_text.query_strategies.strategies.PredictionEntropy[source]
Selects instances with the largest prediction entropy [HOL08].
- class small_text.query_strategies.strategies.BreakingTies[source]
Selects instances which have a small margin between their most likely and second most likely predicted class [LUO05].
- class small_text.query_strategies.strategies.EmbeddingKMeans(normalize=True)[source]
This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.
- __init__(normalize=True)
- Parameters
normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.
- class small_text.query_strategies.coresets.GreedyCoreset(normalize=True, batch_size=100)[source]
- __init__(normalize=True, batch_size=100)
- class small_text.query_strategies.coresets.LightweightCoreset(normalize=True)[source]
Selects instances using the lightweight coreset method _[BAC18].
- Parameters
normalize (bool) – Embeddings are normalized if True, otherwise they are left unchanged.
- __init__(normalize=True)
- class small_text.query_strategies.strategies.ContrastiveActiveLearning(k=10, embed_kwargs={}, normalize=True, batch_size=100, pbar='tqdm')[source]
Contrastive Active Learning [MVB+21] selects instances whose k-nearest neighbours exhibit the largest mean Kullback-Leibler divergence.
- __init__(k=10, embed_kwargs={}, normalize=True, batch_size=100, pbar='tqdm')
- Parameters
k (int) – Number of nearest neighbours whose KL divergence is considered.
embed_kwargs (dict) – Embedding keyword args which are passed to clf.embed().
normalize (bool, default=True) – Embeddings will be L2 normalized if True, otherwise they remain unchanged.
batch_size (int, default=100) – Batch size which is used to process the embeddings.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.
- class small_text.query_strategies.strategies.DiscriminativeActiveLearning(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')[source]
Discriminative Active Learning [GS19] learns to differentiate between the labeled and unlabeled pool and selects the instances that are most likely to belong to the unlabeled pool.
- __init__(classifier_factory, num_iterations, unlabeled_factor=10, pbar='tqdm')
- classifier_factorysmall_text.
Classifier factory which is used for the discriminative classifiers.
- num_iterationsint
Number of iterations for the discriminiative training.
- unlabeled_factorint, default=10
The ratio of “unlabeled pool” instances to “labeled pool” instances in the discriminative training.
- pbar‘tqdm’ or None, default=’tqdm’
Displays a progress bar if ‘tqdm’ is passed.
- class small_text.query_strategies.strategies.SEALS(base_query_strategy, k=100, hnsw_kwargs={}, embed_kwargs={}, normalize=True)[source]
Similarity Search for Efficient Active Learning and Search of Rare Concepts (SEALS) improves the computational efficiency of active learning by presenting a reduced subset of the unlabeled pool to a base strategy [CCK+21].
This method is to be applied in conjunction with a base query strategy. SEALS selects a subset of the unlabeled pool by selecting the k nearest neighbours of the current labeled pool.
If the size of the unlabeled pool falls below the given k, this implementation will not select a subset anymore and will just delegate to the base strategy instead.
Note
This strategy requires the optional dependency hnswlib.
- __init__(base_query_strategy, k=100, hnsw_kwargs={}, embed_kwargs={}, normalize=True)
- base_query_strategysmall_text.query_strategy.QueryStrategy
A base query strategy which operates on the subset that is selected by SEALS.
- kint, default=100
Number of nearest neighbors that will be selected.
- hnsw_kwargsdict(), default=dict()
Kwargs which will be passed to the underlying hnsw index. Check the hnswlib github repository on details for the parameters space, ef_construction, ef, and M.
- embed_kwargsdict, default=dict()
Kwargs that will be passed to the embed() method.
- normalizebool, default=True
Embeddings will be L2 normalized if True, otherwise they remain unchanged.
Pytorch Integration
- class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLength(num_classes, batch_size=50, device='cuda', pbar='tqdm')[source]
Selects instances by expected gradient length [Set07].
- __init__(num_classes, batch_size=50, device='cuda', pbar='tqdm')
- Parameters
num_classes (int) – Number of classes.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.
device (str or torch.device, default=None) – Torch device on which the computation will be performed.
pbar ('tqdm' or None, default='tqdm') – Displays a progress bar if ‘tqdm’ is passed.
- class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthMaxWord(num_classes, layer_name, batch_size=50, device='cuda')[source]
Selects instances using the EGL-word strategy [ZLW17].
The EGL-word strategy works as follows:
For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.
These scores are then summed up over all classes. The result is one score per instance.
Finally, the instances are selected by maximum score.
Notes
An embedding layer is required for this strategy.
This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.
- __init__(num_classes, layer_name, batch_size=50, device='cuda')
- Parameters
num_classes (int) – Number of classes.
layer_name (str) – Name of the embedding layer.
batch_size (int) – Batch size.
device (str or torch.device) – Torch device on which the computation will be performed.
- class small_text.integrations.pytorch.query_strategies.strategies.ExpectedGradientLengthLayer(num_classes, layer_name, batch_size=50)[source]
An EGL variant that is restricted to the gradients of a single layer.
This is a generalized version of the EGL-sm strategy [ZLW17], but instead of being restricted to the last layer it operates on the layer name passed to the constructor.
- __init__(num_classes, layer_name, batch_size=50)
- Parameters
num_classes (int) – Number of classes.
layer_name (str) – Name of the target layer.
batch_size (int, default=50) – Batch size in which the query strategy scores the instances.