Query Strategies¶

Query strategies select data samples from the set of unlabeled data.

Overview¶

General¶

LeastConfidence
PredictionEntropy
BreakingTies
EmbeddingKMeans
RandomSampling

Pytorch¶

ExpectedGradientLength
ExpectedGradientLengthMaxWord
ExpectedGradientLengthLayer
BADGE

Classes¶

class small_text.query_strategies.LeastConfidence¶

Selects instances with the least prediction confidence (regarding the most likely class) [LG94].

References

LG94: David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In SIGIR’94, 1994, 3-12.

get_confidence(clf, x, _x_indices_unlabeled, _x_indices_labeled, _y)¶

Computes a confidence score for each given instance.

Parameters: x (ndarray) – Instances for which the confidence should be computed.
Returns: confidence – A 2D numpy array (of type float) in the shape (n_samples, n_classes).
Return type: ndarray

query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10)¶

A query selects instances from the unlabeled pool.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.

Returns

indices – Indices relative to dataset.

Return type

numpy.ndarray

class small_text.query_strategies.PredictionEntropy¶

Selects instances with the largest prediction entropy [HOL08].

References

HOL08: Alex Holub, Pietro Perona, and Michael C. Burl. 2008. Entropy-based active learning for object recognition. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE, 1–8.

get_confidence(clf, x, _x_indices_unlabeled, _x_indices_labeled, _y)¶

Computes a confidence score for each given instance.

Parameters: x (ndarray) – Instances for which the confidence should be computed.
Returns: confidence – A 2D numpy array (of type float) in the shape (n_samples, n_classes).
Return type: ndarray

query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10)¶

A query selects instances from the unlabeled pool.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.

Returns

indices – Indices relative to dataset.

Return type

numpy.ndarray

class small_text.query_strategies.EmbeddingKMeans(normalize=True)¶

This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.

References

YLB20: Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd-Graber. 2020. Cold-start Active Learning through Self-supervised Language Modeling In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Association for Computational Linguistics, 7935–-7948.

sample(clf, x, x_indices_unlabeled, x_indices_labeled, y, n, embeddings, embeddings_proba=None)¶

Samples from the given embeddings.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (ndarray) – A dataset.
x_indices_unlabeled (ndarray) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (ndarray) – Indices (relative to x) for the labeled data.
y (ndarray or list of int) – List of labels where each label maps by index position to indices_labeled.
x – Instances for which the score should be computed.
embeddings (ndarray) – Embeddings for each sample in x.

Returns

indices – A numpy array of selected indices (relative to x_indices_unlabeled).

Return type

ndarray

query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar='tqdm', embeddings=None, embed_kwargs={})¶

A query selects instances from the unlabeled pool.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.

Returns

indices – Indices relative to dataset.

Return type

numpy.ndarray

class small_text.query_strategies.RandomSampling¶

Randomly selects instances.

query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10)¶

A query selects instances from the unlabeled pool.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.

Returns

indices – Indices relative to dataset.

Return type

numpy.ndarray

class small_text.integrations.pytorch.query_strategies.ExpectedGradientLength(num_classes, batch_size=50, device='cuda', pbar='tqdm')¶

Selects instances by expected gradient length [Set07].

References

Set07: Burr Settles, Mark Craven, and Soumya Ray. 2007. Multiple-instance active learning. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS’07). Curran Associates Inc., Red Hook, 1289–1296.

query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar=None)¶

A query selects instances from the unlabeled pool.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.

Returns

indices – Indices relative to dataset.

Return type

numpy.ndarray

class small_text.integrations.pytorch.query_strategies.ExpectedGradientLengthMaxWord(num_classes, layer_name, batch_size=50, device='cuda')¶

Selects instances using the EGL-word model [ZLW17].

The EGL-word model works as follows:

For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.
These scores are then summed up over all classes. The result is one score per instance.

Finally, the instances are selected by maximum score.

Notes

An embedding layer is required for this strategy.
This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.

References

ZLW17: Ye Zhang, Matthew Lease, and Byron C. Wallace. 2017. Active discriminative text representation learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, 3386–3392.

query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar=None)¶

A query selects instances from the unlabeled pool.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.

Returns

indices – Indices relative to dataset.

Return type

numpy.ndarray

class small_text.integrations.pytorch.query_strategies.ExpectedGradientLengthLayer(num_classes, layer_name, batch_size=50)¶

query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar=None)¶

A query selects instances from the unlabeled pool.

Parameters

clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.

Returns

indices – Indices relative to dataset.

Return type

numpy.ndarray