Query Strategies¶
Query strategies select data samples from the set of unlabeled data.
Overview¶
General¶
BreakingTies
Pytorch¶
ExpectedGradientLength
ExpectedGradientLengthMaxWord
ExpectedGradientLengthLayer
BADGE
Classes¶
- class small_text.query_strategies.LeastConfidence¶
Selects instances with the least prediction confidence (regarding the most likely class) [LG94].
References
- LG94
David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In SIGIR’94, 1994, 3-12.
- get_confidence(clf, x, _x_indices_unlabeled, _x_indices_labeled, _y)¶
Computes a confidence score for each given instance.
- Parameters
x (ndarray) – Instances for which the confidence should be computed.
- Returns
confidence – A 2D numpy array (of type float) in the shape (n_samples, n_classes).
- Return type
ndarray
- query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10)¶
A query selects instances from the unlabeled pool.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.
- Returns
indices – Indices relative to dataset.
- Return type
numpy.ndarray
- class small_text.query_strategies.PredictionEntropy¶
Selects instances with the largest prediction entropy [HOL08].
References
- HOL08
Alex Holub, Pietro Perona, and Michael C. Burl. 2008. Entropy-based active learning for object recognition. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE, 1–8.
- get_confidence(clf, x, _x_indices_unlabeled, _x_indices_labeled, _y)¶
Computes a confidence score for each given instance.
- Parameters
x (ndarray) – Instances for which the confidence should be computed.
- Returns
confidence – A 2D numpy array (of type float) in the shape (n_samples, n_classes).
- Return type
ndarray
- query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10)¶
A query selects instances from the unlabeled pool.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.
- Returns
indices – Indices relative to dataset.
- Return type
numpy.ndarray
- class small_text.query_strategies.EmbeddingKMeans(normalize=True)¶
This is a generalized version of BERT-K-Means [YLB20], which is applicable to any kind of dense embedding, regardless of the classifier.
References
- YLB20
Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd-Graber. 2020. Cold-start Active Learning through Self-supervised Language Modeling In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Association for Computational Linguistics, 7935–-7948.
- sample(clf, x, x_indices_unlabeled, x_indices_labeled, y, n, embeddings, embeddings_proba=None)¶
Samples from the given embeddings.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (ndarray) – A dataset.
x_indices_unlabeled (ndarray) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (ndarray) – Indices (relative to x) for the labeled data.
y (ndarray or list of int) – List of labels where each label maps by index position to indices_labeled.
x – Instances for which the score should be computed.
embeddings (ndarray) – Embeddings for each sample in x.
- Returns
indices – A numpy array of selected indices (relative to x_indices_unlabeled).
- Return type
ndarray
- query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar='tqdm', embeddings=None, embed_kwargs={})¶
A query selects instances from the unlabeled pool.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.
- Returns
indices – Indices relative to dataset.
- Return type
numpy.ndarray
- class small_text.query_strategies.RandomSampling¶
Randomly selects instances.
- query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10)¶
A query selects instances from the unlabeled pool.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.
- Returns
indices – Indices relative to dataset.
- Return type
numpy.ndarray
- class small_text.integrations.pytorch.query_strategies.ExpectedGradientLength(num_classes, batch_size=50, device='cuda', pbar='tqdm')¶
Selects instances by expected gradient length [Set07].
References
- Set07
Burr Settles, Mark Craven, and Soumya Ray. 2007. Multiple-instance active learning. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS’07). Curran Associates Inc., Red Hook, 1289–1296.
- query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar=None)¶
A query selects instances from the unlabeled pool.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.
- Returns
indices – Indices relative to dataset.
- Return type
numpy.ndarray
- class small_text.integrations.pytorch.query_strategies.ExpectedGradientLengthMaxWord(num_classes, layer_name, batch_size=50, device='cuda')¶
Selects instances using the EGL-word model [ZLW17].
The EGL-word model works as follows:
For every instance and class the gradient norm is computed per word. The score for each (instance, class) pair is the norm of the word with the highest gradient norm value.
These scores are then summed up over all classes. The result is one score per instance.
Finally, the instances are selected by maximum score.
Notes
An embedding layer is required for this strategy.
This strategy was designed for the KimCNN model and might not work for other models even if they posses an embedding layer.
References
- ZLW17
Ye Zhang, Matthew Lease, and Byron C. Wallace. 2017. Active discriminative text representation learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, 3386–3392.
- query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar=None)¶
A query selects instances from the unlabeled pool.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.
- Returns
indices – Indices relative to dataset.
- Return type
numpy.ndarray
- class small_text.integrations.pytorch.query_strategies.ExpectedGradientLengthLayer(num_classes, layer_name, batch_size=50)¶
- query(clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10, pbar=None)¶
A query selects instances from the unlabeled pool.
- Parameters
clf (small_text.classifiers.Classifier) – A classifier.
x (small_text.data.datasets.Dataset) – A dataset.
x_indices_unlabeled (list of int) – Indices (relative to x) for the unlabeled data.
x_indices_labeled (list of int) – Indices (relative to x) for the labeled data.
y (list of int) – List of labels where each label maps by index position to indices_labeled.
n (int) – Number of samples to query.
- Returns
indices – Indices relative to dataset.
- Return type
numpy.ndarray