ActiveLearner API

class small_text.active_learner.PoolBasedActiveLearner(clf_factory, query_strategy, dataset, reuse_model=False)[source]

A pool-based active learner in which a pool holds all available unlabeled data.

It uses a classifier, a query strategy and manages the mutually exclusive partition over the whole training data into labeled and unlabeled.

Parameters

clf_factory (small_text.classifiers.factories.AbstractClassifierFactory) – A factory responsible for creating new classifier instances.
query_strategy (small_text.query_strategies.QueryStrategy) – Query strategy which is responsible for selecting instances during a query() call.
dataset (Dataset) – A training dataset that is supported by the underlying classifier.
reuse_model (bool, default=False) – Reuses the previous model during retraining (if a previous model exists), otherwise creates a new model for each retraining.

indices_labeled

Indices of instances (relative to self.x_train) constituting the labeled pool.

Type: numpy.ndarray

indices_ignored

Indices of instances (relative to self.x_train) which have been ignored, i.e. which will never be returned by a query.

Type: numpy.ndarray or scipy.sparse.csr_matrix

y

Labels for the the current labeled pool. Each tuple (x_indices_labeled[i], y[i]) represents one labeled sample.

Type: numpy.ndarray or scipy.sparse.csr_matrix

indices_queried

Queried indices returned by the last query() call, or None if no query has been executed yet.

Type: numpy.ndarray or None

initialize_data(indices_initial, y_initial, indices_ignored=None, indices_validation=None, retrain=True)

(Re-)Initializes the current labeled pool.

This is required once before the first query() call, and whenever the labeled pool is changed from the outside, i.e. when self.x_train changes.

Parameters

indices_initial (numpy.ndarray) – A list of indices (relative to self.x_train) of initially labeled samples.
y_initial (numpy.ndarray or or scipy.sparse.csr_matrix) – Label matrix. One row corresponds to an index in x_indices_initial. If the passed type is numpy.ndarray (dense) all further label-based operations assume dense labels, otherwise sparse labels for scipy.sparse.csr_matrix.
indices_ignored (numpy.ndarray) – List of ignored samples which will be invisible to the query strategy.
indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set. Only used if retrain=True.
retrain (bool, default=True) – Retrains the model after the update if True.

query(num_samples=10, representation=None, query_strategy_kwargs={})

Performs a query step, which selects a number of samples from the unlabeled pool. A query step must be followed by an update step.

Parameters

num_samples (int, default=10) – Number of samples to query.
representation (numpy.ndarray, default=None) – Alternative representation for the samples in the unlabeled pool. his can be used if you want to rely pre-computed fixed representations instead of embeddings that change during each active learning iteration.
query_strategy_kwargs (dict, default=dict()) –

Returns

queried_indices – List of queried indices (relative to the current unlabeled pool).

Return type

numpy.ndarray[int]

Raises

LearnerNotInitializedException – Thrown when the active learner was not initialized via initialize_data(…).
ValueError – Raised when args or kwargs are not used and consumed.

update(y, indices_validation=None)

Performs an update step, which passes the label for each of the previously queried indices.

An update step must be preceded by a query step. At the end of the update step the current model is retrained using all available labels.

Parameters

y (numpy.ndarray or scipy.sparse.csr_matrix) – Labels provided in response to the previous query. Each label at index i corresponds to the sample x[i] for single-label data (ndarray) and each row of labels at index i corresponds to the sample x[i] for multi-label data (csr_matrix). Setting the label / row of labels to ` small_text.base import LABEL_IGNORED` will ignore the respective sample.
indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.

update_label_at(index, y, retrain=False, indices_validation=None)

Updates the label for the given x_index (with regard to self.x_train).

Notes

After adding labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters

index (int) – Data index (relative to self.x_train) for which the label should be updated.
y (int or numpy.ndarray) – The new label(s) to be assigned for the sample at self.x_indices_labeled[x_index].
retrain (bool, default=False) – Retrains the model after the update if True.
indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. This is only used if retrain is True.

remove_label_at(x_index, retrain=False, x_indices_validation=None)

Removes the labeling for the given x_index (with regard to self.x_train).

Notes

After removing labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters

x_index (int) – Data index (relative to self.x_train) for which the label should be removed.
retrain (bool, default=None) – Retrains the model after removal if True.
x_indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. This is only used if retrain is True.

save(file)

Serializes the current active learner object into a single file for later re-use.

Parameters: file (str or path or file) – Serialized output file to be written.

classmethod load(file)

Deserializes a serialized active learner.

Parameters: file (str or path or file) – File to be loaded.