ActiveLearner API¶

class small_text.active_learner.PoolBasedActiveLearner(clf_factory, query_strategy, x_train, incremental_training=False)¶

A pool-based active learner in which a pool holds all available unlabeled data. It uses a classifier, a query strategy and manages the mutually exclusive partition over the whole training data into labeled and unlabeled.

Parameters

clf_factory (small_text.classifiers.factories.AbstractClassifierFactory) – A factory responsible for creating new classifier instances.
query_strategy (small_text.query_strategies.QueryStrategy) – Query strategy which is responsible for selecting instances during a query() call.
x_train (small_text.data.Dataset) – A training dataset that is supported by the underlying classifier.
incremental_training (bool) – If False, creates and trains a new classifier only before the first query, otherwise re-trains the existing classifier. Incremental training must be supported by the classifier provided by clf_factory.

x_indices_labeled¶

Indices of instances (relative to self.x_train) constituting the labeled pool.

Type: numpy.ndarray

x_indices_ignored¶

Indices of instances (relative to self.x_train) which have been ignored, i.e. which will never be returned by a query.

Type: numpy.ndarray

y¶

Labels for the the current labeled pool. Each tuple (x_indices_labeled[i], y[i]) represents one labeled sample.

Type: numpy.ndarray

queried_indices¶

Queried indices returned by the last query() call, or None if no query has been executed yet.

Type: numpy.ndarray or None

initialize_data(x_indices_initial, y_initial, x_indices_ignored=None, x_indices_validation=None, retrain=True)¶

(Re-)Initializes the current labeled pool.

This is required once before the first query() call, and whenever the labeled pool is changed from the outside, i.e. when self.x_train changes.

Parameters

x_indices_initial (numpy.ndarray) – A list of indices (relative to self.x_train) of initially labeled samples.
y_initial (numpy.ndarray) – List of labels. One label correspongs to each index in x_indices_initial.
x_indices_ignored (numpy.ndarray) – List of ignored samples which will be invisible to the query strategy.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set. Only used if retrain=True.
retrain (bool) – Retrains the model after the update if True.

query(num_samples=10, x=None, query_strategy_kwargs=None)¶

Performs a query step, which selects a number of samples from the unlabeled pool. A query step must be followed by an update step.

Parameters

num_samples (int) – Number of samples to query.
x (list-like) – Alternative representation for the samples in the unlabeled pool. This is used by some query strategies.

Returns

queried_indices – List of queried indices (relative to the current unlabeled pool).

Return type

numpy.ndarray

Raises

LearnerNotInitializedException – Thrown when the active learner was not initialized via initialize_data(…).
ValueError – Thrown when args or kwargs are not used and consumed.

update(y, x_indices_validation=None)¶

Performs an update step, which passes the label for each of the previously queried indices. An update step must be preceded by a query step. At the end of the update step the current model is retrained using all available labels.

Parameters

y (list of int or numpy.ndarray) – Labels provided in response to the previous query. Each label at index i corresponds to the sample x[i]. Setting the label to None will ignore the respective sample.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.

update_label_at(x_index, y, retrain=False, x_indices_validation=None)¶

Updates the label for the given x_index (with regard to self.x_train).

Notes

After adding labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters

x_index (int) – Label index for the label to be updated.
y (int) – New label.
retrain (bool) – Retrains the model after the update if True.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.

remove_label_at(x_index, retrain=False, x_indices_validation=None)¶

Removes the labeling for the given x_index (with regard to self.x_train).

Notes

After removing labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters

x_index (int) – Label index for the label to be removed.
retrain (bool) – Retrains the model after the removal if True.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.

save(file)¶

Serializes the current active learner object into a single file for later re-use.

Parameters: file (str or path or file) – Serialized output file to be written.

classmethod load(file)¶

Deserializes a serialized active learner.

Parameters: file (str or path or file) – File to be loaded.