ActiveLearner API

class small_text.active_learner.PoolBasedActiveLearner(clf_factory, query_strategy, x_train, incremental_training=False)

A pool-based active learner in which a pool holds all available unlabeled data. It uses a classifier, a query strategy and manages the mutually exclusive partition over the whole training data into labeled and unlabeled.

Parameters
  • clf_factory (small_text.classifiers.factories.AbstractClassifierFactory) – A factory responsible for creating new classifier instances.

  • query_strategy (small_text.query_strategies.QueryStrategy) – Query strategy which is responsible for selecting instances during a query() call.

  • x_train (small_text.data.Dataset) – A training dataset that is supported by the underlying classifier.

  • incremental_training (bool) – If False, creates and trains a new classifier only before the first query, otherwise re-trains the existing classifier. Incremental training must be supported by the classifier provided by clf_factory.

x_indices_labeled

Indices of instances (relative to self.x_train) constituting the labeled pool.

Type

numpy.ndarray

x_indices_ignored

Indices of instances (relative to self.x_train) which have been ignored, i.e. which will never be returned by a query.

Type

numpy.ndarray

y

Labels for the the current labeled pool. Each tuple (x_indices_labeled[i], y[i]) represents one labeled sample.

Type

numpy.ndarray

queried_indices

Queried indices returned by the last query() call, or None if no query has been executed yet.

Type

numpy.ndarray or None

initialize_data(x_indices_initial, y_initial, x_indices_ignored=None, x_indices_validation=None, retrain=True)

(Re-)Initializes the current labeled pool.

This is required once before the first query() call, and whenever the labeled pool is changed from the outside, i.e. when self.x_train changes.

Parameters
  • x_indices_initial (numpy.ndarray) – A list of indices (relative to self.x_train) of initially labeled samples.

  • y_initial (numpy.ndarray) – List of labels. One label correspongs to each index in x_indices_initial.

  • x_indices_ignored (numpy.ndarray) – List of ignored samples which will be invisible to the query strategy.

  • x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set. Only used if retrain=True.

  • retrain (bool) – Retrains the model after the update if True.

query(num_samples=10, x=None, query_strategy_kwargs=None)

Performs a query step, which selects a number of samples from the unlabeled pool. A query step must be followed by an update step.

Parameters
  • num_samples (int) – Number of samples to query.

  • x (list-like) – Alternative representation for the samples in the unlabeled pool. This is used by some query strategies.

Returns

queried_indices – List of queried indices (relative to the current unlabeled pool).

Return type

numpy.ndarray

Raises
  • LearnerNotInitializedException – Thrown when the active learner was not initialized via initialize_data(…).

  • ValueError – Thrown when args or kwargs are not used and consumed.

update(y, x_indices_validation=None)

Performs an update step, which passes the label for each of the previously queried indices. An update step must be preceded by a query step. At the end of the update step the current model is retrained using all available labels.

Parameters
  • y (list of int or numpy.ndarray) – Labels provided in response to the previous query. Each label at index i corresponds to the sample x[i]. Setting the label to None will ignore the respective sample.

  • x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.

update_label_at(x_index, y, retrain=False, x_indices_validation=None)

Updates the label for the given x_index (with regard to self.x_train).

Notes

After adding labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters
  • x_index (int) – Label index for the label to be updated.

  • y (int) – New label.

  • retrain (bool) – Retrains the model after the update if True.

  • x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.

remove_label_at(x_index, retrain=False, x_indices_validation=None)

Removes the labeling for the given x_index (with regard to self.x_train).

Notes

After removing labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters
  • x_index (int) – Label index for the label to be removed.

  • retrain (bool) – Retrains the model after the removal if True.

  • x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.

save(file)

Serializes the current active learner object into a single file for later re-use.

Parameters

file (str or path or file) – Serialized output file to be written.

classmethod load(file)

Deserializes a serialized active learner.

Parameters

file (str or path or file) – File to be loaded.