ActiveLearner API

Everything in small-text revolves around the active learner.


Pool-Based Active Learner

class small_text.active_learner.PoolBasedActiveLearner(clf_factory, query_strategy, dataset, fit_kwargs={}, reuse_model=False)[source]

A pool-based active learner in which a pool holds all available unlabeled data.

It uses a classifier, a query strategy and manages the mutually exclusive partition over the whole training data into a labeled and an unlabeled set.

Parameters:
  • clf_factory (small_text.classifiers.factories.AbstractClassifierFactory) – A factory responsible for creating new classifier instances.

  • query_strategy (small_text.query_strategies.QueryStrategy) – Query strategy which is responsible for selecting instances during a query() call.

  • dataset (Dataset) – A training dataset that is supported by the underlying classifier.

  • reuse_model (bool, default=False) – Reuses the previous model during retraining (if a previous model exists), otherwise creates a new model for each retraining.

indices_labeled

Indices of instances (relative to self.x_train) constituting the labeled pool.

Type:

numpy.ndarray

indices_ignored

Indices of instances (relative to self.x_train) which have been ignored, i.e. which will never be returned by a query.

Type:

numpy.ndarray or scipy.sparse.csr_matrix

y

Labels for the current labeled pool. Each tuple (x_indices_labeled[i], y[i]) represents one labeled sample.

Type:

numpy.ndarray or scipy.sparse.csr_matrix

indices_queried

Queried indices returned by the last query() call, or None if no query has been executed yet.

Type:

numpy.ndarray or None

fit_kwargs

Keyword arguments that will be passed to the fit() call during update().

Type:

dict

initialize(clf_or_indices=None, indices_validation=None, retrain=True)

Initializes the current active learner.

Initializes the current active learner by either supplying an existing model or by using the set of given indices to train a new model. This is required once before the first query() call for query strategies that rely on the current response, which is the majority.

Notes

  • You can pass a different type of classifier via clf_or_indices than self.clf_factory would have created, but both classifiers have support the same dataset type.

  • If indices are passed, the samples in self.dataset referenced by these indices need to be labeled.

Parameters:
  • clf_or_indices (Classifier or numpy.ndarray or None) – By default you will provide indices relative to self.dataset, which are then used to train the initial model. Alternatively, you can provide a classifier to be used as initial model. You can also pass None to initialize the active learning without having an initial model. None will not work with most query strategies and is only intended for cold start active learning.

  • indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.indices_labeled) define a custom validation set if provided. Otherwise, each classifier that uses a validation set will be responsible for creating a validation set. Only used if initial_clf is not None and retrain=True.

  • retrain (bool, default=True) – Retrains the model after the update if True.

query(num_samples=10, representation=None, query_strategy_kwargs={})

Performs a query step, which selects a number of samples from the unlabeled pool. A query step must be followed by an update step.

Parameters:
  • num_samples (int, default=10) – Number of samples to query.

  • representation (numpy.ndarray, default=None) – Alternative representation for the samples in the unlabeled pool. This can be used if you want to rely on pre-computed fixed representations instead of embeddings that change during each active learning iteration.

  • query_strategy_kwargs (dict, default=dict())

Returns:

queried_indices – List of queried indices (relative to the current unlabeled pool).

Return type:

numpy.ndarray[int]

Raises:
  • LearnerNotInitializedException – Thrown when the active learner was not initialized via initialize_data(…).

  • ValueError – Raised when args or kwargs are not used and consumed.

update(y, indices_validation=None)

Performs an update step, which passes the label for each of the previously queried indices.

An update step must be preceded by a query step. At the end of the update step the current model is retrained using all available labels.

Parameters:
  • y (numpy.ndarray or scipy.sparse.csr_matrix) – Labels provided in response to the previous query. Each label at index i corresponds to the sample x[i] for single-label data (ndarray) and each row of labels at index i corresponds to the sample x[i] for multi-label data (csr_matrix). Setting the label / row of labels to ` small_text.base import LABEL_IGNORED` will ignore the respective sample.

  • indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.indices_labeled) define a custom validation set if provided. Otherwise, each classifier that uses a validation set will be responsible for creating a validation set.

update_label_at(index, y, retrain=False, indices_validation=None)

Updates the label for the given x_index (with regard to self.x_train).

Notes

After adding labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters:
  • index (int) – Data index (relative to self.x_train) for which the label should be updated.

  • y (int or numpy.ndarray) – The new label(s) to be assigned for the sample at self.indices_labeled[x_index].

  • retrain (bool, default=False) – Retrains the model after the update if True.

  • indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.indices_labeled) define a custom validation set if provided. This is only used if retrain is True.

remove_label_at(x_index, retrain=False, x_indices_validation=None)

Removes the labeling for the given x_index (with regard to self.x_train).

Notes

After removing labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.

Parameters:
  • x_index (int) – Data index (relative to self.x_train) for which the label should be removed.

  • retrain (bool, default=None) – Retrains the model after removal if True.

  • x_indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. This is only used if retrain is True.

save(file)

Serializes the current active learner object into a single file for later re-use.

Parameters:

file (str or path or file) – Serialized output file to be written.

classmethod load(file)

Deserializes a serialized active learner.

Parameters:

file (str or path or file) – File to be loaded.