ActiveLearner API¶
- class small_text.active_learner.PoolBasedActiveLearner(clf_factory, query_strategy, x_train, incremental_training=False)¶
A pool-based active learner in which a pool holds all available unlabeled data. It uses a classifier, a query strategy and manages the mutually exclusive partition over the whole training data into labeled and unlabeled.
- Parameters
clf_factory (small_text.classifiers.factories.AbstractClassifierFactory) – A factory responsible for creating new classifier instances.
query_strategy (small_text.query_strategies.QueryStrategy) – Query strategy which is responsible for selecting instances during a query() call.
x_train (small_text.data.Dataset) – A training dataset that is supported by the underlying classifier.
incremental_training (bool) – If False, creates and trains a new classifier only before the first query, otherwise re-trains the existing classifier. Incremental training must be supported by the classifier provided by clf_factory.
- x_indices_labeled¶
Indices of instances (relative to self.x_train) constituting the labeled pool.
- Type
numpy.ndarray
- x_indices_ignored¶
Indices of instances (relative to self.x_train) which have been ignored, i.e. which will never be returned by a query.
- Type
numpy.ndarray
- y¶
Labels for the the current labeled pool. Each tuple (x_indices_labeled[i], y[i]) represents one labeled sample.
- Type
numpy.ndarray
- queried_indices¶
Queried indices returned by the last query() call, or None if no query has been executed yet.
- Type
numpy.ndarray or None
- initialize_data(x_indices_initial, y_initial, x_indices_ignored=None, x_indices_validation=None, retrain=True)¶
(Re-)Initializes the current labeled pool.
This is required once before the first query() call, and whenever the labeled pool is changed from the outside, i.e. when self.x_train changes.
- Parameters
x_indices_initial (numpy.ndarray) – A list of indices (relative to self.x_train) of initially labeled samples.
y_initial (numpy.ndarray) – List of labels. One label correspongs to each index in x_indices_initial.
x_indices_ignored (numpy.ndarray) – List of ignored samples which will be invisible to the query strategy.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set. Only used if retrain=True.
retrain (bool) – Retrains the model after the update if True.
- query(num_samples=10, x=None, query_strategy_kwargs=None)¶
Performs a query step, which selects a number of samples from the unlabeled pool. A query step must be followed by an update step.
- Parameters
num_samples (int) – Number of samples to query.
x (list-like) – Alternative representation for the samples in the unlabeled pool. This is used by some query strategies.
- Returns
queried_indices – List of queried indices (relative to the current unlabeled pool).
- Return type
numpy.ndarray
- Raises
LearnerNotInitializedException – Thrown when the active learner was not initialized via initialize_data(…).
ValueError – Thrown when args or kwargs are not used and consumed.
- update(y, x_indices_validation=None)¶
Performs an update step, which passes the label for each of the previously queried indices. An update step must be preceded by a query step. At the end of the update step the current model is retrained using all available labels.
- Parameters
y (list of int or numpy.ndarray) – Labels provided in response to the previous query. Each label at index i corresponds to the sample x[i]. Setting the label to None will ignore the respective sample.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.
- update_label_at(x_index, y, retrain=False, x_indices_validation=None)¶
Updates the label for the given x_index (with regard to self.x_train).
Notes
After adding labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.
- Parameters
x_index (int) – Label index for the label to be updated.
y (int) – New label.
retrain (bool) – Retrains the model after the update if True.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.
- remove_label_at(x_index, retrain=False, x_indices_validation=None)¶
Removes the labeling for the given x_index (with regard to self.x_train).
Notes
After removing labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.
- Parameters
x_index (int) – Label index for the label to be removed.
retrain (bool) – Retrains the model after the removal if True.
x_indices_validation (numpy.ndarray) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.
- save(file)¶
Serializes the current active learner object into a single file for later re-use.
- Parameters
file (str or path or file) – Serialized output file to be written.
- classmethod load(file)¶
Deserializes a serialized active learner.
- Parameters
file (str or path or file) – File to be loaded.