ActiveLearner API
- class small_text.active_learner.PoolBasedActiveLearner(clf_factory, query_strategy, dataset, fit_kwargs=dict(), reuse_model=False)[source]
A pool-based active learner in which a pool holds all available unlabeled data.
It uses a classifier, a query strategy and manages the mutually exclusive partition over the whole training data into labeled and unlabeled.
- Parameters
clf_factory (small_text.classifiers.factories.AbstractClassifierFactory) – A factory responsible for creating new classifier instances.
query_strategy (small_text.query_strategies.QueryStrategy) – Query strategy which is responsible for selecting instances during a query() call.
dataset (Dataset) – A training dataset that is supported by the underlying classifier.
reuse_model (bool, default=False) – Reuses the previous model during retraining (if a previous model exists), otherwise creates a new model for each retraining.
- indices_labeled
Indices of instances (relative to self.x_train) constituting the labeled pool.
- Type
- indices_ignored
Indices of instances (relative to self.x_train) which have been ignored, i.e. which will never be returned by a query.
- Type
- y
Labels for the the current labeled pool. Each tuple (x_indices_labeled[i], y[i]) represents one labeled sample.
- Type
- indices_queried
Queried indices returned by the last query() call, or None if no query has been executed yet.
- Type
numpy.ndarray or None
- fit_kwargs
Keyword arguments that will be passed to the fit() call during update().
- Type
dict
- initialize_data(indices_initial, y_initial, indices_ignored=None, indices_validation=None, retrain=True)
(Re-)Initializes the current labeled pool.
This is required once before the first query() call, and whenever the labeled pool is changed from the outside, i.e. when self.x_train changes.
- Parameters
indices_initial (numpy.ndarray) – A list of indices (relative to self.x_train) of initially labeled samples.
y_initial (numpy.ndarray or or scipy.sparse.csr_matrix) – Label matrix. One row corresponds to an index in x_indices_initial. If the passed type is numpy.ndarray (dense) all further label-based operations assume dense labels, otherwise sparse labels for scipy.sparse.csr_matrix.
indices_ignored (numpy.ndarray) – List of ignored samples which will be invisible to the query strategy.
indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set. Only used if retrain=True.
retrain (bool, default=True) – Retrains the model after the update if True.
- query(num_samples=10, representation=None, query_strategy_kwargs=dict())
Performs a query step, which selects a number of samples from the unlabeled pool. A query step must be followed by an update step.
- Parameters
num_samples (int, default=10) – Number of samples to query.
representation (numpy.ndarray, default=None) – Alternative representation for the samples in the unlabeled pool. his can be used if you want to rely pre-computed fixed representations instead of embeddings that change during each active learning iteration.
query_strategy_kwargs (dict, default=dict()) –
- Returns
queried_indices – List of queried indices (relative to the current unlabeled pool).
- Return type
numpy.ndarray[int]
- Raises
LearnerNotInitializedException – Thrown when the active learner was not initialized via initialize_data(…).
ValueError – Raised when args or kwargs are not used and consumed.
- update(y, indices_validation=None)
Performs an update step, which passes the label for each of the previously queried indices.
An update step must be preceded by a query step. At the end of the update step the current model is retrained using all available labels.
- Parameters
y (numpy.ndarray or scipy.sparse.csr_matrix) – Labels provided in response to the previous query. Each label at index i corresponds to the sample x[i] for single-label data (ndarray) and each row of labels at index i corresponds to the sample x[i] for multi-label data (csr_matrix). Setting the label / row of labels to ` small_text.base import LABEL_IGNORED` will ignore the respective sample.
indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. Otherwise each classifier that uses a validation set will be responsible for creating a validation set.
- update_label_at(index, y, retrain=False, indices_validation=None)
Updates the label for the given x_index (with regard to self.x_train).
Notes
After adding labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.
- Parameters
index (int) – Data index (relative to self.x_train) for which the label should be updated.
y (int or numpy.ndarray) – The new label(s) to be assigned for the sample at self.x_indices_labeled[x_index].
retrain (bool, default=False) – Retrains the model after the update if True.
indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. This is only used if retrain is True.
- remove_label_at(x_index, retrain=False, x_indices_validation=None)
Removes the labeling for the given x_index (with regard to self.x_train).
Notes
After removing labels the current model might not reflect the labeled data anymore. You should consider if a retraining is necessary when using this operation. Since retraining is often time-consuming, retrain is set to False by default.
- Parameters
x_index (int) – Data index (relative to self.x_train) for which the label should be removed.
retrain (bool, default=None) – Retrains the model after removal if True.
x_indices_validation (numpy.ndarray, default=None) – The given indices (relative to self.x_indices_labeled) define a custom validation set if provided. This is only used if retrain is True.
- save(file)
Serializes the current active learner object into a single file for later re-use.
- Parameters
file (str or path or file) – Serialized output file to be written.
- classmethod load(file)
Deserializes a serialized active learner.
- Parameters
file (str or path or file) – File to be loaded.