Classifier API

Classifiers

Core

class small_text.classifiers.classification.Classifier[source]

Abstract base class for classifiers that can be used with the active learning components.

abstract fit(train_set: Dataset, weights: npt.NDArray[np.double] | None = None) → Classifier

Train the model using the given train set.

Parameters:

train_set (Dataset) – The dataset used for training the model.
weights (np.ndarray[np.double] or None, default=None) – Sample weights or None.

abstract predict(data_set: Dataset, return_proba: bool = False, multi_label_threshold: float = 0.5, **kwargs) → npt.NDArray[np.uint] | Tuple[npt.NDArray[np.uint], npt.NDArray[np.double]] | csr_matrix | Tuple[csr_matrix, csr_matrix]

Predicts the labels for each sample in the given dataset.

Parameters:

data_set (Dataset) – A dataset for which the labels are to be predicted.
return_proba (bool, default=False) – If True, also returns a probability-like class distribution.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.

abstract predict_proba(data_set: Dataset, multi_label_threshold: float = 0.5, **kwargs) → npt.NDArray[np.double]

Predicts the label distribution for each sample in the given dataset.

Parameters:

data_set (Dataset) – A dataset for which the labels are to be predicted.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.

class small_text.classifiers.classification.SklearnClassifier(model: BaseEstimator, num_classes, multi_label=False)[source]

An adapter for using scikit-learn estimators.

Notes

The multi-label settings currently assumes that the underlying classifer returns a sparse matrix if trained on sparse data.

__init__(model: BaseEstimator, num_classes, multi_label=False)

Parameters:

model (BaseEstimator) – A scikit-learn estimator that implements fit and predict_proba.
num_classes (int) – Number of classes which are to be trained and predicted.
multi_label (bool, default=False) – If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.

fit(train_set, weights=None)

Trains the model using the given train set.

Parameters:

train_set (SklearnDataset) – The dataset used for training the model.
weights (np.ndarray[np.double] or None, default=None) – Sample weights or None.

Returns:

clf – Returns the current classifier with a fitted model.

Return type:

SklearnClassifier

predict(data_set: Dataset, return_proba=False, multi_label_threshold: float = 0.5)

Predicts the labels for the given dataset.

Parameters:

data_set (SklearnDataset) – A dataset for which the labels are to be predicted.
return_proba (bool, default=False) – If True, also returns a probability-like class distribution.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.

Returns:

predictions (np.ndarray[np.uint] or csr_matrix) – List of predictions if the classifier was fitted on multi-label data, otherwise a sparse matrix of predictions.
probas (np.ndarray[np.double]) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(data_set: Dataset, multi_label_threshold: float = 0.5)

Predicts the label distribution for each sample in the given dataset.

Parameters:

data_set (SklearnDataset) – A dataset for which the labels are to be predicted.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.

Pytorch Integration

class small_text.integrations.pytorch.classifiers.base.AMPArguments(use_amp: bool = False, device_type=None, dtype=torch.bfloat16)[source]

Arguments for configuring Automated Mixed Precision.

See also

Pytorch Docs: Automatic Mixed Precision Package <https://pytorch.org/docs/stable/amp.html>

PyTorch Docs: Automatic Mixed Precision Recipes <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>

Added in version 2.0.0.

__init__(use_amp: bool = False, device_type=None, dtype=torch.bfloat16)

use_ampbool, default=False: Enabled AMP if true.
device_typestr: Device type to be used for torch.autocast (‘cuda’ or ‘cpu’).
dtypetorch.dtype, default=torch.bfloat16: Data type to be used for torch.autocast (torch.float16 or torch.bfloat16).

class small_text.integrations.pytorch.classifiers.KimCNNClassifier(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs: int = 10, train_batch_size: int = 25, predict_batch_size: int = 25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], show_progress_bar=None, class_weight=None, amp_args=None, compile_model=False, verbosity=VERBOSITY_MORE_VERBOSE)[source]

__init__(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs: int = 10, train_batch_size: int = 25, predict_batch_size: int = 25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], show_progress_bar=None, class_weight=None, amp_args=None, compile_model=False, verbosity=VERBOSITY_MORE_VERBOSE)

num_classesint: Number of classes.
multi_labelbool, default=False: If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.
embedding_matrixtorch.FloatTensor: A tensor of embeddings in the shape of (vocab_size, embedding_size).
devicestr or torch.device, default=None: Torch device on which the computation will be performed.
num_epochsint, default=10: Epochs to train.
train_batch_sizeint, default=25: Batch size during training.
predict_batch_sizeint, default=25: Batch size during prediction.
lrfloat, default=2e-5: Learning rate.
max_seq_lenint: Maximum sequence length.
out_channelsint: Number of output channels.
filter_paddingint: Size of the padding to add before and after the sequence before applying the filters.
dropoutfloat: Dropout probability for the final layer in KimCNN.
validation_set_sizefloat, default=0.1: The size of the validation set as a fraction of the training set.
padding_idxint: Index of the padding token (as given by the vocab).
kernel_heightslist of int: Kernel sizes.
class_weight‘balanced’ or None, default=None: If ‘balanced’, then the loss function is weighted inversely proportional to the label distribution to the current train set.
show_progress_barbool or None: Determines whether progress bars are shown. If none, the small-text default is used.
amp_argsAMPArguments, default=None: Configures the use of Automatic Mixed Precision (AMP).

See also

AMPArguments

Added in version 2.0.0.
compile_modelbool, default=False: Compiles the model (using torch.compile) if True and PyTorch version is greater than or equal 2.0.0.

Added in version 2.0.0.

fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

Parameters:

train_set (PytorchTextClassificationDataset) – The dataset used for training the model.
validation_set (PytorchTextClassificationDataset) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the train set as a validation set, whose size is set by self.validation_set_size.
weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.
early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.
model_selection (ModelSelectionHandler or None, default=None) – A model selection handler. Passing ‘none’ disables model selection.
optimizer (torch.optim.optimizer.Optimizer or None, default=None) – A pytorch optimizer.
scheduler (torch.optim._LRScheduler or None, default=None) – A pytorch scheduler.

Returns:

self – Returns the current classifier with a fitted model.

Return type:

KimCNNClassifier

validate(validation_set)

Obtains validation scores (loss, accuracy) for the given validation set.

Parameters:

validation_set (PytorchTextClassificationDataset) – Validation set.

Returns:

validation_loss (float) – Validation loss.
validation_acc (float) – Validation accuracy.

predict(dataset, return_proba=False, multi_label_threshold: float = 0.5)

Predicts the labels for the given dataset.

Parameters:

dataset (PytorchTextClassificationDataset) – A dataset on whose instances predictions are made.
return_proba (bool) – If True, additionally returns the confidence distribution over all classes.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.

Returns:

predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.
probas (np.ndarray[np.float32] (optional)) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, multi_label_threshold: float = 0.5, dropout_sampling=1)

Predicts the label distributions.

Parameters:

dataset (PytorchTextClassificationDataset) – A dataset whose labels will be predicted.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.
dropout_sampling (int) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.

Returns:

scores – Confidence score distribution over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropout_sampling, num_classes).

Return type:

np.ndarray

Transformers Integration

class small_text.integrations.transformers.classifiers.classification.TransformerBasedClassification(transformer_model_args: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-5, validation_set_size: float = 0.1, validations_per_epoch: int = 1, device=None, memory_fix=1, class_weight=None, amp_args=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')[source]

__init__(transformer_model_args: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-5, validation_set_size: float = 0.1, validations_per_epoch: int = 1, device=None, memory_fix=1, class_weight=None, amp_args=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')

Parameters:

transformer_model_args (TransformerModelArguments) – Settings for transformer model, tokenizer and config.
num_classes (int) – Number of classes.
multi_label (bool, default=False) – If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.
num_epochs (int, default=10) – Epochs to train.
lr (float, default=2e-5) – Learning rate.
validation_set_size (float, default=0.1) – The size of the validation set as a fraction of the training set.
validations_per_epoch (int, default=1) – Defines how of the validation set is evaluated during the training of a single epoch.
device (str or torch.device, default=None) – Torch device on which the computation will be performed.
memory_fix (int, default=1) – If this value is greater than zero, every memory_fix-many epochs the cuda cache will be emptied to force unused GPU memory being released.
class_weight ('balanced' or None, default=None) – If ‘balanced’, then the loss function is weighted inversely proportional to the label distribution to the current train set. label distribution to the current train set.
amp_args (AMPArguments, default=None) –
Configures the use of Automatic Mixed Precision (AMP).

See also

AMPArguments

Added in version 2.0.0.
verbosity (int) – Controls the verbosity of logging messages. Lower values result in less log messages. Set this to VERBOSITY_QUIET or 0 for the minimum amount of logging.

fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

Parameters:

train_set (TransformersDataset) – Training set.
validation_set (TransformersDataset, default=None) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the trainset as a validation set, whose size is set by self.validation_set_size.
weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.
early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.
model_selection (ModelSelectionHandler or None, default=None) – A model selection handler. Passing ‘none’ disables model selection.
optimizer (torch.optim.optimizer.Optimizer or None, default=None) – A pytorch optimizer.
scheduler (torch.optim.LRScheduler or None, default=None) – A pytorch scheduler.

Returns:

self – Returns the current classifier with a fitted model.

Return type:

TransformerBasedClassification

predict(dataset, return_proba: bool = False, multi_label_threshold: float = 0.5)

Predicts the labels for the given dataset.

Parameters:

dataset (TransformersDataset) – A dataset on whose instances predictions are made.
return_proba (bool, default=False) – If True, additionally returns the confidence distribution over all classes.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.

Returns:

predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.
probas (np.ndarray[np.float32], optional) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, multi_label_threshold: float = 0.5, dropout_sampling=1)

Predicts the label distributions.

Parameters:

dataset (TransformersDataset) – A dataset whose labels will be predicted.
multi_label_threshold (float, default=0.5) – In multi-label classification, a label is predicted for a sample only if the respective probability value is greater than multi_label_threshold. Must be between 0.0 and 1.0. Ignored when multi_label is False.
dropout_sampling (int, default=1) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.

Returns:

scores – Confidence score distribution over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropout_sampling, num_classes).

Return type:

np.ndarray

class small_text.integrations.transformers.classifiers.classification.TransformerModelArguments(model: str, tokenizer=None, config=None, model_kwargs: dict = {}, tokenizer_kwargs: dict = {}, config_kwargs: dict = {}, train_batch_size: int = 12, predict_batch_size: int = 12, show_progress_bar: None | bool = None, model_loading_strategy: ModelLoadingStrategy = get_default_model_loading_strategy(), compile_model: bool = False)[source]

Model arguments for TransformerBasedClassification.

__init__(model: str, tokenizer=None, config=None, model_kwargs: dict = {}, tokenizer_kwargs: dict = {}, config_kwargs: dict = {}, train_batch_size: int = 12, predict_batch_size: int = 12, show_progress_bar: None | bool = None, model_loading_strategy: ModelLoadingStrategy = get_default_model_loading_strategy(), compile_model: bool = False)

Parameters:

model (str) – Name of the transformer model. Will be passed into AutoModel.from_pretrained().
tokenizer (str, default=None) – Name of the tokenizer if deviating from the model name. Will be passed into AutoTokenizer.from_pretrained().
config (str, default=None) – Name of the config if deviating from the model name. Will be passed into AutoConfig.from_pretrained().
model_kwargs (dict, default={}) –
Additional kwargs that will be passed into AutoModelForSequenceClassification.from_pretrained(). Arguments that are managed by small-text (such as the model name given by model) are excluded.

See also

AutoModelForSequenceClassification.from_pretrained() in transformers.
tokenizer_kwargs (dict, default={}) –
Additional kwargs that will be passed into AutoTokenizer.from_pretrained(). Arguments that are managed by small-text (such as the tokenizer name given by tokenizer) are excluded.

See also

AutoTokenizer.from_pretrained() in transformers.
config_kwargs (dict, default={}) –
Additional kwargs that will be passed into AutoConfig.from_pretrained(). Arguments that are managed by small-text (such as the tokenizer name given by tokenizer) are excluded.

See also

AutoConfig.from_pretrained() in transformers.
train_batch_size (int, default=12) – Batch size during training.
predict_batch_size (int, default=12) – Batch size during prediction.
show_progress_bar (None or bool, default=None) – Determines whether progress bars are shown. If none, the small-text default is used.
model_loading_strategy (ModelLoadingStrategy, default=ModelLoadingStrategy.DEFAULT) – Specifies if there should be attempts to download the model or if only local files should be used.
compile_model (bool, default=False) –
Compiles the model (using torch.compile) if True and provided that the PyTorch version is greater or equal to 2.0.0.

Added in version 2.0.0.

Factories

Core

class small_text.classifiers.factories.SklearnClassifierFactory(base_estimator: BaseEstimator, num_classes: int, kwargs: dict = {})[source]

__init__(base_estimator: BaseEstimator, num_classes: int, kwargs: dict = {})

base_estimatorBaseEstimator: A scikit learn estimator which is used as a template for creating new classifier objects.
num_classesint: Number of classes.
kwargsdict: Keyword arguments that are passed to the constructor of each classifier that is built by the factory.

new() → SklearnClassifier

Creates a new SklearnClassifier instance.

Returns:: classifier – A new instance of SklearnClassifier which is initialized with the given keyword args kwargs.
Return type:: SklearnClassifier

Pytorch Integration

class small_text.integrations.pytorch.classifiers.factories.KimCNNClassifierFactory(num_classes: int, classification_kwargs: dict = {})[source]

__init__(num_classes: int, classification_kwargs: dict = {})

num_classesint: Number of classes.
kwargsdict, default={}: Keyword arguments that are passed to the constructor of each classifier that is built by the factory.

new() → KimCNNClassifier

Creates a new KimCNNClassifier instance.

Returns:: classifier – A new instance of KimCNNClassifier which is initialized with the given keyword args kwargs.
Return type:: KimCNNClassifier