Classifier API



class small_text.classifiers.classification.Classifier[source]

Abstract base class for classifiers that can be used with the active learning components.

abstract fit(train_set, weights=None)

Trains the model using the given train set.

  • train_set (Dataset) – The dataset used for training the model.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

abstract predict(data_set, return_proba=False)

Predicts the labels for each sample in the given dataset.

  • data_set (Dataset) – A dataset for which the labels are to be predicted.

  • return_proba (bool, default=False) – If True, also returns a probability-like class distribution.

abstract predict_proba(data_set)

Predicts the label distribution for each sample in the given dataset.


data_set (Dataset) – A dataset for which the labels are to be predicted.

class small_text.classifiers.classification.SklearnClassifier(model, num_classes, multi_label=False)[source]

An adapter for using scikit-learn estimators.


The multi-label settings currently assumes that the underlying classifer returns a sparse matrix if trained on sparse data.

__init__(model, num_classes, multi_label=False)
  • model (sklearn.base.BaseEstimator) – A scikit-learn estimator that implements fit and predict_proba.

  • num_classes (int) – Number of classes which are to be trained and predicted.

  • multi_label (bool, default=False) – If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.

fit(train_set, weights=None)

Trains the model using the given train set.

  • train_set (SklearnDataset) – The dataset used for training the model.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.


clf – Returns the current classifier with a fitted model.

Return type


predict(data_set, return_proba=False)

Predicts the labels for the given dataset.

  • data_set (SklearnDataset) – A dataset for which the labels are to be predicted.

  • return_proba (bool, default=False) – If True, also returns a probability-like class distribution.


  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on multi-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32]) – List of probabilities (or confidence estimates) if return_proba is True.


Predicts the label distribution for each sample in the given dataset.


data_set (SklearnDataset) – A dataset for which the labels are to be predicted.

Pytorch Integration

class small_text.integrations.pytorch.classifiers.KimCNNClassifier(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs=10, mini_batch_size=25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], early_stopping=5, early_stopping_acc=-1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE)[source]
__init__(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs=10, mini_batch_size=25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], early_stopping=5, early_stopping_acc=-1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE)

Number of classes.

multi_labelbool, default=False

If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.


A tensor of embeddings in the shape of (vocab_size, embedding_size).

devicestr or torch.device, default=None

Torch device on which the computation will be performed.

num_epochsint, default=10

Epochs to train.

mini_batch_sizeint, default=12

Size of mini batches during training.

lrfloat, default=2e-5

Learning rate.


Maximum sequence length.


Number of output channels.


Size of the padding to add before and after the sequence before applying the filters.


Dropout probability for the final layer in KimCNN.

validation_set_sizefloat, default=0.1

The size of the validation set as a fraction of the training set.


Index of the padding token (as given by the vocab).

kernel_heightslist of int

Kernel sizes.


Number of epochs with no improvement in validation loss until early stopping is triggered.

Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.


Accuracy threshold in the interval (0, 1] which triggers early stopping.

Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.

class_weight‘balanced’ or None, default=None

If ‘balanced’, then the loss function is weighted inversely proportional to the label distribution to the current train set.

fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

  • train_set (PytorchTextClassificationDataset) – The dataset used for training the model.

  • validation_set (PytorchTextClassificationDataset) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the train set as a validation set, whose size is set by self.validation_set_size.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

  • early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.

  • model_selection (ModelSelectionHandler or 'none') – A model selection handler. Passing ‘none’ disables model selection.

  • optimizer (torch.optim.optimizer.Optimizer) – A pytorch optimizer.

  • scheduler (torch.optim._LRScheduler) – A pytorch scheduler.


self – Returns the current classifier with a fitted model.

Return type



Obtains validation scores (loss, accuracy) for the given validation set.


validation_set (PytorchTextClassificationDataset) – Validation set.


  • validation_loss (float) – Validation loss.

  • validation_acc (float) – Validation accuracy.

predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

  • dataset (PytorchTextClassificationDataset) – A dataset on whose instances predictions are made.

  • return_proba (bool) – If True, additionally returns the confidence distribution over all classes.


  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32] (optional)) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

  • dataset (PytorchTextClassificationDataset) – A dataset whose labels will be predicted.

  • dropout_sampling (int) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.


scores – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropour_samples, num_classes).

Return type


Transformers Integration

class small_text.integrations.transformers.classifiers.TransformerBasedClassification(transformer_model: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-05, mini_batch_size: int = 12, validation_set_size: float = 0.1, validations_per_epoch: int = 1, early_stopping_no_improvement: int = 5, early_stopping_acc: float = -1, model_selection: bool = True, fine_tuning_arguments=None, device=None, memory_fix=1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')[source]
__init__(transformer_model: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-05, mini_batch_size: int = 12, validation_set_size: float = 0.1, validations_per_epoch: int = 1, early_stopping_no_improvement: int = 5, early_stopping_acc: float = -1, model_selection: bool = True, fine_tuning_arguments=None, device=None, memory_fix=1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')
  • transformer_model (TransformerModelArguments) – Settings for transformer model, tokenizer and config.

  • num_classes (int) – Number of classes.

  • multi_label (bool, default=False) – If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.

  • num_epochs (int, default=10) – Epochs to train.

  • lr (float, default=2e-5) – Learning rate.

  • mini_batch_size (int, default=12) – Size of mini batches during training.

  • validation_set_size (float, default=0.1) – The size of the validation set as a fraction of the training set.

  • validations_per_epoch (int, default=1) – Defines how of the validation set is evaluated during the training of a single epoch.

  • early_stopping_no_improvement (int, default=5) –

    Number of epochs with no improvement in validation loss until early stopping is triggered.

    Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.

  • early_stopping_acc (float, default=-1) –

    Accuracy threshold in the interval (0, 1] which triggers early stopping.

    Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.

  • model_selection (bool, default=True) – If True, model selects first saves the model after each epoch. At the end of the training step the model with the lowest validation error is selected.

  • fine_tuning_arguments (FineTuningArguments or None, default=None) – Fine tuning arguments.

  • device (str or torch.device, default=None) – Torch device on which the computation will be performed.

  • memory_fix (int, default=1) – If this value is greater than zero, every memory_fix-many epochs the cuda cache will be emptied to force unused GPU memory being released.

  • class_weight ('balanced' or None, default=None) – If ‘balanced’, then the loss function is weighted inversely proportional to the label distribution to the current train set.

fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

  • train_set (TransformersDataset) – Training set.

  • validation_set (TransformersDataset, default=None) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the trainset as a validation set, whose size is set by self.validation_set_size.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

  • early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.

  • model_selection (ModelSelectionHandler or 'none') – A model selection handler. Passing ‘none’ disables model selection.

  • optimizer (torch.optim.optimizer.Optimizer or None, default=None) – A pytorch optimizer.

  • scheduler (torch.optim._LRScheduler or None, default=None) – A pytorch scheduler.


self – Returns the current classifier with a fitted model.

Return type


predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

  • dataset (TransformersDataset) – A dataset on whose instances predictions are made.

  • return_proba (bool, default=False) – If True, additionally returns the confidence distribution over all classes.


  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32], optional) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

  • dataset (TransformersDataset) – A dataset whose labels will be predicted.

  • dropout_sampling (int, default=1) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.


scores – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropout_sampling, num_classes).

Return type


class small_text.integrations.transformers.classifiers.classification.TransformerModelArguments(model, tokenizer=None, config=None, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)[source]
__init__(model, tokenizer=None, config=None, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)
  • model (str) – Name of the transformer model. Will be passed into AutoModel.from_pretrained().

  • tokenizer (str, default=None) – Name of the tokenizer if deviating from the model name. Will be passed into AutoTokenizer.from_pretrained().

  • config (str, default=None) – Name of the config if deviating from the model name. Will be passed into AutoConfig.from_pretrained().

  • model_loading_strategy (ModelLoadingStrategy, default=ModelLoadingStrategy.DEFAULT) – Specifies if there should be attempts to download the model or if only local files should be used.

class small_text.integrations.transformers.classifiers.classification.FineTuningArguments(base_lr, layerwise_gradient_decay, gradual_unfreezing=-1, cut_fraction=0.1)[source]

Arguments to enable and configure gradual unfreezing and discriminative learning rates as used in Universal Language Model Fine-tuning (ULMFiT) [HR18].

__init__(base_lr, layerwise_gradient_decay, gradual_unfreezing=-1, cut_fraction=0.1)
class small_text.integrations.transformers.classifiers.setfit.SetFitClassification(setfit_model_args, num_classes, multi_label=False, max_seq_len=512, use_differentiable_head=False, mini_batch_size=32, model_kwargs=dict(), trainer_kwargs=dict(), device=None)[source]

A classifier that operates through Sentence Transformer Finetuning (SetFit, [TRE+22]).

This class is a wrapper which encapsulates the Hugging Face SetFit implementation <>_ .


This strategy requires the optional dependency setfit.

New in version 1.2.0.

__init__(setfit_model_args, num_classes, multi_label=False, max_seq_len=512, use_differentiable_head=False, mini_batch_size=32, model_kwargs=dict(), trainer_kwargs=dict(), device=None)

Settings for the sentence transformer model to be used.


Number of classes.

multi_labelbool, default=False

If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.


Uses a differentiable head instead of a logistic regression for the classification head. Corresponds to the keyword argument with the same name in SetFitModel.from_pretrained().


Keyword arguments used for the SetFit model. The keyword use_differentiable_head is excluded and managed by this class. The other keywords are directly passed to SetFitModel.from_pretrained().


Keyword arguments used for the SetFit model. The keyword batch_size is excluded and is instead controlled by the keyword mini_batch_size of this class. The other keywords are directly passed to SetFitTrainer.__init__().

devicestr or torch.device, default=None

Torch device on which the computation will be performed.

fit(train_set, validation_set=None, setfit_train_kwargs=dict())

Trains the model using the given train set.

  • train_set (TextDataset) – A dataset used for training the model.

  • validation_set (TextDataset or None, default None) – A dataset used for validation during training.

  • setfit_train_kwargs (dict) – Additional keyword arguments that are passed to SetFitTrainer.train()


self – Returns the current classifier with a fitted model.

Return type


predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

  • dataset (TextDataset) – A dataset on whose instances predictions are made.

  • return_proba (bool, default=False) – If True, additionally returns the confidence distribution over all classes.


  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32], optional) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

  • dataset (TextDataset) – A dataset whose labels will be predicted.

  • dropout_sampling (int, default=1) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.


  • scores (np.ndarray) – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropout_sampling, num_classes).

  • .. warning:: – This function is not thread-safe if dropout_sampling > 1, since the underlying model gets temporarily modified.

class small_text.integrations.transformers.classifiers.setfit.SetFitModelArguments(sentence_transformer_model: str, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)[source]

New in version 1.2.0.

__init__(sentence_transformer_model: str, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)
  • sentence_transformer_model (str) – Name of a sentence transformer model.

  • model_loading_strategy (ModelLoadingStrategy, default=ModelLoadingStrategy.DEFAULT) – Specifies if there should be attempts to download the model or if only local files should be used.



class small_text.classifiers.factories.SklearnClassifierFactory(base_estimator, num_classes, kwargs={})[source]
__init__(base_estimator, num_classes, kwargs={})

A scikit learn estimator which is used as a template for creating new classifier objects.


Number of classes.


Keyword arguments that are passed to the constructor of each classifier that is built by the factory.


Creates a new SklearnClassifier instance.


classifier – A new instance of SklearnClassifier which is initialized with the given keyword args kwargs.

Return type


Pytorch Integration

class small_text.integrations.pytorch.classifiers.factories.KimCNNFactory(classifier_name, num_classes, kwargs={})[source]
__init__(classifier_name, num_classes, kwargs={})

Obsolete. Do not use any more.


Number of classes.


Keyword arguments that are passed to the constructor of each classifier that is built by the factory.


Creates a new KimCNNClassifier instance.


classifier – A new instance of KimCNNClassifier which is initialized with the given keyword args kwargs.

Return type


Transformers Integration

class small_text.integrations.transformers.classifiers.factories.TransformerBasedClassificationFactory(transformer_model_args, num_classes, kwargs={})[source]
__init__(transformer_model_args, num_classes, kwargs={})
  • transformer_model_args (TransformerModelArguments) – Name of the sentence transformer model.

  • num_classes (int) – Number of classes.

  • kwargs (dict) – Keyword arguments which will be passed to TransformerBasedClassification.


Creates a new TransformerBasedClassification instance.


classifier – A new instance of TransformerBasedClassification which is initialized with the given keyword args kwargs.

Return type


class small_text.integrations.transformers.classifiers.factories.SetFitClassificationFactory(setfit_model_args, num_classes, classification_kwargs={})[source]

New in version 1.2.0.

__init__(setfit_model_args, num_classes, classification_kwargs={})
  • setfit_model_args (SetFitModelArguments) – Name of the sentence transformer model.

  • num_classes (int) – Number of classes.

  • classification_kwargs (dict) – Keyword arguments which will be passed to SetFitClassification.


Creates a new SetFitClassification instance.


classifier – A new instance of SetFitClassification which is initialized with the given keyword args kwargs.

Return type
