Classifier API


Classifiers

Core

class small_text.classifiers.classification.Classifier[source]

Abstract base class for classifiers that can be used with the active learning components.

abstract fit(train_set, weights=None)

Trains the model using the given train set.

Parameters
  • train_set (Dataset) – The dataset used for training the model.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

abstract predict(data_set, return_proba=False)

Predicts the labels for each sample in the given dataset.

Parameters
  • data_set (Dataset) – A dataset for which the labels are to be predicted.

  • return_proba (bool, default=False) – If True, also returns a probability-like class distribution.

abstract predict_proba(data_set)

Predicts the label distribution for each sample in the given dataset.

Parameters

data_set (Dataset) – A dataset for which the labels are to be predicted.

class small_text.classifiers.classification.SklearnClassifier(model, num_classes, multi_label=False)[source]

An adapter for using scikit-learn estimators.

Notes

The multi-label settings currently assumes that the underlying classifer returns a sparse matrix if trained on sparse data.

__init__(model, num_classes, multi_label=False)
Parameters
  • model (sklearn.base.BaseEstimator) – A scikit-learn estimator that implements fit and predict_proba.

  • num_classes (int) – Number of classes which are to be trained and predicted.

  • multi_label (bool, default=False) – If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.

fit(train_set, weights=None)

Trains the model using the given train set.

Parameters
  • train_set (SklearnDataset) – The dataset used for training the model.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

Returns

clf – Returns the current classifier with a fitted model.

Return type

SklearnClassifier

predict(data_set, return_proba=False)

Predicts the labels for the given dataset.

Parameters
  • data_set (SklearnDataset) – A dataset for which the labels are to be predicted.

  • return_proba (bool, default=False) – If True, also returns a probability-like class distribution.

Returns

  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on multi-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32]) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(data_set)

Predicts the label distribution for each sample in the given dataset.

Parameters

data_set (SklearnDataset) – A dataset for which the labels are to be predicted.

Pytorch Integration

class small_text.integrations.pytorch.classifiers.KimCNNClassifier(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs=10, mini_batch_size=25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], early_stopping=5, early_stopping_acc=-1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE)[source]
__init__(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs=10, mini_batch_size=25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], early_stopping=5, early_stopping_acc=-1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE)
num_classesint

Number of classes.

multi_labelbool, default=False

If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.

embedding_matrixtorch.FloatTensor

A tensor of embeddings in the shape of (vocab_size, embedding_size).

devicestr or torch.device, default=None

Torch device on which the computation will be performed.

num_epochsint, default=10

Epochs to train.

mini_batch_sizeint, default=12

Size of mini batches during training.

lrfloat, default=2e-5

Learning rate.

max_seq_lenint

Maximum sequence length.

out_channelsint

Number of output channels.

filter_paddingint

Size of the padding to add before and after the sequence before applying the filters.

dropoutfloat

Dropout probability for the final layer in KimCNN.

validation_set_sizefloat, default=0.1

The size of the validation set as a fraction of the training set.

padding_idxint

Index of the padding token (as given by the vocab).

kernel_heightslist of int

Kernel sizes.

early_stoppingint

Number of epochs with no improvement in validation loss until early stopping is triggered.

Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.

early_stopping_accfloat

Accuracy threshold in the interval (0, 1] which triggers early stopping.

Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.

class_weight‘balanced’ or None, default=None

If ‘balanced’, then the loss function is weighted inversely proportional to the label distribution to the current train set.

fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

Parameters
  • train_set (PytorchTextClassificationDataset) – The dataset used for training the model.

  • validation_set (PytorchTextClassificationDataset) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the train set as a validation set, whose size is set by self.validation_set_size.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

  • early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.

  • model_selection (ModelSelectionHandler or 'none') – A model selection handler. Passing ‘none’ disables model selection.

  • optimizer (torch.optim.optimizer.Optimizer) – A pytorch optimizer.

  • scheduler (torch.optim._LRScheduler) – A pytorch scheduler.

Returns

self – Returns the current classifier with a fitted model.

Return type

KimCNNClassifier

validate(validation_set)

Obtains validation scores (loss, accuracy) for the given validation set.

Parameters

validation_set (PytorchTextClassificationDataset) – Validation set.

Returns

  • validation_loss (float) – Validation loss.

  • validation_acc (float) – Validation accuracy.

predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

Parameters
  • dataset (PytorchTextClassificationDataset) – A dataset on whose instances predictions are made.

  • return_proba (bool) – If True, additionally returns the confidence distribution over all classes.

Returns

  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32] (optional)) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

Parameters
  • dataset (PytorchTextClassificationDataset) – A dataset whose labels will be predicted.

  • dropout_sampling (int) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.

Returns

scores – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropour_samples, num_classes).

Return type

np.ndarray

Transformers Integration

class small_text.integrations.transformers.classifiers.TransformerBasedClassification(transformer_model: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-5, mini_batch_size: int = 12, validation_set_size: float = 0.1, validations_per_epoch: int = 1, early_stopping_no_improvement: int = 5, early_stopping_acc: float = -1, model_selection: bool = True, fine_tuning_arguments=None, device=None, memory_fix=1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')[source]
__init__(transformer_model: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-5, mini_batch_size: int = 12, validation_set_size: float = 0.1, validations_per_epoch: int = 1, early_stopping_no_improvement: int = 5, early_stopping_acc: float = -1, model_selection: bool = True, fine_tuning_arguments=None, device=None, memory_fix=1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')
Parameters
  • transformer_model (TransformerModelArguments) – Settings for transformer model, tokenizer and config.

  • num_classes (int) – Number of classes.

  • multi_label (bool, default=False) – If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.

  • num_epochs (int, default=10) – Epochs to train.

  • lr (float, default=2e-5) – Learning rate.

  • mini_batch_size (int, default=12) – Size of mini batches during training.

  • validation_set_size (float, default=0.1) – The size of the validation set as a fraction of the training set.

  • validations_per_epoch (int, default=1) – Defines how of the validation set is evaluated during the training of a single epoch.

  • early_stopping_no_improvement (int, default=5) –

    Number of epochs with no improvement in validation loss until early stopping is triggered.

    Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.

  • early_stopping_acc (float, default=-1) –

    Accuracy threshold in the interval (0, 1] which triggers early stopping.

    Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.

  • model_selection (bool, default=True) – If True, model selects first saves the model after each epoch. At the end of the training step the model with the lowest validation error is selected.

  • fine_tuning_arguments (FineTuningArguments or None, default=None) – Fine tuning arguments.

  • device (str or torch.device, default=None) – Torch device on which the computation will be performed.

  • memory_fix (int, default=1) – If this value is greater than zero, every memory_fix-many epochs the cuda cache will be emptied to force unused GPU memory being released.

  • class_weight ('balanced' or None, default=None) – If ‘balanced’, then the loss function is weighted inversely proportional to the label distribution to the current train set.

fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

Parameters
  • train_set (TransformersDataset) – Training set.

  • validation_set (TransformersDataset, default=None) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the trainset as a validation set, whose size is set by self.validation_set_size.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

  • early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.

  • model_selection (ModelSelectionHandler or 'none') – A model selection handler. Passing ‘none’ disables model selection.

  • optimizer (torch.optim.optimizer.Optimizer or None, default=None) – A pytorch optimizer.

  • scheduler (torch.optim._LRScheduler or None, default=None) – A pytorch scheduler.

Returns

self – Returns the current classifier with a fitted model.

Return type

TransformerBasedClassification

predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

Parameters
  • dataset (TransformersDataset) – A dataset on whose instances predictions are made.

  • return_proba (bool, default=False) – If True, additionally returns the confidence distribution over all classes.

Returns

  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32], optional) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

Parameters
  • dataset (TransformersDataset) – A dataset whose labels will be predicted.

  • dropout_sampling (int, default=1) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.

Returns

scores – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropout_sampling, num_classes).

Return type

np.ndarray

class small_text.integrations.transformers.classifiers.classification.TransformerModelArguments(model, tokenizer=None, config=None, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)[source]
__init__(model, tokenizer=None, config=None, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)
Parameters
  • model (str) – Name of the transformer model. Will be passed into AutoModel.from_pretrained().

  • tokenizer (str, default=None) – Name of the tokenizer if deviating from the model name. Will be passed into AutoTokenizer.from_pretrained().

  • config (str, default=None) – Name of the config if deviating from the model name. Will be passed into AutoConfig.from_pretrained().

  • model_loading_strategy (ModelLoadingStrategy, default=ModelLoadingStrategy.DEFAULT) – Specifies if there should be attempts to download the model or if only local files should be used.

class small_text.integrations.transformers.classifiers.classification.FineTuningArguments(base_lr, layerwise_gradient_decay, gradual_unfreezing=-1, cut_fraction=0.1)[source]

Arguments to enable and configure gradual unfreezing and discriminative learning rates as used in Universal Language Model Fine-tuning (ULMFiT) [HR18].

__init__(base_lr, layerwise_gradient_decay, gradual_unfreezing=-1, cut_fraction=0.1)
class small_text.integrations.transformers.classifiers.setfit.SetFitClassification(setfit_model_args, num_classes, multi_label=False, max_seq_len=512, use_differentiable_head=False, mini_batch_size=32, model_kwargs=dict(), trainer_kwargs=dict(), device=None)[source]

A classifier that operates through Sentence Transformer Finetuning (SetFit, [TRE+22]).

This class is a wrapper which encapsulates the Hugging Face SetFit implementation <https://github.com/huggingface/setfit>_ .

Note

This strategy requires the optional dependency setfit.

New in version 1.2.0.

__init__(setfit_model_args, num_classes, multi_label=False, max_seq_len=512, use_differentiable_head=False, mini_batch_size=32, model_kwargs=dict(), trainer_kwargs=dict(), device=None)
sentence_transformer_modelSetFitModelArguments

Settings for the sentence transformer model to be used.

num_classesint

Number of classes.

multi_labelbool, default=False

If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.

use_differentiable_headbool

Uses a differentiable head instead of a logistic regression for the classification head. Corresponds to the keyword argument with the same name in SetFitModel.from_pretrained().

model_kwargsdict

Keyword arguments used for the SetFit model. The keyword use_differentiable_head is excluded and managed by this class. The other keywords are directly passed to SetFitModel.from_pretrained().

trainer_kwargsdict

Keyword arguments used for the SetFit model. The keyword batch_size is excluded and is instead controlled by the keyword mini_batch_size of this class. The other keywords are directly passed to SetFitTrainer.__init__().

devicestr or torch.device, default=None

Torch device on which the computation will be performed.

fit(train_set, validation_set=None, setfit_train_kwargs=dict())

Trains the model using the given train set.

Parameters
  • train_set (TextDataset) – A dataset used for training the model.

  • validation_set (TextDataset or None, default None) – A dataset used for validation during training.

  • setfit_train_kwargs (dict) – Additional keyword arguments that are passed to SetFitTrainer.train()

Returns

self – Returns the current classifier with a fitted model.

Return type

SetFitClassification

predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

Parameters
  • dataset (TextDataset) – A dataset on whose instances predictions are made.

  • return_proba (bool, default=False) – If True, additionally returns the confidence distribution over all classes.

Returns

  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32], optional) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

Parameters
  • dataset (TextDataset) – A dataset whose labels will be predicted.

  • dropout_sampling (int, default=1) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.

Returns

  • scores (np.ndarray) – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropout_sampling, num_classes).

  • .. warning:: – This function is not thread-safe if dropout_sampling > 1, since the underlying model gets temporarily modified.

class small_text.integrations.transformers.classifiers.setfit.SetFitModelArguments(sentence_transformer_model: str, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)[source]

New in version 1.2.0.

__init__(sentence_transformer_model: str, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)
Parameters
  • sentence_transformer_model (str) – Name of a sentence transformer model.

  • model_loading_strategy (ModelLoadingStrategy, default=ModelLoadingStrategy.DEFAULT) – Specifies if there should be attempts to download the model or if only local files should be used.

Factories

Core

class small_text.classifiers.factories.SklearnClassifierFactory(base_estimator, num_classes, kwargs={})[source]
__init__(base_estimator, num_classes, kwargs={})
base_estimatorBaseEstimator

A scikit learn estimator which is used as a template for creating new classifier objects.

num_classesint

Number of classes.

kwargsdict

Keyword arguments that are passed to the constructor of each classifier that is built by the factory.

new()

Creates a new SklearnClassifier instance.

Returns

classifier – A new instance of SklearnClassifier which is initialized with the given keyword args kwargs.

Return type

SklearnClassifier

Pytorch Integration

class small_text.integrations.pytorch.classifiers.factories.KimCNNFactory(classifier_name, num_classes, kwargs={})[source]
__init__(classifier_name, num_classes, kwargs={})
classifier_namestr

Obsolete. Do not use any more.

num_classesint

Number of classes.

kwargsdict

Keyword arguments that are passed to the constructor of each classifier that is built by the factory.

new()

Creates a new KimCNNClassifier instance.

Returns

classifier – A new instance of KimCNNClassifier which is initialized with the given keyword args kwargs.

Return type

KimCNNClassifier

Transformers Integration

class small_text.integrations.transformers.classifiers.factories.TransformerBasedClassificationFactory(transformer_model_args, num_classes, kwargs={})[source]
__init__(transformer_model_args, num_classes, kwargs={})
Parameters
  • transformer_model_args (TransformerModelArguments) – Name of the sentence transformer model.

  • num_classes (int) – Number of classes.

  • kwargs (dict) – Keyword arguments which will be passed to TransformerBasedClassification.

new()

Creates a new TransformerBasedClassification instance.

Returns

classifier – A new instance of TransformerBasedClassification which is initialized with the given keyword args kwargs.

Return type

TransformerBasedClassification

class small_text.integrations.transformers.classifiers.factories.SetFitClassificationFactory(setfit_model_args, num_classes, classification_kwargs={})[source]

New in version 1.2.0.

__init__(setfit_model_args, num_classes, classification_kwargs={})
Parameters
  • setfit_model_args (SetFitModelArguments) – Name of the sentence transformer model.

  • num_classes (int) – Number of classes.

  • classification_kwargs (dict) – Keyword arguments which will be passed to SetFitClassification.

new()

Creates a new SetFitClassification instance.

Returns

classifier – A new instance of SetFitClassification which is initialized with the given keyword args kwargs.

Return type

SetFitClassification