Transformers Integration Classes

Overview

Dataset Classes
Classification
- Transformer-based Classification
- Sentence Transformer Finetuning

Dataset Classes

class small_text.integrations.transformers.datasets.TransformersDataset(data, multi_label=False, target_labels=None)[source]

Dataset class for classifiers from Transformers Integration.

__init__(data, multi_label=False, target_labels=None)

Parameters

data (list of 3-tuples (text data [Tensor], mask [Tensor], labels [int or list of int])) – The single items constituting the dataset. For single-label datasets, unlabeled instances the label should be set to small_text.base.LABEL_UNLABELED`, and for multi-label datasets to an empty list.
multi_label (bool, default=False) – Indicates if this is a multi-label dataset.
target_labels (numpy.ndarray[int] or None, default=None) – This is a list of (integer) labels to be encountered within this dataset. This is important to set if your data does not contain some labels, e.g. due to dataset splits, where the labels should however be considered by entities such as the classifier. If None, the target labels will be inferred from the labels encountered in self.data.

property x

Returns the features.

Returns: x
Return type: list of Tensor

property target_labels

Returns the target labels.

Returns: target_labels – List of target labels.
Return type: list of int

classmethod from_arrays(texts, y, tokenizer, target_labels=None, max_length=512)

Constructs a new TransformersDataset from the given text and label arrays.

Parameters

texts (list of str or np.ndarray[str]) – List of text documents.
y (np.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row. Depending on the type of y the resulting dataset will be single-label (np.ndarray) or multi-label (scipy.sparse.csr_matrix).
tokenizer (tokenizers.Tokenizer) – A huggingface tokenizer.
target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be directly passed to the datset constructor.
max_length (int) – Maximum sequence length.
train (bool) – If True fits the vectorizer and transforms the data, otherwise just transforms the data.

Returns

dataset – A dataset constructed from the given texts and labels.

Return type

TransformersDataset

Warning

This functionality is still experimental and may be subject to change.

New in version 1.1.0.

Classification

Transformer-based Classification

class small_text.integrations.transformers.classifiers.classification.FineTuningArguments(base_lr, layerwise_gradient_decay, gradual_unfreezing=-1, cut_fraction=0.1)[source]

Arguments to enable and configure gradual unfreezing and discriminative learning rates as used in Universal Language Model Fine-tuning (ULMFiT) [HR18].

__init__(base_lr, layerwise_gradient_decay, gradual_unfreezing=-1, cut_fraction=0.1)

class small_text.integrations.transformers.classifiers.classification.TransformerBasedClassification(transformer_model: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-05, mini_batch_size: int = 12, validation_set_size: float = 0.1, validations_per_epoch: int = 1, early_stopping_no_improvement: int = 5, early_stopping_acc: float = -1, model_selection: bool = True, fine_tuning_arguments=None, device=None, memory_fix=1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')[source]

__init__(transformer_model: TransformerModelArguments, num_classes: int, multi_label: bool = False, num_epochs: int = 10, lr: float = 2e-05, mini_batch_size: int = 12, validation_set_size: float = 0.1, validations_per_epoch: int = 1, early_stopping_no_improvement: int = 5, early_stopping_acc: float = -1, model_selection: bool = True, fine_tuning_arguments=None, device=None, memory_fix=1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE, cache_dir='.active_learning_lib_cache/')

Parameters

transformer_model (TransformerModelArguments) – Settings for transformer model, tokenizer and config.
num_classes (int) – Number of classes.
multi_label (bool, default=False) – If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.
num_epochs (int, default=10) – Epochs to train.
lr (float, default=2e-5) – Learning rate.
mini_batch_size (int, default=12) – Size of mini batches during training.
validation_set_size (float, default=0.1) – The size of the validation set as a fraction of the training set.
validations_per_epoch (int, default=1) – Defines how of the validation set is evaluated during the training of a single epoch.
early_stopping_no_improvement (int, default=5) –
Number of epochs with no improvement in validation loss until early stopping is triggered.

Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.
early_stopping_acc (float, default=-1) –
Accuracy threshold in the interval (0, 1] which triggers early stopping.

Deprecated since version 1.1.0: Use the early_stopping kwarg in fit() instead.
model_selection (bool, default=True) – If True, model selects first saves the model after each epoch. At the end of the training step the model with the lowest validation error is selected.
fine_tuning_arguments (FineTuningArguments or None, default=None) – Fine tuning arguments.
device (str or torch.device, default=None) – Torch device on which the computation will be performed.
memory_fix (int, default=1) – If this value is greater than zero, every memory_fix-many epochs the cuda cache will be emptied to force unused GPU memory being released.
class_weight ('balanced' or None, default=None) – If ‘balanced’, then the loss function is weighted inversely proportional to the label distribution to the current train set.

fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

Parameters

train_set (TransformersDataset) – Training set.
validation_set (TransformersDataset, default=None) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the trainset as a validation set, whose size is set by self.validation_set_size.
weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.
early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.
model_selection (ModelSelectionHandler or 'none') – A model selection handler. Passing ‘none’ disables model selection.
optimizer (torch.optim.optimizer.Optimizer or None, default=None) – A pytorch optimizer.
scheduler (torch.optim._LRScheduler or None, default=None) – A pytorch scheduler.

Returns

self – Returns the current classifier with a fitted model.

Return type

TransformerBasedClassification

predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

Parameters

dataset (TransformersDataset) – A dataset on whose instances predictions are made.
return_proba (bool, default=False) – If True, additionally returns the confidence distribution over all classes.

Returns

predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.
probas (np.ndarray[np.float32], optional) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

Parameters

dataset (TransformersDataset) – A dataset whose labels will be predicted.
dropout_sampling (int) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.

Returns

scores – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropour_samples, num_classes).

Return type

np.ndarray

Sentence Transformer Finetuning

class small_text.integrations.transformers.classifiers.setfit.SetFitModelArguments(sentence_transformer_model: str, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)[source]

New in version 1.2.0.

__init__(sentence_transformer_model: str, model_loading_strategy: ModelLoadingStrategy = ModelLoadingStrategy.DEFAULT)

Parameters

sentence_transformer_model (str) – Name of a sentence transformer model.
model_loading_strategy (ModelLoadingStrategy, default=ModelLoadingStrategy.DEFAULT) – Specifies if there should be attempts to download the model or if only local files should be used.

class small_text.integrations.transformers.classifiers.setfit.SetFitClassification(setfit_model_args, num_classes, multi_label=False, max_seq_len=512, use_differentiable_head=False, mini_batch_size=32, model_kwargs=dict(), trainer_kwargs=dict(), device=None)[source]

A classifier that operates through Sentence Transformer Finetuning (SetFit, [TRE+22]).

This class is a wrapper which encapsulates the Hugging Face SetFit implementation <https://github.com/huggingface/setfit>_ .

Note

This strategy requires the optional dependency setfit.

New in version 1.2.0.

__init__(setfit_model_args, num_classes, multi_label=False, max_seq_len=512, use_differentiable_head=False, mini_batch_size=32, model_kwargs=dict(), trainer_kwargs=dict(), device=None)

sentence_transformer_modelSetFitModelArguments: Settings for the sentence transformer model to be used.
num_classesint: Number of classes.
multi_labelbool, default=False: If False, the classes are mutually exclusive, i.e. the prediction step results in exactly one predicted label per instance.
use_differentiable_headbool: Uses a differentiable head instead of a logistic regression for the classification head. Corresponds to the keyword argument with the same name in SetFitModel.from_pretrained().
model_kwargsdict: Keyword arguments used for the SetFit model. The keyword use_differentiable_head is excluded and managed by this class. The other keywords are directly passed to SetFitModel.from_pretrained().

See also

SetFit: src/setfit/modeling.py
trainer_kwargsdict: Keyword arguments used for the SetFit model. The keyword batch_size is excluded and is instead controlled by the keyword mini_batch_size of this class. The other keywords are directly passed to SetFitTrainer.__init__().

See also

SetFit: src/setfit/trainer.py
devicestr or torch.device, default=None: Torch device on which the computation will be performed.

fit(train_set, validation_set=None, setfit_train_kwargs=dict())

Trains the model using the given train set.

Parameters

train_set (TextDataset) – A dataset used for training the model.
validation_set (TextDataset or None, default None) – A dataset used for validation during training.
setfit_train_kwargs (dict) – Additional keyword arguments that are passed to SetFitTrainer.train()

Returns

self – Returns the current classifier with a fitted model.

Return type

SetFitClassification

predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

Parameters

dataset (TextDataset) – A dataset on whose instances predictions are made.
return_proba (bool, default=False) – If True, additionally returns the confidence distribution over all classes.

Returns

predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.
probas (np.ndarray[np.float32], optional) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset)

Predicts the label distributions.

Parameters: dataset (TextDataset) – A dataset whose labels will be predicted.
Returns: scores – Distribution of confidence scores over all classes of shape (num_samples, num_classes).
Return type: np.ndarray