Pytorch Integration Classes

Dataset Classes

class small_text.integrations.pytorch.datasets.PytorchTextClassificationDataset(data, vocab, multi_label=False, target_labels=None)[source]

Dataset class for classifiers from Pytorch Integration.

__init__(data, vocab, multi_label=False, target_labels=None)
Parameters
  • data (list of tuples (text data [Tensor], labels [int or list of int])) – The single items constituting the dataset. For single-label datasets, unlabeled instances the label should be set to small_text.base.LABEL_UNLABELED`, and for multi-label datasets to an empty list.

  • vocab (torchtext.vocab.vocab) – Vocabulary object.

  • multi_label (bool, default=False) – Indicates if this is a multi-label dataset.

  • target_labels (np.ndarray[int] or None, default=None) – This is a list of (integer) labels to be encountered within this dataset. This is important to set if your data does not contain some labels, e.g. due to dataset splits, where the labels should however be considered by entities such as the classifier. If None, the target labels will be inferred from the labels encountered in self.data.

property x

Returns the features.

Returns

x

Return type

list of Tensor

property data

Returns the internal list of tuples storing the data.

Returns

data – Vocab object.

Return type

list of tuples (text data [Tensor], label)

property vocab

Returns the vocab.

Returns

vocab – Vocab object.

Return type

torchtext.vocab.Vocab

property target_labels

Returns the target labels.

Returns

target_labels – List of target labels.

Return type

list of int

to(other, non_blocking=False, copy=False)

Calls torch.Tensor.to on all Tensors in data.

Returns

self – The object with to having been called on all Tensors in data.

Return type

PytorchTextClassificationDataset

classmethod from_arrays(texts, y, text_field, target_labels=None, train=True)

Constructs a new PytorchTextClassificationDataset from the given text and label arrays.

Parameters
  • texts (list of str or np.ndarray[str]) – List of text documents.

  • y (np.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row. Depending on the type of y the resulting dataset will be single-label (np.ndarray) or multi-label (scipy.sparse.csr_matrix).

  • text_field (torchtext.data.field.Field or torchtext.legacy.data.field.Field) – A torchtext field used for preprocessing the text and building the vocabulary.

  • vocab (object) – A torch

  • target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be directly passed to the datset constructor.

  • train (bool) – If True fits the vectorizer and transforms the data, otherwise just transforms the data.

Returns

dataset – A dataset constructed from the given texts and labels.

Return type

PytorchTextClassificationDataset

Warning

This functionality is still experimental and may be subject to change.

New in version 1.1.0.

Models

class small_text.integrations.pytorch.models.kimcnn.KimCNN(vocabulary_size, max_seq_length, num_classes=2, out_channels=100, embed_dim=300, padding_idx=0, kernel_heights=[3, 4, 5], dropout=0.5, embedding_matrix=None, freeze_embedding_layer=False)[source]
forward(x)
Parameters

x (torch.LongTensor or torch.cuda.LongTensor) – input tensor (batch_size, max_sequence_length) with padded sequences of word ids

Classification

class small_text.integrations.pytorch.classifiers.kimcnn.KimCNNClassifier(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs=10, mini_batch_size=25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], early_stopping=5, early_stopping_acc=-1, class_weight=None, verbosity=VERBOSITY_MORE_VERBOSE)[source]
fit(train_set, validation_set=None, weights=None, early_stopping=None, model_selection=None, optimizer=None, scheduler=None)

Trains the model using the given train set.

Parameters
  • train_set (PytorchTextClassificationDataset) – The dataset used for training the model.

  • validation_set (PytorchTextClassificationDataset) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the train set as a validation set, whose size is set by self.validation_set_size.

  • weights (np.ndarray[np.float32] or None, default=None) – Sample weights or None.

  • early_stopping (EarlyStoppingHandler or 'none') – A strategy for early stopping. Passing ‘none’ disables early stopping.

  • model_selection (ModelSelectionHandler or 'none') – A model selection handler. Passing ‘none’ disables model selection.

  • optimizer (torch.optim.optimizer.Optimizer) – A pytorch optimizer.

  • scheduler (torch.optim._LRScheduler) – A pytorch scheduler.

Returns

self – Returns the current classifier with a fitted model.

Return type

KimCNNClassifier

validate(validation_set)

Obtains validation scores (loss, accuracy) for the given validation set.

Parameters

validation_set (PytorchTextClassificationDataset) – Validation set.

Returns

  • validation_loss (float) – Validation loss.

  • validation_acc (float) – Validation accuracy.

predict(dataset, return_proba=False)

Predicts the labels for the given dataset.

Parameters
  • dataset (PytorchTextClassificationDataset) – A dataset on whose instances predictions are made.

  • return_proba (bool) – If True, additionally returns the confidence distribution over all classes.

Returns

  • predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.

  • probas (np.ndarray[np.float32] (optional)) – List of probabilities (or confidence estimates) if return_proba is True.

predict_proba(dataset, dropout_sampling=1)

Predicts the label distributions.

Parameters
  • dataset (PytorchTextClassificationDataset) – A dataset whose labels will be predicted.

  • dropout_sampling (int) – If dropout_sampling > 1 then all dropout modules will be enabled during prediction and multiple rounds of predictions will be sampled for each instance.

Returns

scores – Distribution of confidence scores over all classes of shape (num_samples, num_classes). If dropout_sampling > 1 then the shape is (num_samples, dropour_samples, num_classes).

Return type

np.ndarray