Pytorch Integration Classes
Dataset Classes
- class small_text.integrations.pytorch.datasets.PytorchTextClassificationDataset(data, vocab, multi_label=False, target_labels=None)[source]
Dataset class for classifiers from Pytorch Integration.
- __init__(data, vocab, multi_label=False, target_labels=None)
- Parameters
data (list of tuples (text data [Tensor], labels [int or list of int])) – The single items constituting the dataset. For single-label datasets, unlabeled instances the label should be set to small_text.base.LABEL_UNLABELED`, and for multi-label datasets to an empty list.
vocab (torchtext.vocab.vocab) – Vocabulary object.
multi_label (bool, default=False) – Indicates if this is a multi-label dataset.
target_labels (np.ndarray[int] or None, default=None) – This is a list of (integer) labels to be encountered within this dataset. This is important to set if your data does not contain some labels, e.g. due to dataset splits, where the labels should however be considered by entities such as the classifier. If None, the target labels will be inferred from the labels encountered in self.data.
- property x
Returns the features.
- Returns
x
- Return type
list of Tensor
- property data
Returns the internal list of tuples storing the data.
- Returns
data – Vocab object.
- Return type
list of tuples (text data [Tensor], label)
- property vocab
Returns the vocab.
- Returns
vocab – Vocab object.
- Return type
torchtext.vocab.Vocab
- property target_labels
Returns the target labels.
- Returns
target_labels – List of target labels.
- Return type
list of int
- to(other, non_blocking=False, copy=False)
Calls torch.Tensor.to on all Tensors in data.
- Returns
self – The object with to having been called on all Tensors in data.
- Return type
See also
Models
- class small_text.integrations.pytorch.models.kimcnn.KimCNN(vocabulary_size, max_seq_length, num_classes=2, out_channels=100, embed_dim=300, padding_idx=0, kernel_heights=[3, 4, 5], dropout=0.5, embedding_matrix=None, freeze_embedding_layer=False)[source]
- forward(x)
- Parameters
x (torch.LongTensor or torch.cuda.LongTensor) – input tensor (batch_size, max_sequence_length) with padded sequences of word ids
Classification
- class small_text.integrations.pytorch.classifiers.kimcnn.KimCNNClassifier(num_classes, multi_label=False, embedding_matrix=None, device=None, num_epochs=10, mini_batch_size=25, lr=0.001, max_seq_len=60, out_channels=100, filter_padding=0, dropout=0.5, validation_set_size=0.1, padding_idx=0, kernel_heights=[3, 4, 5], early_stopping=5, early_stopping_acc=0.98, class_weight=None, verbosity=20)[source]
- fit(train_set, validation_set=None, optimizer=None, scheduler=None)
Trains the model using the given train set.
- Parameters
train_set (PytorchTextClassificationDataset) – The dataset used for training the model.
validation_set (PytorchTextClassificationDataset) – A validation set used for validation during training, or None. If None, the fit operation will split apart a subset of the trainset as a validation set, whose size is set by self.validation_set_size.
optimizer (torch.optim.optimizer.Optimizer) – A pytorch optimizer.
scheduler (torch.optim._LRScheduler) – A pytorch scheduler.
- Returns
self – Returns the current classifier with a fitted model.
- Return type
- validate(validation_set)
Obtains validation scores (loss, accuracy) for the given validation set.
- Parameters
validation_set (PytorchTextClassificationDataset) – Validation set.
- Returns
validation_loss (float) – Validation loss.
validation_acc (float) – Validation accuracy.
- predict(data_set, return_proba=False)
Predicts the labels for the given dataset.
- Parameters
data_set (PytorchTextClassificationDataset) – A dataset on whose instances predictions are made.
return_proba (bool) – If True, additionally returns the confidence distribution over all classes.
- Returns
predictions (np.ndarray[np.int32] or csr_matrix[np.int32]) – List of predictions if the classifier was fitted on single-label data, otherwise a sparse matrix of predictions.
probas (np.ndarray[np.float32] (optional)) – List of probabilities (or confidence estimates) if return_proba is True.
- predict_proba(test_set)
- Parameters
test_set (small_text.integrations.pytorch.PytorchTextClassificationDataset) – Test set.
- Returns
scores – Distribution of confidence scores over all classes.
- Return type
np.ndarray