Dataset API

All datset implementations inherit from the the abstract class Dataset. Several such implementations are available, depending on the choice of classifier (and on the installed optional dependencies).

Core

class small_text.data.datasets.Dataset[source]

A dataset contains a set of instances in the form of features, include a respective labeling for every instance.

abstract property x

Returns the features.

Returns

x – Feature representation.

Return type

object

abstract property y

Returns the labels.

Returns

y – The labels as either numpy array (single-label) or sparse matrix (multi-label).

Return type

numpy.ndarray or scipy.sparse.csr_matrix

abstract property target_labels

Returns a list of possible labels.

Returns

target_labels – List of possible labels.

Return type

numpy.ndarray

class small_text.data.datasets.SklearnDataset(x, y, target_labels=None)[source]

A dataset representations which is usable in combination with scikit-learn classifiers.

__init__(x, y, target_labels=None)
Parameters
  • x (numpy.ndarray or scipy.sparse.csr_matrix) – Dense or sparse feature matrix.

  • y (numpy.ndarray[int]) – List of labels where each label belongs to the features of the respective row.

  • target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be inferred from y if None is passed.

property x

Returns the features.

Returns

x – Dense or sparse feature matrix.

Return type

numpy.ndarray or scipy.sparse.csr_matrix

property y

Returns the labels.

Returns

y – The labels as either numpy array (single-label) or sparse matrix (multi-label).

Return type

numpy.ndarray or scipy.sparse.csr_matrix

property is_multi_label

Returns True if this is a multi-label dataset, otherwise False.

property target_labels

Returns a list of possible labels.

Returns

target_labels – List of possible labels.

Return type

numpy.ndarray

clone()

Returns an identical copy of the dataset.

Returns

dataset – An exact copy of the dataset.

Return type

Dataset

Pytorch Integration

class small_text.integrations.pytorch.datasets.PytorchTextClassificationDataset(data, vocab, multi_label=False, target_labels=None)[source]

Dataset class for classifiers from Pytorch Integration.

__init__(data, vocab, multi_label=False, target_labels=None)
Parameters
  • data (list of tuples (text data [Tensor], labels [int or list of int])) – The single items constituting the dataset. For single-label datasets, unlabeled instances the label should be set to small_text.base.LABEL_UNLABELED`, and for multi-label datasets to an empty list.

  • vocab (torchtext.vocab.vocab) – Vocabulary object.

  • multi_label (bool, default=False) – Indicates if this is a multi-label dataset.

  • target_labels (np.ndarray[int] or None, default=None) – This is a list of (integer) labels to be encountered within this dataset. This is important to set if your data does not contain some labels, e.g. due to dataset splits, where the labels should however be considered by entities such as the classifier. If None, the target labels will be inferred from the labels encountered in self.data.

property x

Returns the features.

Returns

x

Return type

list of Tensor

property data

Returns the internal list of tuples storing the data.

Returns

data – Vocab object.

Return type

list of tuples (text data [Tensor], label)

property vocab

Returns the vocab.

Returns

vocab – Vocab object.

Return type

torchtext.vocab.Vocab

property target_labels

Returns the target labels.

Returns

target_labels – List of target labels.

Return type

list of int

to(other, non_blocking=False, copy=False)

Calls torch.Tensor.to on all Tensors in data.

Returns

self – The object with to having been called on all Tensors in data.

Return type

PytorchTextClassificationDataset

Transformers Integration

class small_text.integrations.transformers.datasets.TransformersDataset(data, multi_label=False, target_labels=None)[source]

Dataset class for classifiers from Transformers Integration.

__init__(data, multi_label=False, target_labels=None)
Parameters
  • data (list of 3-tuples (text data [Tensor], mask [Tensor], labels [int or list of int])) – The single items constituting the dataset. For single-label datasets, unlabeled instances the label should be set to small_text.base.LABEL_UNLABELED`, and for multi-label datasets to an empty list.

  • multi_label (bool, default=False) – Indicates if this is a multi-label dataset.

  • target_labels (numpy.ndarray[int] or None, default=None) – This is a list of (integer) labels to be encountered within this dataset. This is important to set if your data does not contain some labels, e.g. due to dataset splits, where the labels should however be considered by entities such as the classifier. If None, the target labels will be inferred from the labels encountered in self.data.

property x

Returns the features.

Returns

x

Return type

list of Tensor

property target_labels

Returns the target labels.

Returns

target_labels – List of target labels.

Return type

list of int