Dataset API¶

All datset implementations inherit from the the abstract class Dataset. Several such implementations are available, depending on the choice of classifier (and on the installed optional dependencies).

Overview

Core
Pytorch Integration
Transformers Integration

Core¶

class small_text.data.datasets.Dataset¶

Abstract class for all datasets.

property x¶

Returns the features.

Returns: x – Feature representation.
Return type: object

property y¶

Returns the labels.

Returns: y – Label representation.
Return type: object

property target_labels¶

Returns a list of possible labels.

Returns: target_labels – List of possible labels.
Return type: numpy.ndarray

class small_text.data.datasets.SklearnDataset(x, y, target_labels=None)¶

A dataset representations which is usable in combination with scikit-learn classifiers.

Parameters

x (numpy.ndarray or scipy.sparse.csr_matrix) – Dense or sparse feature matrix.
y (list of int) – List of labels where each label belongs to the features of the respective row.
target_labels (list of int or None) – List of possible labels. Will be inferred from y if None is passed.

__init__(x, y, target_labels=None)¶

property x¶

Returns the features.

Returns: x – Dense or sparse feature matrix.
Return type: numpy.ndarray or scipy.sparse.csr_matrix

property y¶

Returns the labels.

Returns: y – List of labels.
Return type: numpy.ndarray

property target_labels¶

Returns a list of possible labels.

Returns: target_labels – List of possible labels.
Return type: numpy.ndarray

Pytorch Integration¶

class small_text.integrations.pytorch.datasets.PytorchTextClassificationDataset(data, vocab, target_labels=None, device=None)¶

Dataset class for classifiers from Pytorch Integration.

__init__(data, vocab, target_labels=None, device=None)¶

Parameters

data (list of tuples (text data [Tensor], label)) – Data set.
vocab (torchtext.vocab.vocab) – Vocabulary object.

property x¶

Returns the features.

Returns: x – Feature representation.
Return type: object

property y¶

Returns the labels.

Returns: y – Label representation.
Return type: object

property data¶

Returns the internal list of tuples storing the data.

Returns: data – Vocab object.
Return type: list of tuples (text data [Tensor], label)

property vocab¶

Returns the vocab.

Returns: vocab – Vocab object.
Return type: torchtext.vocab.Vocab

property target_labels¶

Returns a list of possible labels.

Returns: target_labels – List of possible labels.
Return type: numpy.ndarray

to(device=None, dtype=None, non_blocking=False, copy=False, memory_format=torch.preserve_format)¶

Calls torch.Tensor.to on all Tensors in data.

Returns: self – The object with to having been called on all Tensors in data.
Return type: PytorchTextClassificationDataset

Transformers Integration¶

class small_text.integrations.transformers.datasets.TransformersDataset(data, target_labels=None, device=None)¶

Dataset class for classifiers from Transformers Integration.

__init__(data, target_labels=None, device=None)¶

Parameters: data (list of 3-tuples (text data [Tensor], mask [Tensor], label [int])) – Data set.

property x¶

Returns the features.

Returns: x – Feature representation.
Return type: object

property y¶

Returns the labels.

Returns: y – Label representation.
Return type: object

property target_labels¶

Returns a list of possible labels.

Returns: target_labels – List of possible labels.
Return type: numpy.ndarray