Dataset API
All datset implementations inherit from the the abstract class Dataset
.
Several such implementations are available, depending on the choice of classifier (and on the installed optional dependencies).
Core
- class small_text.data.datasets.Dataset[source]
A dataset contains a set of instances in the form of features, include a respective labeling for every instance.
- abstract property x: DATA
Returns the features.
- Returns:
x – Feature representation.
- Return type:
object
- abstract property y: LABELS
Returns the labels.
- Returns:
y – The labels as either numpy array (single-label) or sparse matrix (multi-label).
- Return type:
- abstract property target_labels
Returns a list of possible labels.
- Returns:
target_labels – List of possible labels.
- Return type:
- class small_text.data.datasets.SklearnDataset(x, y, target_labels=None)[source]
A dataset representations which is usable in combination with scikit-learn classifiers.
- __init__(x, y, target_labels=None)
- Parameters:
x (numpy.ndarray or scipy.sparse.csr_matrix) – Dense or sparse feature matrix.
y (numpy.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row.
target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be inferred from y if None is passed.
- property x
Returns the features.
- Returns:
x – Dense or sparse feature matrix.
- Return type:
- property y
Returns the labels.
- Returns:
y – The labels as either numpy array (single-label) or sparse matrix (multi-label).
- Return type:
- property is_multi_label
Returns True if this is a multi-label dataset, otherwise False.
- property target_labels
Returns a list of possible labels.
- Returns:
target_labels – List of possible labels.
- Return type:
- clone()
Returns an identical copy of the dataset.
- Returns:
dataset – An exact copy of the dataset.
- Return type:
- classmethod from_arrays(texts, y, vectorizer, target_labels=None, train=True)
Constructs a new SklearnDataset from the given text and label arrays.
- Parameters:
texts (list of str or np.ndarray[str]) – List of text documents.
y (np.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row. Depending on the type of y the resulting dataset will be single-label (np.ndarray) or multi-label (scipy.sparse.csr_matrix).
vectorizer (object) – A scikit-learn vectorizer which is used to construct the feature matrix.
target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be directly passed to the datset constructor.
train (bool) – If True fits the vectorizer and transforms the data, otherwise just transforms the data.
- Returns:
dataset – A dataset constructed from the given texts and labels.
- Return type:
Added in version 1.1.0.
- class small_text.data.datasets.TextDataset(x, y, target_labels=None)[source]
A dataset representation consisting of raw text data.
- __init__(x, y, target_labels=None)
- Parameters:
x (list of str) – List of texts.
y (numpy.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row.
target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be inferred from y if None is passed.
- property x
Returns the features.
- Returns:
x – List of texts.
- Return type:
list of str
- property y
Returns the labels.
- Returns:
y – The labels as either numpy array (single-label) or sparse matrix (multi-label).
- Return type:
- property is_multi_label
Returns True if this is a multi-label dataset, otherwise False.
- property target_labels
Returns a list of possible labels.
- Returns:
target_labels – List of possible labels.
- Return type:
- clone() TextDataset
Returns an identical copy of the dataset.
- Returns:
dataset – An exact copy of the dataset.
- Return type:
- classmethod from_arrays(texts, y, target_labels=None)
Constructs a new TextDataset from the given text and label arrays.
- Parameters:
texts (list of str) – List of text documents.
y (np.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row. Depending on the type of y the resulting dataset will be single-label (np.ndarray) or multi-label (scipy.sparse.csr_matrix).
target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be directly passed to the datset constructor.
- Returns:
dataset – A dataset constructed from the given texts and labels.
- Return type:
Added in version 1.2.0.
Pytorch Integration
- class small_text.integrations.pytorch.datasets.PytorchTextClassificationDataset(data, multi_label=False, target_labels=None)[source]
Dataset class for classifiers from Pytorch Integration.
- __init__(data, multi_label=False, target_labels=None)
- Parameters:
data (list of tuples (text data [Tensor], labels [int or list of int])) – The single items constituting the dataset. For single-label datasets, unlabeled instances the label should be set to small_text.base.LABEL_UNLABELED`, and for multi-label datasets to an empty list.
multi_label (bool, default=False) – Indicates if this is a multi-label dataset.
target_labels (np.ndarray[int] or None, default=None) – This is a list of (integer) labels to be encountered within this dataset. This is important to set if your data does not contain some labels, e.g. due to dataset splits, where the labels should however be considered by entities such as the classifier. If None, the target labels will be inferred from the labels encountered in self.data.
- property x
Returns the features.
- Returns:
x
- Return type:
list of Tensor
- property data
Returns the internal list of tuples storing the data.
- Returns:
data – Internal list of tuples storing examples and labels.
- Return type:
list of tuples (text data [Tensor], label)
- property target_labels
Returns the target labels.
- Returns:
target_labels – List of target labels.
- Return type:
list of int
- to(other, non_blocking=False, copy=False)
Calls torch.Tensor.to on all Tensors in data.
- Returns:
self – The object with to having been called on all Tensors in data.
- Return type:
See also
- classmethod from_arrays(texts, y, tokenizer, target_labels=None, max_length=512)
Constructs a new PytorchTextClassificationDataset from the given text and label arrays.
- Parameters:
texts (list of str or np.ndarray[str]) – List of text documents.
y (np.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row. Depending on the type of y the resulting dataset will be single-label (np.ndarray) or multi-label (scipy.sparse.csr_matrix).
tokenizer (tokenizers.Tokenizer) – A tokenizer from the tokenizers library that is used to convert each of the given text documents into tokens.
target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be directly passed to the datset constructor.
max_length (int) – Maximum sequence length.
- Returns:
dataset – A dataset constructed from the given texts and labels.
- Return type:
Added in version 1.1.0.
Changed in version 2.0.0.
Transformers Integration
- class small_text.integrations.transformers.datasets.TransformersDataset(data, multi_label=False, target_labels=None)[source]
Dataset class for classifiers from Transformers Integration.
- __init__(data, multi_label=False, target_labels=None)
- Parameters:
data (list of 3-tuples (text data [Tensor], mask [Tensor], labels [int or list of int])) – The single items constituting the dataset. For single-label datasets, unlabeled instances the label should be set to small_text.base.LABEL_UNLABELED`, and for multi-label datasets to an empty list.
multi_label (bool, default=False) – Indicates if this is a multi-label dataset.
target_labels (numpy.ndarray[int] or None, default=None) – This is a list of (integer) labels to be encountered within this dataset. This is important to set if your data does not contain some labels, e.g. due to dataset splits, where the labels should however be considered by entities such as the classifier. If None, the target labels will be inferred from the labels encountered in self.data.
- property x
Returns the features.
- Returns:
x
- Return type:
list of Tensor
- property target_labels
Returns the target labels.
- Returns:
target_labels – List of target labels.
- Return type:
list of int
- classmethod from_arrays(texts, y, tokenizer, target_labels=None, max_length=512)
Constructs a new TransformersDataset from the given text and label arrays.
- Parameters:
texts (list of str or np.ndarray[str]) – List of text documents.
y (np.ndarray[int] or scipy.sparse.csr_matrix) – List of labels where each label belongs to the features of the respective row. Depending on the type of y the resulting dataset will be single-label (np.ndarray) or multi-label (scipy.sparse.csr_matrix).
tokenizer (tokenizers.Tokenizer) – A tokenizer from the tokenizers library that is used to convert each of the given text documents into tokens.
target_labels (numpy.ndarray[int] or None, default=None) – List of possible labels. Will be directly passed to the dataset constructor.
max_length (int) – Maximum sequence length.
- Returns:
dataset – A dataset constructed from the given texts and labels.
- Return type:
Added in version 1.1.0.