Datasets

Small-Text’s basic data structures for data are called Datasets and represent text data for single-label and multi-label classification. Besides features and labels, these datasets also hold meta information about the underlying data, namely the number of classes and whether the labeling is single- or multi-label.


Dataset Overview

While all the other components are mostly unified, for the datasets the respective dataset and classifier have to match, since the underlying representations can be quite different.

Dataset Implementation

Classifier(s)

SklearnDataset

SklearnClassifier

PytorchTextClassificationDataset

KimCNNClassifier

TextDataset

SetFitClassification

TransformersDataset

TransformerBasedClassification

SklearnDatasets

Disregarding any integrations, small-text’s core is built around dense (numpy) and sparse (scipy) matrices, which can be easily used for active learning via SklearnDataset. This dataset is compatible with SklearnClassifier classifiers.

The form of the features and labels can vary as follows:

  • The features can either be dense or sparse.

  • The labeling can either be single- or multi-label targets.

Note

Despite all integration efforts, at the end it comes down to the model in use, which combinations of dense/sparse features and single-/multi-label are supported.

Sparse Features

Traditional text classification methods relied on the Bag-of-Words representation, which can be efficiently represented as a sparse matrix.

import numpy as np
from scipy.sparse import csr_matrix, random
from small_text.data import SklearnDataset

# create exemplary features and labels randomly
x = random(100, 2000, density=0.15, format='csr')
y = np.random.randint(0, 2, size=100)

dataset = SklearnDataset(x, y, target_labels=np.arange(2))

Dense Features

Or similarly with dense features:

import numpy as np
from small_text.data import SklearnDataset

# create exemplary features and labels randomly
x = np.random.rand(100, 30)
y = np.random.randint(0, 2, size=100)

dataset = SklearnDataset(x, y, target_labels=np.arange(2))

Multi-Label

The previous two examples were single-label datasets, i.e. each instance had exactly one label assigned. If you want to classify multi-label problems, you need to pass a scipy csr_matrix. This matrix must be a multi-label indicator matrix, i.e. a matrix in the shape of (num_documents, num_labels) where each non-zero entry is exactly 1 and represents a label.

import numpy as np
from scipy import sparse
from small_text.data import SklearnDataset

x = sparse.random(100, 2000, density=0.15, format='csr')
# a random sparse matrix
y = sparse.random(100, 5, density=0.5, format='csr')
# convert non-zero entries to 1, making it an indicator
y.data[np.s_[:]] = 1

dataset = SklearnDataset(x, y, target_labels=np.arange(5))

Indexing and Views

Accessing an data object by index or range such as dataset[selector] is called indexing, where selector can be an index (dataset[10]), a range (dataset[2:10]), or an array of indices (dataset[[1, 5, 10]]). Similarly to numpy indexing, dataset indexing does not create a copy of the selected subset but creates a view thereon. DatasetView objects behave similarly to Datasets, but are readonly.

import numpy as np
from small_text.data import SklearnDataset

# create exemplary features and labels randomly
x = np.random.rand(100, 30)
y = np.random.randint(0, 2, size=100)

dataset = SklearnDataset(x, y, target_labels=np.arange(2))

# returns a DatasetView of the first ten items in x
dataset_sub = dataset[0:10]

In the multi-label case, this is for once simpler, and here no separate handling is needed. An unlabeled instance just has no label in the corresponding row of the indicator matrix.

Copying a Dataset

While indexing creates a view instead of copying, there are cases where you want a copy instead.

dataset_copy = dataset.clone()
print(type(dataset_copy).__name__)

Output:

SklearnDataset

This also works on DatasetView instances, however, the clone() operation dissolves a view and returns a dataset again:

dataset_view = dataset[0:5]
dataset_view_copy = dataset_view.clone()
print(type(dataset_view_copy).__name__)

Output:

SklearnDataset

Constructing an Unlabeled Dataset

Unless you are doing a simulated experiment, you will need to deal with (partly or completely) unlabeled data. We show how to construct an unlabeled dataset at the example of SklearnDataset, but the concept is the same for PytorchTextClassificationDataset and TransformersDataset.

For this, it must be distinguished between the single- and multi-label setting. For the single-label case, there is a special label constant LABEL_UNLABELED, which indicates that an instance is unlabeled:

import numpy as np
from small_text.base import LABEL_UNLABELED
from small_text.data import SklearnDataset

x = np.random.rand(100, 30)
# a label array of size 100 where each entry means "unlabeled"
y = np.array([LABEL_UNLABELED] * 100)

dataset = SklearnDataset(x, y, target_labels=np.arange(2))

For the multi-label case, creating unlabeled datasets is trivial. The sparse label matrix works as usual, and unlabeled instances simply correspond to empty rows:

import numpy as np
from scipy import sparse
from small_text.data import SklearnDataset

num_labels = 3

x = sparse.random(100, 2000, density=0.15, format='csr')
y = sparse.csr_matrix((100, num_labels))  # <-- this a sparse empty matrix

dataset = SklearnDataset(x, y, target_labels=np.arange(num_labels))

For partially labeled data, the sparse label matrix y has empty and non-empty rows.


Integration Data Structures

Both the Pytorch Integration the Transformers Integration bring their own Datasets (each subclassing Dataset), which rely on different representations and bring additional methods for handling GPU-related operations. See the respective integration’s page for more information.


Building your own Dataset implementation

In general, any data structure handled by your classifier can be implemented. Custom Datasets should work with existing parts of the library, providing the following conditions are met:

  1. Indexing (using integers, lists, ndarray, slices) must be supported

  2. Iteration must be supported

  3. The length of dataset (__len__) must return the number of data instances

See small_text.integrations.transformers.datasets.TransformersDataset for an example.