Datasets
Small-Text’s basic data structures for data are called Datasets
and
represent text data for single-label and multi-label classification.
Besides features and labels, these datasets also hold meta information about the underlying data, namely the number of classes and
whether the labeling is single- or multi-label.
Overview
Dataset Overview
While all the other components are mostly unified, for the datasets the respective dataset and classifier have to match, since the underlying representations can be quite different.
Dataset Implementation |
Classifier(s) |
---|---|
SklearnDatasets
Disregarding any integrations, small-text’s core is built around dense (numpy) and sparse (scipy)
matrices, which can be easily used for active learning via SklearnDataset
.
This dataset is compatible with SklearnClassifier
classifiers.
The form of the features and labels can vary as follows:
The features can either be dense or sparse.
The labeling can either be single- or multi-label targets.
Note
Despite all integration efforts, at the end it comes down to the model in use, which combinations of dense/sparse features and single-/multi-label are supported.
Sparse Features
Traditional text classification methods relied on the Bag-of-Words representation, which can be efficiently represented as a sparse matrix.
import numpy as np
from scipy.sparse import csr_matrix, random
from small_text.data import SklearnDataset
# create exemplary features and labels randomly
x = random(100, 2000, density=0.15, format='csr')
y = np.random.randint(0, 2, size=100)
dataset = SklearnDataset(x, y, target_labels=np.arange(2))
Dense Features
Or similarly with dense features:
import numpy as np
from small_text.data import SklearnDataset
# create exemplary features and labels randomly
x = np.random.rand(100, 30)
y = np.random.randint(0, 2, size=100)
dataset = SklearnDataset(x, y, target_labels=np.arange(2))
Multi-Label
The previous two examples were single-label datasets, i.e. each instance had exactly one label assigned. If you want to classify multi-label problems, you need to pass a scipy csr_matrix. This matrix must be a multi-label indicator matrix, i.e. a matrix in the shape of (num_documents, num_labels) where each non-zero entry is exactly 1 and represents a label.
import numpy as np
from scipy import sparse
from small_text.data import SklearnDataset
x = sparse.random(100, 2000, density=0.15, format='csr')
# a random sparse matrix
y = sparse.random(100, 5, density=0.5, format='csr')
# convert non-zero entries to 1, making it an indicator
y.data[np.s_[:]] = 1
dataset = SklearnDataset(x, y, target_labels=np.arange(5))
Indexing and Views
Accessing an data object by index or range such as dataset[selector]
is called indexing,
where selector can be an index (dataset[10]
), a range (dataset[2:10]
), or an array
of indices (dataset[[1, 5, 10]]
).
Similarly to numpy indexing,
dataset indexing does not create a copy of the selected subset but creates a view thereon.
DatasetView
objects behave similarly to Datasets, but are readonly.
import numpy as np
from small_text.data import SklearnDataset
# create exemplary features and labels randomly
x = np.random.rand(100, 30)
y = np.random.randint(0, 2, size=100)
dataset = SklearnDataset(x, y, target_labels=np.arange(2))
# returns a DatasetView of the first ten items in x
dataset_sub = dataset[0:10]
In the multi-label case, this is for once simpler, and here no separate handling is needed. An unlabeled instance just has no label in the corresponding row of the indicator matrix.
Copying a Dataset
While indexing creates a view instead of copying, there are cases where you want a copy instead.
dataset_copy = dataset.clone()
print(type(dataset_copy).__name__)
Output:
SklearnDataset
This also works on DatasetView
instances, however,
the clone()
operation dissolves a view and returns a dataset again:
dataset_view = dataset[0:5]
dataset_view_copy = dataset_view.clone()
print(type(dataset_view_copy).__name__)
Output:
SklearnDataset
Constructing an Unlabeled Dataset
Unless you are doing a simulated experiment, you will need to deal with (partly or
completely) unlabeled data. We show how to construct an unlabeled dataset at the example of
SklearnDataset
, but the concept is the same for
PytorchTextClassificationDataset
and
TransformersDataset
.
For this, it must be distinguished between the single- and multi-label setting. For the single-label case,
there is a special label constant LABEL_UNLABELED
,
which indicates that an instance is unlabeled:
import numpy as np
from small_text.base import LABEL_UNLABELED
from small_text.data import SklearnDataset
x = np.random.rand(100, 30)
# a label array of size 100 where each entry means "unlabeled"
y = np.array([LABEL_UNLABELED] * 100)
dataset = SklearnDataset(x, y, target_labels=np.arange(2))
For the multi-label case, creating unlabeled datasets is trivial. The sparse label matrix works as usual, and unlabeled instances simply correspond to empty rows:
import numpy as np
from scipy import sparse
from small_text.data import SklearnDataset
num_labels = 3
x = sparse.random(100, 2000, density=0.15, format='csr')
y = sparse.csr_matrix((100, num_labels)) # <-- this a sparse empty matrix
dataset = SklearnDataset(x, y, target_labels=np.arange(num_labels))
For partially labeled data, the sparse label matrix y has empty and non-empty rows.
Integration Data Structures
Both the Pytorch Integration the Transformers Integration
bring their own Datasets (each subclassing Dataset
),
which rely on different representations and bring additional methods for handling GPU-related operations.
See the respective integration’s page for more information.
Building your own Dataset implementation
In general, any data structure handled by your classifier can be implemented. Custom Datasets should work with existing parts of the library, providing the following conditions are met:
Indexing (using integers, lists, ndarray, slices) must be supported
Iteration must be supported
The length of dataset (__len__) must return the number of data instances
See small_text.integrations.transformers.datasets.TransformersDataset
for an example.