Data Structures¶

In order to make the integrated libraries and all extensions accessible in the same way, classifiers (and more specialized query strategies as well) rely on dataset abstractions based on the interface Dataset.

Basic Data Structures¶

Dense (numpy) and sparse (scipy) matrices can be easily used within datasets in combination with SklearnDataset, which is compatible with all SklearnClassifier classifiers.

Sparse Vectors¶

Traditional text classification methods relied on the Bag-of-Words representation, which can be efficiently represented as a sparse matrix.

import numpy as np
from scipy.sparse import csr_matrix, random
from small_text.data import SklearnDataset

# create exemplary features and labels randomly
x = random(100, 2000, density=0.15, format='csr')
y = np.random.randint(0, 1, size=100)

dataset = SklearnDataset(x, y)

Dense Vectors¶

Or similarly with dense features:

import numpy as np
from small_text.data import SklearnDataset

# create exemplary features and labels randomly
x = np.random.rand(100, 30)
y = np.random.randint(0, 1, size=100)

dataset = SklearnDataset(x, y)

Integration Data Structures¶

Both the Pytorch Integration the Transformers Integration bring their own Datasets (each subclassing Dataset), which rely on different representations and bring additional methods for handling GPU-related operations.

Indexing and Views¶

import numpy as np
from small_text.data import SklearnDataset

# create exemplary features and labels randomly
x = np.random.rand(100, 30)
y = np.random.randint(0, 1, size=100)

dataset = SklearnDataset(x, y)

# returns a DatasetView of the first ten items in x
dataset_sub = dataset[0:10]

Similarly to numpy, indexing does not create a copy of the selected subset but creates a view thereon. DatasetView objects behave similarly to Datasets, but are readonly.

Further Extensions¶

In general, any data structure handled by your classifier can be implemented. Custom Datasets should work with existing parts of the library, providing the following conditions are met:

Indexing (using integers, lists, ndarray, slices) must be supported
Iteration must be supported
The length of dataset (__len__) must return the number of data instances

See small_text.integrations.transformers.datasets.TransformersDataset for an example.