Classifiers

In order to use different models, query strategies, and stopping criteria from the active learner, we provide classification abstractions to allow for a unified interface.

Interface

The classifier interface is very simple and scikit-learn-like, with the difference that it operates on Datasets objects. Call the fit() method with a training set as argument to train your classifier, and use predict() to obtain predictions.

class Classifier(ABC):
    """Abstract base class for classifiers that can be used with the active learning components.
    """

    @abstractmethod
    def fit(self, train_set: Dataset, weights: "Union[npt.NDArray[np.double], None]" = None) -> "Classifier":
        """Train the model using the given train set.

        Parameters
        ----------
        train_set : Dataset
            The dataset used for training the model.
        weights : np.ndarray[np.double] or None, default=None
            Sample weights or None.
        """
        pass

    @abstractmethod
    def predict(self,
                data_set: Dataset,
                return_proba: bool = False,
                multi_label_threshold: float = 0.5,
                **kwargs) \
            -> "Union[npt.NDArray[np.uint], Tuple[npt.NDArray[np.uint], npt.NDArray[np.double]]," \
               "csr_matrix, Tuple[csr_matrix, csr_matrix]]":
        """Predicts the labels for each sample in the given dataset.

        Parameters
        ----------
        data_set : Dataset
            A dataset for which the labels are to be predicted.
        return_proba : bool, default=False
            If `True`, also returns a probability-like class distribution.
        multi_label_threshold : float, default=0.5
            In multi-label classification, a label is predicted for a sample only if the respective probability value
            is greater than `multi_label_threshold`. Must be between 0.0 and 1.0. Ignored when `multi_label` is False.
        """
        pass

    @abstractmethod
    def predict_proba(self,
                      data_set: Dataset,
                      multi_label_threshold: float = 0.5,
                      **kwargs) -> "npt.NDArray[np.double]":
        """Predicts the label distribution for each sample in the given dataset.

        Parameters
        ----------
        data_set : Dataset
            A dataset for which the labels are to be predicted.
        multi_label_threshold : float, default=0.5
            In multi-label classification, a label is predicted for a sample only if the respective probability value
            is greater than `multi_label_threshold`. Must be between 0.0 and 1.0. Ignored when `multi_label` is False.
        """
        pass

Example

This is a simple example which shows the training of a tiny toy dataset.

import numpy as np
from small_text.classifiers import ConfidenceEnhancedLinearSVC, SklearnClassifier
from small_text.data import SklearnDataset

# this is a linear which has been extended to return confidence estimates
model = ConfidenceEnhancedLinearSVC()
num_classes = 2
clf = SklearnClassifier(model, num_classes)

x = np.array([
    [0, 0],
    [0, 0.5],
    [0.5, 1],
    [1, 1]
])
y = np.array([0, 0, 1, 1])
train_set = SklearnDataset(x, y)
clf.fit(train_set)

"""
Generate predictions on the train set
(Only for the purpose of demonstration;
 usually you would be more interested in obtaining predictions on new, unseen data.)
"""
y_train_pred = clf.predict(train_set)
print(y_train_pred)

Output:

[0 0 1 1]

Factories

To configure the active learner to use classifiers a factory object is required because new classifier objects are created at each iteration (unless explicitly configured not to). A factory creates new instances of an object, for which the knowledge of what to pass to the constructor is required, which is why we need a factory. Assuming all constructor took zero arguments we would not need factories here.

from small_text.classifiers import ConfidenceEnhancedLinearSVC
from small_text.classifiers.factories import SklearnClassifierFactory

clf_template = ConfidenceEnhancedLinearSVC()
num_classes = 2

clf_factory = SklearnClassifierFactory(clf_template, num_classes)
clf = clf_factory.new()

This also means that any classifier parameters, e.g. for multi-label classification, are managed by the factory:

from small_text.classifiers import ConfidenceEnhancedLinearSVC
from small_text.classifiers.factories import SklearnClassifierFactory

clf_template = ConfidenceEnhancedLinearSVC()
num_classes = 2
classifier_kwargs = {'multi_label': True}

clf_factory = SklearnClassifierFactory(clf_template, num_classes, kwargs=classifier_kwargs)
clf = clf_factory.new()