Active Learning

Active learning aims at creating training data for classification algorithms in a very efficient manner, for cases in which a large amount of unlabeled data is available but labels are not. Labeling such data is usually time-consuming and expensive. To avoid having to label the full dataset, active learning selectively chooses data points that are assumed to improve the model. This is done iteratively, in a process that alternates between an algorithm selecting data to label, and a human annotator who assigns the true labels to given samples. The goal here is to maximize the quality of the model while keeping the annotation efforts at a minimum. A comprehensive introduction to active learning can be found in (Settles, 2010) [Set10].

Components

An active learning process encompasses several, usually interchangeable components: An initalization strategy, a query strategy, and (optionally) a stopping criterion.

You can mix and match these around the PoolBasedActiveLearner which results in a full active learning setup in just a few lines of code. In some cases, however, there may be conceptually incompatible components, e.g. a gradient-based query strategy requires a classifier that has gradients, but in general the library does not impose any restrictions.

Initialization Strategies

While there are exceptions, in many cases you will already need an initial model to apply a query strategy. This may sound like a contradiction, since you are using active learning to create a model in the first place, but usually a weak model using very few training instances suffices.

In a practical settings this can be solved by manually labeling some instances for each class.
In the experiment setting, we simulate the choice of the initial samples.

For the latter case, we use initalization strategies, which select an initial set of documents. They are just sampling methods which take the label distribution into account.

Query Strategies

Query strategies decide which instances from the pool of unlabeled data will be labeled next. They are the most critical component as they influence both the effectiveness as well as the efficiency. Moreover, they exists in many different forms, which can yield different results and varying runtimes. In case you are not sure which one to choose: Uncertainty-based query strategies [LG94] have been shown to be a strong (and conceptually simple) baseline for both traditional and modern [SNP22] classification models.

Stopping Criteria

How often do we need to query the dataset? Stopping criteria give you an indication whether the active learning process should be stopped or not.