Active learning aims at creating training data for classification algorithms in a very efficient manner, for cases in which a large amount of unlabeled data is available but labels are not. Labeling such data is usually time-consuming and expensive. To avoid having to label the entire dataset, active learning selectively chooses data points that are assumed to improve the model. This is done iteratively, in a process that alternates between an algorithm selecting data to label, and a human annotator who assigns the true labels to given samples. The goal here is to maximize the quality of the model while keeping the annotation efforts at a minimum. A comprehensive introduction to active learning can be found in (Settles, 2010) [Set10].
You can mix and match these around the
PoolBasedActiveLearner which results in
a full active learning setup in just a few lines of code.
In some cases, however, there may be conceptually incompatible components,
gradient-based query strategy
requires a classifier that has gradients,
but in general the library does not impose any restrictions.
While there are exceptions, in many cases you will already need an initial model to apply a query strategy. This may sound like a contradiction, since you are using active learning to create a model in the first place, but usually a weak model using very few training instances suffices.
In a practical settings this can be solved by manually labeling some instances for each class.
In the experiment setting, we simulate the choice of the initial samples.
For the latter case, we use initalization strategies, which select an initial set of documents. They are just sampling methods which take the label distribution into account.
Query strategies decide which instances from the pool of unlabeled data will be labeled next. They are the most critical component as they influence both the effectiveness as well as the efficiency. Moreover, they exists in many different forms, which can yield different results and varying runtimes. In case you are not sure which one to choose: Uncertainty-based query strategies [LG94] have been shown to be a strong (and conceptually simple) baseline for both traditional and modern [SNP22] classification models.
How often do we need to query the dataset? Stopping criteria give you an indication whether the active learning process should be stopped or not.