Initialization
Initialization (sampling) strategies provide the initial labelings from which the first classifier is created. Some of them may require knowledge about the true labels and therefore they are merely intended for experimental purposes.
In an application setting you must provide an initial set of labels instead (or use a cold start approach, which is not yet supported).
Initialization Strategies
For the single-label scenario:
For single-label and multi-label scenarios:
Methods
- small_text.initialization.strategies.random_initialization(x, n_samples=10)[source]
Randomly draws a subset from the given dataset.
- Parameters
x (Dataset) – A dataset.
n_samples (int, default=10) – Number of samples to draw.
- Returns
indices – Indices relative to x.
- Return type
np.ndarray[int]
- small_text.initialization.strategies.random_initialization_stratified(y, n_samples=10, multilabel_strategy='labelsets')[source]
Randomly draws a subset stratified by class labels.
- Parameters
y (np.ndarray[int] or csr_matrix) – Labels to be used for stratification.
n_samples (int) – Number of samples to draw.
multilabel_strategy ({'labelsets'}, default='labelsets') – The multi-label strategy to be used in case of a multi-label labeling. This is only used if y is of type csr_matrix.
- Returns
indices – Indices relative to y.
- Return type
np.ndarray[int]
See also
small_text.data.sampling.multilabel_stratified_subsets_sampling
Details on the labelsets multi-label strategy.
- small_text.initialization.strategies.random_initialization_balanced(y, n_samples=10)[source]
Randomly draws a subset which is (approximately) balanced in the distribution of its class labels.
- Parameters
y (np.ndarray[int] or csr_matrix) – Labels to be used for balanced sampling.
n_samples (int, default=10) – Number of samples to draw.
- Returns
indices – Indices relative to y.
- Return type
np.ndarray[int]
Notes
This is only applicable to single-label classification.