Initialization

Initialization (sampling) strategies provide the initial labelings from which the first classifier is created. Some of them may require knowledge about the true labels and therefore they are merely intended for experimental purposes.

In an application setting you must provide an initial set of labels instead (or use a cold start approach, which is not yet supported).

Initialization Strategies

For the single-label scenario:

For single-label and multi-label scenarios:

Methods

small_text.initialization.strategies.random_initialization(x, n_samples=10)[source]

Randomly draws a subset from the given dataset.

Parameters
  • x (Dataset) – A dataset.

  • n_samples (int, default=10) – Number of samples to draw.

Returns

indices – Indices relative to x.

Return type

np.ndarray[int]

small_text.initialization.strategies.random_initialization_stratified(y, n_samples=10, multilabel_strategy='labelsets')[source]

Randomly draws a subset stratified by class labels.

Parameters
  • y (np.ndarray[int] or csr_matrix) – Labels to be used for stratification.

  • n_samples (int) – Number of samples to draw.

  • multilabel_strategy ({'labelsets'}, default='labelsets') – The multi-label strategy to be used in case of a multi-label labeling. This is only used if y is of type csr_matrix.

Returns

indices – Indices relative to y.

Return type

np.ndarray[int]

See also

small_text.data.sampling.multilabel_stratified_subsets_sampling

Details on the labelsets multi-label strategy.

small_text.initialization.strategies.random_initialization_balanced(y, n_samples=10)[source]

Randomly draws a subset which is (approximately) balanced in the distribution of its class labels.

Parameters
  • y (np.ndarray[int] or csr_matrix) – Labels to be used for balanced sampling.

  • n_samples (int, default=10) – Number of samples to draw.

Returns

indices – Indices relative to y.

Return type

np.ndarray[int]

Notes

This is only applicable to single-label classification.