Semi-supervised Learning Data Selection Strategies

In this section, we consider different data selection strategies geared towards efficient and robust learning in standard semi-supervised learning setting.

Data Selection Strategy - Base Class

class cords.selectionstrategies.SSL.dataselectionstrategy.DataSelectionStrategy(trainloader, valloader, model, tea_model, ssl_alg, num_classes, linear_layer, loss, device, logger)[source]

Bases: object

Implementation of Data Selection Strategy class which serves as base class for other dataselectionstrategies for semi-supervised learning frameworks. :param trainloader: Loading the training data using pytorch dataloader :type trainloader: class :param valloader: Loading the validation data using pytorch dataloader :type valloader: class :param model: Model architecture used for training :type model: class :param tea_model: Teacher model architecture used for training :type tea_model: class :param ssl_alg: SSL algorithm class :type ssl_alg: class :param num_classes: Number of target classes in the dataset :type num_classes: int :param linear_layer: If True, we use the last fc layer weights and biases gradients

If False, we use the last fc layer biases gradients

Parameters

loss (class) – Consistency loss function for unlabeled data with no reduction
device (str) – The device being utilized - cpu | cuda
logger (class) – logger file for printing the info

compute_gradients(valid=False, perBatch=False, perClass=False, store_t=False)[source]

Computes the gradient of each element.

Here, the gradients are computed in a closed form using CrossEntropyLoss with reduction set to ‘none’. This is done by calculating the gradients in last layer through addition of softmax layer.

Using different loss functions, the way we calculate the gradients will change.

For LogisticLoss we measure the Mean Absolute Error(MAE) between the pairs of observations. With reduction set to ‘none’, the loss is formulated as:

\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left| x_n - y_n \right|,\]

where $N$ is the batch size.

For MSELoss, we measure the Mean Square Error(MSE) between the pairs of observations. With reduction set to ‘none’, the loss is formulated as:

\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left( x_n - y_n \right)^2,\]

where $N$ is the batch size. :param valid: if True, the function also computes the validation gradients :type valid: bool :param batch: if True, the function computes the gradients of each mini-batch :type batch: bool :param perClass: if True, the function computes the gradients using perclass dataloaders :type perClass: bool :param store_t: if True, the function stores the hypothesized weak augmentation targets and masks for unlabeled set. :type store_t: bool

get_labels(valid=False)[source]

Function that iterates over labeled or unlabeled data and returns target or hypothesized labels.

Parameters: valid (bool) – If True, iterate over the labeled set

select(budget, model_params, tea_model_params)[source]: Abstract select function that is overloaded by the child classes

ssl_loss(ul_weak_data, ul_strong_data, labels=False)[source]

Function that computes contrastive semi-supervised loss

Parameters

ul_weak_data – Weak agumented version of unlabeled data
ul_strong_data – Strong agumented version of unlabeled data
labels (bool) – if labels, just return hypothesized labels of the unlabeled data

Returns

L_consistency (Consistency loss)
y (Actual labels(Not used anywhere))
l1_strong (Penultimate layer outputs for strongly augmented version of unlabeled data)
targets (Hypothesized labels)
mask (mask vector of the unlabeled data)

update_model(model_params, tea_model_params)[source]

Update the models parameters

Parameters

model_params (OrderedDict) – Python dictionary object containing model’s parameters
tea_model_params (OrderedDict) – Python dictionary object containing teacher model’s parameters

RETRIEVE Strategy 1 

class cords.selectionstrategies.SSL.retrievestrategy.RETRIEVEStrategy(trainloader, valloader, model, tea_model, ssl_alg, loss, eta, device, num_classes, linear_layer, selection_type, greedy, logger, r=15, valid=True)[source]

Bases: cords.selectionstrategies.SSL.dataselectionstrategy.DataSelectionStrategy

Implementation of RETRIEVE Strategy from the paper 1 for efficient and robust semi-supervised learning frameworks. RETRIEVE method tries to solve the bi-level optimization problem given below:

\[\overbrace{\mathcal{S}_{t} = \underset{\mathcal{S} \subseteq \mathcal{U}:|\mathcal{S}| \leq k}{\operatorname{argmin\hspace{0.7mm}}}L_S\Big(\mathcal{D}, \underbrace{\underset{\theta}{\operatorname{argmin\hspace{0.7mm}}}\big(L_S(\mathcal{D}, \theta_t) + \lambda_t \underset{j \in \mathcal{S}}{\sum} \mathbf{m}_{jt} l_u(x_j, \theta_t) \big)}_{inner-level}\Big)}^{outer-level}\]

Notation: Denote :math: mathcal{D} = {x_i, y_i}_{i=1}^n to be the labeled set with :math: n labeled data points, and :math: mathcal{U} = {x_j}_{j=1}^m to be the unlabeled set with :math: m data points. Let :math: theta be the classifier model parameters, :math: l_s be the labeled set loss function (such as cross-entropy loss) and :math: l_u be the unlabeled set loss, e.g. consistency-regularization loss, entropy loss, etc. Denote :math: L_S(mathcal{D}, theta) = underset{i in mathcal{D}}{sum}l_{s}(theta, x_i, y_i) and :math: L_U(mathcal{U}, theta, mathbf{m}) = underset{j in mathcal{U}}{sum} mathbf{m}_i l_u(x_j, theta) where :math: mathbf{m} in {0, 1}^m is the binary mask vector for unlabeled set. For notational convenience, we denote :math: l_{si}(theta) = l_s(x_i, y_i, theta) and denote :math: $l_{uj}(theta) = l_u(x_j, theta)$.

Since, solving the complete inner-optimization is expensive, RETRIEVE adopts a online one-step meta approximation where we approximate the solution to inner problem by taking a single gradient step.

The optimization problem after the approximation is as follows:

\[\mathcal{S}_{t} = \underset{\mathcal{S} \subseteq \mathcal{U}:|\mathcal{S}| \leq k}{\operatorname{argmin\hspace{0.7mm}}}L_S(\mathcal{D}, \theta_t - \alpha_t \nabla_{\theta}L_S(\mathcal{D}, \theta_t) - \alpha_t \lambda_t \underset{j \in \mathcal{S}}{\sum} \mathbf{m}_{jt} \nabla_{\theta}l_u(x_j, \theta_t))\text{\hspace{1.7cm}}\]

In the above equation, $\alpha_t$ denotes the step-size used for one-step gradient update.

RETRIEVE-ONLINE also makes an additional approximation called Taylor-Series approximation to easily solve the outer problem using a greedy selection algorithm. The Taylor series approximation is as follows:

\[L_S(\mathcal{D}, \theta_t - \alpha_t \nabla_{\theta}L_S(\mathcal{D}, \theta_t) - \alpha_t \lambda_t \underset{j \in \mathcal{S}}{\sum} \mathbf{m}_{jt} \nabla_{\theta}l_u(x_j, \theta_t)) \approx L_S(\mathcal{D}, \theta^{S}) - \alpha_t \lambda_t {\nabla_{\theta}L_S(\mathcal{D}, \theta^S)}^T \mathbf{m}_{et} \nabla_{\theta}l_u(x_e, \theta_t)\]

Taylor’s series approximation reduces the time complexity by reducing the need of calculating the labeled set loss for each element during greedy selection step which means reducing the number of forward passes required.

RETRIEVE-ONLINE is an adaptive subset selection algorithm that tries to select a subset every $L$ epochs and the parameter L can be set in the original training loop.

Parameters

trainloader: class

Loading the training data using pytorch DataLoader

valloader: class

Loading the validation data using pytorch DataLoader

model: class

Model architecture used for training

tea_model: class

Teacher model architecture used for training

ssl_alg: class

SSL algorithm class

loss: class

Consistency loss function for unlabeled data with no reduction

eta: float

Learning rate. Step size for the one step gradient update

device: str

The device being utilized - cpu | cuda

num_classes: int

The number of target classes in the dataset

linear_layer: bool

If True, we use the last fc layer weights and biases gradients If False, we use the last fc layer biases gradients

selection_type: str

Type of selection algorithm - - ‘PerBatch’ : PerBatch method is where RETRIEVE algorithm is applied on each minibatch data points. - ‘PerClass’ : PerClass method is where RETRIEVE algorithm is applied on each class data points seperately. - ‘Supervised’ : Supervised method is where RETRIEVE algorithm is applied on entire training data.

greedy: str

Type of greedy selection algorithm - - ‘RGreedy’ : RGreedy Selection method is a variant of naive greedy where we just perform r rounds of greedy selection by choosing k/r points in each round. - ‘Stochastic’ : Stochastic greedy selection method is based on the algorithm presented in this paper 2 - ‘Naive’ : Normal naive greedy selection method that selects a single best element every step until the budget is fulfilled

logger: class

Logger class for logging the information

rint, optional

Number of greedy selection rounds when selection method is RGreedy (default: 15)

valid: bool

If True, we select subset that maximizes the performance on the labeled set.
If False, we select subset that maximizes the performance on the unlabeled set.

eval_taylor_modular(grads)[source]

Evaluate gradients

Parameters: grads (Tensor) – Gradients
Returns: gains – Matrix product of two tensors
Return type: Tensor

greedy_algo(budget)[source]

Implement various greedy algorithms for data subset selection.

Parameters: budget (int) – Budget of data points that needs to be sampled

select(budget, model_params, tea_model_params)[source]

Apply naive greedy method for data selection

Parameters

budget (int) – The number of data points to be selected
model_params (OrderedDict) – Python dictionary object containing model’s parameters
tea_model_params (OrderedDict) – Python dictionary object containing teacher model’s parameters

Returns

greedySet (list) – List containing indices of the best datapoints,
budget (Tensor) – Tensor containing gradients of datapoints present in greedySet

CRAIG Strategy 3 

class cords.selectionstrategies.SSL.craigstrategy.CRAIGStrategy(trainloader, valloader, model, tea_model, ssl_alg, loss, device, num_classes, linear_layer, if_convex, selection_type, logger, optimizer='lazy')[source]

Bases: cords.selectionstrategies.SSL.dataselectionstrategy.DataSelectionStrategy

Adapted Implementation of CRAIG Strategy from the paper 3 for semi-supervised learning setting.

CRAIG strategy tries to solve the optimization problem given below for convex loss functions:

\[\sum_{i\in \mathcal{U}} \min_{j \in S, |S| \leq k} \| x^i - x^j \|\]

In the above equation, $\mathcal{U}$ denotes the training set where $(x^i, y^i)$ denotes the $i^{th}$ training data point and label respectively, $L_T$ denotes the training loss, $S$ denotes the data subset selected at each round, and $k$ is the budget for the subset.

Since, the above optimization problem is not dependent on model parameters, we run the subset selection only once right before the start of the training.

CRAIG strategy tries to solve the optimization problem given below for non-convex loss functions:

\[\underset{\mathcal{S} \subseteq \mathcal{U}:|\mathcal{S}| \leq k}{\operatorname{argmin\hspace{0.7cm}}}\underset{i \in \mathcal{U}}{\sum} \underset{j \in \mathcal{S}}{\min} \left \Vert \mathbf{m}_i \nabla_{\theta}l_u(x_i, \theta) - \mathbf{m}_j \nabla_{\theta}l_u(x_j, \theta) \right \Vert\]

In the above equation, $\mathcal{U}$ denotes the unlabeled set, $l_u$ denotes the unlabeled loss, $\mathcal{S}$ denotes the data subset selected at each round, and $k$ is the budget for the subset. In this case, CRAIG acts an adaptive subset selection strategy that selects a new subset every epoch.

Both the optimization problems given above are an instance of facility location problems which is a submodular function. Hence, it can be optimally solved using greedy selection methods.

Parameters

trainloader: class

Loading the training data using pytorch DataLoader

valloader: class

Loading the validation data using pytorch DataLoader

model: class

Model architecture used for training

tea_model: class

Teacher model architecture used for training

ssl_alg: class

SSL algorithm class

loss: class

Consistency loss function for unlabeled data with no reduction

device: str

The device being utilized - cpu | cuda

num_classes: int

The number of target classes in the dataset

linear_layer: bool

Apply linear transformation to the data

if_convex: bool

If convex or not

selection_type: str

Type of selection:

‘PerClass’: PerClass Implementation where the facility location problem is solved for each class seperately for speed ups.
‘Supervised’: Supervised Implementation where the facility location problem is solved using a sparse similarity matrix by assigning the similarity of a point with other points of different class to zero.
‘PerBatch’: PerBatch Implementation where the facility location problem tries to select subset of mini-batches.

logger: class

Logger class for logging the information

optimizer: str

Type of Greedy Algorithm

compute_gamma(idxs)[source]

Compute the gamma values for the indices.

Parameters: idxs (list) – The indices
Returns: gamma – Gradient values of the input indices
Return type: list

compute_score(model_params, tea_model_params, idxs)[source]

Compute the score of the indices.

Parameters

model_params (OrderedDict) – Python dictionary object containing model’s parameters
tea_model_params (OrderedDict) – Python dictionary object containing teacher model’s parameters
idxs (list) – The indices

distance(x, y, exp=2)[source]

Compute the distance.

Parameters

x (Tensor) – First input tensor
y (Tensor) – Second input tensor
exp (float, optional) – The exponent value (default: 2)

Returns

dist – Output tensor

Return type

Tensor

get_similarity_kernel()[source]

Obtain the similarity kernel.

Returns: kernel – Array of kernel values
Return type: ndarray

select(budget, model_params, tea_model_params)[source]

Data selection method using different submodular optimization functions.

Parameters

budget (int) – The number of data points to be selected
model_params (OrderedDict) – Python dictionary object containing models parameters
optimizer (str) – The optimization approach for data selection. Must be one of ‘random’, ‘modular’, ‘naive’, ‘lazy’, ‘approximate-lazy’, ‘two-stage’, ‘stochastic’, ‘sample’, ‘greedi’, ‘bidirectional’

Returns

total_greedy_list (list) – List containing indices of the best datapoints
gammas (list) – List containing gradients of datapoints present in greedySet

GradMatch Strategy 4 

class cords.selectionstrategies.SSL.gradmatchstrategy.GradMatchStrategy(trainloader, valloader, model, tea_model, ssl_alg, loss, eta, device, num_classes, linear_layer, selection_type, logger, valid=False, v1=True, lam=0, eps=0.0001)[source]

Bases: cords.selectionstrategies.SSL.dataselectionstrategy.DataSelectionStrategy

Implementation of OMPGradMatch Strategy from the paper 4 for supervised learning frameworks.

OMPGradMatch strategy tries to solve the optimization problem given below:

\[\underset{\mathcal{S} \subseteq \mathcal{U}:|\mathcal{S}| \leq k, \{\mathbf{w}_j\}_{j \in [1, |\mathcal{S}|]}:\forall_{j} \mathbf{w}_j \geq 0}{\operatorname{argmin\hspace{0.7mm}}} \left \Vert \underset{i \in \mathcal{U}}{\sum} \mathbf{m}_i \nabla_{\theta}l_u(x_i, \theta) - \underset{j \in \mathcal{S}}{\sum} \mathbf{m}_j \mathbf{w}_j \nabla_{\theta} l_u(x_j, \theta)\right \Vert\]

In the above equation, $\mathbf{w}$ denotes the weight vector that contains the weights for each data instance, $\mathcal{U}$ denotes the unlabeled set where $(x^i, y^i)$ denotes the $i^{th}$ training data point and label respectively, $l_u$ denotes the unlabeled loss, $\mathcal{S}$ denotes the data subset selected at each round, and $k$ is the budget for the subset.

The above optimization problem is solved using the Orthogonal Matching Pursuit(OMP) algorithm.

Parameters
trainloader: class: Loading the training data using pytorch DataLoader
valloader: class: Loading the validation data using pytorch DataLoader
model: class: Model architecture used for training
tea_model: class: Teacher model architecture used for training
ssl_alg: class: SSL algorithm class
loss: class: Consistency loss function for unlabeled data with no reduction
eta: float: Learning rate. Step size for the one step gradient update
device: str: The device being utilized - cpu | cuda
num_classes: int: The number of target classes in the dataset
linear_layer: bool: Apply linear transformation to the data
selection_type: str: Type of selection - - ‘PerClass’: PerClass method is where OMP algorithm is applied on each class data points seperately. - ‘PerBatch’: PerBatch method is where OMP algorithm is applied on each minibatch data points. - ‘PerClassPerGradient’: PerClassPerGradient method is same as PerClass but we use the gradient corresponding to classification layer of that class only.
loggerclass: logger file for printing the info
validbool, optional: If valid==True we use validation dataset gradient sum in OMP otherwise we use training dataset (default: False)
v1bool: If v1==True, we use newer version of OMP solver that is more accurate
lamfloat: Regularization constant of OMP solver
epsfloat: Epsilon parameter to which the above optimization problem is solved using OMP algorithm

ompwrapper(X, Y, bud)[source]

Wrapper function that instantiates the OMP algorithm

Parameters
X:: Individual datapoint gradients
Y:: Gradient sum that needs to be matched to.
bud:: Budget of datapoints that needs to be sampled from the unlabeled set

Returns

idxs (list) – List containing indices of the best datapoints,
gammas (weights tensors) – Tensor containing weights of each instance

select(budget, model_params, tea_model_params)[source]

Apply OMP Algorithm for data selection

Parameters

budget (int) – The number of data points to be selected
model_params (OrderedDict) – Python dictionary object containing model’s parameters
tea_model_params (OrderedDict) – Python dictionary object containing teacher model’s parameters

Returns

idxs (list) – List containing indices of the best datapoints,
gammas (weights tensors) – Tensor containing weights of each instance

Random Strategy

class cords.selectionstrategies.SSL.randomstrategy.RandomStrategy(trainloader, online=False)[source]

Bases: object

This is the Random Selection Strategy class where we select a set of random points as a datasubset and often acts as baselines to compare other subset selection strategies.

Parameters: trainloader (class) – Loading the training data using pytorch DataLoader

select(budget)[source]

Perform random sampling of indices of size budget.

Parameters

budget (int) – The number of data points to be selected

Returns

indices (ndarray) – Array of indices of size budget selected randomly
gammas (Tensor) – Gradient weight values of selected indices

REFERENCES

1(1,2): Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh K Iyer. RETRIEVE: coreset selection for efficient and robust semi-supervised learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems. 2021. URL: https://openreview.net/forum?id=jSz59N8NvUP.
2: Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrak, and Andreas Krause. Lazier than lazy greedy. 2014. arXiv:1409.7938.
3(1,2): Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 6950–6960. PMLR, 13–18 Jul 2020. URL: https://proceedings.mlr.press/v119/mirzasoleiman20a.html.
4(1,2): Krishnateja Killamsetty, Durga S, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: gradient matching based data subset selection for efficient deep model training. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5464–5474. PMLR, 18–24 Jul 2021. URL: https://proceedings.mlr.press/v139/killamsetty21a.html.

Semi-supervised Learning Data Selection Strategies

Data Selection Strategy - Base Class

RETRIEVE Strategy 1

CRAIG Strategy 3

GradMatch Strategy 4

Random Strategy

REFERENCES

RETRIEVE Strategy 1 

CRAIG Strategy 3 

GradMatch Strategy 4 