Supervised Learning Data Selection Strategies
In this section, we consider different data selection strategies geared towards efficient and robust learning in standard supervised learning setting.
Data Selection Strategy (Base Class)
- class cords.selectionstrategies.SL.dataselectionstrategy.DataSelectionStrategy(trainloader, valloader, model, num_classes, linear_layer, loss, device, logger)[source]
Bases:
object
Implementation of Data Selection Strategy class which serves as base class for other dataselectionstrategies for general learning frameworks. :param trainloader: Loading the training data using pytorch dataloader :type trainloader: class :param valloader: Loading the validation data using pytorch dataloader :type valloader: class :param model: Model architecture used for training :type model: class :param num_classes: Number of target classes in the dataset :type num_classes: int :param linear_layer: If True, we use the last fc layer weights and biases gradients
If False, we use the last fc layer biases gradients
- Parameters
loss (class) – PyTorch Loss function
device (str) – The device being utilized - cpu | cuda
logger (class) – logger object for logging the information
- compute_gradients(valid=False, perBatch=False, perClass=False)[source]
Computes the gradient of each element.
Here, the gradients are computed in a closed form using CrossEntropyLoss with reduction set to ‘none’. This is done by calculating the gradients in last layer through addition of softmax layer.
Using different loss functions, the way we calculate the gradients will change.
For LogisticLoss we measure the Mean Absolute Error(MAE) between the pairs of observations. With reduction set to ‘none’, the loss is formulated as:
\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left| x_n - y_n \right|,\]where \(N\) is the batch size.
For MSELoss, we measure the Mean Square Error(MSE) between the pairs of observations. With reduction set to ‘none’, the loss is formulated as:
\[\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left( x_n - y_n \right)^2,\]where \(N\) is the batch size. :param valid: if True, the function also computes the validation gradients :type valid: bool :param perBatch: if True, the function computes the gradients of each mini-batch :type perBatch: bool :param perClass: if True, the function computes the gradients using perclass dataloaders :type perClass: bool
GLISTER 1
- class cords.selectionstrategies.SL.glisterstrategy.GLISTERStrategy(trainloader, valloader, model, loss_func, eta, device, num_classes, linear_layer, selection_type, greedy, logger, r=15)[source]
Bases:
cords.selectionstrategies.SL.dataselectionstrategy.DataSelectionStrategy
Implementation of GLISTER-ONLINE Strategy from the paper 1 for supervised learning frameworks. GLISTER-ONLINE methods tries to solve the bi-level optimization problem given below:
\[\overbrace{\underset{{S \subseteq {\mathcal U}, |S| \leq k}}{\operatorname{argmin\hspace{0.7mm}}} L_V(\underbrace{\underset{\theta}{\operatorname{argmin\hspace{0.7mm}}} L_T( \theta, S)}_{inner-level}, {\mathcal V})}^{outer-level}\]In the above equation, \(\mathcal{U}\) denotes the training set, \(\mathcal{V}\) denotes the validation set that guides the subset selection process, \(L_T\) denotes the training loss, \(L_V\) denotes the validation loss, \(S\) denotes the data subset selected at each round, and \(k\) is the budget for the subset.
Since, solving the complete inner-optimization is expensive, GLISTER-ONLINE adopts a online one-step meta approximation where we approximate the solution to inner problem by taking a single gradient step.
The optimization problem after the approximation is as follows:
\[\overbrace{\underset{{S \subseteq {\mathcal U}, |S| \leq k}}{\operatorname{argmin\hspace{0.7mm}}} L_V(\underbrace{\theta - \eta \nabla_{\theta}L_T(\theta, S)}_{inner-level}, {\mathcal V})}^{outer-level}\]In the above equation, \(\eta\) denotes the step-size used for one-step gradient update.
GLISTER-ONLINE also makes an additional approximation called Taylor-Series approximation to easily solve the outer problem using a greedy selection algorithm. The Taylor series approximation is as follows:
\[L_V(\theta - \eta \nabla_{\theta}L_T(\theta, S), {\mathcal V}) \approx L_V(\theta) - \eta {\nabla_{\theta}L_T(\theta, S)}^T \nabla_{\theta}L_V(\theta, {\mathcal V})\]The Optimization problem after the Taylor series approximation is as follows:
\[\underset{{S \subseteq {\mathcal U}, |S| \leq k}}{\operatorname{argmin\hspace{0.7mm}}}L_V(\theta - \eta \nabla_{\theta}L_T(\theta, S), {\mathcal V}) \approx L_V(\theta) - \eta {\nabla_{\theta}L_T(\theta, S)}^T \nabla_{\theta}L_V(\theta, {\mathcal V})\]Taylor’s series approximation reduces the time complexity by reducing the need of calculating the validation loss for each element during greedy selection step which means reducing the number of forward passes required.
GLISTER-ONLINE is an adaptive subset selection algorithm that tries to select a subset every \(L\) epochs and the parameter L can be set in the original training loop.
- Parameters
- trainloader: class
Loading the training data using pytorch DataLoader
- valloader: class
Loading the validation data using pytorch DataLoader
- model: class
Model architecture used for training
- loss_func: object
Loss function object
- eta: float
Learning rate. Step size for the one step gradient update
- device: str
The device being utilized - cpu | cuda
- num_classes: int
The number of target classes in the dataset
- linear_layer: bool
If True, we use the last fc layer weights and biases gradients If False, we use the last fc layer biases gradients
- selection_type: str
Type of selection algorithm - - ‘PerBatch’ : PerBatch method is where GLISTER algorithm is applied on each minibatch data points. - ‘PerClass’ : PerClass method is where GLISTER algorithm is applied on each class data points seperately. - ‘Supervised’ : Supervised method is where GLISTER algorithm is applied on entire training data.
- greedy: str
Type of greedy selection algorithm - - ‘RGreedy’ : RGreedy Selection method is a variant of naive greedy where we just perform r rounds of greedy selection by choosing k/r points in each round. - ‘Stochastic’ : Stochastic greedy selection method is based on the algorithm presented in this paper 2 - ‘Naive’ : Normal naive greedy selection method that selects a single best element every step until the budget is fulfilled
- logger: class
logger class for logging the information
- rint, optional
Number of greedy selection rounds when selection method is RGreedy (default: 15)
- eval_taylor_modular(grads)[source]
Evaluate gradients
- Parameters
grads (Tensor) – Gradients
- Returns
gains – Matrix product of two tensors
- Return type
Tensor
- select(budget, model_params)[source]
Apply naive greedy method for data selection
- Parameters
budget (int) – The number of data points to be selected
model_params (OrderedDict) – Python dictionary object containing models parameters
- Returns
greedySet (list) – List containing indices of the best datapoints,
budget (Tensor) – Tensor containing gradients of datapoints present in greedySet
CRAIG 3
- class cords.selectionstrategies.SL.craigstrategy.CRAIGStrategy(trainloader, valloader, model, loss, device, num_classes, linear_layer, if_convex, selection_type, logger, optimizer='lazy')[source]
Bases:
cords.selectionstrategies.SL.dataselectionstrategy.DataSelectionStrategy
Implementation of CRAIG Strategy from the paper 3 for supervised learning frameworks.
CRAIG strategy tries to solve the optimization problem given below for convex loss functions:
\[\sum_{i\in \mathcal{U}} \min_{j \in S, |S| \leq k} \| x^i - x^j \|\]In the above equation, \(\mathcal{U}\) denotes the training set where \((x^i, y^i)\) denotes the \(i^{th}\) training data point and label respectively, \(L_T\) denotes the training loss, \(S\) denotes the data subset selected at each round, and \(k\) is the budget for the subset.
Since, the above optimization problem is not dependent on model parameters, we run the subset selection only once right before the start of the training.
CRAIG strategy tries to solve the optimization problem given below for non-convex loss functions:
\[\sum_{i\in \mathcal{U}} \min_{j \in S, |S| \leq k} \| \nabla_{\theta} {L_T}^i(\theta) - \nabla_{\theta} {L_T}^j(\theta) \|\]In the above equation, \(\mathcal{U}\) denotes the training set, \(L_T\) denotes the training loss, \(S\) denotes the data subset selected at each round, and \(k\) is the budget for the subset. In this case, CRAIG acts an adaptive subset selection strategy that selects a new subset every epoch.
Both the optimization problems given above are an instance of facility location problems which is a submodular function. Hence, it can be optimally solved using greedy selection methods.
- Parameters
- trainloader: class
Loading the training data using pytorch DataLoader
- valloader: class
Loading the validation data using pytorch DataLoader
- model: class
Model architecture used for training
- loss_type: class
PyTorch Loss Function
- device: str
The device being utilized - cpu | cuda
- num_classes: int
The number of target classes in the dataset
- linear_layer: bool
Apply linear transformation to the data
- if_convex: bool
If convex or not
- selection_type: str
- Type of selection:
‘PerClass’: PerClass Implementation where the facility location problem is solved for each class seperately for speed ups.
- ‘Supervised’: Supervised Implementation where the facility location problem is solved using a sparse similarity matrix by
assigning the similarity of a point with other points of different class to zero.
‘PerBatch’: PerBatch Implementation where the facility location problem tries to select subset of mini-batches.
- loggerclass
logger object for logging the information
- optimizer: str
Type of Greedy Algorithm
- compute_gamma(idxs)[source]
Compute the gamma values for the indices.
- Parameters
idxs (list) – The indices
- Returns
gamma – Gradient values of the input indices
- Return type
list
- compute_score(model_params, idxs)[source]
Compute the score of the indices.
- Parameters
model_params (OrderedDict) – Python dictionary object containing models parameters
idxs (list) – The indices
- distance(x, y, exp=2)[source]
Compute the distance.
- Parameters
x (Tensor) – First input tensor
y (Tensor) – Second input tensor
exp (float, optional) – The exponent value (default: 2)
- Returns
dist – Output tensor
- Return type
Tensor
- get_similarity_kernel()[source]
Obtain the similarity kernel.
- Returns
kernel – Array of kernel values
- Return type
ndarray
- select(budget, model_params)[source]
Data selection method using different submodular optimization functions.
- Parameters
budget (int) – The number of data points to be selected
model_params (OrderedDict) – Python dictionary object containing models parameters
optimizer (str) – The optimization approach for data selection. Must be one of ‘random’, ‘modular’, ‘naive’, ‘lazy’, ‘approximate-lazy’, ‘two-stage’, ‘stochastic’, ‘sample’, ‘greedi’, ‘bidirectional’
- Returns
total_greedy_list (list) – List containing indices of the best datapoints
gammas (list) – List containing gradients of datapoints present in greedySet
GradMatch 4
- class cords.selectionstrategies.SL.gradmatchstrategy.GradMatchStrategy(trainloader, valloader, model, loss, eta, device, num_classes, linear_layer, selection_type, logger, valid=False, v1=True, lam=0, eps=0.0001)[source]
Bases:
cords.selectionstrategies.SL.dataselectionstrategy.DataSelectionStrategy
Implementation of GradMatch Strategy from the paper 4 for supervised learning frameworks.
GradMatch strategy tries to solve the optimization problem given below:
\[\min_{\mathbf{w}, S: |S| \leq k} \Vert \sum_{i \in S} w_i \nabla_{\theta}L_T^i(\theta) - \nabla_{\theta}L(\theta)\Vert\]In the above equation, \(\mathbf{w}\) denotes the weight vector that contains the weights for each data instance, \(\mathcal{U}\) training set where \((x^i, y^i)\) denotes the \(i^{th}\) training data point and label respectively, \(L_T\) denotes the training loss, \(L\) denotes either training loss or validation loss depending on the parameter valid, \(S\) denotes the data subset selected at each round, and \(k\) is the budget for the subset.
The above optimization problem is solved using the Orthogonal Matching Pursuit(OMP) algorithm.
- Parameters
- trainloader: class
Loading the training data using pytorch DataLoader
- valloader: class
Loading the validation data using pytorch DataLoader
- model: class
Model architecture used for training
- loss: class
PyTorch loss function for training
- eta: float
Learning rate. Step size for the one step gradient update
- device: str
The device being utilized - cpu | cuda
- num_classes: int
The number of target classes in the dataset
- linear_layer: bool
Apply linear transformation to the data
- selection_type: str
Type of selection - - ‘PerClass’: PerClass method is where OMP algorithm is applied on each class data points seperately. - ‘PerBatch’: PerBatch method is where OMP algorithm is applied on each minibatch data points. - ‘PerClassPerGradient’: PerClassPerGradient method is same as PerClass but we use the gradient corresponding to classification layer of that class only.
- loggerclass
logger object for logging the information
- validbool
If valid==True, we use validation dataset gradient sum in OMP otherwise we use training dataset (default: False)
- v1bool
If v1==True, we use newer version of OMP solver that is more accurate
- lamfloat
Regularization constant of OMP solver
- epsfloat
Epsilon parameter to which the above optimization problem is solved using OMP algorithm
- select(budget, model_params)[source]
Apply OMP Algorithm for data selection
- Parameters
budget (int) – The number of data points to be selected
model_params (OrderedDict) – Python dictionary object containing models parameters
- Returns
idxs (list) – List containing indices of the best datapoints,
gammas (weights tensors) – Tensor containing weights of each instance
Random Strategy
- class cords.selectionstrategies.SL.randomstrategy.RandomStrategy(trainloader, online=False)[source]
Bases:
object
This is the Random Selection Strategy class where we select a set of random points as a datasubset and often acts as baselines to compare other selection strategies.
- Parameters
trainloader (class) – Loading the training data using pytorch DataLoader
Submodular Selection Strategy
- class cords.selectionstrategies.SL.submodularselectionstrategy.SubmodularSelectionStrategy(trainloader, valloader, model, loss, device, num_classes, linear_layer, if_convex, selection_type, submod_func_type, optimizer)[source]
Bases:
cords.selectionstrategies.SL.dataselectionstrategy.DataSelectionStrategy
This class extends
selectionstrategies.supervisedlearning.dataselectionstrategy.DataSelectionStrategy
to include submodular optmization functions using apricot for data selection.- Parameters
trainloader (class) – Loading the training data using pytorch DataLoader
valloader (class) – Loading the validation data using pytorch DataLoader
model (class) – Model architecture used for training
loss_type (class) – The type of loss criterion
device (str) – The device being utilized - cpu | cuda
num_classes (int) – The number of target classes in the dataset
linear_layer (bool) – Apply linear transformation to the data
if_convex (bool) – If convex or not
selection_type (str) – PerClass or Supervised
submod_func_type (str) – The type of submodular optimization function. Must be one of ‘facility-location’, ‘graph-cut’, ‘sum-redundancy’, ‘saturated-coverage’
- compute_gamma(idxs)[source]
Compute the gamma values for the indices.
- Parameters
idxs (list) – The indices
- Returns
gamma – Gradient values of the input indices
- Return type
list
- compute_score(model_params, idxs)[source]
Compute the score of the indices.
- Parameters
model_params (OrderedDict) – Python dictionary object containing models parameters
idxs (list) – The indices
- distance(x, y, exp=2)[source]
Compute the distance.
- Parameters
x (Tensor) – First input tensor
y (Tensor) – Second input tensor
exp (float, optional) – The exponent value (default: 2)
- Returns
dist – Output tensor
- Return type
Tensor
- get_similarity_kernel()[source]
Obtain the similarity kernel.
- Returns
kernel – Array of kernel values
- Return type
ndarray
- select(budget, model_params)[source]
Data selection method using different submodular optimization functions.
- Parameters
budget (int) – The number of data points to be selected
model_params (OrderedDict) – Python dictionary object containing models parameters
optimizer (str) – The optimization approach for data selection. Must be one of ‘random’, ‘modular’, ‘naive’, ‘lazy’, ‘approximate-lazy’, ‘two-stage’, ‘stochastic’, ‘sample’, ‘greedi’, ‘bidirectional’
- Returns
total_greedy_list (list) – List containing indices of the best datapoints
gammas (list) – List containing gradients of datapoints present in greedySet
REFERENCES
- 1(1,2)
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: generalization based data subset selection for efficient and robust learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):8110–8118, May 2021. URL: https://ojs.aaai.org/index.php/AAAI/article/view/16988.
- 2
Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrak, and Andreas Krause. Lazier than lazy greedy. 2014. arXiv:1409.7938.
- 3(1,2)
Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 6950–6960. PMLR, 13–18 Jul 2020. URL: https://proceedings.mlr.press/v119/mirzasoleiman20a.html.
- 4(1,2)
Krishnateja Killamsetty, Durga S, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: gradient matching based data subset selection for efficient deep model training. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5464–5474. PMLR, 18–24 Jul 2021. URL: https://proceedings.mlr.press/v139/killamsetty21a.html.