Contains the CVMatrix class which implements methods for fast computation of training set kernel matrices in cross-validation using the fast algorithms described in the paper by O.-C. G. Engstrøm and M. H. Jensen: https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/full/10.1002/cem.70008

The implementation is written using NumPy.

Author: Ole-Christian Galbo Engstrøm E-mail: ocge@foss.dk

Classes

CVMatrix(center_X, center_Y, scale_X, ...)

Implements the fast cross-validation algorithms for kernel matrix-based models such as PCA, PCR, PLS, and OLS.

class cvmatrix.cvmatrix.CVMatrix(center_X: bool = True, center_Y: bool = True, scale_X: bool = True, scale_Y: bool = True, ddof: int = 1, dtype: type[~numpy.floating] = <class 'numpy.float64'>, copy: bool = True)

Bases: object

Implements the fast cross-validation algorithms for kernel matrix-based models such as PCA, PCR, PLS, and OLS. The algorithms are based on Algorithms 2-7 and all the extensions in Table 1 in the paper by O.-C. G. Engstrøm and M. H. Jensen: https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/full/10.1002/cem.70008. This class is designed to be used in conjunction with cvmatrix.validation_partitioner.Partitioner which implements Algorithm 1 from the same paper.

Parameters:
  • center_X (bool, optional, default=True) – Whether to center X before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by subtracting its row of column-wise weighted means from each row. The row of column-wise weighted means is computed on the training set for each fold to avoid data leakage.

  • center_Y (bool, optional, default=True) – Whether to center Y before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by subtracting its row of column-wise weighted means from each row. The row of column-wise weighted means is computed on the training set for each fold to avoid data leakage. This parameter is ignored if Y is None.

  • scale_X (bool, optional, default=True) – Whether to scale X before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by dividing each row with the row of X’s column-wise weighted standard deviations. The row of column-wise weighted standard deviations is computed on the training set for each fold to avoid data leakage.

  • scale_Y (bool, optional, default=True) – Whether to scale Y before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by dividing each row with the row of Y’s column-wise weighted standard deviations. The row of column-wise weighted standard deviations is computed on the training set for each fold to avoid data leakage. This parameter is ignored if Y is None.

  • ddof (int, optional, default=1) – The delta degrees of freedom used in the computation of the sample standard deviation. The default is 1, which corresponds to Bessel’s correction for the unbiased estimate of the sample standard deviation. If ddof is set to 0, the population standard deviation is computed instead.

  • dtype (type[np.floating], optional, default=np.float64) – The data type used for the computations. The default is np.float64.

  • copy (bool, optional, default=True) – Whether to make a copy of the input arrays. If False and the input arrays are already NumPy arrays of type dtype, then no copy is made. If False and the input arrays are not NumPy arrays of type dtype, then a copy is made. If True a copy is always made. If no copy is made, then external modifications to X or Y will result in undefined behavior.

fit(X: ArrayLike, Y: ArrayLike | None = None, weights: ArrayLike | None = None) None

Loads and stores X, Y, and “weights”, for cross-validation. Computes dataset-wide \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and, if Y is not None, \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\). If center_X, center_Y, scale_X, or scale_Y is True, the corresponding global statistics are also computed.

Parameters:
  • X (Array-like of shape (N, K) or (N,)) – Predictor variables for the entire dataset.

  • Y (None or array-like of shape (N, M) or (N,), optional, default=None) – Response variables for the entire dataset. If None, subsequent calls to training_XTY and training_XTX_XTY will raise a ValueError.

  • weights (None or array-like of shape (N,) or (N, 1), optional, default=None) – Weights for each sample in X and Y. If None, no weights are used in the computations. If provided, the weights must be non-negative.

X

The total predictor matrix X for the entire dataset.

Type:

np.ndarray

Y

The total response matrix Y for the entire dataset. If Y is None, this is None.

Type:

np.ndarray or None

N

The number of samples in the dataset.

Type:

int

K

The number of predictor variables in X.

Type:

int

M

The number of response variables in Y. If Y is None, this is None.

Type:

int or None

XTX

The total matrix \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) for the entire dataset.

Type:

np.ndarray

XTY

The total matrix \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) for the entire dataset. This is computed only if Y is not None.

Type:

np.ndarray or None

sum_X

The row of column-wise weighted sums of X for the entire dataset. This is computed only if center_X, scale_X, or center_Y is True. This is the row of column-wise sums of \(\mathbf{W}\mathbf{X}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{X}\).

Type:

np.ndarray or None

sum_Y

The row of column-wise weighted sums of Y for the entire dataset. This is computed only if center_Y, scale_Y, or center_X is True and Y is not None. This is the row of column-wise sums of \(\mathbf{W}\mathbf{Y}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{Y}\).

Type:

np.ndarray or None

sum_sq_X

The row of column-wise weighted squared sums of X for the entire dataset. This is computed only if scale_X is True. This is the row of column-wise sums of \(\mathbf{W}\mathbf{X}\odot\mathbf{X}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{X}\odot\mathbf{X}\).

Type:

np.ndarray or None

sum_sq_Y

The row of column-wise weighted squared sums of Y for the entire dataset. This is computed only if scale_Y is True and Y is not None. This is the row of column-wise sums of \(\mathbf{W}\mathbf{Y}\odot\mathbf{Y}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{Y}\odot\mathbf{Y}\).

Type:

np.ndarray or None

sq_X

The total weighted squared predictor matrix X for the entire dataset. This is \(\mathbf{W}\mathbf{X}\odot\mathbf{X}\). This is computed only if scale_X is True.

Type:

np.ndarray or None

sq_Y

The total weighted squared response matrix Y for the entire dataset. This is \(\mathbf{W}\mathbf{Y}\odot\mathbf{Y}\). This is computed only if scale_Y is True and Y is not None.

Type:

np.ndarray or None

WX

The total weighted predictor matrix X for the entire dataset. This is \(\mathbf{W}\mathbf{X}\).

Type:

np.ndarray or None

WY

The total weighted response matrix Y for the entire dataset. This is \(\mathbf{W}\mathbf{Y}\). This is computed only if Y is not None.

Type:

np.ndarray or None

weights

The total weights for the entire dataset. This is an array of shape (N, 1). If weights is None, this is None.

Type:

np.ndarray or None

sum_w

The sum of the weights for the entire dataset. If weights is None, this is None.

Type:

float or None

num_nonzero_w

The number of non-zero weights for the entire dataset. If weights is None, this is None.

Type:

int or None

Return type:

None

Raises:

ValueError – If weights is provided and contains negative values.

training_XTX(validation_indices: ndarray[tuple[Any, ...], dtype[int64]]) Tuple[ndarray, Tuple[ndarray | None, ndarray | None, None, None]]

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) corresponding to every sample except those at the validation_indices. Also computes the row of column-wise weighted means for X and the row of column-wise weighted standard deviations for X.

Parameters:

validation_indices (npt.NDArray[np.int_]) – An integer array of indices for the validation set for which to return the corresponding training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\). The validation indices should be obtained using the get_validation_indices method of a cvmatrix.validation_partitioner.Partitioner object.

Returns:

The first element is an array of shape (K, K) corresponding to the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\). The second element is a tuple containing the row of column-wise weighted means for X, the row of column-wise weighted standard deviations for X, and two None corresponding to the non-computed rows of column-wise weighted means and standard deviations for Y. If a statistic is not computed, it is None.

Return type:

Tuple of two elements.

Raises:

ValueError

  • If self.center_X or self.scale_X is True but cannot be computed due to the training set having no non-zero weights. - If self.scale_X is True and the corresponding weighted standard deviations cannot be computed due self.ddof being greater or equal to the number of non-zero weights in the training set.

See also

training_XTY

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) and weighted statistics.

training_XTX_XTY

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\), \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\), and weighted statistics. This method is faster than calling training_XTX and training_XTY separately.

training_XTY(validation_indices: ndarray[tuple[Any, ...], dtype[int64]]) Tuple[ndarray, Tuple[ndarray | None, ndarray | None, ndarray | None, ndarray | None]]

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) corresponding to every sample except those at the validation_indices. Also computes the row of column-wise weighted means for X, the row of column-wise weighted standard deviations for X, the row of column-wise weighted means for Y, and the row of column-wise weighted standard deviations for Y. If a statistic is not computed, it is None.

Parameters:

validation_indices (npt.NDArray[np.int_]) – An integer array of indices for the validation set for which to return the corresponding training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\). The validation indices should be obtained using the get_validation_indices method of a cvmatrix.validation_partitioner.Partitioner object.

Returns:

The first element is an array of shape (K, M) corresponding to the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\). The second element is a tuple containing the row of column-wise weighted means for X, the row of column-wise weighted standard deviations for X, the row of column-wise weighted means for Y, and the row of column-wise weighted standard deviations for Y. If a statistic is not computed, it is None.

Return type:

Tuple of two elements.

Raises:

ValueError

  • If Y is None. - If self.center_X, self.center_Y, self.scale_X, or self.scale_Y is True and the corresponding weighted means and standard deviations cannot be computed due to the training set having no non-zero weights. - If self.scale_X or self.scale_Y is True and the corresponding weighted standard deviations cannot be computed due self.ddof being greater or equal to the number of non-zero weights in the training set.

See also

training_XTX

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and weighted statistics.

training_XTX_XTY

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\), \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\), and weighted statistics. This method is faster than calling training_XTX and training_XTY separately.

training_XTX_XTY(validation_indices: ndarray[tuple[Any, ...], dtype[int64]]) Tuple[Tuple[ndarray, ndarray], Tuple[ndarray | None, ndarray | None, ndarray | None, ndarray | None]]

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) corresponding to every sample except those at the validation_indices. Also computes the row of column-wise weighted means for X, the row of column-wise weighted standard deviations for X, the row of column-wise weighted means for Y, and the row of column-wise weighted standard deviations for Y. If a statistic is not computed, it is None.

Parameters:

validation_indices (npt.NDArray[np.int_]) – An integer array of indices for the validation set for which to return the corresponding training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\). The validation indices should be obtained using the get_validation_indices method of a cvmatrix.validation_partitioner.Partitioner object.

Returns:

The first tuple contains arrays of shapes (K, K) and (K, M). These are the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\). The second tuple contains the row of column-wise weighted means for X, the row of column-wise weighted standard deviations for X, the row of column-wise weighted means for Y, and the row of column-wise weighted standard deviations for Y. If a statistic is not computed, it is None.

Return type:

Tuple of two tuples.

Raises:

ValueError

  • If Y is None. - If self.center_X, self.center_Y, self.scale_X, or self.scale_Y is True and the corresponding weighted means and standard deviations cannot be computed due to the training set having no non-zero weights. - If self.scale_X or self.scale_Y is True and the corresponding weighted standard deviations cannot be computed due self.ddof being greater or equal to the number of non-zero weights in the training set.

See also

training_XTX

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and weighted statistics.

training_XTY

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) and weighted statistics.

training_statistics(validation_indices: ndarray[tuple[Any, ...], dtype[int64]]) Tuple[ndarray | None, ndarray | None, ndarray | None, ndarray | None]

Computes the row of column-wise weighted means and standard deviations for X and Y corresponding to every sample except those at validation_indices. The statistics that can be computed depend on the arguments provided in the constructor: X mean can be computed if center_X or scale_X, or center_Y is True. X standard deviation can be computed if scale_X is True. Y mean can be computed if center_X ,`center_Y or scale_Y is True, and Y is provided. Y standard deviation can be computed if scale_Y is True and Y is provided.

Parameters:

fold (Hashable) – The fold for which to return the corresponding training statistics.

Returns:

A tuple containing the row of column-wise weighted means for X, the row of column-wise weighted standard deviations for X, the row of column-wise weighted means for Y, and the row of column-wise weighted standard deviations for Y. If a statistic is not computed, it is None.

Return type:

Tuple of four elements of Optional[np.ndarray]

Raises:

ValueError

  • If self.center_X, self.center_Y, self.scale_X, or self.scale_Y is True and the corresponding weighted means and standard deviations cannot be computed due to the training set having no non-zero weights. - If self.scale_X or self.scale_Y is True and the corresponding weighted standard deviations cannot be computed due self.ddof being greater or equal to the number of non-zero weights in the training set.