<no title>

Contains the CVMatrix class which implements methods for fast computation of training set kernel matrices in cross-validation using the fast algorithms described in the paper by O.-C. G. Engstrøm and M. H. Jensen: https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/full/10.1002/cem.70008

The implementation is written using NumPy.

Author: Ole-Christian Galbo Engstrøm E-mail: ocge@foss.dk

Classes

CVMatrix(center_X, center_Y, scale_X, ...)

Implements the fast cross-validation algorithms for kernel matrix-based models such as PCA, PCR, PLS, and OLS.

class cvmatrix.cvmatrix.CVMatrix(center_X: bool = True, center_Y: bool = True, scale_X: bool = True, scale_Y: bool = True, ddof: int = 1, dtype: DTypeLike = <class 'numpy.float64'>, copy: bool = True, backend: Literal['numpy', 'jax']='numpy')

Bases: object

Implements the fast cross-validation algorithms for kernel matrix-based models such as PCA, PCR, PLS, and OLS. The algorithms are based on Algorithms 2-7 and all the extensions in Table 1 in the paper by O.-C. G. Engstrøm and M. H. Jensen: https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/full/10.1002/cem.70008. This class is designed to be used in conjunction with cvmatrix.validation_partitioner.Partitioner which implements Algorithm 1 from the same paper.

Parameters:

center_X (bool, optional, default=True) – Whether to center X before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by subtracting its row of column-wise weighted means from each row. The row of column-wise weighted means is computed on the training set for each fold to avoid data leakage.
center_Y (bool, optional, default=True) – Whether to center Y before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by subtracting its row of column-wise weighted means from each row. The row of column-wise weighted means is computed on the training set for each fold to avoid data leakage. This parameter is ignored if Y is None.
scale_X (bool, optional, default=True) – Whether to scale X before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by dividing each row with the row of X’s column-wise weighted standard deviations. The row of column-wise weighted standard deviations is computed on the training set for each fold to avoid data leakage.
scale_Y (bool, optional, default=True) – Whether to scale Y before computation of \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) by dividing each row with the row of Y’s column-wise weighted standard deviations. The row of column-wise weighted standard deviations is computed on the training set for each fold to avoid data leakage. This parameter is ignored if Y is None.
ddof (int, optional, default=1) – The delta degrees of freedom used in the computation of the sample standard deviation. The default is 1, which corresponds to Bessel’s correction for the unbiased estimate of the sample standard deviation. If ddof is set to 0, the population standard deviation is computed instead.
dtype (type[np.floating], optional, default=np.float64) – The data type used for the computations. The default is np.float64.
copy (bool, optional, default=True) – Whether to make a copy of the input arrays. If False and the input arrays are already NumPy arrays of type dtype, then no copy is made. If False and the input arrays are not NumPy arrays of type dtype, then a copy is made. If True a copy is always made. If no copy is made, then external modifications to X or Y will result in undefined behavior.

fit(X: ArrayLike, Y: ArrayLike | None = None, weights: ArrayLike | None = None) → None

Loads and stores X, Y, and “weights”, for cross-validation. Computes dataset-wide \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) and, if Y is not None, \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\). If center_X, center_Y, scale_X, or scale_Y is True, the corresponding global statistics are also computed.

Parameters:

X (Array-like of shape (N, K) or (N,)) – Predictor variables for the entire dataset.
Y (None or array-like of shape (N, M) or (N,), optional, default=None) – Response variables for the entire dataset. If None, subsequent calls to training_XTY and training_XTX_XTY will raise a ValueError.
weights (None or array-like of shape (N,) or (N, 1), optional, default=None) – Weights for each sample in X and Y. If None, no weights are used in the computations. If provided, the weights must be non-negative.

X

The total predictor matrix X for the entire dataset.

Type:: Array

Y

The total response matrix Y for the entire dataset. If Y is None, this is None.

Type:: Array or None

N

The number of samples in the dataset.

Type:: int

K

The number of predictor variables in X.

Type:: int

M

The number of response variables in Y. If Y is None, this is None.

Type:: int or None

XTX

The total matrix \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) for the entire dataset.

Type:: Array

XTY

The total matrix \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}\) for the entire dataset. This is computed only if Y is not None.

Type:: Array or None

sum_X

The row of column-wise weighted sums of X for the entire dataset. This is computed only if center_X, scale_X, or center_Y is True. This is the row of column-wise sums of \(\mathbf{W}\mathbf{X}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{X}\).

Type:: Array or None

sum_Y

The row of column-wise weighted sums of Y for the entire dataset. This is computed only if center_Y, scale_Y, or center_X is True and Y is not None. This is the row of column-wise sums of \(\mathbf{W}\mathbf{Y}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{Y}\).

Type:: Array or None

sum_sq_X

The row of column-wise weighted squared sums of X for the entire dataset. This is computed only if scale_X is True. This is the row of column-wise sums of \(\mathbf{W}\mathbf{X}\odot\mathbf{X}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{X}\odot\mathbf{X}\).

Type:: Array or None

sum_sq_Y

The row of column-wise weighted squared sums of Y for the entire dataset. This is computed only if scale_Y is True and Y is not None. This is the row of column-wise sums of \(\mathbf{W}\mathbf{Y}\odot\mathbf{Y}\) if weights are provided and otherwise the row of column-wise sums of \(\mathbf{Y}\odot\mathbf{Y}\).

Type:: Array or None

sq_X

The total weighted squared predictor matrix X for the entire dataset. This is \(\mathbf{W}\mathbf{X}\odot\mathbf{X}\). This is computed only if scale_X is True.

Type:: Array or None

sq_Y

The total weighted squared response matrix Y for the entire dataset. This is \(\mathbf{W}\mathbf{Y}\odot\mathbf{Y}\). This is computed only if scale_Y is True and Y is not None.

Type:: Array or None

WX

The total weighted predictor matrix X for the entire dataset. This is \(\mathbf{W}\mathbf{X}\).

Type:: Array or None

WY

The total weighted response matrix Y for the entire dataset. This is \(\mathbf{W}\mathbf{Y}\). This is computed only if Y is not None.

Type:: Array or None

weights

The total weights for the entire dataset. This is an array of shape (N, 1). If weights is None, this is None.

Type:: Array or None

sum_w

The sum of the weights for the entire dataset. If weights is None, this is None.

Type:: float or None

num_nonzero_w

The number of non-zero weights for the entire dataset. If weights is None, this is None.

Type:: int or None

Return type:: None
Raises:: ValueError – If weights is provided and contains negative values.

training_XTX(validation_indices: NDArray[int64]) → Tuple[ndarray, Tuple[ndarray | None, ndarray | None, None, None]]

Computes the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\) corresponding to every sample except those at the validation_indices. Also computes the row of column-wise weighted means for X and the row of column-wise weighted standard deviations for X.

Parameters:

validation_indices (IndexArray) – An integer array of indices for the validation set for which to return the corresponding training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\). The validation indices should be obtained using the get_validation_indices method of a cvmatrix.validation_partitioner.Partitioner object.

Returns:

The first element is an array of shape (K, K) corresponding to the training set \(\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{X}\). The second element is a tuple containing the row of column-wise weighted means for X, the row of column-wise weighted standard deviations for X, and two None corresponding to the non-computed rows of column-wise weighted means and standard deviations for Y. If a statistic is not computed, it is None.

Return type:

Tuple of two elements.

Raises:

ValueError –

If self.center_X or self.scale_X is True but cannot be computed due to the training set having no non-zero weights. - If self.scale_X is True and the corresponding weighted standard deviations cannot be computed due self.ddof being greater or equal to the number of non-zero weights in the training set.