Removal of cell cycle effect using batchelor::regressBatches() and proper usage of corrected UMI matrix
1
0
Entering edit mode
jirkanov • 0
@jirkanov-14624
Last seen 18 months ago

Hello, I am working with 10x Genomics scRNA-seq data. I am using Seurat::CellCycleScoring() to assign cell cycle stages to cells and later batchelor::regressBatches() to remove this unwanted source of variation. My question is regarding output from regressBatches(). As documentation says, the output is "A SingleCellExperiment object containing the corrected assay.", where corrected assay is object of class ResidualMatrix, and "the residuals represent the expression values after correcting for the batch effect".

I am really not sure if I am using the ResidualMatrix properly in downstream analyses. I have used this matrix to calculate dimreds, clustering and to find cluster markers. Although I can see that some genes are relatively good markers (log2FC > 1, FDR < 0.1; Seurat::FindAllMarkers() using Wilcox test), the heatmap of such genes is not very convincing (see below - top 15 upregulated genes from each cluster). For example, the first gene (IGFBP5) has log2FC of 2.2 in the first cluster, but I can barely see a difference in the heatmap. I am not sure if this is just caused by improper color gradient, but you can see that are negative values in the ResidualMatrix.

TLDR: Can I use ResidualMatrix obtained from regressBatches() in the same way as e.g. corrected UMI matrix obtained from Seurat::SCTransform() (both used to remove cell cycle effect)?

0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 8 hours ago
The city by the bay

I'll assume you've already read the bit about the dangers of regression in single-cell applications, so I'll skip that.

I am not sure if this is just caused by improper color gradient, but you can see that are negative values in the ResidualMatrix.

Yes, negative values in the residual matrix are to be expected - it is, after all a matrix of residuals, so some of the values have to be negative by definition.

Anyway, there should be nothing (too) wrong with using the residual matrix in the standard Wilcoxon tests. For example:

set.seed(100)
blah <- matrix(rnorm(20000), ncol=100)
covariate <- runif(ncol(blah))

# Double-transpose as we are fitting the models to to columns, not to rows.
library(ResidualMatrix)
resids <- t(ResidualMatrix(t(blah), design=model.matrix(~covariate)))
dim(resids)
## [1] 200 100

library(scran)
groups <- rep(1:2, 50) # two groups of 50 each.
p.out <- findMarkers(resids, groups, test.type="wilcox")
p.out[[1]]
## DataFrame with 200 rows and 5 columns
##           Top    p.value       FDR summary.AUC     AUC.2
##     <integer>  <numeric> <numeric>   <numeric> <numeric>
## 37          1 0.00164939  0.306746      0.6828    0.6828
## 44          2 0.00306746  0.306746      0.3280    0.3280
## 162         3 0.00465607  0.310404      0.3356    0.3356
## 199         4 0.01345722  0.615416      0.6436    0.6436
## 116         5 0.01538540  0.615416      0.3592    0.3592
## ...       ...        ...       ...         ...       ...
## 170       196   0.986249  0.996212      0.4988    0.4988
## 188       197   0.986249  0.996212      0.5012    0.5012
## 198       198   0.986249  0.996212      0.4988    0.4988
## 197       199   0.997250  1.000000      0.4996    0.4996
## 146       200   1.000000  1.000000      0.5000    0.5000

# Top gene agrees with the reference R implementation:
wilcox.test(resids[37,groups==1], resids[37,groups==2])$p.value ## [1] 0.001649387 # And the residual values match up to a reference calculation: ref <- lm.fit(t(blah), x=model.matrix(~covariate)) all.equal(ref$residuals[,37], resids[37,])
## [1] TRUE


So, there's nothing wrong on my end. Maybe Seurat::FindMarkers() just doesn't like negative values? The Wilcoxon test doesn't have any limits on the range of values it can operate over, so there are no theoretical problems with supplying negatives. (There are a whole bunch of other theoretical problems with using residuals in hypothesis tests, but all the $p$-values from these tests are broken anyway, so these concerns barely matter.)

0
Entering edit mode

Many thanks Aaron! Especially your comment about "the dangers of regression in single-cell applications" was quite useful for me, as I was not fully aware of potential troubles. After further reading of this chapter in OSCA and this and this question, I am more convinced that in my case, regression was not the ideal tool for dealing with cell cycle effects.

I will try to use the "CC driver genes removal strategy" described here (or alternatively in a comment of the second question above).