We're analysing some RNA-seq data using limma voom. The design of the study is very similar to https://support.bioconductor.org/p/52920/. In the past I have used blocking (subject and treatment or time point) and then used voom with duplicateCorrelation. This time in addition to subject and treatment effects we're also trying to remove the effect of a contaminating cell line. There will be varying levels of contamination but there is a very specific gene that is expressed in this cell line that can be used to quantify the amount of contamination present in each sample.
The question I have is: do we add the marker gene together with the blocking variables (i.e. part of "treat") or regress out the residuals like what is commonly done with nuisance variables in single cell RNA seq (https://satijalab.org/seurat/v3.0/cellcyclevignette.html) prior to performing DGE analysis?
Is it really common to do that? I would be rather surprised, but maybe they are doing something different from what you describe.
Anyway, regressing things out and then fitting the model on the residuals isn't really copacetic, because you are removing a source of variation invisibly to the model you ultimately fit. Put a different way, you have to account for all degrees of freedom that are lost by computing statistics on the data, and if you do a two-stage model like that, the second model doesn't know about the loss of df from the first. This is why you see things like this, in the help for removeBatchEffect:
Note:
This function is not intended to be used prior to linear
modelling. For linear modelling, it is better to include the batch
factors in the linear model.
Yes, it is pretty common to see people removing the batch effect in scRNA-seq data - this is because >90% of scRNA-seq data analysis is exploratory and involves model-free methods like clustering, PCA, etc. Once people get to the point where they can actually do some rigorous statistical analysis, we have always recommended to account for the batch in the linear model, e.g., here.
Thank you both for your replies. Given that a surprising number of people are starting to use tools like Seurat for bulk RNA sequencing (I think it's the ease of generating the object and running the analysis) this is important to know for those of us with biology backgrounds. Cheers.
Given that a surprising number of people are starting to use tools like Seurat for bulk RNA sequencing
This comment made me snort so hard, my taco came out of my nose. Seriously? Who are these people? A standard bulk RNA-seq data analysis is <10 lines of code, and if they're finding that too hard... well, they probably shouldn't quit their day jobs.
I have to say, I personally really appreciate your contributions to the bioconductor community (especially the scRNAseq packages). Even more reason for me to be worried about the taco causing you problems.
No worries, the rest of the taco went down pretty quickly. Waste not, want not.
Anyway, onto the thread. Some of the comments about using the Wilcoxon rank sum test are quite concerning, see Gordon's comments here. Even for single-cell data, the validity of the Wilcoxon p-values is dubious - in addition to the problems mentioned by Gordon, the sheer number of ties at low counts and zeroes will distort the distribution of the test statistic.
The use of Seurat as a wrapper around DESeq2 is less objectionable. However, this doesn't mean it's a good strategy. Absent actual work done by the wrapper, it introduces an unnecessary layer of re-direction that has practical consequences. From a developer perspective, there is more scope for bugs, e.g., if an argument is not passed along properly. From a user perspective, the most relevant documentation is no longer the first port of call, and you end up with situations like this entire thread.
The real solution would be for Seurat to use standard data structures like the SummarizedExperiment (or even better, a SingleCellExperiment), so that you could seamlessly interoperate between Seurat and Bioconductor packages. But that's a long, long rant for another day and on another forum.
Yes, it is pretty common to see people removing the batch effect in scRNA-seq data - this is because >90% of scRNA-seq data analysis is exploratory and involves model-free methods like clustering, PCA, etc. Once people get to the point where they can actually do some rigorous statistical analysis, we have always recommended to account for the batch in the linear model, e.g., here.
Thank you both for your replies. Given that a surprising number of people are starting to use tools like Seurat for bulk RNA sequencing (I think it's the ease of generating the object and running the analysis) this is important to know for those of us with biology backgrounds. Cheers.
This comment made me snort so hard, my taco came out of my nose. Seriously? Who are these people? A standard bulk RNA-seq data analysis is <10 lines of code, and if they're finding that too hard... well, they probably shouldn't quit their day jobs.
Unfortunately I can only up-vote this one time. Otherwise I would have run it up to at least 100!
You should at least consider your potential as A) standup comedian or B) comedy writer. That's some comedy gold right there.
This is what I meant:https://bioinformatics.stackexchange.com/questions/5168/seurat-for-clustering-bulk-rna-seq/5204#5204 I truly hope you can stomach the thread and it's not too detrimental to your health. On a serious note, there's no need to be mean. The Seurat authors themselves were suggesting it's use for bulk RNA seq. https://github.com/satijalab/seurat/issues/826
I have to say, I personally really appreciate your contributions to the bioconductor community (especially the scRNAseq packages). Even more reason for me to be worried about the taco causing you problems.
No worries, the rest of the taco went down pretty quickly. Waste not, want not.
Anyway, onto the thread. Some of the comments about using the Wilcoxon rank sum test are quite concerning, see Gordon's comments here. Even for single-cell data, the validity of the Wilcoxon p-values is dubious - in addition to the problems mentioned by Gordon, the sheer number of ties at low counts and zeroes will distort the distribution of the test statistic.
The use of Seurat as a wrapper around DESeq2 is less objectionable. However, this doesn't mean it's a good strategy. Absent actual work done by the wrapper, it introduces an unnecessary layer of re-direction that has practical consequences. From a developer perspective, there is more scope for bugs, e.g., if an argument is not passed along properly. From a user perspective, the most relevant documentation is no longer the first port of call, and you end up with situations like this entire thread.
The real solution would be for Seurat to use standard data structures like the
SummarizedExperiment
(or even better, aSingleCellExperiment
), so that you could seamlessly interoperate between Seurat and Bioconductor packages. But that's a long, long rant for another day and on another forum.