Hello everyone, I am working on scRNA-Seq data analysis and I have a technical question. We can combine different scRNA-Seq experiments with batch correction methods such as MNN or CCA. As I know, while doing differential expression analysis we should consider batch effect like Scater/Scran package provided a block parameter to do analysis with batch.
But the point is, if out data sets comes from different conditions (let's say healthy and disease) and real source of batch effect is the condition and we want to compare the transcriptomes of specific cell types between conditions, what should we do? We cannot block or do correction for batch since we want to see the effect of batch to specific conditions.
I identified cell types of clusters of each data seperately. Now I have 2 data sets from 2 conditions and I know which cluster is which cell type and I want to compare specific cell types in these 2 different conditions.
Treating scRNA-Seq data as bulk RNA-Seq data and use raw-counts (after deletion of non-expressed genes of course) with methods such as DESeq2 or edgeR, would it be okay?
Thank you in advance.
This may be of interest, as well as the vignettes here and here. Note that the last link is getting fixed in the next build cycle to use the summed counts directly, so just ignore the weighting part.
Ah yes, I forgot about your Biostatistics paper (1st link) exploring just this question. Thanks.
Hello again Aaron, Micheal and thank you for answers. Counting counts and create a pseudo-bulk RNA-Seq is an effective method I see and it is suitable to use on specific cell types but I face another question mark. What if we don't have replicates which mean only 1 data for healthy and 1 data for disease. Counting counts will create a pseudo-bulk RNA-Seq without replicates and without replicates it is not suitable to use standard statistical methods. Do you think using raw counts without summing and treating each cell as a replicate is a proper way to do it (with DESeq2 for example)?
Thank you for all your answers.
Is it "proper"? No. You're treating the cells as replicates, which is inappropriate for various reasons. The most obvious is that, if you have multiple cell types/states in a cluster, your replicates will exhibit hidden correlations that compromise the DE analysis. Even more problematic is the fact that cells are not units of experimental replication, and treating them as such makes little sense.
To understand the latter reason, imagine what would happen if you or someone else tried to repeat the experiment. In the vast majority of cases, you will not have access to the same population of cells. Rather, you will generate a new population of cells from a different sample (e.g., patient, animal, cell culture) to use in your experiment. These samples are the relevant units of replication in an experimental context, not the individual cells themselves. Indeed, all of classical hypothesis testing is about the long-run expected results from repeated experiments (e.g., type I error rates, expected FDRs), so this is what the replication should reflect.
If you only have one sample per condition, you are in the same position as if you have only one bulk RNA-seq sample. Cell-to-cell variability doesn't tell you anything about the sample-to-sample variability; you can have highly heterogeneous populations (high cell-to-cell variability) that are very consistent across samples, as well as populations that consist of one cell type/kind (low cell-to-cell variability) but that cell type/kind differs across samples.
Having said that, people frequently pretend that cells are replicates (including me, e.g., here). Sounds bad, but the excuse is that the aim of the analysis is to simply rank the genes to identify good markers for particular clusters. Cell-to-cell variability becomes relevant in such cases as the most appealing markers are consistently upregulated in one population compared to another. Importantly: we don't bother interpreting the magnitude of the p-value to define significant DE genes, for reasons related to the replication described above, and also because of the circularity of computing p-values from the same data used to define clusters.
If you care about having valid p-values to define DE genes, then you need replicate samples. This is no different from bulk RNA-seq, you can't talk you way out of it by pretending cells are replicates.