Hello!
I am currently analyzing a RNA-Seq experiment where two cells lines were transfected with 10 different inserts that should decrease the level of 10 different targets that are present for sure in those cells. My goal is to detect the effect of the decrease of a given target at the transcriptomic level. The experiment has the following informations:
colData
name cell vector CellA_1 A vector1 CellB_1 B vector1 ... ... ... CellA_10 A vector10 CellB_10 B vector10 CellA_empty A empty CellB_empty B empty
An empty vector is used as control.
I ran limma-voom and DESeq2 to detect the differentially expressed genes. The following is a summary of the analysis:
Design
limma
vector <- factor(colData$vector) cell <- factor(colData$cell) design_Subtypes <- model.matrix(~0+ vector + cell )
DESeq2
design <- cell + vector
Contrast
contrast.limma <- makeContrasts(vector1-empty, vector2-empty,...) contrast.DESeq <- c("vector", "vector1", "empty")
PCA plot
I also generated a PCA/MDS plots (here is the PCA plot using the rld counts from DESeq2 and an intgroup="vector"). The empty vector is represented by the first color in the legend. PCA2=1% and PCA1=97%. The samples on the right of the graph are the one from cell type B and the one on the left the one from cell type A.
Analysis
At the end, I obtained differentially expressed genes for the contrasts Vector1-Empty, .... Vector10-Empty.
When pulled together, I obtained 36 targets with limma and 172 targets with DESeq2. Among them 8 are shared between the two softwares.
Questions
I had a look to the DESeq2 normalized counts of those targets and found that among them I had cases like those one:
CellA_Vec1 | CellA_Vec2 | CellA_Vec3 | CellB_Vec1 | CellB_Vec2 | CellB_Vec3 | |
GeneA | 500,00 | 550,00 | 300,00 | 0,00 | 0,00 | 0,00 |
GeneB | 0,00 | 0,00 | 1,00 | 10,00 | 20,00 | 15,00 |
So GeneA and GeneB are differentially expressed even if one of the two cell line is not expressing the Gene. This biological situation is expected but I was wondering the impact of such difference on the statist of limma/DESeq. To be precise, ~30% of the genes found by limma have this issue and 0.5% with DESeq2.
Question1: Should I remove those genes and consider them as false positives or the statistics behind limma/DESeq can deal this biological situation? If the statistics can handle that, could you try to explain me how or point me to the thing I should read again the DESeq2/limma vignette?
Question2: I was looking at the DESeq2 normalized count so far but is it an limma-voom equivalent that I could use detect weird "normalized" counts like here?
Thanks in advance for your answers!
Thanks! That is what I suspected and you are right when the cell line A or B doesn't express any read for a given gene it is also the case for the empty vector so fold-change of 0/0.
Do I have to consider that for those genes I am running an "analysis without replicates"-like with all the problems it involves? From what I have red about it is that DESeq2 can do it but the result needs to be carefully analyzed and that limma can't compute the eBayes on an no replication analysis. So following this reasoning, would it not "trick limma" and become an issue?
Even if they can't be considered as false positives as you explained, should I still keep them in my pool of differentially expressed genes for limma and DESeq? If this is more complicated than a simple "yes you should" or "no don't", does it exist a right way to decide?
If you have a large number of zeros, the true residual d.f. in the design may be lower than what
limma
expects. At the most extreme, if you have a gene where all the counts for cell B are zero, you should have no residual d.f. to estimate the variance.limma
won't acknowledge this as it works with log-CPMs and doesn't really have a concept of a zero count. This means that the variance will probably be underestimated for these genes, such that the p-values will also be underestimated (though this should be mitigated to some extent by EB shrinkage - setrobust=TRUE
ineBayes
for more protection). If you want to handle it properly, you should tryedgeR
's QL framework - check out?glmQLFit
for more details.For your second question, I reckon you should keep them in. These genes are putatively DE in cell line A upon vector treatment, which seems like it would be something that's biologically interesting. The fact that those genes are not expressed in cell line B doesn't make the vector response in A any less interesting - besides, that's largely beyond your control.
The text in the DESeq2 help about experiments without replicates doesn't really apply here, because you have if i count correctly 10 residual degrees of freedom for estimating the dispersion. The text in ?DESeq is for when there are no residual degrees of freedom, and in this case a warning is printed to console about "Experiments without replicates...".
I think the point is that, you could separate them out in the results descriptively, of the total set of targets with small adjusted p-value, X were only expressed in one cell. And the way to have better inference than this is to perform with replicates within each cell.
Thanks to both of you! I understand more what's going on here.