Question: Limma/DESeq2: Statistic and readcount of biological replicates
0
3.8 years ago by
Belgium

Hello!

I am currently analyzing a RNA-Seq experiment where two cells lines were transfected with 10 different inserts that should decrease the level of 10 different targets that are present for sure in those cells. My goal is to detect the effect of the decrease of a given target at the transcriptomic level. The experiment has the following informations:

## colData

name    cell    vector
CellA_1    A    vector1
CellB_1    B    vector1
...    ...   ...
CellA_10   A    vector10
CellB_10   B    vector10
CellA_empty    A    empty
CellB_empty    B    empty

An empty vector is used as control.

I ran limma-voom and DESeq2 to detect the differentially expressed genes. The following is a summary of the analysis:

## Design

limma

vector <- factor(colData$vector) cell <- factor(colData$cell)
design_Subtypes <- model.matrix(~0+ vector + cell )

DESeq2

design <- cell + vector

## Contrast

contrast.limma <- makeContrasts(vector1-empty, vector2-empty,...)
contrast.DESeq <- c("vector", "vector1", "empty")

## PCA plot

I also generated a PCA/MDS plots (here is the PCA plot using the rld counts from DESeq2 and an intgroup="vector"). The empty vector is represented by the first color in the legend. PCA2=1% and PCA1=97%. The samples on the right of the graph are the one from cell type B and the one on the left the one from cell type A.

## Analysis

At the end, I obtained differentially expressed genes for the contrasts Vector1-Empty, .... Vector10-Empty.

When pulled together, I obtained 36 targets with limma and 172 targets with DESeq2. Among them 8 are shared between the two softwares.

## Questions

I had a look to the DESeq2 normalized counts of those targets and found that among them I had cases like those one:

 CellA_Vec1 CellA_Vec2 CellA_Vec3 CellB_Vec1 CellB_Vec2 CellB_Vec3 GeneA 500,00 550,00 300,00 0,00 0,00 0,00 GeneB 0,00 0,00 1,00 10,00 20,00 15,00

So GeneA and GeneB are differentially expressed even if one of the two cell line is not expressing the Gene. This biological situation is expected but I was wondering the impact of such difference on the statist of limma/DESeq. To be precise, ~30% of the genes found by limma have this issue and 0.5% with DESeq2.

Question1: Should I remove those genes and consider them as false positives or the statistics behind limma/DESeq can deal this biological situation? If the statistics can handle that, could you try to explain me how or point me to the thing I should read again the DESeq2/limma vignette?

Question2: I was looking at the DESeq2 normalized count so far but is it an limma-voom equivalent that I could use detect weird "normalized" counts like here?

normalization limma deseq2 counts • 1.8k views
modified 3.8 years ago by Aaron Lun25k • written 3.8 years ago by Radek60
2
3.8 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Your design assumes that the response to each vector is the same in both cell lines. This seems like a strong qualitative (and even stronger quantitative) assumption to me. A safer approach would be to do the experiment with multiple replicates for each cell line/vector combination, and then parametrize the design as a one-way layout where each cell line/vector combination is defined as a separate group. Nonetheless, what's done is done, so let's do the best we can with your current set-up.

For your first question, I suspect that the empty vectors also have counts of zero. This means that, e.g., for gene A and vector 1, you have a fold change of 0/0 for cell line B. This provides no evidence for DE but it also provides no evidence against DE, as it doesn't indicate that the fold change should be equal or different to 1. In fact, it provides no information at all, which means that significance is determined entirely by the difference between vectors for cell line A. This is not a false positive; for all we know, the true log-fold change due to vector 1 might be huge, but any fold change multiplied by an expression of zero in cell line B will always give back zero. Rather, the unintuitive nature of this result is due to your parametrization where the vector response is assumed to be replicated across cell lines. If you tested the response in each cell line individually, you'd probably find significance in cell line A but not B, as there's no evidence in the latter to support DE.

The above answer mostly pertains to GLMs where zeroes are explicitly considered, but the same roughly applies for voom. Log-CPMs computed from zero counts should have small precision weights, because the mean-variance relationship typically goes up at low abundances (and zero counts should be the lowest of them all). As such, a pair of zero counts for cell line B won't provide a lot of information on the vector 1 fold change, such that cell line A will again dominate the significance calculations.

For your second question, if you want to look at "normalized counts" of zero, you can use the cpm function in edgeR after setting prior.count=0. voom automatically adds 0.5 to its counts prior to log-transformation, so you'll only ever get log-CPMs out of it.

Thanks! That is what I suspected and you are right when the cell line A or B doesn't express any read for a given gene it is also the case for the empty vector so fold-change of 0/0.

Do I have to consider that for those genes I am running an "analysis without replicates"-like with all the problems it involves? From what I have red about it is that DESeq2 can do it but the result needs to be carefully analyzed and that limma can't compute the eBayes on an no replication analysis. So  following this reasoning, would it not "trick limma" and become an issue?

Even if they can't be considered as false positives as you explained, should I still keep them in my pool of differentially expressed genes for limma and DESeq? If this is more complicated than a simple "yes you should" or "no don't", does it exist a right way to decide?

If you have a large number of zeros, the true residual d.f. in the design may be lower than what limma expects. At the most extreme, if you have a gene where all the counts for cell B are zero, you should have no residual d.f. to estimate the variance. limma won't acknowledge this as it works with log-CPMs and doesn't really have a concept of a zero count. This means that the variance will probably be underestimated for these genes, such that the p-values will also be underestimated (though this should be mitigated to some extent by EB shrinkage - set robust=TRUE in eBayes for more protection). If you want to handle it properly, you should try edgeR's QL framework - check out ?glmQLFit for more details.

For your second question, I reckon you should keep them in. These genes are putatively DE in cell line A upon vector treatment, which seems like it would be something that's biologically interesting. The fact that those genes are not expressed in cell line B doesn't make the vector response in A any less interesting - besides, that's largely beyond your control.

The text in the DESeq2 help about experiments without replicates doesn't really apply here, because you have if i count correctly 10 residual degrees of freedom for estimating the dispersion. The text in ?DESeq is for when there are no residual degrees of freedom, and in this case a warning is printed to console about "Experiments without replicates...".

I think the point is that, you could separate them out in the results descriptively, of the total set of targets with small adjusted p-value, X were only expressed in one cell. And the way to have better inference than this is to perform with replicates within each cell.