Question: too many genes with low read-counts as differential expressed in DESeq2
3
2.9 years ago by
anand_mt80
anand_mt80 wrote:

Hi all,

I am analyzing an rna-seq dataset (treated vs control) with three replicates in each. After performing differential expression with DESeq2, there are way too many differentially expressed genes ! Most of them are with low readcounts (<5). Here is MA plot, vertical line at readcount 5.

Here is my summary of results:

>summary(dds.24h.res, alpha = 0.01)

out of 41405 with nonzero total read count
LFC > 0 (up)     : 7163, 17%
LFC < 0 (down)   : 6146, 15%
outliers [1]     : 0, 0%
low counts [2]   : 11791, 28%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Then I looked at metadata of results for threshold:

>metadata(dds.24h.res)$filterThreshold 48.74028% 0.7043008  Its too low! Then I tried turning of independentFiltering false, yet I have the same results (too many DEGs). > summary(res.24.noFilt, alpha = 0.01) out of 41405 with nonzero total read count adjusted p-value < 0.01 LFC > 0 (up) : 6910, 17% LFC < 0 (down) : 6056, 15% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 0) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results I also manually filtered raw counts (rowmean > 5) before running DESeq, but still I have many DEGs. PCA results show nice clustering of samples and replicates so no problem there. Is there any thing I am missing ? Thanks. -Anand. deseq2 rna-seq • 3.1k views ADD COMMENTlink modified 2.8 years ago • written 2.9 years ago by anand_mt80 Answer: too many genes with low read-counts as differential expressed in DESeq2 2 2.9 years ago by Michael Love23k United States Michael Love23k wrote: hi Anand, "there are way too many differentially expressed genes ! Most of them are with low readcounts (<5)" There is nothing here which suggests to me that the DE calls or the filtering is too low. I would guess that your treatment induces very strong differences, and that the amount of biological and technical variation among replicates is low compared to the differences across treatment. You can get a sense of this by looking at a PCA plot. I don't think that most of the DE genes have read count < 5, you could actually quantify this by binning the genes which low p-value by mean count. There is such a plot in the vignette. A gene with a mean count of 5 can still have strong DE signal, for example if you have 0,0,0 vs 10,10,10 in an experiment with low biological and technical variation among replicates. Also, regarding, too many DE genes, there was a recent similar question on the support site, and I recommended that they take a look at the DESeq2 paper, specifically the part on the use of tests incorporating a log2 fold change threshold. ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Michael Love23k Hi Anand and Michael, To illustrate that I also have some troubles with few genes that have low read-counts but are DE with a high log2FC threshold. Here is my top DE genes for padj <0.01 and |log2FC| >2. See For example Gene G1 : lo2FC= - 22.01 BUT mean condition 1 (BC.mean)= 59.93 and 0 for condition2 (BV.mean). So, First : How to deal with these "low read-counts" genes ? Clearly here the filtering according to Log2FC is not relevant. Moreover It is not possible to validate these genes with QPCR because of low reads. Second : These genes are DE or not according to the model used in DESeq2 : my design is two conditions (virus-infected versus control) two genotypes B and A. When I analysed the data according to the "Example 2" from DEseq2 vignette (which fit adequately to my design). I obtained these genes. Whereas when I analysed my design as 2 conditions only B Virus/ B control, these genes are not DE. Any idea ? Comment ?  names BC.m BV.m AC.m AV.m log2FC padj G1 59 0,00 1048,90 1263,14 -22,01 1,30E-15 G4 8,9 0,41 62,91 110,18 -4,35 2,87E-03 G5 4,40 0,00 110,33 102,44 -5,28 1,56E-03 G6 97,07 0,00 1893,44 1897,97 -9,61 5,94E-04 G7 24,81 2,58 302,67 147,67 -3,27 1,88E-03 G8 67,37 16,34 35,24 60,79 -2,05 8,63E-05 G9 63,68 0,00 1530,28 1982,28 -9,00 2,79E-04 G10 18,91 0,00 365,41 336,67 -7,26 4,10E-03 ADD REPLYlink written 2.9 years ago by cagenet3420 I'll reply to your post here: DESeq2 : DE genes witth high log2FC and few or null counts in some samples ADD REPLYlink written 2.9 years ago by Michael Love23k Answer: too many genes with low read-counts as differential expressed in DESeq2 0 2.8 years ago by anand_mt80 anand_mt80 wrote: Hi Michael, Thank you for the reply and you were rite about the affect of treatment. I have few more questions. My data here comes from kallisto, I have summarized counts and TPM values to gene level. However, when I see for few genes TPM values are very diffrent between two conditions , but still the gene is not differentially expressed or log2Foldchage is too low in DESeq results. For ex here, TPM values differ by 2 fold change between treatment and control, but normalized counts show no diffrenece. Then I calculated FPKM values with fpkm().  condition counts normalizedCounts TPM FPKM treat1 17334 11569.34 85.75486 36.79019 treat2 8242 15883.92 130.43380 50.51046 treat3 8777 15182.55 121.49322 48.28013 control1 41283 13487.88 222.30643 42.89112 control2 11569 14882.25 259.38771 47.32518 control3 14657 14369.24 252.28415 45.69382 I am bit confused how to explain this difference between TPM values and DESeq result. I know TPM/FPKM are not comparable between samples, but if I have draw heat-map, which values are better? TPM or rlog transfomed ? Thanks again, -Anand. ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by anand_mt80 2 Are you using the tximport package to build the DESeqDataSet? You should if you want to make a reasonable comparison here. See the tximport or DESeq2 vignette. ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Michael Love23k Hi Michael, Thanks for suggestion. I used tximport and repeated DESeq but still result is same as above. txi = tximport(files = u87.coldf$file, type = 'kallisto', tx2gene = t2g, reader = fread)

txi.dds = DESeqDataSetFromTximport(txi = txi, colData = u87.coldf, design = ~Batch + Condition)

txi.res = results(object = txi.dds, contrast = c('Condition', 'treated', 'control'))

Table for same gene:

 condition counts normalizedCounts TPM FPKM treat1 17334 11841.23 85.75486 103.3063 treat2 8242 15588.66 130.43380 135.9999 treat3 8777 14805.13 121.49322 129.1642 control1 41283 14278.97 222.30643 124.5738 control2 control3 11569 14657 15200.54 14798.43 259.38771 252.28415 132.6139 129.1057

If looking at TPM values, this genes expression changes by over 2 fold.

Why are the counts exactly the same after using tximport? Are you sure you are computing from a new DESeqDataSet?

Also, what does the batch look like? You can't just look at the normalized counts alone if you have batch variables in the design. What does the full colData(dds) look like?

Hi Michael,

Yes I am using proper DESeqDataSet generated from tximport. There are two batches, batch one contains samples treat1 and control1 which were sequened in May at high coverage, whereas batch two contains rest of samples sequenced at relatively low coverage in July. coldata looks like this,

condition  batch                   files
control_1   control batch1 control_1/abundance.tsv
control_2   control batch2 control_2/abundance.tsv
control_3   control batch2 control_3/abundance.tsv
treated_1   treated batch1 treated_1/abundance.tsv
treated_2   treated batch2 treated_2/abundance.tsv
treated_3   treated batch2 treated_3/abundance.tsv

I am also sharing the raw data and reproducible code that I used in case you want to take a look. Here: https://www.dropbox.com/s/ffswpwv6qa9ctft/bioconductor_85509.zip?dl=0  (Its about 20mb)

I am confused regarding desecrepency between deseq results and tpm values.

1

DESeq2 (and edgeR and limma) use a robust estimation procedure for library size (which also takes into account feature length if you use the tximport pipeline), compared to TPM which cannot be described as robust. An example is if you have large changes in expression across condition for some genes -- as it seems you do from the MA plots -- this will induce large changes on the TPMs of all other genes, even if the true expression of these genes did not change, because the TPM is restricted to sum to 1e6. In contrast, the library size procedure in DESeq2 is robust to large differential expression for some but not all genes. If there is a discrepancy between TPM and normalized counts (where the normalized counts are derived using the tximport pipeline), I'd guess it has to do with other genes with large differences across condition.

Another point to make is that here treatment seems to induce very large changes across many genes, and so any effort at normalization becomes difficult. Is your goal to identify any genes which changed at all? If you increase your sample size by 1 or 2 per group, you would likely gain thousands of genes given the MA plot. You might therefore consider using a log fold change threshold to define a more manageable set of differentially expressed genes (see DESeq2 paper and vignette).

Hi Michael,

Thanks for detailed clarification! Yes, the drug being used here is global repressor, so huge change is expected, and we were looking for robust DEGs. For now I am using fold change threshold on top of fdr to filter out the list. I have one more question,

Since most of the postdocs in lab are used to expression units (tpm/fpkm), how do you suggest to estimate such values here? In my opinion using DESeq2::fpkm() on tximport dds seems to give reasonable expression (proportional to normalized counts/DESeq results). But same cannot be said for fpm (it's still messy).

EDIT: I was looking at vignette on rlog transofrmation (section 2.1.1 Blind dispersion estimation) and it says -

However, blind dispersion estimation is not the appropriate choice if one expects
that many or the majority of genes (rows) will have large differences in counts which
are explainable by the experimental design, and one wishes to transform the data
for downstream analysis.

This appears to be my case, so blind should be FALSE for better heatmap/downstream processes ?

1

I don't have a strong preference between TPM and FPKM. TPM are much easier to explain. The least confusing thing would be to present the gene-level TPM derived from the upstream software if you are using tximport. But if you're trying to explain about the difference between what you see with TPM and what you see in the MA plot above, you can use this contrived example to try to demonstrate the difference:

Suppose we have 5 genes only, and we are using TPT = transcripts per ten, as a measure of expression. Let's say the control sample has true transcript counts [1,1,1,1,6]. And then there is a treated sample with true transcript counts [1,1,1,1,12]. The first sample is already normalized as TPT because it sums to 10, but the second sample has TPT of [5/8, 5/8, 5/8, 5/8, 7 1/2]. Note that there is nothing wrong with the TPT, it accurately represents the proportion each gene contributes to the total, scaled such that the total is ten. However, increases in transcription for the genes with the highest TPT (or TPM) will affect the values for all the other genes, even if those genes did not undergo differences in the amount of transcription upon treatment.

Yes, I would recommend blind=FALSE here, and for most visualizations.

Excellent! A lot to think about. Sorry one last question, in one of the comment above, you mentioned -

You can't just look at the normalized counts alone if you have batch variables in the design.

I understand that the batch variables are not reflected in normalized counts. Whats the rational behind the sentence ?

I just wanted to remind you that when you have a batch variable in the design, the LFC is not simply, e.g. the average of group 2 normalized counts divided by group 1 normalized counts. The LFC is calculated within the GLM and takes into account differences due to batch.