Question

too many genes with low read-counts as differential expressed in DESeq2

4

Entering edit mode

anand_mt ▴ 90

@anand_mt-10016

Last seen 3.1 years ago

Hi all,

I am analyzing an rna-seq dataset (treated vs control) with three replicates in each. After performing differential expression with DESeq2, there are way too many differentially expressed genes ! Most of them are with low readcounts (<5). Here is MA plot, vertical line at readcount 5.

Here is my summary of results:

>summary(dds.24h.res, alpha = 0.01)

out of 41405 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 7163, 17% 
LFC < 0 (down)   : 6146, 15% 
outliers [1]     : 0, 0% 
low counts [2]   : 11791, 28% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Then I looked at metadata of results for threshold:

>metadata(dds.24h.res)$filterThreshold

48.74028% 
0.7043008

Its too low! Then I tried turning of independentFiltering false, yet I have the same results (too many DEGs).

> summary(res.24.noFilt, alpha = 0.01)

out of 41405 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 6910, 17% 
LFC < 0 (down)   : 6056, 15% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

I also manually filtered raw counts (rowmean > 5) before running DESeq, but still I have many DEGs. PCA results show nice clustering of samples and replicates so no problem there.

Is there any thing I am missing ?

Thanks.

-Anand.

deseq2 rna-seq • 8.5k views

ADD COMMENT • link 7.7 years ago anand_mt ▴ 90

score 2 · Answer 1 · 2016-07-28

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 3 hours ago

United States

hi Anand,

"there are way too many differentially expressed genes ! Most of them are with low readcounts (<5)"

There is nothing here which suggests to me that the DE calls or the filtering is too low. I would guess that your treatment induces very strong differences, and that the amount of biological and technical variation among replicates is low compared to the differences across treatment. You can get a sense of this by looking at a PCA plot.

I don't think that most of the DE genes have read count < 5, you could actually quantify this by binning the genes which low p-value by mean count. There is such a plot in the vignette.

A gene with a mean count of 5 can still have strong DE signal, for example if you have 0,0,0 vs 10,10,10 in an experiment with low biological and technical variation among replicates.

Also, regarding, too many DE genes, there was a recent similar question on the support site, and I recommended that they take a look at the DESeq2 paper, specifically the part on the use of tests incorporating a log2 fold change threshold.

ADD COMMENT • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Hi Anand and Michael,

To illustrate that I also have some troubles with few genes that have low read-counts but are DE with a high log2FC threshold. Here is my top DE genes for padj <0.01 and |log2FC| >2. See For example Gene G1 : lo2FC= - 22.01 BUT mean condition 1 (BC.mean)= 59.93 and 0 for condition2 (BV.mean).

So, First : How to deal with these "low read-counts" genes ?

Clearly here the filtering according to Log2FC is not relevant. Moreover It is not possible to validate these genes with QPCR because of low reads.

Second : These genes are DE or not according to the model used in DESeq2 : my design is two conditions (virus-infected versus control) two genotypes B and A. When I analysed the data according to the "Example 2" from DEseq2 vignette (which fit adequately to my design). I obtained these genes. Whereas when I analysed my design as 2 conditions only B Virus/ B control, these genes are not DE.

Any idea ? Comment ?

names	BC.m	BV.m	AC.m	AV.m	log2FC	padj
G1	59	0,00	1048,90	1263,14	-22,01	1,30E-15
G4	8,9	0,41	62,91	110,18	-4,35	2,87E-03
G5	4,40	0,00	110,33	102,44	-5,28	1,56E-03
G6	97,07	0,00	1893,44	1897,97	-9,61	5,94E-04
G7	24,81	2,58	302,67	147,67	-3,27	1,88E-03
G8	67,37	16,34	35,24	60,79	-2,05	8,63E-05
G9	63,68	0,00	1530,28	1982,28	-9,00	2,79E-04
G10	18,91	0,00	365,41	336,67	-7,26	4,10E-03

ADD REPLY • link 7.7 years ago cagenet34 ▴ 20

0

Entering edit mode

I'll reply to your post here: DESeq2 : DE genes witth high log2FC and few or null counts in some samples

ADD REPLY • link 7.7 years ago Michael Love 41k

score 0 · Answer 2 · 2016-08-25

0

Entering edit mode

anand_mt ▴ 90

@anand_mt-10016

Last seen 3.1 years ago

Hi Michael,

Thank you for the reply and you were rite about the affect of treatment.

I have few more questions. My data here comes from kallisto, I have summarized counts and TPM values to gene level. However, when I see for few genes TPM values are very diffrent between two conditions , but still the gene is not differentially expressed or log2Foldchage is too low in DESeq results. For ex here, TPM values differ by 2 fold change between treatment and control, but normalized counts show no diffrenece. Then I calculated FPKM values with `fpkm()`.

condition	counts	normalizedCounts	TPM	FPKM
treat1	17334	11569.34	85.75486	36.79019
treat2	8242	15883.92	130.43380	50.51046
treat3	8777	15182.55	121.49322	48.28013
control1	41283	13487.88	222.30643	42.89112
control2	11569	14882.25	259.38771	47.32518
control3	14657	14369.24	252.28415	45.69382

I am bit confused how to explain this difference between TPM values and DESeq result. I know TPM/FPKM are not comparable between samples, but if I have draw heat-map, which values are better? TPM or rlog transfomed ?

Thanks again,

-Anand.

ADD COMMENT • link 7.7 years ago anand_mt ▴ 90

2

Entering edit mode

Are you using the tximport package to build the DESeqDataSet? You should if you want to make a reasonable comparison here. See the tximport or DESeq2 vignette.

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks for suggestion. I used tximport and repeated DESeq but still result is same as above.

txi = tximport(files = u87.coldf$file, type = 'kallisto', tx2gene = t2g, reader = fread)

txi.dds = DESeqDataSetFromTximport(txi = txi, colData = u87.coldf, design = ~Batch + Condition)

txi.res = results(object = txi.dds, contrast = c('Condition', 'treated', 'control'))

Table for same gene:

condition	counts	normalizedCounts	TPM	FPKM
treat1	17334	11841.23	85.75486	103.3063
treat2	8242	15588.66	130.43380	135.9999
treat3	8777	14805.13	121.49322	129.1642
control1	41283	14278.97	222.30643	124.5738
control2 control3	11569 14657	15200.54 14798.43	259.38771 252.28415	132.6139 129.1057

If looking at TPM values, this genes expression changes by over 2 fold.

ADD REPLY • link 7.7 years ago anand_mt ▴ 90

0

Entering edit mode

Why are the counts exactly the same after using tximport? Are you sure you are computing from a new DESeqDataSet?

Also, what does the batch look like? You can't just look at the normalized counts alone if you have batch variables in the design. What does the full colData(dds) look like?

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Yes I am using proper DESeqDataSet generated from tximport. There are two batches, batch one contains samples treat1 and control1 which were sequened in May at high coverage, whereas batch two contains rest of samples sequenced at relatively low coverage in July. coldata looks like this,

condition  batch                   files
control_1   control batch1 control_1/abundance.tsv
control_2   control batch2 control_2/abundance.tsv
control_3   control batch2 control_3/abundance.tsv
treated_1   treated batch1 treated_1/abundance.tsv
treated_2   treated batch2 treated_2/abundance.tsv
treated_3   treated batch2 treated_3/abundance.tsv

I am also sharing the raw data and reproducible code that I used in case you want to take a look. Here: https://www.dropbox.com/s/ffswpwv6qa9ctft/bioconductor_85509.zip?dl=0 (Its about 20mb)

I am confused regarding desecrepency between deseq results and tpm values.

ADD REPLY • link 7.7 years ago anand_mt ▴ 90

1

Entering edit mode

DESeq2 (and edgeR and limma) use a robust estimation procedure for library size (which also takes into account feature length if you use the tximport pipeline), compared to TPM which cannot be described as robust. An example is if you have large changes in expression across condition for some genes -- as it seems you do from the MA plots -- this will induce large changes on the TPMs of all other genes, even if the true expression of these genes did not change, because the TPM is restricted to sum to 1e6. In contrast, the library size procedure in DESeq2 is robust to large differential expression for some but not all genes. If there is a discrepancy between TPM and normalized counts (where the normalized counts are derived using the tximport pipeline), I'd guess it has to do with other genes with large differences across condition.

Another point to make is that here treatment seems to induce very large changes across many genes, and so any effort at normalization becomes difficult. Is your goal to identify any genes which changed at all? If you increase your sample size by 1 or 2 per group, you would likely gain thousands of genes given the MA plot. You might therefore consider using a log fold change threshold to define a more manageable set of differentially expressed genes (see DESeq2 paper and vignette).

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks for detailed clarification! Yes, the drug being used here is global repressor, so huge change is expected, and we were looking for robust DEGs. For now I am using fold change threshold on top of fdr to filter out the list. I have one more question,

Since most of the postdocs in lab are used to expression units (tpm/fpkm), how do you suggest to estimate such values here? In my opinion using `DESeq2::fpkm()` on tximport dds seems to give reasonable expression (proportional to normalized counts/DESeq results). But same cannot be said for fpm (it's still messy).

EDIT: I was looking at vignette on rlog transofrmation (section 2.1.1 Blind dispersion estimation) and it says -

However, blind dispersion estimation is not the appropriate choice if one expects
that many or the majority of genes (rows) will have large differences in counts which
are explainable by the experimental design, and one wishes to transform the data
for downstream analysis.

This appears to be my case, so blind should be FALSE for better heatmap/downstream processes ?

ADD REPLY • link 7.7 years ago anand_mt ▴ 90

1

Entering edit mode

I don't have a strong preference between TPM and FPKM. TPM are much easier to explain. The least confusing thing would be to present the gene-level TPM derived from the upstream software if you are using tximport. But if you're trying to explain about the difference between what you see with TPM and what you see in the MA plot above, you can use this contrived example to try to demonstrate the difference:

Suppose we have 5 genes only, and we are using TPT = transcripts per ten, as a measure of expression. Let's say the control sample has true transcript counts [1,1,1,1,6]. And then there is a treated sample with true transcript counts [1,1,1,1,12]. The first sample is already normalized as TPT because it sums to 10, but the second sample has TPT of [5/8, 5/8, 5/8, 5/8, 7 1/2]. Note that there is nothing wrong with the TPT, it accurately represents the proportion each gene contributes to the total, scaled such that the total is ten. However, increases in transcription for the genes with the highest TPT (or TPM) will affect the values for all the other genes, even if those genes did not undergo differences in the amount of transcription upon treatment.

Yes, I would recommend blind=FALSE here, and for most visualizations.

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Excellent! A lot to think about. Sorry one last question, in one of the comment above, you mentioned -

You can't just look at the normalized counts alone if you have batch variables in the design.

I understand that the batch variables are not reflected in normalized counts. Whats the rational behind the sentence ?

ADD REPLY • link 7.7 years ago anand_mt ▴ 90

0

Entering edit mode

I just wanted to remind you that when you have a batch variable in the design, the LFC is not simply, e.g. the average of group 2 normalized counts divided by group 1 normalized counts. The LFC is calculated within the GLM and takes into account differences due to batch.