Question

High fold change reported for comparison of samples with zero values in all samples

0

Entering edit mode

Laura • 0

@305347ea

Last seen 3.0 years ago

Spain

Hi everybody,

I performed differential gene expression analysis for RNA-seq data with edgeR. The method for testing DE genes was glmLRTest. For each locality (8 in the experiment), DE comparisons have been performed between 2 samples (treatment vs control) with 3 replicates each (i.e.: controls_locality1 vs treatments_localitity1; controls_locality2 vs treatments_localitity2).

Raw counts of some genes in the two samples of the same locality are zero but high log fold-change (LFC) were estimated (p-adjusted >0.05).

After consulting several posts and edgeR manual my first basic understanding is that our LFC results are linked to the internal transformations and normalization (pseudo- count addition, library-size,) that edgeR applied on raw counts in order to adjust zero counts to some value larger than zero to allow FC estimation,

a) Is that correct?

b) Due to I’ve calculated DE between sample groups of the same locality, it’s hard for me to understand how a gene with zero values in its six replicates is highly up-regulated (i.e: logFC=8.3) in locality1.

Thank you very much for your help.

edgeR • 826 views

ADD COMMENT • link 3.0 years ago Laura • 0

score 0 · Answer 1 · 2021-04-15

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 16 hours ago

United States

The universal recommendation is to remove all genes with consistently low (or all zero) expression values prior to doing any modeling. It is quite easy to get large fold changes from very small values, and part of the analysis pipeline includes adding a small prior to eliminate zeros, and then adjusting for library size. So it's not unexpected that you could have large fold changes for all zeros, which, again, is why you should remove them first.

You can use filterByExpr to remove unexpressed genes.

ADD COMMENT • link 3.0 years ago James W. MacDonald 65k

0

Entering edit mode

Hello, Thank you so much for your response. In fact, I removed those genes that have at least a cpm of 1 or greater for at least three samples (the size of the smallest group of replicates).

dgefilt <- rowSums(cpm(dge)>=1) >= 3

I expected that those genes with counts below 1 CPM in 3 or more replicates would be removed.

c1_sar  c2_sar  c3_sar  t1_sar  t2_sar  t3_sar  c1_can  c2_can  c3_can  t1_can  t2_can  t3_can  c1_cre  c2_cre  c3_cre  t1_cre  t2_cre  t3_cre  c1_ali  c2_ali  c3_ali  t1_ali  t2_ali  t3_ali  c1_cor  c2_cor  c3_cor  t1_cor  t2_cor  t3_cor  c1_maj  c2_maj  c3_maj  t1_maj  t2_maj  t3_maj  c1_cro  c2_cro  c3_cro  t1_cro  t2_cro  t3_cro  c1_hal  c2_hal  c3_hal  t1_hal  t2_hal  t3_hal

TRINITY_DN12592_c0_g1 0.00 0.00 0.00 0.00 0.00 0.00 57.04 37.41 0.00 0.00 0.00 54.99 0.00 0.00 36.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.81 0.00 3.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The comparison between T_SARvsC_SAR estimated a logFC=8.3 for this gene.

Thanks for your help.

ADD REPLY • link 3.0 years ago Laura • 0