Question

library size and fold changes

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 2.2 years ago

Palo Alto, CA, USA

Dear Mark, Gordon, probably a naive statistical question on edgeR : considering 3 samples of 3 library sizes : a) 10 mil reads, b) 30 mil reads, and c) 12 mil reads. after applying edgeR, I do obtain 1) > 2000 genes differentially expressed between a) and b) (FDR< 0.01, FC > 2), and 2) only ~ 200 genes differentially expressed between a) and c) (FDR < 0.01, FC >2). my question would be : given the fact that the number of differentially expressed genes is dependent on the library size, would it be valid to compare and contrast the set 1) of 2000 differentially expressed genes (FDR < 0.01, FC >2), with an expanded set 2) of 200+800 differentially expressed genes (FDR < 0.01, BUT FC > 1.2). thanks a lot, Bogdan [[alternative HTML version deleted]]

edgeR edgeR • 1.4k views

ADD COMMENT • link updated 21 months ago by Gordon Smyth 53k • written 14.1 years ago by Bogdan ▴ 670

score 0 · Answer 1 · 2011-12-08

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 6 hours ago

WEHI, Melbourne, Australia

Dear Bogdan,

Give that library size may affect FDR, but will not affect FC (even might increase it slightly), it would seem to me more natural to relax the FDR cutoff rather than the FC cutoff. I would use the same FC cutoff regardless of library size.

This is especially so because, once counts get to a certain size, the p-value under the negative binomial model depends only on the fold change, further increases in count size making little or no difference. This is because the sequencing variability become negligible for large counts, after which biological inter-library variability is the only soure of variation.

What is a sensible analysis for your current data might of course depend on many things, which we don't know from your email

Best wishes
Gordon

ADD COMMENT • link 14.1 years ago • updated 21 months ago Gordon Smyth 53k

0

Entering edit mode

Hi Gordon,

I also have a similar question. I have two RNA-Seq sequenced at different times but exactly with the same protocol, except the library sizes are different.

In the first experiment, I have 4 samples with three replicates with average library sizes of 11 million reads:

untreated A and B cells.

Drug1 treated A and B cells.

In the second experiment, I have 4 samples with two replicates with average library sizes of 23 million reads:

untreated A and B cells.

Drug2 treated A and B cells.

I have done analysis for Experiment1 and Experiment2 separately.

Now, when I want to do comparison between Experiment1 and Experiment2 I have the following problem.

In Untreated cells of A(Experiment1) vs A(Experiment2) I have 3633 genes differentially expressed [abs(logFC) >= 1.0 and FDR < 0.1]. Similar results (4550 genes) also true for the comparisons between B cells. I am expecting some differences but these numbers are really really high. I think this is because of library sizes? Do you have any suggestion for the normalizations?

best,

ilyas.

ADD REPLY • link 9.3 years ago Mehmet Ilyas Cosacak • 0