Hi Jim,
I posted my first question as a guest and just became a
member today, so itâs a bit of learning curve for me on how to use
the list. Sorry
for the email and I hope I am using the list correctly now by emailing
this
address.
About my edgeR question:
Thank you for your elaboration and also the example. I can
see that in your example, you have samples with 0 reads and logFC is
okay, and
thatâs what logically should be. However, in my dataset, I see many
cases of
logFC of ~144269489 (or negative of ~ this value). When I check the
genes, I see
that these are the cases where all the replicates of one samples have
0 reads
mapped to them, whereas the other groups of samples have many reads.
These are
the cases that cpm didnât filter them. Thatâs why I tried to use
more restrictive
cpm filtering to get rid of these genes.
Any thoughts on why this non-interpretive logFC cases happen
are greatly appreciated.
Thanks,
John
Hi John,
Please don't take things off-list. Even if you are not a
subscriber (and if you are using BioC stuff you should be, and you
can
always stop delivery but remain a subscriber), I believe that replying
to an existing thread will work.
I don't see any zero counts
causing a problem. Using the example for cpm() as a starting point,
and
modifying to have a set with zero counts, I get this:
> y
  [,1] [,2] [,3] [,4]
[1,]Â Â 1Â Â 2Â 14Â 11
[2,]Â 11Â 25Â Â 1Â 26
[3,]Â Â 1Â 22Â Â 2Â 19
[4,]Â Â 5Â Â 6Â 15Â Â 6
[5,]Â Â 0Â Â 0Â Â 1Â Â 5
> d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2)))
> d <- estimateCommonDisp(d)
> d <- estimateTagwiseDisp(d)
> topTags(exactTest(d))
Comparison of groups:Â 2-1
   logFC logCPM   PValue     FDR
1Â 2.9550376 12.76964 6.109348e-05 0.0003054674
5Â 4.6421574 10.54712 1.283343e-01 0.3208358043
4Â 0.9149142 12.96222 2.668415e-01 0.4447357815
2 -0.4149407 13.93933 8.539261e-01 0.9783799675
3 -0.1325391 13.42121 9.783800e-01 0.9783799675
So
the sample with zero counts (sample 5), is the second row in the
topTags() output, and it has no problem computing a logFC value.
Best,
Jim
On 2/11/2013 4:30 PM, John Sperry wrote:
> Hi again Jim,
>
>
One more thing, in microarray days, people used to add a small value,
let say 1 to the 0 values to avoid non-sense fold changes. It's not
the
case in NGS any more right? so it's not possible to do that in edgeR,
right? that's why I was thinking about filtering out with cpm.
>
> Thanks,
> John
>
>
>
> --------------------------------------------------------------------
----
> *From:* John Sperry <johnsperry51@yahoo.com>
> *To:* "jmacdon@uw.edu" <jmacdon@uw.edu>
> *Sent:* Monday, February 11, 2013 1:47 PM
> *Subject:* [BioC] edgeR cpm filtering
>
> Hi Jim,
>
> I'm very new to edgeR and BioC. I couldn't reply to your post in
BioC, so here is my post in an email :D
>
> I still cannot see why 1M is chosen, but I appreciate your
explanations.
>
>
About the cpm filtering, the reason that I chose '> 2' for 3 samples
with each having 2 replicates was that I though edgeR must be smart
enough to figure out that when I say more than 5 reads per million for
more than 2 samples, it means for ALL the replicates of each samples!
which apparently is not the case! thanks for pointing that out!
>
>
as for the reason for wanting to get rid of the sample 3 with 2
replicates that have 0 reads mapped to them, I don't want them,
because,
they cause the logFC to become huge non-sense numbers! i guess
dividing
be 0 causes the problem! so I thought for not seeing weird values
when
the significant genes are selected, it's better to get rid of genes
that
have 0 reads mapped to any of their groups. Does it make sense?
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,]
>
> Thanks,
> John
>
>
-- James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]