edgeR cpm filtering

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.4 years ago

All, I am a new edgeR user. I have difficulty understanding the meaning of the ???cpm??? function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don???t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M? Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example: Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2 Gene_X, 150,100, 270,320,0,0 I used: d_DGEList <- d_DGEList[rowSums(cpm_filtered > 5) > 2,] But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this? Thank you. John -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_2.6.0 limma_3.12.0 > -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 6.5k views

ADD COMMENT • link updated 12.9 years ago by John Sperry ▴ 20 • written 13.0 years ago by Guest User ★ 13k

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 16 minutes ago

United States

Hi John, On 2/11/2013 11:54 AM, John [guest] wrote: > All, > > I am a new edgeR user. I have difficulty understanding the meaning of the ???cpm??? function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don???t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M? It's reads. You are inputting read counts per gene, so by definition edgeR knows nothing about bases. As for why 1M, I don't know for sure, but would imagine it is because a 'reasonable' sized RNA-Seq experiment is based on somewhere around 10M reads or so. In other words, say you have a sample with 10M reads, and one gene has 60 reads that align. You would have 6 cpm, which looks pretty low, and it is. However you could do statistics on it based on a discrete distribution like the negative binomial. If you used 10M to normalize, the cp10M would be 0.6, so you would have to throw that gene out. If you used cpk, it would be 6000 cpk and it would look like you had lots of reads when in fact you only had 60. So cpm looks like a reasonable compromise to me. > > Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example: > > Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2 > Gene_X, 150,100, 270,320,0,0 > > I used: > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this? Setting aside the logic of removing these genes (which IMO is missing the point of looking for differential expression), note the logic of your filter. Let's break it down by section: rowSums(cpm_filtered > 5) This gives the count of samples where the cpm value is > 5. In the case you mention, this would be four. rowSums(cpm_filtered > 5) > 2 This results in TRUE if the count of samples for a given gene with a cpm value > 5 is greater than two. So you are saying that at least two samples have to have a cpm > 5. In the instance you want to filter, the count is 4, and 4 > 2, so this passes the filter. So what you apparently want are rows where the cpm is greater than some value in ALL samples, not just two of them, so you would want to change the > 2 part to == 6. Note that this doesn't really make any sense. You are in effect saying that you are completely uninterested in any genes that appear not to be expressed in one of your samples, but that might be highly expressed in one or more of the other samples. That to me is actually really interesting, and I am not sure why you would want to filter out any gene that fulfills that criterion. Best, Jim > > Thank you. > John > > > -- output of sessionInfo(): > >> sessionInfo() > R version 2.15.0 (2012-03-30) > > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > > [1] edgeR_2.6.0 limma_3.12.0 > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 13.0 years ago James W. MacDonald 68k

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 16 minutes ago

United States

Hi John, Please don't take things off-list. Even if you are not a subscriber (and if you are using BioC stuff you should be, and you can always stop delivery but remain a subscriber), I believe that replying to an existing thread will work. I don't see any zero counts causing a problem. Using the example for cpm() as a starting point, and modifying to have a set with zero counts, I get this: > y [,1] [,2] [,3] [,4] [1,] 1 2 14 11 [2,] 11 25 1 26 [3,] 1 22 2 19 [4,] 5 6 15 6 [5,] 0 0 1 5 > d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2))) > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > topTags(exactTest(d)) Comparison of groups: 2-1 logFC logCPM PValue FDR 1 2.9550376 12.76964 6.109348e-05 0.0003054674 5 4.6421574 10.54712 1.283343e-01 0.3208358043 4 0.9149142 12.96222 2.668415e-01 0.4447357815 2 -0.4149407 13.93933 8.539261e-01 0.9783799675 3 -0.1325391 13.42121 9.783800e-01 0.9783799675 So the sample with zero counts (sample 5), is the second row in the topTags() output, and it has no problem computing a logFC value. Best, Jim On 2/11/2013 4:30 PM, John Sperry wrote: > Hi again Jim, > > One more thing, in microarray days, people used to add a small value, > let say 1 to the 0 values to avoid non-sense fold changes. It's not > the case in NGS any more right? so it's not possible to do that in > edgeR, right? that's why I was thinking about filtering out with cpm. > > Thanks, > John > > > > -------------------------------------------------------------------- ---- > *From:* John Sperry <johnsperry51 at="" yahoo.com=""> > *To:* "jmacdon at uw.edu" <jmacdon at="" uw.edu=""> > *Sent:* Monday, February 11, 2013 1:47 PM > *Subject:* [BioC] edgeR cpm filtering > > Hi Jim, > > I'm very new to edgeR and BioC. I couldn't reply to your post in BioC, > so here is my post in an email :D > > I still cannot see why 1M is chosen, but I appreciate your explanations. > > About the cpm filtering, the reason that I chose '> 2' for 3 samples > with each having 2 replicates was that I though edgeR must be smart > enough to figure out that when I say more than 5 reads per million for > more than 2 samples, it means for ALL the replicates of each samples! > which apparently is not the case! thanks for pointing that out! > > as for the reason for wanting to get rid of the sample 3 with 2 > replicates that have 0 reads mapped to them, I don't want them, > because, they cause the logFC to become huge non-sense numbers! i > guess dividing be 0 causes the problem! so I thought for not seeing > weird values when the significant genes are selected, it's better to > get rid of genes that have 0 reads mapped to any of their groups. Does > it make sense? > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > Thanks, > John > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 13.0 years ago James W. MacDonald 68k

0

Entering edit mode

John Sperry ▴ 20

@john-sperry-5759

Last seen 11.4 years ago

Hi Jim, I posted my first question as a guest and just became a member today, so itâs a bit of learning curve for me on how to use the list. Sorry for the email and I hope I am using the list correctly now by emailing this address. About my edgeR question: Thank you for your elaboration and also the example. I can see that in your example, you have samples with 0 reads and logFC is okay, and thatâs what logically should be. However, in my dataset, I see many cases of logFC of ~144269489 (or negative of ~ this value). When I check the genes, I see that these are the cases where all the replicates of one samples have 0 reads mapped to them, whereas the other groups of samples have many reads. These are the cases that cpm didnât filter them. Thatâs why I tried to use more restrictive cpm filtering to get rid of these genes. Any thoughts on why this non-interpretive logFC cases happen are greatly appreciated. Thanks, John Hi John, Please don't take things off-list. Even if you are not a subscriber (and if you are using BioC stuff you should be, and you can always stop delivery but remain a subscriber), I believe that replying to an existing thread will work. I don't see any zero counts causing a problem. Using the example for cpm() as a starting point, and modifying to have a set with zero counts, I get this: > y Â Â [,1] [,2] [,3] [,4] [1,]Â Â 1Â Â 2Â 14Â 11 [2,]Â 11Â 25Â Â 1Â 26 [3,]Â Â 1Â 22Â Â 2Â 19 [4,]Â Â 5Â Â 6Â 15Â Â 6 [5,]Â Â 0Â Â 0Â Â 1Â Â 5 > d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2))) > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > topTags(exactTest(d)) Comparison of groups:Â 2-1 Â Â Â logFCÂ logCPMÂ Â Â PValueÂ Â Â Â Â FDR 1Â 2.9550376 12.76964 6.109348e-05 0.0003054674 5Â 4.6421574 10.54712 1.283343e-01 0.3208358043 4Â 0.9149142 12.96222 2.668415e-01 0.4447357815 2 -0.4149407 13.93933 8.539261e-01 0.9783799675 3 -0.1325391 13.42121 9.783800e-01 0.9783799675 So the sample with zero counts (sample 5), is the second row in the topTags() output, and it has no problem computing a logFC value. Best, Jim On 2/11/2013 4:30 PM, John Sperry wrote: > Hi again Jim, > > One more thing, in microarray days, people used to add a small value, let say 1 to the 0 values to avoid non-sense fold changes. It's not the case in NGS any more right? so it's not possible to do that in edgeR, right? that's why I was thinking about filtering out with cpm. > > Thanks, > John > > > > -------------------------------------------------------------------- ---- > *From:* John Sperry <johnsperry51@yahoo.com> > *To:* "jmacdon@uw.edu" <jmacdon@uw.edu> > *Sent:* Monday, February 11, 2013 1:47 PM > *Subject:* [BioC] edgeR cpm filtering > > Hi Jim, > > I'm very new to edgeR and BioC. I couldn't reply to your post in BioC, so here is my post in an email :D > > I still cannot see why 1M is chosen, but I appreciate your explanations. > > About the cpm filtering, the reason that I chose '> 2' for 3 samples with each having 2 replicates was that I though edgeR must be smart enough to figure out that when I say more than 5 reads per million for more than 2 samples, it means for ALL the replicates of each samples! which apparently is not the case! thanks for pointing that out! > > as for the reason for wanting to get rid of the sample 3 with 2 replicates that have 0 reads mapped to them, I don't want them, because, they cause the logFC to become huge non-sense numbers! i guess dividing be 0 causes the problem! so I thought for not seeing weird values when the significant genes are selected, it's better to get rid of genes that have 0 reads mapped to any of their groups. Does it make sense? > > d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,] > > Thanks, > John > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago John Sperry ▴ 20

0

Entering edit mode

Hi John, On Tue, Feb 12, 2013 at 12:45 AM, John Sperry <johnsperry51 at="" yahoo.com=""> wrote: > Hi Jim, > > I posted my first question as a guest and just became a > member today, so it?s a bit of learning curve for me on how to use the list. Sorry > for the email and I hope I am using the list correctly now by emailing this > address. > > About my edgeR question: > > Thank you for your elaboration and also the example. I can > see that in your example, you have samples with 0 reads and logFC is okay, and > that?s what logically should be. However, in my dataset, I see many cases of > logFC of ~144269489 (or negative of ~ this value). When I check the genes, I see > that these are the cases where all the replicates of one samples have 0 reads > mapped to them, whereas the other groups of samples have many reads. These are > the cases that cpm didn?t filter them. That?s why I tried to use more restrictive > cpm filtering to get rid of these genes. > > Any thoughts on why this non-interpretive logFC cases happen > are greatly appreciated. >From the looks of your previous (original) email, it looks like you've landed w/ a (very) strange bioconductor setup given the version of R you are using -- you are using R-2.15.0, with edgeR version 2.6.0 Please update to the latest version of R (2.15.2 -- cuz, hey it's a good idea), but more importantly, the latest version of edgeR: http://bioconductor.org/packages/2.11/bioc/html/edgeR.html You should be installing your bioconductor packages w/ the BiocInstaller package/biocLite functionality. Unwinding yourself from the bizarre package versions you have running to ensure that you are on the latest bioc release train (2.11) might get a bit tricky, and sometimes the easiest way is to blow out your R install and start from scratch. After you do that, do you still get such extreme logFC's? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

John Sperry ▴ 20

@john-sperry-5759

Last seen 11.4 years ago

Hi Steve, Thank you for the suggestion. That actually solved my problem! Upgrading edgeR package to its latest version made all the huge (or very low) logFC values to disappear! I think the issue has been taken care of in the latest version. This is great! Thanks, John [[alternative HTML version deleted]]

ADD COMMENT • link 12.9 years ago John Sperry ▴ 20

Login before adding your answer.