multi-mapped reads: Cufflinks + baySeq? edgeR?

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi, I am trying to analyze RNA-Seq data for (gene-level) differential expression between treatments, in a design incorporating multiple factors (effects of species * treatment & interaction, 4 replicates for each combination). I have reads that map to multiple locations (Single-End data) and while I'd first used Bowtie2/Tophat >> htseq (discarding multi-mapping reads = multihits in htseq), and then used the GLM and baySeq approaches, it was suggested I go back and include multi-mapping hits. I know Cufflinks allows incorporation of the multi-mapping reads (Mortazavi method I think), and I know it is not compatible with the GLM methods of edgeR/DEseq due to use of FPKM but does that incompatibility apply to baySeq as well? Using CuffDiff seems problematic as it only does pairwise tests -- and while I can do that, I think a full model testing for individual effects and their interaction (esp. as our real interest is the species*treatment interaction) is probably more statistically accurate. Thus I'm not sure how to proceed; any suggestions would be greatly appreciated if someone has time! Thank you, Hilary -- output of sessionInfo(): () -- Sent via the guest posting facility at bioconductor.org.

GO baySeq GO baySeq • 2.8k views

ADD COMMENT • link 11.2 years ago Guest User ★ 13k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hi Hilary, If you want to include multi-mapped reads in your counts, I believe that the latest version of Cufflinks reports an estimate of the counts for each transcript/gene in addition to FPKM. See the Cufflinks papers & manual to see how it splits the counts for multi-mapping reads. However, if you decide to use count estimates based on multi-mapping reads in edgeR or baySeq, be aware that you are feeding them estimates with some degree of uncertainty, while they are expecting exact raw counts. As a result, the uncertainty in the count estimates will be ignored by edgeR & baySeq, leading to an underestimate of variability and an overestimate of significance. If most of your reads are uniquely mapped, then this effect is probably quite modest. You should compare the multi-mapped counts to the unique-only counts from htseq to see what fraction of your reads are counted as multi-mapped. The higher this fraction, the greater your risk of overstating the significance of differential expression calls, but in truth no one really knows how big an effect this might have, or whether it negates the benefit of including the additional data from multi-mapping reads. Hope this helps, -Ryan On Fri 01 Mar 2013 06:51:33 AM PST, Hilary [guest] wrote: > > Hi, > I am trying to analyze RNA-Seq data for (gene-level) differential expression between treatments, in a design incorporating multiple factors (effects of species * treatment & interaction, 4 replicates for each combination). I have reads that map to multiple locations (Single-End data) and while I'd first used Bowtie2/Tophat >> htseq (discarding multi-mapping reads = multihits in htseq), and then used the GLM and baySeq approaches, it was suggested I go back and include multi-mapping hits. > > I know Cufflinks allows incorporation of the multi-mapping reads (Mortazavi method I think), and I know it is not compatible with the GLM methods of edgeR/DEseq due to use of FPKM but does that incompatibility apply to baySeq as well? > > Using CuffDiff seems problematic as it only does pairwise tests -- and while I can do that, I think a full model testing for individual effects and their interaction (esp. as our real interest is the species*treatment interaction) is probably more statistically accurate. Thus I'm not sure how to proceed; any suggestions would be greatly appreciated if someone has time! > Thank you, > Hilary > > -- output of sessionInfo(): > > () > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Thanks Ryan; this is very helpful. It's too bad that there isn't a GLM/Bayesian approach (beyond pairwise comparisons) that can model the uncertainty from multi-mapped reads. -- output of sessionInfo(): -- Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 11.2 years ago Guest User ★ 13k

Login before adding your answer.