easyRNAseq question

0

Entering edit mode

Akula, Nirmala NIH/NIMH [C] ▴ 190

@akula-nirmala-nihnimh-c-5007

Last seen 6.2 years ago

Thank you Simon. I tried Ensemble GTF file with HTSeq and got ~50,000 genes for testing by DESeq. We filtered the genes with low counts and the resulting file had ~23,000 genes. The problem now is the QQ-plot is way above the expected. Please see the attachment. Analysis pipeline: Tophat-HTSeq-DESeq Any suggestions will be greatly helpful. Thank you very much. Regards, Nirmala -----Original Message----- From: Simon Anders [mailto:anders@embl.de] Sent: Thursday, May 31, 2012 2:31 AM To: bioconductor at r-project.org Subject: Re: [BioC] easyRNAseq question Dear Nirmala On 2012-05-27 02:25, Akula, Nirmala (NIH/NIMH) [C] wrote: > I used HTSeq (similar to your geneModel method) which takes the counts > of disjoint exons for the genes. The problem with this method is that > too many reads are assigned to ambiguous category and sometimes total > number of reads that fall on disjoint exons are too few to get a valid > DESeq result. Using RefSeq genes the total number of genes counted by > HTSeq on my data is ~14000 whereas using the bestExon method we get > ~22000. Do you observe similar counts with your data? It does not quite make sense that counting only for the best exons gives you more counts than counting for all exons. Could it be that the issue with UCSC GTF files described here is the source of your problems: https://stat.ethz.ch/pipermail/bioconductor/2012-April/044717.html Simon _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Category DESeq easyRNASeq Category DESeq easyRNASeq • 1.4k views

ADD COMMENT • link updated 13.5 years ago by Wolfgang Huber ★ 13k • written 13.6 years ago by Akula, Nirmala NIH/NIMH [C] ▴ 190

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 months ago

EMBL European Molecular Biology Laborat…

Dear Nirmala It seems that the attachent did not come through the mailing list server. Can you use a public (picture) server for posting the plot? And provide a reproducible code example. Also, could you be more clear about what you mean by "the QQ-plot is way above the expected"? Thanks and best wishes Wolfgang Jul/2/12 11:10 PM, Akula, Nirmala (NIH/NIMH) [C] scripsit:: > Thank you Simon. I tried Ensemble GTF file with HTSeq and got ~50,000 genes for testing by DESeq. We filtered the genes with low counts and the resulting file had ~23,000 genes. The problem now is the QQ-plot is way above the expected. Please see the attachment. > > Analysis pipeline: Tophat-HTSeq-DESeq > > Any suggestions will be greatly helpful. > > Thank you very much. > > Regards, > Nirmala > > -----Original Message----- > From: Simon Anders [mailto:anders at embl.de] > Sent: Thursday, May 31, 2012 2:31 AM > To: bioconductor at r-project.org > Subject: Re: [BioC] easyRNAseq question > > Dear Nirmala > > On 2012-05-27 02:25, Akula, Nirmala (NIH/NIMH) [C] wrote: >> I used HTSeq (similar to your geneModel method) which takes the counts >> of disjoint exons for the genes. The problem with this method is that >> too many reads are assigned to ambiguous category and sometimes total >> number of reads that fall on disjoint exons are too few to get a valid >> DESeq result. Using RefSeq genes the total number of genes counted by >> HTSeq on my data is ~14000 whereas using the bestExon method we get >> ~22000. Do you observe similar counts with your data? > > It does not quite make sense that counting only for the best exons gives you more counts than counting for all exons. > > Could it be that the issue with UCSC GTF files described here is the source of your problems: > > https://stat.ethz.ch/pipermail/bioconductor/2012-April/044717.html > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 13.5 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 months ago

EMBL European Molecular Biology Laborat…

Dear Nirmala thank you. What you call expected is expected only if all null hypotheses are true. (If this sentence does not make sense to you, please consult a local statistician or a book on hypothesis testing.) In your case, you have many small p values. One needs to know more about the data to tell whether this could make biological sense. If not, then you need to explore your data for batch effects or problems with the experimental design or data quality. I posted your plot here: http://www-huber.embl.de/users/whuber/bioc- list/120708/EnsemblGTF_qqplot.png Its axes however do not match what you claim below ("expected is calculated using the formula - rank/(n+1)"). PS There are many ways to post an image on the internet, e.g. Facebook, Flickr, Imagevenue, Google+, Tumblr and many others. You can pick your choice. Alternatively, I am sure that you have an IT department that is able to teach you how to best do this. Best wishes Wolfgang On 7/5/12 10:16 PM, Akula, Nirmala (NIH/NIMH) [C] wrote: > Hi, > > My analysis pipeline in detail: > > 1. Used Tophat 2.0.4 for mapping the reads > 2. Used Ensemble GTF file for counting using HTSeq > 3. Then DESeq to find the differentially expressed genes > 4. The genes are then ranked in the ascending order of p-values. The expected is calculated using the formula - rank/(n+1), where n is the total number of genes. Observed is -log(pvalue). The QQ plot is expected vs observed. > > Please let me know if you need additional details. > > Sorry, I am not sure what public server you are talking about. I have attached the plot to this e-mail so please post it to the server. > > Thank you very much. > > Best Regards, > Nirmala > > -----Original Message----- > From: Wolfgang Huber [mailto:whuber at embl.de] > Sent: Thursday, July 05, 2012 3:50 AM > To: bioconductor at r-project.org > Subject: Re: [BioC] easyRNAseq question > > Dear Nirmala > > It seems that the attachent did not come through the mailing list server. Can you use a public (picture) server for posting the plot? And provide a reproducible code example. > > Also, could you be more clear about what you mean by "the QQ-plot is way above the expected"? > > Thanks and best wishes > Wolfgang > > > Jul/2/12 11:10 PM, Akula, Nirmala (NIH/NIMH) [C] scripsit:: >> Thank you Simon. I tried Ensemble GTF file with HTSeq and got ~50,000 genes for testing by DESeq. We filtered the genes with low counts and the resulting file had ~23,000 genes. The problem now is the QQ-plot is way above the expected. Please see the attachment. >> >> Analysis pipeline: Tophat-HTSeq-DESeq >> >> Any suggestions will be greatly helpful. >> >> Thank you very much. >> >> Regards, >> Nirmala >> >> -----Original Message----- >> From: Simon Anders [mailto:anders at embl.de] >> Sent: Thursday, May 31, 2012 2:31 AM >> To: bioconductor at r-project.org >> Subject: Re: [BioC] easyRNAseq question >> >> Dear Nirmala >> >> On 2012-05-27 02:25, Akula, Nirmala (NIH/NIMH) [C] wrote: >>> I used HTSeq (similar to your geneModel method) which takes the >>> counts of disjoint exons for the genes. The problem with this method >>> is that too many reads are assigned to ambiguous category and >>> sometimes total number of reads that fall on disjoint exons are too >>> few to get a valid DESeq result. Using RefSeq genes the total number >>> of genes counted by HTSeq on my data is ~14000 whereas using the >>> bestExon method we get ~22000. Do you observe similar counts with your data? >> >> It does not quite make sense that counting only for the best exons gives you more counts than counting for all exons. >> >> Could it be that the issue with UCSC GTF files described here is the source of your problems: >> >> https://stat.ethz.ch/pipermail/bioconductor/2012-April/044717.html >> >> Simon > > > -- > Best wishes > Wolfgang > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 13.5 years ago Wolfgang Huber ★ 13k

Login before adding your answer.