Different number of genes when using HTseq and cuffdiff
1
0
Entering edit mode
@fatemehsadat-seyednasrollah-5367
Last seen 9.6 years ago
Hi, I have used the same output tophat bam files both for HTseq (and then DESeq) and cuffdiff to find DE genes. But I do not understand why even when the bam files and references are the same the number of genes are different in the result of cuffdiff and HTseq. Actually I expected to have different number of counts for each gene but not getting more (nearly 100) number of genes in HTseq comparing to cuffdiff. Thank you in advance
• 2.4k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi, On Fri, Dec 14, 2012 at 9:05 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > I have used the same output tophat bam files both for HTseq (and then DESeq) and cuffdiff to find DE genes. But I do not understand why even when the bam files and references are the same the number of genes are different in the result of cuffdiff and HTseq. Actually I expected to have different number of counts for each gene but not getting more (nearly 100) number of genes in HTseq comparing to cuffdiff. What do you mean by "more genes"? You mean more genes are called as differentially expressed? Or is there some pipeline that you are using to just count reads over genes, and these two pipelines are giving different number of genes as "input"? If it's the former -- cuffdiff and DESeq do rather different things to assess differential expression, and so your result should not be a surprise. While I haven't actually read the paper, I would imagine the new publication on cuffdiff2 would be rather informative in this regard: http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2450.html You haven't said what version of each software you are using, but I guess you're using cuffdiff 2? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Hi, By more genes I do not mean number of DE genes but the number of total input genes. I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. And yes I am using the latest version of both(cuffdiff2) Thanks in advance ________________________________________ From: Steve Lianoglou [mailinglist.honeypot@gmail.com] Sent: Friday, December 14, 2012 4:41 PM To: Fatemehsadat Seyednasrollah Cc: bioconductor at r-project.org Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff Hi, On Fri, Dec 14, 2012 at 9:05 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > I have used the same output tophat bam files both for HTseq (and then DESeq) and cuffdiff to find DE genes. But I do not understand why even when the bam files and references are the same the number of genes are different in the result of cuffdiff and HTseq. Actually I expected to have different number of counts for each gene but not getting more (nearly 100) number of genes in HTseq comparing to cuffdiff. What do you mean by "more genes"? You mean more genes are called as differentially expressed? Or is there some pipeline that you are using to just count reads over genes, and these two pipelines are giving different number of genes as "input"? If it's the former -- cuffdiff and DESeq do rather different things to assess differential expression, and so your result should not be a surprise. While I haven't actually read the paper, I would imagine the new publication on cuffdiff2 would be rather informative in this regard: http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2450.html You haven't said what version of each software you are using, but I guess you're using cuffdiff 2? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi, On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > By more genes I do not mean number of DE genes but the number of total input genes. > I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. > And yes I am using the latest version of both(cuffdiff2) So, isn't this a function of the gene annotation files (gft's) you are using for each tool? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
I am using the same gtf file for both. Actually this is why I do not understand this variation between them. I was thinking maybe cuffidff may discard some genes with very low number of counts but I did not find something which support this idea. ________________________________________ From: Steve Lianoglou [mailinglist.honeypot@gmail.com] Sent: Friday, December 14, 2012 4:49 PM To: Fatemehsadat Seyednasrollah Cc: bioconductor at r-project.org Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff Hi, On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > By more genes I do not mean number of DE genes but the number of total input genes. > I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. > And yes I am using the latest version of both(cuffdiff2) So, isn't this a function of the gene annotation files (gft's) you are using for each tool? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
On Fri, Dec 14, 2012 at 9:51 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > I am using the same gtf file for both. Actually this is why I do not understand this variation between them. > I was thinking maybe cuffidff may discard some genes with very low number of counts but I did not find something which support this idea. This is hard to debug blind -- have you tried looking at the genes + your data in something like IGV to see if you can figure this out? Does this have something to do w/ how each tool handles "overlapping" regions? Are there multimap reads that these tools handle differently? Are they they filtering reads differently based on alignment quality, maybe? Lots of things to check. -steve > > ________________________________________ > From: Steve Lianoglou [mailinglist.honeypot at gmail.com] > Sent: Friday, December 14, 2012 4:49 PM > To: Fatemehsadat Seyednasrollah > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff > > Hi, > > On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah > <fatsey at="" utu.fi=""> wrote: >> Hi, >> >> By more genes I do not mean number of DE genes but the number of total input genes. >> I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. >> And yes I am using the latest version of both(cuffdiff2) > > So, isn't this a function of the gene annotation files (gft's) you are > using for each tool? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi It is true that HTSeq from python and easyRNASeq give different counts using the same annotation file (gtf from ensembl). This was for Dmelanogaster. The bam file used to get count tables were from tophat/bowtie. easyRNASeq: > head(countTable)               accepted_hits.bam "FBgn0000003"              1632 "FBgn0000008"              1156 "FBgn0000014"               198 "FBgn0000015"               129 "FBgn0000017"              5270 "FBgn0000018"               616 HTSeq: FBgn0000003    0 FBgn0000008    1229 FBgn0000014    206 FBgn0000015    137 FBgn0000017    5508 FBgn0000018    631 Nicco directed me to previous posts where someone had studied this. If you are interested, google "counting RNA-seq reads in R/BioC" (with quotes) to find the posts. Silav ________________________________ From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: Fatemehsadat Seyednasrollah <fatsey@utu.fi> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Friday, December 14, 2012 10:13 AM Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff On Fri, Dec 14, 2012 at 9:51 AM, Fatemehsadat Seyednasrollah <fatsey@utu.fi> wrote: > I am using the same gtf file for both. Actually this is why I do not understand this variation between them. > I was thinking maybe cuffidff may discard some genes with very low number of counts but I did not find something which support this idea. This is hard to debug blind -- have you tried looking at the genes + your data in something like IGV to see if you can figure this out? Does this have something to do w/ how each tool handles "overlapping" regions? Are there multimap reads that these tools handle differently? Are they they filtering reads differently based on alignment quality, maybe? Lots of things to check. -steve > > ________________________________________ > From: Steve Lianoglou [mailinglist.honeypot@gmail.com] > Sent: Friday, December 14, 2012 4:49 PM > To: Fatemehsadat Seyednasrollah > Cc: bioconductor@r-project.org > Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff > > Hi, > > On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah > <fatsey@utu.fi> wrote: >> Hi, >> >> By more genes I do not mean number of DE genes but the number of total input genes. >> I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. >> And yes I am using the latest version of both(cuffdiff2) > > So, isn't this a function of the gene annotation files (gft's) you are > using for each tool? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology >  | Memorial Sloan-Kettering Cancer Center >  | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6