Different number of genes when using HTseq and cuffdiff

0

Entering edit mode

Fatemehsadat Seyednasrollah ▴ 260

@fatemehsadat-seyednasrollah-5367

Last seen 9.6 years ago

Hi, I have used the same output tophat bam files both for HTseq (and then DESeq) and cuffdiff to find DE genes. But I do not understand why even when the bam files and references are the same the number of genes are different in the result of cuffdiff and HTseq. Actually I expected to have different number of counts for each gene but not getting more (nearly 100) number of genes in HTseq comparing to cuffdiff. Thank you in advance

• 2.4k views

ADD COMMENT • link updated 11.4 years ago by Steve Lianoglou ★ 13k • written 11.4 years ago by Fatemehsadat Seyednasrollah ▴ 260

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Fri, Dec 14, 2012 at 9:05 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > I have used the same output tophat bam files both for HTseq (and then DESeq) and cuffdiff to find DE genes. But I do not understand why even when the bam files and references are the same the number of genes are different in the result of cuffdiff and HTseq. Actually I expected to have different number of counts for each gene but not getting more (nearly 100) number of genes in HTseq comparing to cuffdiff. What do you mean by "more genes"? You mean more genes are called as differentially expressed? Or is there some pipeline that you are using to just count reads over genes, and these two pipelines are giving different number of genes as "input"? If it's the former -- cuffdiff and DESeq do rather different things to assess differential expression, and so your result should not be a surprise. While I haven't actually read the paper, I would imagine the new publication on cuffdiff2 would be rather informative in this regard: http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2450.html You haven't said what version of each software you are using, but I guess you're using cuffdiff 2? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 11.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, By more genes I do not mean number of DE genes but the number of total input genes. I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. And yes I am using the latest version of both(cuffdiff2) Thanks in advance ________________________________________ From: Steve Lianoglou [mailinglist.honeypot@gmail.com] Sent: Friday, December 14, 2012 4:41 PM To: Fatemehsadat Seyednasrollah Cc: bioconductor at r-project.org Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff Hi, On Fri, Dec 14, 2012 at 9:05 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > I have used the same output tophat bam files both for HTseq (and then DESeq) and cuffdiff to find DE genes. But I do not understand why even when the bam files and references are the same the number of genes are different in the result of cuffdiff and HTseq. Actually I expected to have different number of counts for each gene but not getting more (nearly 100) number of genes in HTseq comparing to cuffdiff. What do you mean by "more genes"? You mean more genes are called as differentially expressed? Or is there some pipeline that you are using to just count reads over genes, and these two pipelines are giving different number of genes as "input"? If it's the former -- cuffdiff and DESeq do rather different things to assess differential expression, and so your result should not be a surprise. While I haven't actually read the paper, I would imagine the new publication on cuffdiff2 would be rather informative in this regard: http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2450.html You haven't said what version of each software you are using, but I guess you're using cuffdiff 2? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.4 years ago Fatemehsadat Seyednasrollah ▴ 260

0

Entering edit mode

Hi, On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > By more genes I do not mean number of DE genes but the number of total input genes. > I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. > And yes I am using the latest version of both(cuffdiff2) So, isn't this a function of the gene annotation files (gft's) you are using for each tool? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

I am using the same gtf file for both. Actually this is why I do not understand this variation between them. I was thinking maybe cuffidff may discard some genes with very low number of counts but I did not find something which support this idea. ________________________________________ From: Steve Lianoglou [mailinglist.honeypot@gmail.com] Sent: Friday, December 14, 2012 4:49 PM To: Fatemehsadat Seyednasrollah Cc: bioconductor at r-project.org Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff Hi, On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > Hi, > > By more genes I do not mean number of DE genes but the number of total input genes. > I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. > And yes I am using the latest version of both(cuffdiff2) So, isn't this a function of the gene annotation files (gft's) you are using for each tool? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.4 years ago Fatemehsadat Seyednasrollah ▴ 260

0

Entering edit mode

On Fri, Dec 14, 2012 at 9:51 AM, Fatemehsadat Seyednasrollah <fatsey at="" utu.fi=""> wrote: > I am using the same gtf file for both. Actually this is why I do not understand this variation between them. > I was thinking maybe cuffidff may discard some genes with very low number of counts but I did not find something which support this idea. This is hard to debug blind -- have you tried looking at the genes + your data in something like IGV to see if you can figure this out? Does this have something to do w/ how each tool handles "overlapping" regions? Are there multimap reads that these tools handle differently? Are they they filtering reads differently based on alignment quality, maybe? Lots of things to check. -steve > > ________________________________________ > From: Steve Lianoglou [mailinglist.honeypot at gmail.com] > Sent: Friday, December 14, 2012 4:49 PM > To: Fatemehsadat Seyednasrollah > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff > > Hi, > > On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah > <fatsey at="" utu.fi=""> wrote: >> Hi, >> >> By more genes I do not mean number of DE genes but the number of total input genes. >> I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. >> And yes I am using the latest version of both(cuffdiff2) > > So, isn't this a function of the gene annotation files (gft's) you are > using for each tool? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi It is true that HTSeq from python and easyRNASeq give different counts using the same annotation file (gtf from ensembl). This was for Dmelanogaster. The bam file used to get count tables were from tophat/bowtie. easyRNASeq: > head(countTable) accepted_hits.bam "FBgn0000003" 1632 "FBgn0000008" 1156 "FBgn0000014" 198 "FBgn0000015" 129 "FBgn0000017" 5270 "FBgn0000018" 616 HTSeq: FBgn0000003 0 FBgn0000008 1229 FBgn0000014 206 FBgn0000015 137 FBgn0000017 5508 FBgn0000018 631 Nicco directed me to previous posts where someone had studied this. If you are interested, google "counting RNA-seq reads in R/BioC" (with quotes) to find the posts. Silav ________________________________ From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: Fatemehsadat Seyednasrollah <fatsey@utu.fi> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Friday, December 14, 2012 10:13 AM Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff On Fri, Dec 14, 2012 at 9:51 AM, Fatemehsadat Seyednasrollah <fatsey@utu.fi> wrote: > I am using the same gtf file for both. Actually this is why I do not understand this variation between them. > I was thinking maybe cuffidff may discard some genes with very low number of counts but I did not find something which support this idea. This is hard to debug blind -- have you tried looking at the genes + your data in something like IGV to see if you can figure this out? Does this have something to do w/ how each tool handles "overlapping" regions? Are there multimap reads that these tools handle differently? Are they they filtering reads differently based on alignment quality, maybe? Lots of things to check. -steve > > ________________________________________ > From: Steve Lianoglou [mailinglist.honeypot@gmail.com] > Sent: Friday, December 14, 2012 4:49 PM > To: Fatemehsadat Seyednasrollah > Cc: bioconductor@r-project.org > Subject: Re: [BioC] Different number of genes when using HTseq and cuffdiff > > Hi, > > On Fri, Dec 14, 2012 at 9:45 AM, Fatemehsadat Seyednasrollah > <fatsey@utu.fi> wrote: >> Hi, >> >> By more genes I do not mean number of DE genes but the number of total input genes. >> I have used the same tophat bam files as the input of HTSeq and cuffdiff but in the results regardless of how many DE genes I have, there are different number of total genes. >> And yes I am using the latest version of both(cuffdiff2) > > So, isn't this a function of the gene annotation files (gft's) you are > using for each tool? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Silav Bremos ▴ 80

Login before adding your answer.