Question

htseq and dexseq

0

Entering edit mode

Andreia Fonseca ▴ 810

@andreia-fonseca-3796

Last seen 7.2 years ago

Dear all, sorry if this question is not appropriate to this list, but if you are using DEXSeq probably you used htseq to make exon count previously. While making the counts using the python script available we saw that counts are being attributted to regions overlaping 2 genes, that what we think it means the counts corrseponding to IDs that have 2 or more ensembl ids. But then the most weird thing is that the total is being much higher than the number of reads aligned, id htseq counting reads in the individual genes and also in the gene fusions? Below is a summary of the counts. Thanks in advance for the help. Kind regards, Andreia HTseq isoform read count summary for DEXSeq NS_1 _ambiguous 0 _empty 5338242 _lowaqual 0 _notaligned 7690251 With feature 48284659 with feature no overlap 43660100 with feature overlap 4624559 total 61313152 sam file counts NS_1 Real Pair Number 43,324,075 against total DESeq counts 260,581 against total DEXseq counts -17,989,077 -- ---------------------------------------------------------------------- ----------------------- Andreia J. Amaral, PhD BioFIG - Center for Biodiversity, Functional and Integrative Genomics Instituto de Medicina Molecular University of Lisbon Tel: +352 217500000 (ext. office: 28253) email:andreiaamaral@fm.ul.pt ; andreiaamaral@fc.ul.pt [[alternative HTML version deleted]]

DEXSeq DEXSeq • 2.1k views

ADD COMMENT • link updated 10.8 years ago by Steve Lianoglou ★ 13k • written 10.8 years ago by Andreia Fonseca ▴ 810

score 0 · Answer 1 · 2013-06-26

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi Andreia, It's hard to parse the read stats you put into the email due to some forced wrapping, I guess, so I'm not quite sure what the numbers are that you are showing, but you say regarding the htseq counting for exons: > But then the most weird thing is that the total is being much higher than the > number of reads aligned, While I've never actually used ht-seq to count for DEXSeq, from the chatter on this list I'm pretty sure that it will likely double count a read if it spans more than one exon, ie. one exon will assign a +1 to the count for every exon it spans. If that can explain the issue that you are seeing, then this is expected behavior -- and moreover, you actually want that. As an aside, the vignette that comes with the parathyroidSE data packge has a section that shows you how to do DEXSeq entirely using R/bioconductor tools: http://bioconductor.org/packages/release/data/experiment/vignettes/par athyroidSE/inst/doc/parathyroidSE.pdf HTH, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech

ADD COMMENT • link 10.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks Steve, The vignette looks very useful. ;y question was not related with counting into two exons of same gene. With Htseq we get counts for exons that overlap different genes, so it creates feature fusing two or more genes, and I don't know how to handle this. should I look into the counts of the exons of the single gene , or of the counts of exons the gene fusions? thanks a bunch On Wed, Jun 26, 2013 at 5:27 PM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi Andreia, > > It's hard to parse the read stats you put into the email due to some > forced wrapping, I guess, so I'm not quite sure what the numbers are > that you are showing, but you say regarding the htseq counting for > exons: > > > But then the most weird thing is that the total is being much higher > than the > > number of reads aligned, > > While I've never actually used ht-seq to count for DEXSeq, from the > chatter on this list I'm pretty sure that it will likely double count > a read if it spans more than one exon, ie. one exon will assign a +1 > to the count for every exon it spans. If that can explain the issue > that you are seeing, then this is expected behavior -- and moreover, > you actually want that. > > As an aside, the vignette that comes with the parathyroidSE data > packge has a section that shows you how to do DEXSeq entirely using > R/bioconductor tools: > > > http://bioconductor.org/packages/release/data/experiment/vignettes/p arathyroidSE/inst/doc/parathyroidSE.pdf > > HTH, > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -- ---------------------------------------------------------------------- ----------------------- Andreia J. Amaral, PhD BioFIG - Center for Biodiversity, Functional and Integrative Genomics Instituto de Medicina Molecular University of Lisbon Tel: +352 217500000 (ext. office: 28253) email:andreiaamaral@fm.ul.pt ; andreiaamaral@fc.ul.pt [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Andreia Fonseca ▴ 810

0

Entering edit mode

Hi, On Wed, Jun 26, 2013 at 2:00 PM, Andreia Fonseca <andreia.fonseca at="" gmail.com=""> wrote: > Thanks Steve, > > The vignette looks very useful. ;y question was not related with counting > into two exons of same gene. With Htseq we get counts for exons that > overlap different genes, so it creates feature fusing two or more genes, > and I don't know how to handle this. should I look into the counts of the > exons of the single gene > , or of the counts of exons the gene fusions? Remember that I've never used ht-seq, but I'll assume that it's easy enough to find the "bins" that result from an exonic region shared between two different genes. The way I'd handle these special cases is that I wouldn't handle them at all and I'd likely prefer to remove those bins from my initial analysis -- I suspect the number of such cases would be relatively few when compared to the number of "kosher" bins. HTH, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech

ADD REPLY • link 10.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, You are right these cases are the minority. My feelling was to leave these apart but I was affraid to be discarding something important. Thanks for the tips. Kind regards, Andreia Sent from my iPad On 26/06/2013, at 22:56, Steve Lianoglou <lianoglou.steve at="" gene.com=""> wrote: > Hi, > > On Wed, Jun 26, 2013 at 2:00 PM, Andreia Fonseca > <andreia.fonseca at="" gmail.com=""> wrote: >> Thanks Steve, >> >> The vignette looks very useful. ;y question was not related with counting >> into two exons of same gene. With Htseq we get counts for exons that >> overlap different genes, so it creates feature fusing two or more genes, >> and I don't know how to handle this. should I look into the counts of the >> exons of the single gene >> , or of the counts of exons the gene fusions? > > Remember that I've never used ht-seq, but I'll assume that it's easy > enough to find the "bins" that result from an exonic region shared > between two different genes. > > The way I'd handle these special cases is that I wouldn't handle them > at all and I'd likely prefer to remove those bins from my initial > analysis -- I suspect the number of such cases would be relatively few > when compared to the number of "kosher" bins. > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech

ADD REPLY • link 10.8 years ago Andreia Fonseca ▴ 810

0

Entering edit mode

here is an attach with the summary of the counts I sent On Wed, Jun 26, 2013 at 10:00 PM, Andreia Fonseca <andreia.fonseca at="" gmail.com=""> wrote: > Thanks Steve, > > The vignette looks very useful. ;y question was not related with counting > into two exons of same gene. With Htseq we get counts for exons that > overlap different genes, so it creates feature fusing two or more genes, > and I don't know how to handle this. should I look into the counts of the > exons of the single gene > , or of the counts of exons the gene fusions? > > thanks a bunch > > > On Wed, Jun 26, 2013 at 5:27 PM, Steve Lianoglou <lianoglou.steve at="" gene.com=""> > wrote: > >> Hi Andreia, >> >> It's hard to parse the read stats you put into the email due to some >> forced wrapping, I guess, so I'm not quite sure what the numbers are >> that you are showing, but you say regarding the htseq counting for >> exons: >> >> > But then the most weird thing is that the total is being much higher >> than the >> > number of reads aligned, >> >> While I've never actually used ht-seq to count for DEXSeq, from the >> chatter on this list I'm pretty sure that it will likely double count >> a read if it spans more than one exon, ie. one exon will assign a +1 >> to the count for every exon it spans. If that can explain the issue >> that you are seeing, then this is expected behavior -- and moreover, >> you actually want that. >> >> As an aside, the vignette that comes with the parathyroidSE data >> packge has a section that shows you how to do DEXSeq entirely using >> R/bioconductor tools: >> >> >> http://bioconductor.org/packages/release/data/experiment/vignettes/ parathyroidSE/inst/doc/parathyroidSE.pdf >> >> HTH, >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> > > > > -- > > -------------------------------------------------------------------- ------------------------- > Andreia J. Amaral, PhD > BioFIG - Center for Biodiversity, Functional and Integrative Genomics > Instituto de Medicina Molecular > University of Lisbon > Tel: +352 217500000 (ext. office: 28253) > email:andreiaamaral at fm.ul.pt ; andreiaamaral at fc.ul.pt > -- ---------------------------------------------------------------------- ----------------------- Andreia J. Amaral, PhD BioFIG - Center for Biodiversity, Functional and Integrative Genomics Instituto de Medicina Molecular University of Lisbon Tel: +352 217500000 (ext. office: 28253) email:andreiaamaral at fm.ul.pt ; andreiaamaral at fc.ul.pt

ADD REPLY • link 10.8 years ago Andreia Fonseca ▴ 810

0

Entering edit mode

Hi, I'm not sure if I completely understand what the question is, but maybe the following remarks help: If a read overlaps with several exons of the _same_ gene, dexseq_count.py counts the read for each of these exons. If the read overlaps with exons from different genes, it gets discarded. Unfortunately, annotation files such as Ensembl's GTF file for humans, have a few overlapping genes: The right-most exons of one gene overlap or are even identical with the lef-most exons of the neighboring gene. According to the rules just stated, reads mapping to these exons should get discarded, because we would not be able anyway to say which of the two gene to count this for. One might, however, also argue that if an exon is assigned to two genes then these are not two genes, but different sets of isoforms of one and the same gene. Therefore, dexseq_prepare_annotation.py will "fuse" these genes into an "aggregate gene" and give it a name formed by concatenating the fused genes' names with plus signs. This sounded good in theory but, in practice, turned out to be not that useful, because such aggregate genes are hard to interpret. Furthermore, the Ensembl curators probably had good reasons to not fuse the two genes, and the better strategy might hence be to keep the genes seperate and accept that we cannot do inference on the one or two exons that overlap between he genes. Hence, the new version of dexseq_prepare_annotation.py now accepts a new option, '-r', which switches off gene aggregation ('-r no'). Simon

ADD REPLY • link 10.8 years ago Simon Anders ★ 3.7k

0

Entering edit mode

dear Simon, thanks for your reply, now its more clear what is happening. My issue was that even discarding the reads of the gene fusions, the total of counts of the exons was still much higher that the total number of reads aligned, now I understand from your reply and Steven that this will occurs when reads overlap 2 exons of the same gene, and then this explains the multiplication of reads that we observe. We will disregard the counts of gene fusions. Kind regards, Andreia On Thu, Jun 27, 2013 at 10:01 AM, Simon Anders <anders@embl.de> wrote: > Hi, > > I'm not sure if I completely understand what the question is, but maybe > the following remarks help: > > If a read overlaps with several exons of the _same_ gene, dexseq_count.py > counts the read for each of these exons. If the read overlaps with exons > from different genes, it gets discarded. > > Unfortunately, annotation files such as Ensembl's GTF file for humans, > have a few overlapping genes: The right-most exons of one gene overlap or > are even identical with the lef-most exons of the neighboring gene. > According to the rules just stated, reads mapping to these exons should get > discarded, because we would not be able anyway to say which of the two gene > to count this for. > > One might, however, also argue that if an exon is assigned to two genes > then these are not two genes, but different sets of isoforms of one and the > same gene. Therefore, dexseq_prepare_annotation.py will "fuse" these genes > into an "aggregate gene" and give it a name formed by concatenating the > fused genes' names with plus signs. > > This sounded good in theory but, in practice, turned out to be not that > useful, because such aggregate genes are hard to interpret. Furthermore, > the Ensembl curators probably had good reasons to not fuse the two genes, > and the better strategy might hence be to keep the genes seperate and > accept that we cannot do inference on the one or two exons that overlap > between he genes. > > Hence, the new version of dexseq_prepare_annotation.py now accepts a new > option, '-r', which switches off gene aggregation ('-r no'). > > Simon > -- ---------------------------------------------------------------------- ----------------------- Andreia J. Amaral, PhD BioFIG - Center for Biodiversity, Functional and Integrative Genomics Instituto de Medicina Molecular University of Lisbon Tel: +352 217500000 (ext. office: 28253) email:andreiaamaral@fm.ul.pt ; andreiaamaral@fc.ul.pt [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Andreia Fonseca ▴ 810