aggregate genes in DEXSeq

0

Entering edit mode

Rao,Xiayu ▴ 550

@raoxiayu-6003

Last seen 8.9 years ago

United States

Alejandro Reyes <alejandro.reyes at="" ...=""> writes: > > Dear Julien, Dear Mar and people interested in DEXSeq , > > You recently reported some problems in DEXSeq that had to do with the > way the HTSeq python scripts deal with the exons that overlap with more > than one gene ID. > > The solution that we had taken so far was that the gene IDs sharing an > exon were merged into an "aggregate gene" ID. From the input of some > users and our own experience, we know that it was not the most > appropriate solution: when the merged genes were differentially > expressed, DEXSeq falsely calls differential usage in other exons of the > aggregate genes. We have included a "-r" parameter in the script > "prepare_annotation_dexseq.py", for the user to decide what to do with > these exons: either to ignore the exons associated with more than one > gene IDs and treat each gene separately, or to merge the genes and take > these exons into account. > > Additionally, we have implemented the R/Bioconductor functions > equivalent to the python scripts. These functions were implemented using > code contributed by Mike Love. > > All these changes are available in the last svn version (1.5.9). > > Best regards, > Alejandro Reyes > > Hi Alejandro, > Just to let you know that adding the junctions to the test of > differential expression of DEXSeq worked fine! The "hack" was actually > straightforward, I just had to modify the counts files taken as input. > > On a different note, I noticed that many false positives were generated > because of "aggregate" gene models that were composed on different > overlapping genes. When these overlapping genes have different behavior > in different conditions, this is interpreted as differential expression > of some exons, while it is differential expression of genes... See the > attached picture, this might turn out to be easier to understand > Did you notice this behavior of DEXSeq, and do you have any comment on > this? > > Thanks again for your work on DEXSeq > Julien > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Hi, Alejandro We recently used the python script dexseq_prepare_annotation.py to generate exon.gtf from genes.gtf. Similar to the situations in your previous discussion, we found some overlapping exons which share patial/all regions and in the meanwhile belong to different genes, which you can see below, the 1st and the 4th line. Even after using "-r yes" parameter, we still see these happening. "-r yes" should be the default. We thought by merging the genes and taking all these exons into account would generate non-overlapping exon.gtf in the end, but now the gtf still have some overlapping exons. We are concerned if the further count files generated would be biased in these regions. Do you have any lastest update about the solution of this problem? In your last message, you mentioned about the R functions equivalent to the python scripts which were contributed by Mike Love. Could you provide more details about how we can find these functions, like the name or the website of it? Thank you very much in advance! Thanks, Xiayu 1 dexseq_prepare_annotation.py exonic_part 13671 14409 . + . transcripts "ENST00000456328+ENST00000515242+ENST00000518655"; exonic_ part_number "018"; gene_id "ENSG00000223972" 1 dexseq_prepare_annotation.py exonic_part 14410 14412 . + . transcripts "ENST00000515242"; exonic_part_number "019"; gene_id "ENSG 00000223972" 1 dexseq_prepare_annotation.py aggregate_gene 14363 29806 . - . gene_id "ENSG00000227232" 1 dexseq_prepare_annotation.py exonic_part 14363 14403 . - . transcripts "ENST00000541675+ENST00000423562+ENST00000438504"; exonic_ part_number "001"; gene_id "ENSG00000227232"

DEXSeq DEXSeq • 1.7k views

ADD COMMENT • link updated 10.1 years ago by Alejandro Reyes ★ 1.9k • written 10.1 years ago by Rao,Xiayu ▴ 550

0

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 7 days ago

Novartis Institutes for BioMedical Rese…

Dear Xiayu, To avoid genes from merging you have to use "-r no". You could do the counding steps inside R using the function "buildExonCountSet" Best regards, Alejandro > Alejandro Reyes <alejandro.reyes at="" ...=""> writes: > >> Dear Julien, Dear Mar and people interested in DEXSeq , >> >> You recently reported some problems in DEXSeq that had to do with the >> way the HTSeq python scripts deal with the exons that overlap with more >> than one gene ID. >> >> The solution that we had taken so far was that the gene IDs sharing an >> exon were merged into an "aggregate gene" ID. From the input of some >> users and our own experience, we know that it was not the most >> appropriate solution: when the merged genes were differentially >> expressed, DEXSeq falsely calls differential usage in other exons of the >> aggregate genes. We have included a "-r" parameter in the script >> "prepare_annotation_dexseq.py", for the user to decide what to do with >> these exons: either to ignore the exons associated with more than one >> gene IDs and treat each gene separately, or to merge the genes and take >> these exons into account. >> >> Additionally, we have implemented the R/Bioconductor functions >> equivalent to the python scripts. These functions were implemented using >> code contributed by Mike Love. >> >> All these changes are available in the last svn version (1.5.9). >> >> Best regards, >> Alejandro Reyes >> >> Hi Alejandro, >> Just to let you know that adding the junctions to the test of >> differential expression of DEXSeq worked fine! The "hack" was actually >> straightforward, I just had to modify the counts files taken as input. >> >> On a different note, I noticed that many false positives were generated >> because of "aggregate" gene models that were composed on different >> overlapping genes. When these overlapping genes have different behavior >> in different conditions, this is interpreted as differential expression >> of some exons, while it is differential expression of genes... See the >> attached picture, this might turn out to be easier to understand >> Did you notice this behavior of DEXSeq, and do you have any comment on >> this? >> >> Thanks again for your work on DEXSeq >> Julien >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at ... >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > Hi, Alejandro > > We recently used the python script dexseq_prepare_annotation.py to > generate exon.gtf from genes.gtf. Similar to the situations in your > previous discussion, we found some overlapping exons which share > patial/all regions and in the meanwhile belong to different genes, which > you can see below, the 1st and the 4th line. Even after using "-r yes" > parameter, we still see these happening. "-r yes" should be the default. > We thought by merging the genes and taking all these exons into account > would generate non-overlapping exon.gtf in the end, but now the gtf still > have some overlapping exons. We are concerned if the further count files > generated would be biased in these regions. Do you have any lastest update > about the solution of this problem? > > In your last message, you mentioned about the R functions equivalent to > the python scripts which were contributed by Mike Love. Could you provide > more details about how we can find these functions, like the name or the > website of it? > > Thank you very much in advance! > > Thanks, > Xiayu > > > 1 dexseq_prepare_annotation.py exonic_part 13671 > 14409 . + . > transcripts "ENST00000456328+ENST00000515242+ENST00000518655"; exonic_ > part_number "018"; gene_id "ENSG00000223972" > 1 dexseq_prepare_annotation.py exonic_part 14410 > 14412 . + . transcripts "ENST00000515242"; > exonic_part_number "019"; gene_id "ENSG > 00000223972" > 1 dexseq_prepare_annotation.py aggregate_gene 14363 > 29806 . - . gene_id "ENSG00000227232" > 1 dexseq_prepare_annotation.py exonic_part 14363 > 14403 . - . > transcripts "ENST00000541675+ENST00000423562+ENST00000438504"; exonic_ > part_number "001"; gene_id "ENSG00000227232" > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.1 years ago Alejandro Reyes ★ 1.9k

Login before adding your answer.