Recommended gene model for DESeq

0

Entering edit mode

Assaf Gordon ▴ 30

@assaf-gordon-5206

Last seen 9.7 years ago

Hello, I have a question regarding the gene model source to use with DESeq. Assuming the following workflow: 1. Map reads to genome (bowtie/tophat/bwa/etc). 2. Count hits-per-gene (HTSeq / CoverageBed / etc. ) 3. Repeat 1,2 for all samples, merge together into one table. 4. Run DESeq on merged table. My question is about step 2: What is the recommended gene model to use when counting hits-per-gene ? RefSeq-Genes, UCSC Known Genes, Ensembl Genes and others come to mind, but those usually contain multiple transcripts per gene as different records - would that skew the DESeq results? (Note that I'm interested in gene-level differential expression, not worried about isoform-level differential expression). I've read previous discussions about transcript vs. gene level [1] and exon level considerations [2] but perhaps I've missed the bottom line: Is it OK to have multiple isoforms per gene (and treat each transcript as "gene record", which will result in some double-counting of reads), or do I need to pre-process the gene model file, to ensure there are no overlaps (e.g. by merging all isoforms of a single gene) ? Or, is some post-processing needed to the DESeq results (from nbinomTest()) to "normalize" genes with multiple isoforms? Any suggestions and comments will be appreciated (or corrections, if something above is wrong). Thanks, -gordon [1] http://article.gmane.org/gmane.science.biology.informatics.conduct or/38805/ [2] http://article.gmane.org/gmane.science.biology.informatics.conduct or/38915/

DESeq DESeq • 2.4k views

ADD COMMENT • link updated 12.1 years ago by Simon Anders ★ 3.7k • written 12.1 years ago by Assaf Gordon ▴ 30

0

Entering edit mode

Simon Anders ★ 3.7k

@simon-anders-3855

Last seen 3.8 years ago

Zentrum für Molekularbiologie, Universi…

Hi Gordon On 04/04/2012 10:30 PM, Assaf Gordon wrote: > I've read previous discussions about transcript vs. gene level [1] > and exon level considerations [2] but perhaps I've missed the bottom > line: Is it OK to have multiple isoforms per gene (and treat each > transcript as "gene record", which will result in some > double-counting of reads), or do I need to pre-process the gene model > file, to ensure there are no overlaps (e.g. by merging all isoforms > of a single gene) ? Or, is some post-processing needed to the DESeq > results (from nbinomTest()) to "normalize" genes with multiple > isoforms? You should make sure that each read is counted only once per gene. Hence, counting per transcript and the summing up the counts for the transcripts from the same gene is not a good idea. I use my htseq-count script (part of HTSeq, http://www-huber.embl.de/users/anders/HTSeq/). This Python script takes a GTF file and a SAM file and output a count table per gene. To this end, it finds all the exons with which the read overlaps and then checks whether they all have the same Gene ID. If so, it counts the read for this gene. If you want to stay in R: Valerie Obenchain has recently added functionality to GenomicRanges to perform counting in a way similar to my htseq-count script, and described this in the vignette 'countGenomicOverlaps.pdf' (in the GenomicRanges package). Also look at Nico Delhomme's new package 'easyRNASeq' which is also meant to facilitate this task. A subtle but very annoying problem is this: According to the GTF file standard specs, an "exon" or "CDS" line in a GTF file shlould have two attributes labelled "gene_id" and "transcript_id". Exon lines describing different transcripts of the same gene should hence have different transcript IDs but the same gene ID. Ensembl honors this rule while GTF files downloaded from the UCSC Table browser just have the transcript ID repeated as gene ID, which defeats the purpose, of course. A script won't notice that two transcripts belong to the same gene because they do not have the same gene ID. Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC people to fix the issue. (iii) Fix a UCSC GTF file by removing from the gene_id attibute the number behind the last dot. Hence, taking e.g., this line from the UCSC GTF file for hg19: chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; change it to this chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjd"; transcript_id "uc009vjd.2"; This change should be easy to make on the fly by any of the three methods mentioned above. At least mine does not offer this functionality, but I should add it, I suppose. Valerie and Nico, how do your functions deal with this? Simon

ADD COMMENT • link 12.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Simon, I support fully Ensembl GTF files. As for other GFF files, I describe in the vignette of the easyRNASeq package how to generate/modify them so that I can deal with them. I'm as well performing a number of validity checks on the resulting gene models and issue warnings whenever something that might result in double-counting occurs. This implies that the generated gene model might need to be further filtered after generation. That's something I need to describe better how to do in the vignette. Gordon, if you try easyRNASeq, let me know what's your GFF file of choice and I'll help out as that would be helpful to generate the missing documentation :-) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 5 Apr 2012, at 11:11, Simon Anders wrote: > Hi Gordon > > On 04/04/2012 10:30 PM, Assaf Gordon wrote: >> I've read previous discussions about transcript vs. gene level [1] >> and exon level considerations [2] but perhaps I've missed the bottom >> line: Is it OK to have multiple isoforms per gene (and treat each >> transcript as "gene record", which will result in some >> double-counting of reads), or do I need to pre-process the gene model >> file, to ensure there are no overlaps (e.g. by merging all isoforms >> of a single gene) ? Or, is some post-processing needed to the DESeq >> results (from nbinomTest()) to "normalize" genes with multiple >> isoforms? > > You should make sure that each read is counted only once per gene. Hence, counting per transcript and the summing up the counts for the transcripts from the same gene is not a good idea. > > I use my htseq-count script (part of HTSeq, http://www- huber.embl.de/users/anders/HTSeq/) This Python script takes a GTF file and a SAM file and output a count table per gene. To this end, it finds all the exons with which the read overlaps and then checks whether they all have the same Gene ID. If so, it counts the read for this gene. > > If you want to stay in R: Valerie Obenchain has recently added functionality to GenomicRanges to perform counting in a way similar to my htseq-count script, and described this in the vignette 'countGenomicOverlaps.pdf' (in the GenomicRanges package). Also look at Nico Delhomme's new package 'easyRNASeq' which is also meant to facilitate this task. > > > A subtle but very annoying problem is this: > > According to the GTF file standard specs, an "exon" or "CDS" line in a GTF file shlould have two attributes labelled "gene_id" and "transcript_id". Exon lines describing different transcripts of the same gene should hence have different transcript IDs but the same gene ID. Ensembl honors this rule while GTF files downloaded from the UCSC Table browser just have the transcript ID repeated as gene ID, which defeats the purpose, of course. A script won't notice that two transcripts belong to the same gene because they do not have the same gene ID. > > Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC people to fix the issue. (iii) Fix a UCSC GTF file by removing from the gene_id attibute the number behind the last dot. Hence, taking e.g., this line from the UCSC GTF file for hg19: > > chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; > > change it to this > > chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjd"; transcript_id "uc009vjd.2"; > > > This change should be easy to make on the fly by any of the three methods mentioned above. At least mine does not offer this functionality, but I should add it, I suppose. Valerie and Nico, how do your functions deal with this? > > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.1 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

On 04/05/2012 02:11 AM, Simon Anders wrote: > Hi Gordon > > On 04/04/2012 10:30 PM, Assaf Gordon wrote: >> I've read previous discussions about transcript vs. gene level [1] >> and exon level considerations [2] but perhaps I've missed the bottom >> line: Is it OK to have multiple isoforms per gene (and treat each >> transcript as "gene record", which will result in some >> double-counting of reads), or do I need to pre-process the gene model >> file, to ensure there are no overlaps (e.g. by merging all isoforms >> of a single gene) ? Or, is some post-processing needed to the DESeq >> results (from nbinomTest()) to "normalize" genes with multiple >> isoforms? > > You should make sure that each read is counted only once per gene. > Hence, counting per transcript and the summing up the counts for the > transcripts from the same gene is not a good idea. > > I use my htseq-count script (part of HTSeq, > http://www-huber.embl.de/users/anders/HTSeq/). This Python script > takes a GTF file and a SAM file and output a count table per gene. To > this end, it finds all the exons with which the read overlaps and then > checks whether they all have the same Gene ID. If so, it counts the > read for this gene. > > If you want to stay in R: Valerie Obenchain has recently added > functionality to GenomicRanges to perform counting in a way similar to > my htseq-count script, and described this in the vignette > 'countGenomicOverlaps.pdf' (in the GenomicRanges package). We renamed the vignette to 'Overview of summarizeOverlaps'. Also see ?summarizeOverlaps. Valerie > Also look at Nico Delhomme's new package 'easyRNASeq' which is also > meant to facilitate this task. > > > A subtle but very annoying problem is this: > > According to the GTF file standard specs, an "exon" or "CDS" line in a > GTF file shlould have two attributes labelled "gene_id" and > "transcript_id". Exon lines describing different transcripts of the > same gene should hence have different transcript IDs but the same gene > ID. Ensembl honors this rule while GTF files downloaded from the UCSC > Table browser just have the transcript ID repeated as gene ID, which > defeats the purpose, of course. A script won't notice that two > transcripts belong to the same gene because they do not have the same > gene ID. > > Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC > people to fix the issue. (iii) Fix a UCSC GTF file by removing from > the gene_id attibute the number behind the last dot. Hence, taking > e.g., this line from the UCSC GTF file for hg19: > > chr1 hg19_knownGene exon 16858 17055 0.000000 - > . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2"; > > change it to this > > chr1 hg19_knownGene exon 16858 17055 0.000000 - > . gene_id "uc009vjd"; transcript_id "uc009vjd.2"; > > > This change should be easy to make on the fly by any of the three > methods mentioned above. At least mine does not offer this > functionality, but I should add it, I suppose. Valerie and Nico, how > do your functions deal with this? > > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thank you for the quick answer. A side-note regarding UCSC's GTF export: Simon Anders wrote, On 04/05/2012 05:11 AM: > > A subtle but very annoying problem is this: > > According to the GTF file standard specs, an "exon" or "CDS" line in > a GTF file shlould have two attributes labelled "gene_id" and > "transcript_id". Exon lines describing different transcripts of the > same gene should hence have different transcript IDs but the same > gene ID. Ensembl honors this rule while GTF files downloaded from the > UCSC Table browser just have the transcript ID repeated as gene ID, > which defeats the purpose, of course. A script won't notice that two > transcripts belong to the same gene because they do not have the same > gene ID. > > Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC > people to fix the issue. (iii) Fix a UCSC GTF file by removing from > the gene_id attibute the number behind the last dot. Hence, taking > e.g., this line from the UCSC GTF file for hg19: > The UCSC Table Browser indeed produces unsuitable GTF files, it's a known issue and they're not going to fix it (technically: the format conversion code uses a BED-like structure for all formats, and BED has only one ID per record). The correct way to export usable GFF files directly from UCSC: 1. Download a program called "genePredToGtf" from here: http://hgdownload.cse.ucsc.edu/admin/exe/ or compile it from source ( http://genome.ucsc.edu/admin/git.html ) if you're feeling lucky. 2. Create the following file in your home directory: $ cat ~/.hg.conf db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=password # the file's permissions must be user-only $ chmod 0600 ~/.hg.conf 3. run "genePredToGtf" with any organism and any table that is in "genePred" format: ## mm9/RefSeq Genes $ genePredToGtf mm9 refGene refGene.gtf ## hg19/Ensemble Genes $ genePredToGtf hg19 ensGene ensGene.gtf ## hg18/UCSC Known Genes $ genePredToGtf hg18 knownGene knownGene.gtf This will save "refGene.gtf" with all the required attributes (gene_id, gene_name transcript_id, exon_number). (but still, per your explanation, not directly usable with DESeq because of multiple isoforms per gene). -gordon

ADD REPLY • link 12.1 years ago Assaf Gordon ▴ 30

0

Entering edit mode

Thank you all for your responses. I'm still looking for the optimal way to count hits for RNA-Seq Paired-end data, may I ask for couple of clarifications? Simon Anders wrote, On 04/05/2012 05:11 AM: > You should make sure that each read is counted only once per gene. Once per gene - got it. What about a case where a read matches multiple genes? (described as "ambiguous" in HTSeq-Count/GenomicRanges "modes") Is it OK to count this read several times (once for each gene, multiple different genes), or would that invalidate the results? It seems "easyRNASeq" will count a read multiple times (once per gene) when using "geneModels" summarization mode (based on [1], page 10) - so can it be used? > If you want to stay in R: Valerie Obenchain has recently added functionality to GenomicRanges to perform counting in a way similar to my htseq-count script Related question: handling paired-end data correctly. It seems only GenomicRanges does not handle paired-end reads at all (based on [2], page 2, section "3. counting mode") - so the only option is "htseq-count" - is that correct ? I also couldn't find any mention of paired-end data in "easyRNASeq" PDF [1], so I don't know if it handles that or not. Thanks, -gordon [1] http://bioconductor.org/packages/release/bioc/vignettes/easyRNASeq /inst/doc/easyRNASeq.pdf [2] http://bioconductor.org/packages/release/bioc/vignettes/GenomicRan ges/inst/doc/summarizeOverlaps-modes.pdf

ADD REPLY • link 12.1 years ago Assaf Gordon ▴ 30

0

Entering edit mode

Hi Gordon On 2012-04-06 23:24, Assaf Gordon wrote: > Once per gene - got it. What about a case where a read matches > multiple genes? (described as "ambiguous" in > HTSeq-Count/GenomicRanges "modes") Is it OK to count this read > several times (once for each gene, multiple different genes), or > would that invalidate the results? HTSeq-count discards reads that map to several genes and counts vthem as "ambiguous". I've explained the reason a while ago in a SeqAnswers thread, in post #4 here: http://seqanswers.com/forums/showthread.php?t=9129 "Imagine we have two paralogous genes that have identical sequence at one half or their length and divergent sequence at the other half, and one of these genes is differentially expressed and the other is not. All reads that stem from the identical-sequence parts of the transcripts will map to both genes, and if we include them in our counts, both genes will appear to be differentially expressed, even though only one is really. If we count only the uniquely mapping reads (i.e., those stemming from the divergent parts of the transcripts), we are safe." Simon

ADD REPLY • link 12.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

On 6 Apr 2012, at 23:24, Assaf Gordon wrote: > Thank you all for your responses. > > I'm still looking for the optimal way to count hits for RNA-Seq Paired-end data, may I ask for couple of clarifications? > > Simon Anders wrote, On 04/05/2012 05:11 AM: >> You should make sure that each read is counted only once per gene. > > Once per gene - got it. > What about a case where a read matches multiple genes? (described as "ambiguous" in HTSeq-Count/GenomicRanges "modes") > Is it OK to count this read several times (once for each gene, multiple different genes), or would that invalidate the results? > > It seems "easyRNASeq" will count a read multiple times (once per gene) when using "geneModels" summarization mode (based on [1], page 10) - so can it be used? > That's only TRUE if the user does not pay attention to the warnings easyRNASeq is emitting. easyRNASeq creates geneModels per strand, so indeed genes overlapping on opposite strand my end up in double counting of reads. I do not want easyRNASeq to be a "black box", I want the user to understand what (s)he's doing and what (s)he's looking at. This is why I have many sanity checks that return warnings and guide the user. I can perfectly imagine that in a given tissue it is known that one gene is expressed whereas the gene on the other strand is not and that the user wants to keep the full geneModels for that first gene. What is still missing from the documentation (vignette) and that I need to add is how to rework the obtained gene model(s). I'm actually thinking of making the geneModel generation more flexible for the user, i.e. returning an annotated structure with synthetic exons (exonic regions as unique as possible, similar to what HTSeq does, i.e. including the complementary strand overlaps) and a set of methods that the user could apply to decide what (s)he's more interested in. Another word of caution here: I do not check the unicity of the reads in the bam file. I expect the user to know enough about the aligner (s)he is using for his/her application to decide on that. From the literature it seems that people only retains uniquely mapped reads (for most applications). I could add a sanity check for that. > >> If you want to stay in R: Valerie Obenchain has recently added functionality to GenomicRanges to perform counting in a way similar to my htseq-count script > > Related question: handling paired-end data correctly. > It seems only GenomicRanges does not handle paired-end reads at all (based on [2], page 2, section "3. counting mode") - so the only option is "htseq-count" - is that correct ? > I also couldn't find any mention of paired-end data in "easyRNASeq" PDF [1], so I don't know if it handles that or not. I thought I wrote it in the vignette (I'll double check that). It can't at the moment but will soon; I'm working on it. So at the moment you might be better set using htseq-count, but keep an eye open ;-) Cheers, Nico > > > Thanks, > -gordon > > > > [1] http://bioconductor.org/packages/release/bioc/vignettes/easyRNAS eq/inst/doc/easyRNASeq.pdf > [2] http://bioconductor.org/packages/release/bioc/vignettes/GenomicR anges/inst/doc/summarizeOverlaps-modes.pdf > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.1 years ago delhomme@embl.de ★ 1.2k

0

Entering edit mode

Hi Nico On 2012-04-07 12:36, Nicolas Delhomme wrote: > Another word of caution here: I do not check the unicity of the reads > in the bam file. I expect the user to know enough about the aligner > (s)he is using for his/her application to decide on that. From the > literature it seems that people only retains uniquely mapped reads > (for most applications). I could add a sanity check for that. I've added to htseq-count a check that a read gets skipped if it has tag "NH:i:n" with n>1. However, not all aligner use the NH tag properly, so this is not a full solution. Another possibility is to discard everything with mapping quality below 10, because non-uniquly mapped reads should have a MAPQ<3. How is life in Ume?? Simon

ADD REPLY • link 12.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Simon, These are good sanity checks indeed. I'll add that to my TODO. Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 7 Apr 2012, at 16:30, Simon Anders wrote: > Hi Nico > > On 2012-04-07 12:36, Nicolas Delhomme wrote: >> Another word of caution here: I do not check the unicity of the reads >> in the bam file. I expect the user to know enough about the aligner >> (s)he is using for his/her application to decide on that. From the >> literature it seems that people only retains uniquely mapped reads >> (for most applications). I could add a sanity check for that. > > I've added to htseq-count a check that a read gets skipped if it has tag "NH:i:n" with n>1. However, not all aligner use the NH tag properly, so this is not a full solution. Another possibility is to discard everything with mapping quality below 10, because non-uniquly mapped reads should have a MAPQ<3. > > How is life in Ume?? > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.1 years ago delhomme@embl.de ★ 1.2k

Login before adding your answer.