Search
Question: makeTranscriptDbFromGFF for Unstranded Transcripts
0
5.1 years ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:
Hello, I have assembled transcripts based on unstranded RNA-seq data. It's not the ideal experimental design, of course. Is it not possible to make a TranscriptDb object from this ? makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number") extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Error in .normargTranscripts(transcripts) : values in 'transcripts$tx_strand' must be "+" or "-" R version 3.0.0 (2013-04-03) Platform: x86_64-w64-mingw32/x64 (64-bit) GenomicFeatures_1.12.1 -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ADD COMMENTlink modified 5.1 years ago by Hervé Pagès ♦♦ 13k • written 5.1 years ago by Dario Strbenac1.4k 0 5.1 years ago by Hervé Pagès ♦♦ 13k United States Hervé Pagès ♦♦ 13k wrote: Hi Dario, On 05/07/2013 10:00 PM, Dario Strbenac wrote: > Hello, > > I have assembled transcripts based on unstranded RNA-seq data. It's not the ideal experimental design, of course. Is it not possible to make a TranscriptDb object from this ? Right now, it's not possible. Note that in the TranscriptDb object returned by makeTranscriptDbFromGFF(), we really want to see stranded transcripts, exons, and CDSs. Because: (1) the purpose of a TranscriptDb is to describe a gene model, and the strand information (especially the exon strand) is an *essential* part of the gene model, and (2) there are too many tools/too much code around that expect this. The same could probably be said of GTF and GFF files, and I wonder if storing a set of unstranded mRNAs, exons, CDSs in those files is considered valid. Anyway, if we wanted makeTranscriptDbFromGFF() to support such GTF and GFF files, we would need to automatically replace all missing strands by a + or a -. This would need to be done in a blunt way e.g. all missing mRNA/exon/CDS strands would be replaced with a + with a warning. Trying to make makeTranscriptDbFromGFF() smarter by looking whether the exons within a given mRNA have their start/end going up or down when the exon rank is going up would be too problematic. > > makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number") Ok, so you've managed to store the exon rank in your file. But that means that you must have *implicitly* chosen a strand for your exons right? Because, within a given mRNA, when the exon rank is going up, start/end should normally go up on the + strand and down on the - strand. Why not just set the strand for the mRNAs / exons / CDSs in your GTF file? There is too much going on here so, at this point, I'm not convinced makeTranscriptDbFromGFF() should decide the strand for you. Cheers, H. > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Error in .normargTranscripts(transcripts) : > values in 'transcripts$tx_strand' must be "+" or "-" > > R version 3.0.0 (2013-04-03) > Platform: x86_64-w64-mingw32/x64 (64-bit) > GenomicFeatures_1.12.1 > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
> The same could probably be said of GTF and GFF files, and I wonder if > storing a set of unstranded mRNAs, exons, CDSs in those files is > considered valid. >From the specification, it is. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). > Anyway, if we wanted makeTranscriptDbFromGFF() to support such GTF and > GFF files, we would need to automatically replace all missing strands > by a + or a -. It is better if it retains the error result, so there is no ambiguity. Adding a sentence about this to the help file would be useful, since users will expect that it reads in all valid GTF and GFF files. > > makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number") > Ok, so you've managed to store the exon rank in your file. But that > means that you must have *implicitly* chosen a strand for your exons > right? Cufflinks can infer the strand of the transcript for multi-exon transcripts by looking for the canonical GT-AG splice site in reads mapping across an intron, but not for single exon genes. So, it outputs a strand for some genes and not for others.