Search
Question: TranscriptDb of GENCODE Genes
0
5.2 years ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:
Hello, Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012. cds_end_NF : the coding region end could not be confirmed. cds_start_NF : the coding region start could not be confirmed. mRNA_end_NF : the mRNA end could not be confirmed. mRNA_start_NF : the mRNA start could not be confirmed. Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR. /nb/dario/genes$egrep -c "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf 194871 /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF - 21699 /nb/dario/genes$egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF - 19788 Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too. Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is. This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa : genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number") UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE) UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE) whichNo3prime <- setdiff(names(UTR5), names(UTR3)) whichNo5prime <- setdiff(names(UTR3), names(UTR5)) > length(whichNo5prime) [1] 12217 > length(whichNo3prime) [1] 16675 So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR. Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ? -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ADD COMMENTlink modified 5.2 years ago by Marc Carlson7.2k • written 5.2 years ago by Dario Strbenac1.4k 0 5.2 years ago by Marc Carlson7.2k United States Marc Carlson7.2k wrote: Hi Dario, I think it's great to discuss this. It is not often enough that people point out just how unusual some of these gtf files can be. However, it is not intended job of the makeTranscriptDbFromGFF() function to do data cleanup or custom pre-filtering. That function is already stressed enough just handling all the peculiar ways that GFF and GTF files can essentially represent the same information. So much so that that my efforts to have one job per function is already being strained in this case. But if you wanted to make some functions that cleaned up unwanted features from gtf and gff files. It is possible that other people might find that useful too. Marc On 08/09/2013 12:00 AM, Dario Strbenac wrote: > Hello, > > Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012. > > cds_end_NF : the coding region end could not be confirmed. > cds_start_NF : the coding region start could not be confirmed. > mRNA_end_NF : the mRNA end could not be confirmed. > mRNA_start_NF : the mRNA start could not be confirmed. > > Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR. > > /nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf > 194871 > /nb/dario/genes$egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF - > 21699 > /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF - > 19788 > > Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too. > > Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is. > > This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa : > > genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number") > UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE) > UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE) > whichNo3prime <- setdiff(names(UTR5), names(UTR3)) > whichNo5prime <- setdiff(names(UTR3), names(UTR5)) > >> length(whichNo5prime) > [1] 12217 >> length(whichNo3prime) > [1] 16675 > > So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR. > > Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ? > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor