TranscriptDb of GENCODE Genes
1
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 7 days ago
Australia
Hello, Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012. cds_end_NF : the coding region end could not be confirmed. cds_start_NF : the coding region start could not be confirmed. mRNA_end_NF : the mRNA end could not be confirmed. mRNA_start_NF : the mRNA start could not be confirmed. Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR. /nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf 194871 /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF - 21699 /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF - 19788 Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too. Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is. This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa : genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number") UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE) UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE) whichNo3prime <- setdiff(names(UTR5), names(UTR3)) whichNo5prime <- setdiff(names(UTR3), names(UTR5)) > length(whichNo5prime) [1] 12217 > length(whichNo3prime) [1] 16675 So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR. Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ? -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia
Transcription Annotation Transcription Annotation • 4.0k views
ADD COMMENT
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 8.4 years ago
United States
Hi Dario, I think it's great to discuss this. It is not often enough that people point out just how unusual some of these gtf files can be. However, it is not intended job of the makeTranscriptDbFromGFF() function to do data cleanup or custom pre-filtering. That function is already stressed enough just handling all the peculiar ways that GFF and GTF files can essentially represent the same information. So much so that that my efforts to have one job per function is already being strained in this case. But if you wanted to make some functions that cleaned up unwanted features from gtf and gff files. It is possible that other people might find that useful too. Marc On 08/09/2013 12:00 AM, Dario Strbenac wrote: > Hello, > > Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012. > > cds_end_NF : the coding region end could not be confirmed. > cds_start_NF : the coding region start could not be confirmed. > mRNA_end_NF : the mRNA end could not be confirmed. > mRNA_start_NF : the mRNA start could not be confirmed. > > Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR. > > /nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf > 194871 > /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF - > 21699 > /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF - > 19788 > > Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too. > > Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is. > > This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa : > > genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number") > UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE) > UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE) > whichNo3prime <- setdiff(names(UTR5), names(UTR3)) > whichNo5prime <- setdiff(names(UTR3), names(UTR5)) > >> length(whichNo5prime) > [1] 12217 >> length(whichNo3prime) > [1] 16675 > > So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR. > > Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ? > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode

Since this is one of the top hits when searching for "gencode in Bioconductor" I add this observation regarding AnnotationHub

> query(ah, c("gencode", "v27"))
AnnotationHub with 6 records
# snapshotDate(): 2020-01-09
# $dataprovider: GENCODE
# $species: Homo sapiens
# $rdataclass: list, TxDb, GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH75158"]]' 

            title                                              
  AH75158 | TxDb for Gencode v27 on hg19 coordinates           
  AH75159 | Annotated genes for Gencode v27 on hg19 coordinates
  AH75160 | GenomicState for Gencode v27 on hg19 coordinates   
  AH75161 | TxDb for Gencode v27 on hg38 coordinates           
  AH75162 | Annotated genes for Gencode v27 on hg38 coordinates
  AH75163 | GenomicState for Gencode v27 on hg38 coordinates  
ADD REPLY

Login before adding your answer.

Traffic: 406 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6