TranscriptDb of GENCODE Genes

0

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 9 days ago

Australia

Hello, Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012. cds_end_NF : the coding region end could not be confirmed. cds_start_NF : the coding region start could not be confirmed. mRNA_end_NF : the mRNA end could not be confirmed. mRNA_start_NF : the mRNA start could not be confirmed. Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR. /nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf 194871 /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF - 21699 /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF - 19788 Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too. Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is. This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa : genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number") UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE) UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE) whichNo3prime <- setdiff(names(UTR5), names(UTR3)) whichNo5prime <- setdiff(names(UTR3), names(UTR5)) > length(whichNo5prime) [1] 12217 > length(whichNo3prime) [1] 16675 So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR. Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ? -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia

Transcription Annotation Transcription Annotation • 4.5k views

ADD COMMENT • link updated 12.4 years ago by Marc Carlson ★ 7.2k • written 12.4 years ago by Dario Strbenac ★ 1.6k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.4 years ago

United States

Hi Dario, I think it's great to discuss this. It is not often enough that people point out just how unusual some of these gtf files can be. However, it is not intended job of the makeTranscriptDbFromGFF() function to do data cleanup or custom pre-filtering. That function is already stressed enough just handling all the peculiar ways that GFF and GTF files can essentially represent the same information. So much so that that my efforts to have one job per function is already being strained in this case. But if you wanted to make some functions that cleaned up unwanted features from gtf and gff files. It is possible that other people might find that useful too. Marc On 08/09/2013 12:00 AM, Dario Strbenac wrote: > Hello, > > Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012. > > cds_end_NF : the coding region end could not be confirmed. > cds_start_NF : the coding region start could not be confirmed. > mRNA_end_NF : the mRNA end could not be confirmed. > mRNA_start_NF : the mRNA start could not be confirmed. > > Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR. > > /nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf > 194871 > /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF - > 21699 > /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF - > 19788 > > Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too. > > Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is. > > This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa : > > genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number") > UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE) > UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE) > whichNo3prime <- setdiff(names(UTR5), names(UTR3)) > whichNo5prime <- setdiff(names(UTR3), names(UTR5)) > >> length(whichNo5prime) > [1] 12217 >> length(whichNo3prime) > [1] 16675 > > So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR. > > Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ? > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.4 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Since this is one of the top hits when searching for "gencode in Bioconductor" I add this observation regarding AnnotationHub

> query(ah, c("gencode", "v27"))
AnnotationHub with 6 records
# snapshotDate(): 2020-01-09
# $dataprovider: GENCODE
# $species: Homo sapiens
# $rdataclass: list, TxDb, GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH75158"]]' 

            title                                              
  AH75158 | TxDb for Gencode v27 on hg19 coordinates           
  AH75159 | Annotated genes for Gencode v27 on hg19 coordinates
  AH75160 | GenomicState for Gencode v27 on hg19 coordinates   
  AH75161 | TxDb for Gencode v27 on hg38 coordinates           
  AH75162 | Annotated genes for Gencode v27 on hg38 coordinates
  AH75163 | GenomicState for Gencode v27 on hg38 coordinates

ADD REPLY • link 5.9 years ago Vincent J. Carey, Jr. 6.7k

Login before adding your answer.