Entering edit mode
Hello,
Who else uses the GENCODE annotation in their analyses ? I just found
out that some transcripts are annotated as incomplete fragments. This
is described in http://www.gencodegenes.org/gencode_tags.html but not
in "GENCODE: the reference human genome annotation for The ENCODE
Project." Genome Research, 2012.
cds_end_NF : the coding region end could not be confirmed.
cds_start_NF : the coding region start could not be confirmed.
mRNA_end_NF : the mRNA end could not be confirmed.
mRNA_start_NF : the mRNA start could not be confirmed.
Over 10 % of transcripts are missing their RNA ends and almost as many
are missing either a 5' UTR or a 3' UTR.
/nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL) transcript"
gencode.v17.annotation.gtf
194871
/nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript"
gencode.v17.annotation.gtf | grep -c mRNA_end_NF -
21699
/nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript"
gencode.v17.annotation.gtf | grep -c cds_end_NF -
19788
Have you been using this gene annotation as-is for counting in windows
around transcription start sites or transcription end sites ? Have you
been using the functions fiveUTRsByTranscript or threeUTRsByTranscript
? If so, your results are incorrect, too.
Also, can there be a way for the function makeTranscriptDbFromGFF to
filter on elements of the attribute column ? This finding makes it
unusable for reading into R the GENCODE annotation, as it now is.
This can also be observed by noticing that some transcripts have a 3'
UTR, but no 5' UTR, and vice-versa :
genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format =
"gtf", exonRankAttributeName = "exon_number")
UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE)
UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE)
whichNo3prime <- setdiff(names(UTR5), names(UTR3))
whichNo5prime <- setdiff(names(UTR3), names(UTR5))
> length(whichNo5prime)
[1] 12217
> length(whichNo3prime)
[1] 16675
So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5'
UTR, but no 3' UTR.
Also, note that some transcripts don't have the expected attribute
set. Have a look at ENST00000381469.2 in a genome browser and notice
it's missing mRNA_start_NF. Or, is it possible to start translation
from the very first 3 bases of a transcript ?
--------------------------------------
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
Since this is one of the top hits when searching for "gencode in Bioconductor" I add this observation regarding AnnotationHub