Hi,I use GenomicFeatures to extract 5'UTR and 3'UTR and find that GenomicFeatures doesn't seem to regard stop codon as 3'UTR. I want to make sure about this and wonder how to change the raw R code to include stop codon. The following is my procedure.
I download gencode V19 gtf file from gencode https://www.gencodegenes.org/releases/19.html (Comprehensive gene annotation, CHG region) then I load it into R with GenomicFeature using this:
gencode_v19 = makeTxDbFromGFF(gtfFile, format = 'gtf', dataSource = 'gencode', organism = 'Homo sapiens', chrominfo = chromInfo)
chromInfo is like this :
chrom length 1 chr1 249250621 2 chr10 135534747 3 chr11 135006516 4 chr12 133851895 5 chr13 115169878 6 chr14 107349540 7 chr15 102531392 8 chr16 90354753 9 chr17 81195210 10 chr18 78077248 11 chr19 59128983 12 chr2 243199373 13 chr20 63025520 14 chr21 48129895 15 chr22 51304566 16 chr3 198022430 17 chr4 191154276 18 chr5 180915260 19 chr6 171115067 20 chr7 159138663 21 chr8 146364022 22 chr9 141213431 23 chrM 16569 24 chrX 155270560 25 chrY 59373566
I extract 5'UTR and 3'UTR using fiveUTRsByTranscript() and threeUTRsByTranscript() like this:
utr5 = fiveUTRsByTranscript(gencode_v19,use.names=TRUE)
utr3 = threeUTRsByTranscript(gencode_v19,use.names=TRUE)
But the UTR I get is slightly different from UTR annotation in gencode v19 gtf file. It seems that GenomicFeatures doesn't regard stop codon as part of 3'UTR. Here is an example: ENST00000370584. Is this right ? If it's right, where should I change in raw codes to include stop codon when calculate 3'UTR ?
Also total UTR regions (the sum of 5'UTR and 3'UTR) generated by GenomicFeatures is only 280400 while total UTR regions in gtf file is 284573.(One UTR may be composed of several exons so one UTR can have several regions.) I think it is still because genomicFeatures doen't treat stop codon as 3' UTR but I am not sure. I use the following steps to count UTR regions generated by GenomicFeatures. Let's regard utr5 and utr3 as the result of fiveUTRsByTranscript() and threeUTRsByTranscript() respectively. Then I transform utr5 and utr3 to dataframe with as.data.frame() and use nrow() to count UTR regions.
I count UTR regions in gencode gtf files simply using :
awk '$3~/UTR/{print}' gencode.v19.gtf | wc -l
I think this two number should be the same but apparently they are not. So why is there small difference ?
Thank you very much for helping me !
> sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS Matrix products: default BLAS: /usr/local/lib/libblas.so.3.2.1 LAPACK: /usr/local/lib/liblapack.so.3.2.1 locale: [1] LC_CTYPE=en_HK.UTF-8 LC_NUMERIC=C LC_TIME=en_HK.UTF-8 LC_COLLATE=en_HK.UTF-8 LC_MONETARY=en_HK.UTF-8 [6] LC_MESSAGES=en_HK.UTF-8 LC_PAPER=en_HK.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_HK.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel grid stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.30.3 AnnotationDbi_1.40.0 Biobase_2.38.0 GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0 [7] S4Vectors_0.16.0 BiocGenerics_0.24.0 BiocInstaller_1.28.0 reshape2_1.4.3 stringr_1.3.0 ggplot2_2.2.1 [13] VennDiagram_1.6.20 futile.logger_1.4.3 hash_2.2.6 loaded via a namespace (and not attached): [1] SummarizedExperiment_1.8.1 progress_1.2.0 lattice_0.20-33 colorspace_1.3-2 rtracklayer_1.38.3 [6] blob_1.1.1 XML_3.98-1.11 rlang_0.2.0 pillar_1.2.1 DBI_1.0.0 [11] BiocParallel_1.12.0 bit64_0.9-7 lambda.r_1.2 matrixStats_0.53.1 GenomeInfoDbData_1.0.0 [16] plyr_1.8.4 zlibbioc_1.24.0 Biostrings_2.46.0 munsell_0.4.3 gtable_0.2.0 [21] memoise_1.1.0 biomaRt_2.34.2 Rcpp_0.12.15 scales_0.5.0 DelayedArray_0.4.1 [26] XVector_0.18.0 bit_1.1-14 Rsamtools_1.30.0 RMySQL_0.10.15 hms_0.4.2 [31] digest_0.6.15 stringi_1.1.6 tools_3.4.3 bitops_1.0-6 magrittr_1.5 [36] tibble_1.4.2 lazyeval_0.2.1 RCurl_1.95-4.10 RSQLite_2.1.1 futile.options_1.0.0 [41] crayon_1.3.4 pkgconfig_2.0.1 Matrix_1.2-6 prettyunits_1.0.2 assertthat_0.2.0 [46] httr_1.3.1 rstudioapi_0.7 R6_2.2.2 GenomicAlignments_1.14.2 compiler_3.4.3