Question

How does GenomicFeature calculate 3'UTR?

0

Entering edit mode

hxlei613 • 0

@hxlei613-16299

Last seen 2.7 years ago

China

Hi,I use GenomicFeatures to extract 5'UTR and 3'UTR and find that GenomicFeatures doesn't seem to regard stop codon as 3'UTR. I want to make sure about this and wonder how to change the raw R code to include stop codon. The following is my procedure.

I download gencode V19 gtf file from gencode https://www.gencodegenes.org/releases/19.html (Comprehensive gene annotation, CHG region) then I load it into R with GenomicFeature using this:

gencode_v19 = makeTxDbFromGFF(gtfFile, format = 'gtf', dataSource = 'gencode', organism = 'Homo sapiens', chrominfo = chromInfo)

chromInfo is like this :

   chrom    length
1   chr1 249250621
2  chr10 135534747
3  chr11 135006516
4  chr12 133851895
5  chr13 115169878
6  chr14 107349540
7  chr15 102531392
8  chr16  90354753
9  chr17  81195210
10 chr18  78077248
11 chr19  59128983
12  chr2 243199373
13 chr20  63025520
14 chr21  48129895
15 chr22  51304566
16  chr3 198022430
17  chr4 191154276
18  chr5 180915260
19  chr6 171115067
20  chr7 159138663
21  chr8 146364022
22  chr9 141213431
23  chrM     16569
24  chrX 155270560
25  chrY  59373566

I extract 5'UTR and 3'UTR using fiveUTRsByTranscript() and threeUTRsByTranscript() like this:

utr5 = fiveUTRsByTranscript(gencode_v19,use.names=TRUE)

utr3 = threeUTRsByTranscript(gencode_v19,use.names=TRUE)

But the UTR I get is slightly different from UTR annotation in gencode v19 gtf file. It seems that GenomicFeatures doesn't regard stop codon as part of 3'UTR. Here is an example: ENST00000370584. Is this right ? If it's right, where should I change in raw codes to include stop codon when calculate 3'UTR ?

Also total UTR regions (the sum of 5'UTR and 3'UTR) generated by GenomicFeatures is only 280400 while total UTR regions in gtf file is 284573.(One UTR may be composed of several exons so one UTR can have several regions.) I think it is still because genomicFeatures doen't treat stop codon as 3' UTR but I am not sure. I use the following steps to count UTR regions generated by GenomicFeatures. Let's regard utr5 and utr3 as the result of fiveUTRsByTranscript() and threeUTRsByTranscript() respectively. Then I transform utr5 and utr3 to dataframe with as.data.frame() and use nrow() to count UTR regions.

I count UTR regions in gencode gtf files simply using :

awk '$3~/UTR/{print}' gencode.v19.gtf | wc -l

I think this two number should be the same but apparently they are not. So why is there small difference ?

Thank you very much for helping me !

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/local/lib/libblas.so.3.2.1
LAPACK: /usr/local/lib/liblapack.so.3.2.1

locale:
 [1] LC_CTYPE=en_HK.UTF-8       LC_NUMERIC=C               LC_TIME=en_HK.UTF-8        LC_COLLATE=en_HK.UTF-8     LC_MONETARY=en_HK.UTF-8   
 [6] LC_MESSAGES=en_HK.UTF-8    LC_PAPER=en_HK.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_HK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.30.3 AnnotationDbi_1.40.0   Biobase_2.38.0         GenomicRanges_1.30.3   GenomeInfoDb_1.14.0    IRanges_2.12.0        
 [7] S4Vectors_0.16.0       BiocGenerics_0.24.0    BiocInstaller_1.28.0   reshape2_1.4.3         stringr_1.3.0          ggplot2_2.2.1         
[13] VennDiagram_1.6.20     futile.logger_1.4.3    hash_2.2.6            

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1 progress_1.2.0             lattice_0.20-33            colorspace_1.3-2           rtracklayer_1.38.3        
 [6] blob_1.1.1                 XML_3.98-1.11              rlang_0.2.0                pillar_1.2.1               DBI_1.0.0                 
[11] BiocParallel_1.12.0        bit64_0.9-7                lambda.r_1.2               matrixStats_0.53.1         GenomeInfoDbData_1.0.0    
[16] plyr_1.8.4                 zlibbioc_1.24.0            Biostrings_2.46.0          munsell_0.4.3              gtable_0.2.0              
[21] memoise_1.1.0              biomaRt_2.34.2             Rcpp_0.12.15               scales_0.5.0               DelayedArray_0.4.1        
[26] XVector_0.18.0             bit_1.1-14                 Rsamtools_1.30.0           RMySQL_0.10.15             hms_0.4.2                 
[31] digest_0.6.15              stringi_1.1.6              tools_3.4.3                bitops_1.0-6               magrittr_1.5              
[36] tibble_1.4.2               lazyeval_0.2.1             RCurl_1.95-4.10            RSQLite_2.1.1              futile.options_1.0.0      
[41] crayon_1.3.4               pkgconfig_2.0.1            Matrix_1.2-6               prettyunits_1.0.2          assertthat_0.2.0          
[46] httr_1.3.1                 rstudioapi_0.7             R6_2.2.2                   GenomicAlignments_1.14.2   compiler_3.4.3

genomicfeatures annotation utr • 2.3k views

ADD COMMENT • link updated 5.8 years ago by Hervé Pagès 16k • written 5.8 years ago by hxlei613 • 0

score 3 · Answer 1 · 2018-07-02

Hi,

UCSC and Ensembl always consider the stop codon to be part of the CDS. This means that it's not considered to be part of the 3' UTR. For consistency with UCSC and Ensembl, makeTxDbFromGFF() imports the stop codon as part of the CDS. The makeTxDbFromGRanges() function, which makeTxDbFromGFF() is based on, has a drop.stop.codons argument that you can set to TRUE to NOT import the stop codon as part of the CDS. Use it like this:

library(rtracklayer)
library(GenomicFeatures)
gtf <- import("gencode.v19.annotation.gtf.gz", format="gtf")
txdb1 <- makeTxDbFromGRanges(gtf, drop.stop.codons=FALSE)
txdb2 <- makeTxDbFromGRanges(gtf, drop.stop.codons=TRUE)

Then:

utr3 <- threeUTRsByTranscript(txdb1, use.names=TRUE)
utr3_with_stop_codon <- threeUTRsByTranscript(txdb2, use.names=TRUE)

> utr3[["ENST00000370584.3"]]
GRanges object with 1 range and 3 metadata columns:
      seqnames              ranges strand |   exon_id         exon_name
         <Rle>           <IRanges>  <Rle> | <integer>       <character>
  [1]    chr10 100003916-100004321      + |    328245 ENSE00002215285.1
      exon_rank
      <integer>
  [1]         9
  -------
  seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths

> utr3_with_stop_codon[["ENST00000370584.3"]]
GRanges object with 1 range and 3 metadata columns:
      seqnames              ranges strand |   exon_id         exon_name
         <Rle>           <IRanges>  <Rle> | <integer>       <character>
  [1]    chr10 100003913-100004321      + |    328245 ENSE00002215285.1
      exon_rank
      <integer>
  [1]         9
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

Note that import() + makeTxDbFromGRanges() is basically what makeTxDbFromGFF() does. See ?makeTxDbFromGFF and ?makeTxDbFromGRanges for more information about these functions.

Hope this helps,

H.

PS: Note that gencode.v19.annotation.gtf,gz contains no ENST00000370584 transcript.