how to close the recognizes features of type miRNA in "makeTxDbFromGFF"
2
0
Entering edit mode
@shangguandong1996-21805
Last seen 2.1 years ago
China

The GFF can be downloa TAIR_download

I want to make my own Txdb from Araport11( Arabidopsis thaliana) using makeTxdbFromGFF. But I found a weird thing. Some miRNA appears in my gene result.

seq_length <- read.table("~/reference/genome/TAIR10/Athaliana.fa.fai")[,1:2]
Txdb_gff <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gff", chrominfo = Seqinfo(seqnames = seq_length$V1, seqlengths = seq_length$V2))

# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
#  In .get_cds_IDX(mcols0$type, mcols0$phase) :
#  The "phase" metadata column contains non-NA values for
#  features of type exon. This information was ignored.

> genes(Txdb_gff)
GRanges object with 38194 ranges and 1 metadata column:
            seqnames            ranges strand |     gene_id
               <Rle>         <IRanges>  <Rle> | <character>
  AT1G01010     Chr1         3631-5899      + |   AT1G01010
  AT1G01020     Chr1         6788-9130      - |   AT1G01020
  AT1G01030     Chr1       11649-13714      - |   AT1G01030
  AT1G01040     Chr1       23121-31227      + |   AT1G01040
  AT1G01046     Chr1       28500-28706      + |   AT1G01046
        ...      ...               ...    ... .         ...
    MIR854c     Chr5 11838096-11838316      + |     MIR854c
    MIR854d     Chr5 11689861-11690081      - |     MIR854d
    MIR854e     Chr5 11942467-11942687      + |     MIR854e
     MIR855     Chr2   4674427-4674698      + |      MIR855
    MIR858b     Chr1 26772633-26772935      + |     MIR858b
  -------
  seqinfo: 7 sequences from an unspecified genome

But when I check the MIRNA, it did has its own gene ID.

$ grep MIR855 Araport11_GFF3_genes_transposons.201606.gff 
Chr2    Araport11   gene    4674427 4674698 .   +   .   ID=AT2G05605;Name=MIR855;locus_type=mirna
Chr2    Araport11   miRNA_primary_transcript    4674427 4674698 .   +   .   ID=AT2G05605.1;Parent=AT2G05605;Alias=MI0005411;Name=ath-MIR855;symbol=ath-MIR855;Note=microRNA ath-MIR855 precursor

I am wondering why the makeTxdbFromGFF use the MIRNA name rather than AT2G05605.

But I noticed a info about that in GenomicFeatures/NEWS in line 241

makeTxDbFromGRanges() now recognizes features of type miRNA,
      miRNA_primary_transcript, SRP_RNA, RNase_P_RNA, RNase_MRP_RNA, misc_RNA,
      antisense_RNA, and antisense as transcripts. It also now recognizes
      features of type transposable_element_gene as genes.

So I am wondering whether I can change the MIR855 into its original geneID(AT2G05605).

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/sysoft/R-3.6.1/lib64/R/lib/libRblas.so
LAPACK: /opt/sysoft/R-3.6.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.38.0 AnnotationDbi_1.48.0  
[3] Biobase_2.46.0         GenomicRanges_1.38.0  
[5] GenomeInfoDb_1.22.0    IRanges_2.20.0        
[7] S4Vectors_0.24.0       BiocGenerics_0.32.0   

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.16.0 progress_1.2.2             
 [3] tidyselect_0.2.5            purrr_0.3.3                
 [5] lattice_0.20-38             vctrs_0.2.1                
 [7] BiocFileCache_1.10.0        rtracklayer_1.46.0         
 [9] blob_1.2.0                  XML_3.98-1.20              
[11] rlang_0.4.2                 pillar_1.4.3               
[13] glue_1.3.1                  DBI_1.1.0                  
[15] BiocParallel_1.19.6         rappdirs_0.3.1             
[17] bit64_0.9-7                 dbplyr_1.4.2               
[19] matrixStats_0.55.0          GenomeInfoDbData_1.2.2     
[21] stringr_1.4.0               zlibbioc_1.32.0            
[23] Biostrings_2.54.0           memoise_1.1.0              
[25] biomaRt_2.42.0              curl_4.3                   
[27] Rcpp_1.0.3                  openssl_1.4.1              
[29] backports_1.1.5             DelayedArray_0.12.0        
[31] XVector_0.26.0              bit_1.1-14                 
[33] Rsamtools_2.2.0             hms_0.5.2                  
[35] askpass_1.1                 packrat_0.5.0              
[37] digest_0.6.23               stringi_1.4.3              
[39] dplyr_0.8.3                 grid_3.6.1                 
[41] tools_3.6.1                 bitops_1.0-6               
[43] magrittr_1.5                RCurl_1.95-4.12            
[45] tibble_2.1.3                RSQLite_2.1.5              
[47] crayon_1.3.4                pkgconfig_2.0.3            
[49] zeallot_0.1.0               Matrix_1.2-18              
[51] prettyunits_1.0.2           assertthat_0.2.1           
[53] httr_1.4.1                  rstudioapi_0.10            
[55] R6_2.4.1                    GenomicAlignments_1.22.0   
[57] compiler_3.6.1             
GenomicFeatures genomicranges • 1.3k views
ADD COMMENT
0
Entering edit mode

Can somebody help :)

ADD REPLY
3
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

Hi,

makeTxDbFromGFF() imports the Name attribute (when present) instead of the ID attribute because the former tends to be more useful than the latter. At least that was my impression at the time I made that decision. If you want the ID attribute instead then it's easy but you have to go a little bit lower level. Here is how:

First, some background: makeTxDbFromGFF() proceeds in 2 steps: (1) it imports the GFF file in a GRanges object and (2) it calls makeTxDbFromGRanges() on that GRanges object to import it in a TxDb object. If we perform these 2 steps manually, we have an opportunity to do things between the 2 steps.

Step 1:

gr <- import("Araport11_GFF3_genes_transposons.201606.gff")

gr is a GRanges object with many metadata columns:

> colnames(mcols(gr))
 [1] "source"                    "type"                     
 [3] "score"                     "phase"                    
 [5] "ID"                        "Name"                     
 [7] "Note"                      "symbol"                   
 [9] "Alias"                     "full_name"                
[11] "computational_description" "Dbxref"                   
[13] "locus_type"                "Parent"                   
[15] "conf_class"                "conf_rating"              
[17] "Derives_from"              "curator_summary"          
[19] "description"               "index"                    
[21] "nochangenat-description"  

Except for the first 4 (source, type, score, and phase) these metadata columns are the attributes imported from the "attributes" column of the GFF file (the "attributes" column is the last column in a GFF file, see GFF3 specs here). In particular the ID and Name attributes are here.

Let's replace the Name metadata column with the ID metadata column:

mcols(gr)$Name <- mcols(gr)$ID

Step 2:

txdb <- makeTxDbFromGRanges(gr)

Now this TxDb object contains the ID atrribute instead of the Name attribute:

genes(txdb)
GRanges object with 38194 ranges and 1 metadata column:
            seqnames        ranges strand |     gene_id
               <Rle>     <IRanges>  <Rle> | <character>
  AT1G01010     Chr1     3631-5899      + |   AT1G01010
  AT1G01020     Chr1     6788-9130      - |   AT1G01020
  AT1G01030     Chr1   11649-13714      - |   AT1G01030
  AT1G01040     Chr1   23121-31227      + |   AT1G01040
  AT1G01046     Chr1   28500-28706      + |   AT1G01046
        ...      ...           ...    ... .         ...
  ATMG09800     ChrM 239952-240023      - |   ATMG09800
  ATMG09830     ChrM 262131-262209      - |   ATMG09830
  ATMG09950     ChrM 333651-333725      - |   ATMG09950
  ATMG09960     ChrM 347266-347348      + |   ATMG09960
  ATMG09980     ChrM 359666-359739      - |   ATMG09980
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thanks! That's very helpful

ADD REPLY
0
Entering edit mode
@shangguandong1996-21805
Last seen 2.1 years ago
China

Can somebody help me :)

ADD COMMENT

Login before adding your answer.

Traffic: 434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6