Question

Could GenomicFeature include genes that are on multiple chromosomes?

0

Entering edit mode

hxlei613 • 0

@hxlei613-16299

Last seen 2.7 years ago

China

Hi, I am working with macaque, whose annotation and reference are not complete. I download Refseq annotation from UCSC and load the gtf using this.

annotation_m = makeTxDbFromGFF(gtfFile,
                              format = 'gtf',
                              dataSource = 'ensemble',
                              organism = 'Macaca fascicularis',
                              chrominfo = chromInfo,
                              metadata = metaData)

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  NM_001283298, NM_001283311, NM_001283379, NM_001283387, NM_001283401,
  NM_001283462, NM_001283504, NM_001283522, NM_001283551, NM_001283594,
  NM_001283671, NM_001283708, NM_001283746, NM_001283802, NM_001283855,
  NM_001283893, NM_001284027, NM_001284076, NM_001284114, NM_001284173,
  NM_001284607, NM_001284630, NM_001284689, NM_001284707, NM_001284756,
  NM_001284835, NM_001284840, NM_001284890, NM_001284912, NM_001285101,
  NM_001285216, NM_001285318, NM_001319465, NM_001319481, NM_001319512,
  NM_001319514, NM_001319538, NM_001319588, NM_001319591

These genes may be paralogous genes .I don't want to drop them. Is there any way to include them ? I haven't find any answers. Thank you very much for helping me!

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/local/lib/libblas.so.3.2.1
LAPACK: /usr/local/lib/liblapack.so.3.2.1

locale:
 [1] LC_CTYPE=en_HK.UTF-8       LC_NUMERIC=C               LC_TIME=en_HK.UTF-8       
 [4] LC_COLLATE=en_HK.UTF-8     LC_MONETARY=en_HK.UTF-8    LC_MESSAGES=en_HK.UTF-8   
 [7] LC_PAPER=en_HK.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_HK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    parallel  splines   stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] GenomicFeatures_1.30.3 AnnotationDbi_1.40.0   GenomicRanges_1.30.3   GenomeInfoDb_1.14.0   
 [5] IRanges_2.12.0         S4Vectors_0.16.0       bladderbatch_1.16.0    Biobase_2.38.0        
 [9] BiocGenerics_0.24.0    limma_3.34.9           sva_3.26.0             BiocParallel_1.12.0   
[13] genefilter_1.60.0      mgcv_1.8-12            nlme_3.1-128           survival_2.39-4       
[17] BiocInstaller_1.28.0   RColorBrewer_1.1-2     gplots_3.0.1           scales_0.5.0          
[21] cqn_1.24.0             quantreg_5.35          SparseM_1.77           preprocessCore_1.40.0 
[25] nor1mix_1.2-3          mclust_5.4             ggfortify_0.4.5        ggplot2_2.2.1         
[29] hash_2.2.6             pheatmap_1.0.8         stringr_1.3.0         

loaded via a namespace (and not attached):
 [1] bitops_1.0-6               matrixStats_0.53.1         bit64_0.9-7               
 [4] progress_1.2.0             httr_1.3.1                 tools_3.4.3               
 [7] R6_2.2.2                   KernSmooth_2.23-15         DBI_1.0.0                 
[10] lazyeval_0.2.1             colorspace_1.3-2           tidyselect_0.2.5          
[13] gridExtra_2.3              prettyunits_1.0.2          RMySQL_0.10.15            
[16] bit_1.1-14                 compiler_3.4.3             DelayedArray_0.4.1        
[19] rtracklayer_1.38.3         caTools_1.17.1             digest_0.6.15             
[22] Rsamtools_1.30.0           XVector_0.18.0             pkgconfig_2.0.1           
[25] rlang_0.2.2                rstudioapi_0.7             RSQLite_2.1.1             
[28] bindr_0.1.1                gtools_3.5.0               dplyr_0.7.6               
[31] RCurl_1.95-4.10            magrittr_1.5               GenomeInfoDbData_1.0.0    
[34] Matrix_1.2-6               Rcpp_0.12.19               munsell_0.4.3             
[37] stringi_1.1.6              SummarizedExperiment_1.8.1 zlibbioc_1.24.0           
[40] plyr_1.8.4                 grid_3.4.3                 blob_1.1.1                
[43] gdata_2.18.0               crayon_1.3.4               lattice_0.20-33           
[46] Biostrings_2.46.0          annotate_1.56.2            hms_0.4.2                 
[49] pillar_1.2.1               biomaRt_2.34.2             XML_3.98-1.11             
[52] glue_1.3.0                 MatrixModels_0.4-1         gtable_0.2.0              
[55] purrr_0.2.5                tidyr_0.8.1                assertthat_0.2.0          
[58] xtable_1.8-3               tibble_1.4.2               GenomicAlignments_1.14.2  
[61] memoise_1.1.0              bindrcpp_0.2.2

GenomicFeature • 642 views

ADD COMMENT • link updated 5.5 years ago by James W. MacDonald 65k • written 5.5 years ago by hxlei613 • 0

score 0 · Answer 1 · 2018-10-30

You could use makeTxDbFromUCSC instead, but that has issues as well, for transcripts with weird CDS lengths:

> z <- makeTxDbFromUCSC("macFas5", "refGene")
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable) :
  UCSC data anomaly in 202 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts 'NM_001283848' 'NM_001285023'
  'NM_001283570' 'NM_001285318' 'NM_001283532' 'NM_001283699'
  'NM_001284065' 'NM_001284063' 'NM_001283962' 'NM_001284085'
  'NM_001283287' 'NM_001283322' 'NM_001285249' 'NM_001283901'
  'NM_001283331' 'NM_001283810' 'NM_001283551' 'NM_001284115'
  'NM_001284114' 'NM_001284919' 'NM_001283862' 'NM_001284758'
  'NM_001285021' 'NM_001284944' 'NM_001283699' 'NM_001284840'
  'NM_001291917' 'NM_001292005' 'NM_001291870' 'NM_001285136'
  'NM_001285032' 'NM_001285082' 'NM_001283449' 'NM_001285200'
  'NM_001284085' 'NM_001284637' 'NM_001283244' 'NM_001283657'
  'NM_001284707' 'NM_001283842' 'NM_001285191' 'NM_001285202'
  'NM_001319510' 'NM_001284092' 'NM_001284038' 'NM_001284161'
  'NM_001283240' 'NM_001284745' 'NM_001284986' 'NM_001285081'
  'NM_001284997' 'NM_001285097' 'NM_001285074' 'NM_001283802'
  'NM_001285211' 'NM_001283316' 'NM_001284054' 'NM_001283 [... truncated]

But it doesn't have problems for transcripts on separate chromosomes. For example, NM_001283298 is Entrez Gene ID 101926689:

> zz <- transcriptsBy(z)
> zz["101926689"]
GRangesList object of length 1:
$101926689
GRanges object with 2 ranges and 2 metadata columns:
      seqnames              ranges strand |     tx_id      tx_name
         <Rle>           <IRanges>  <Rle> | <integer>  <character>
  [1]     chr1 138243716-138244832      - |       183 NM_001283298
  [2]    chr11   40858952-40874389      - |      1243 NM_001283298

-------

I should note here that the problems with makeTxDbFromUCSC are simply that you get some warnings, not that you lose transcripts like you do when using the GFF.