Question

problem using makeTranscriptDbFromGFF parsing cyno GFF

0

Entering edit mode

chenh67 • 0

@chenh67-8510

Last seen 8.7 years ago

United States

Hi,

I am trying to build a TxDb object for the cyno transcriptome using the GFF3 file from NCBI (http://www.ncbi.nlm.nih.gov/genome/776)

I did

cyno <- makeTranscriptDbFromGFF('GCF_000364345.1_Macaca_fascicularis_5.0_genomic.gff', circ_seqs=c('NC_012670.1'))

The resulting TxDb object does not contain all chromosomes/contigs in the GFF. The GFF file contains 7601 contigs but the TxDb has only 430. Is there a limit to the number of chromosomes?

> cyno.genes<-genes(cyno)

> cyno.genes
GRanges object with 20704 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
gene10 NC_022272.1 [ 246699, 257364] + | gene10
gene1000 NC_022272.1 [81551864, 81574527] + | gene1000
gene10001 NC_022276.1 [10507277, 10538293] - | gene10001
gene1001 NC_022272.1 [81578980, 81589228] + | gene1001
gene10010 NC_022276.1 [12449492, 12565425] - | gene10010
... ... ... ... ... ...
gene9986 NC_022276.1 [ 9163789, 9207809] - | gene9986
gene999 NC_022272.1 [81508817, 81552167] - | gene999
gene9991 NC_022276.1 [ 9535984, 9537009] + | gene9991
gene9992 NC_022276.1 [ 9568256, 9587517] - | gene9992
gene9994 NC_022276.1 [ 9616382, 9874580] - | gene9994
-------
seqinfo: 430 sequences from an unspecified genome; no seqlengths

Also, how can I specify which field in the GFF to use for the gene_id?

Thanks!
Haiyin

genomicfeatures • 834 views

ADD COMMENT • link updated 8.7 years ago by Hervé Pagès 16k • written 8.7 years ago by chenh67 • 0

score 0 · Answer 1 · 2015-07-30

Hi Haiyin,

Please show your sessionInfo() as requested in our posting guide. makeTranscriptDbFromGFF() is deprecated in the current release of BioC (BioC 3.1) and has been replaced with makeTxDbFromGFF(). So I suspect you are using an old (and unsupported) version of BioC. Please update your installation to use the current release of BioC.

Generally speaking, there is no reason to expect that the resulting TxDb object contains all the chromosomes/contigs that are in the GFF file. This is because a GFF file can contain many kinds of features however only the genes, transcripts, exons, and CDS are imported in the TxDb object. So only chromosomes/contigs that actually have these features will be reported by seqlevels(txdb).

As for specifying which field in the GFF to use for the gene_id there is no easy way at the moment. But note that the current version of makeTxDbFromGFF() is trying to do a better job at importing the gene_id from the right place:

cyno <- makeTxDbFromGFF("GCF_000364345.1_Macaca_fascicularis_5.0_genomic.gff.gz")
# Prepare the 'metadata' data frame ... metadata: OK
# Warning messages:
# 1: In matchCircularity(seqlevels(gr), circ_seqs) :
#   None of the strings in your circ_seqs argument match your seqnames.
# 2: In .extract_exons_from_GRanges(exon_IDX, gr, ID, Name, Parent, feature = "exon",  :
#   411 orphan exons were dropped

genes(cyno)
# GRanges object with 35836 ranges and 1 metadata column:
#              seqnames                 ranges strand   |     gene_id
#                 <Rle>              <IRanges>  <Rle>   | <character>
#      A1BG NC_022290.1 [ 58937557,  58951892]      -   |        A1BG
#      A1CF NC_022280.1 [ 85376927,  85454067]      +   |        A1CF
#       A2M NC_022282.1 [  9354433,   9401000]      -   |         A2M
#     A2ML1 NC_022282.1 [  9098500,   9153859]      +   |       A2ML1
#   A3GALT2 NC_022272.1 [194631589, 194645505]      +   |     A3GALT2
#       ...         ...                    ...    ... ...         ...
#      ZXDC NC_022273.1 [145848836, 145886645]      +   |        ZXDC
#    ZYG11A NC_022272.1 [174567376, 174637485]      -   |      ZYG11A
#    ZYG11B NC_022272.1 [174660062, 174713629]      -   |      ZYG11B
#       ZYX NC_022274.1 [176564664, 176574909]      +   |         ZYX
#      ZZZ3 NC_022272.1 [149532272, 149659867]      +   |        ZZZ3
#   -------
#   seqinfo: 670 sequences from an unspecified genome; no seqlengths

Please let us know if this is still not what you need.

Cheers,

H.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] XVector_0.8.0          GenomicFeatures_1.20.1 AnnotationDbi_1.30.1  
 [4] Biobase_2.28.0         GenomicRanges_1.20.5   GenomeInfoDb_1.4.1    
 [7] IRanges_2.2.5          S4Vectors_0.6.2        BiocGenerics_0.14.0   
[10] BiocInstaller_1.18.4  

loaded via a namespace (and not attached):
 [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.9     
 [4] tools_3.2.0             DBI_0.3.1               lambda.r_1.1.7         
 [7] futile.logger_1.4.1     rtracklayer_1.28.6      futile.options_1.0.0   
[10] bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0         
[13] RSQLite_1.0.0           Biostrings_2.36.1       Rsamtools_1.20.4       
[16] XML_3.98-1.3