problem using makeTranscriptDbFromGFF parsing cyno GFF
1
0
Entering edit mode
chenh67 • 0
@chenh67-8510
Last seen 8.7 years ago
United States

Hi,

I am trying to build a TxDb object for the cyno transcriptome using the GFF3 file from NCBI (http://www.ncbi.nlm.nih.gov/genome/776)

I did 

cyno <- makeTranscriptDbFromGFF('GCF_000364345.1_Macaca_fascicularis_5.0_genomic.gff', circ_seqs=c('NC_012670.1')) 

The resulting TxDb object does not contain all chromosomes/contigs in the GFF. The GFF file contains 7601 contigs but the TxDb has only 430. Is there a limit to the number of chromosomes?

> cyno.genes<-genes(cyno)

> cyno.genes
GRanges object with 20704 ranges and 1 metadata column:
               seqnames               ranges strand   |     gene_id
                  <Rle>            <IRanges>  <Rle>   | <character>
     gene10 NC_022272.1 [  246699,   257364]      +   |      gene10
   gene1000 NC_022272.1 [81551864, 81574527]      +   |    gene1000
  gene10001 NC_022276.1 [10507277, 10538293]      -   |   gene10001
   gene1001 NC_022272.1 [81578980, 81589228]      +   |    gene1001
  gene10010 NC_022276.1 [12449492, 12565425]      -   |   gene10010
        ...         ...                  ...    ... ...         ...
   gene9986 NC_022276.1 [ 9163789,  9207809]      -   |    gene9986
    gene999 NC_022272.1 [81508817, 81552167]      -   |     gene999
   gene9991 NC_022276.1 [ 9535984,  9537009]      +   |    gene9991
   gene9992 NC_022276.1 [ 9568256,  9587517]      -   |    gene9992
   gene9994 NC_022276.1 [ 9616382,  9874580]      -   |    gene9994
  -------
  seqinfo: 430 sequences from an unspecified genome; no seqlengths

Also, how can I specify which field in the GFF to use for the gene_id?

Thanks!
Haiyin

genomicfeatures • 834 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 12 hours ago
Seattle, WA, United States

Hi Haiyin,

Please show your sessionInfo() as requested in our posting guide. makeTranscriptDbFromGFF() is deprecated in the current release of BioC (BioC 3.1) and has been replaced with makeTxDbFromGFF(). So I suspect you are using an old (and unsupported) version of BioC. Please update your installation to use the current release of BioC.

Generally speaking, there is no reason to expect that the resulting TxDb object contains all the chromosomes/contigs that are in the GFF file. This is because a GFF file can contain many kinds of features however only the genes, transcripts, exons, and CDS are imported in the TxDb object. So only chromosomes/contigs that actually have these features will be reported by seqlevels(txdb).

As for specifying which field in the GFF to use for the gene_id there is no easy way at the moment. But note that the current version of makeTxDbFromGFF() is trying to do a better job at importing the gene_id from the right place:

cyno <- makeTxDbFromGFF("GCF_000364345.1_Macaca_fascicularis_5.0_genomic.gff.gz")
# Prepare the 'metadata' data frame ... metadata: OK
# Warning messages:
# 1: In matchCircularity(seqlevels(gr), circ_seqs) :
#   None of the strings in your circ_seqs argument match your seqnames.
# 2: In .extract_exons_from_GRanges(exon_IDX, gr, ID, Name, Parent, feature = "exon",  :
#   411 orphan exons were dropped

genes(cyno)
# GRanges object with 35836 ranges and 1 metadata column:
#              seqnames                 ranges strand   |     gene_id
#                 <Rle>              <IRanges>  <Rle>   | <character>
#      A1BG NC_022290.1 [ 58937557,  58951892]      -   |        A1BG
#      A1CF NC_022280.1 [ 85376927,  85454067]      +   |        A1CF
#       A2M NC_022282.1 [  9354433,   9401000]      -   |         A2M
#     A2ML1 NC_022282.1 [  9098500,   9153859]      +   |       A2ML1
#   A3GALT2 NC_022272.1 [194631589, 194645505]      +   |     A3GALT2
#       ...         ...                    ...    ... ...         ...
#      ZXDC NC_022273.1 [145848836, 145886645]      +   |        ZXDC
#    ZYG11A NC_022272.1 [174567376, 174637485]      -   |      ZYG11A
#    ZYG11B NC_022272.1 [174660062, 174713629]      -   |      ZYG11B
#       ZYX NC_022274.1 [176564664, 176574909]      +   |         ZYX
#      ZZZ3 NC_022272.1 [149532272, 149659867]      +   |        ZZZ3
#   -------
#   seqinfo: 670 sequences from an unspecified genome; no seqlengths

Please let us know if this is still not what you need.

Cheers,

H.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] XVector_0.8.0          GenomicFeatures_1.20.1 AnnotationDbi_1.30.1  
 [4] Biobase_2.28.0         GenomicRanges_1.20.5   GenomeInfoDb_1.4.1    
 [7] IRanges_2.2.5          S4Vectors_0.6.2        BiocGenerics_0.14.0   
[10] BiocInstaller_1.18.4  

loaded via a namespace (and not attached):
 [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.9     
 [4] tools_3.2.0             DBI_0.3.1               lambda.r_1.1.7         
 [7] futile.logger_1.4.1     rtracklayer_1.28.6      futile.options_1.0.0   
[10] bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0         
[13] RSQLite_1.0.0           Biostrings_2.36.1       Rsamtools_1.20.4       
[16] XML_3.98-1.3           
ADD COMMENT
0
Entering edit mode

It does look like things work fine in the devel version of Bioc, so we have a way around the bug, but the ID issue is still a pain. 

It looks like (in devel) it is using the Name attribute as the gene_id, which is strange, since Name is not guaranteed to be unique. Shouldn't it use Name for the gene_name, and ID for gene_id? Haiyin wants to use the "Dbxref" attribute to as the gene_id. That attribute is formalized, and the "GeneID" prefix, according to ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs, indicates that it is an Entrez gene ID, so it is a valid, unique identifier. It would be cool if GenomicFeatures could recognize that, and somehow integrate information from the Org packages when making an OrganismDb package. But at the very least, letting the user specify a custom gene ID attribute would be easy and worthwhile.

 

ADD REPLY
0
Entering edit mode

Hi Michael,

Yes, it seems that Dbxref would be a better place than Name to grab the gene_id. At least when it contains Entez gene or Ensembl IDs.

Just to clarify TxDb objects don't have a gene_name field. For transcripts and exons we have both tx_id/tx_name and exon_id/exon_name, but not for genes. The naming of the fields is confusing because tx_id and exon_id are internal ids that are not meaningful beyond the TxDb object. OTOH tx_name and exon_name are meaningful beyond the TxDb object but are not guaranteed to be unique (or even defined). For genes we don't use internal ids. So gene_id is more like the equivalent of tx_name and exon_name but for genes.

By default makeTxDbFromGFF() now grabs the gene_id from the Name attribute. Used to be ID but for most GFF3 files that was giving a meaningless id. According to the GFF3 official specs, ID is indeed internal to the file with no guarantees to be meaningful beyond it. After testing this on many different files from different annotation providers it turned out that grabbing the gene_id from Name instead of ID was almost always giving more satisfying results.

Dbxref sounds like a better choice than Name though. Anyway it's on our list to let the user specify what attribute to use for the gene_id.

Thanks for your input,

H.

ADD REPLY
0
Entering edit mode

One complication is that Dbxref can have multiple values. So one would need to indicate not just the attribute name, but also the desired Dbxref prefix (like GeneID).

ADD REPLY

Login before adding your answer.

Traffic: 839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6