NCBI gff3 annotation file and read.gff()
2
0
Entering edit mode
Ugo Borello ▴ 340
@ugo-borello-5753
Last seen 5.7 years ago
France
Hi, I am trying to read without success, a gff file downloaded from the NCBI genomes ftp directory. library("genomes") > annot <- read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level.gf f3') Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : replacement has 1 row, data has 0 Any hint/help? Thank you Ugo > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] genomes_2.10.0 Biostrings_2.32.1 XVector_0.4.0 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 [6] IRanges_1.22.9 BiocGenerics_0.10.0 RCurl_1.95-4.1 bitops_1.0-6 XML_3.98-1.1 loaded via a namespace (and not attached): [1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0
• 3.4k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.3 years ago
United States
Takes a couple of minutes but seems to work with: gff <- rtracklayer::import("ref_Macaca_fascicularis_5.0_top_level.gff3.gz") And it's not necessary to ungzip it. On Wed, Jul 16, 2014 at 5:05 AM, Ugo Borello <ugo.borello@inserm.fr> wrote: > Hi, > > I am trying to read without success, a gff file downloaded from the NCBI > genomes ftp directory. > > library("genomes") > > annot <- > read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level. gff3') > Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : > replacement has 1 row, data has 0 > > Any hint/help? > > Thank you > > Ugo > > > > > > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] genomes_2.10.0 Biostrings_2.32.1 XVector_0.4.0 > GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > [6] IRanges_1.22.9 BiocGenerics_0.10.0 RCurl_1.95-4.1 > bitops_1.0-6 XML_3.98-1.1 > > loaded via a namespace (and not attached): > [1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Thank you very much Michael and Martin, it worked. Ugo From: Michael Lawrence <lawrence.michael@gene.com> Date: Wed, 16 Jul 2014 05:35:19 -0700 To: Ugo Borello <ugo.borello@inserm.fr> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] NCBI gff3 annotation file and read.gff() Takes a couple of minutes but seems to work with: gff <- rtracklayer::import("ref_Macaca_fascicularis_5.0_top_level.gff3.gz") And it's not necessary to ungzip it. On Wed, Jul 16, 2014 at 5:05 AM, Ugo Borello <ugo.borello@inserm.fr> wrote: > Hi, > > I am trying to read without success, a gff file downloaded from the NCBI > genomes ftp directory. > > library("genomes") >> > annot <- > read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level. gff3') > Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : > � replacement has 1 row, data has 0 > > Any hint/help? > > Thank you > > Ugo > > > > >> > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel �stats � � graphics �grDevices utils � � datasets �methods > base > > other attached packages: > �[1] genomes_2.10.0 � � � Biostrings_2.32.1 � �XVector_0.4.0 > GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > �[6] IRanges_1.22.9 � � � BiocGenerics_0.10.0 �RCurl_1.95-4.1 > bitops_1.0-6 � � � � XML_3.98-1.1 > > loaded via a namespace (and not attached): > [1] stats4_3.1.1 � �tools_3.1.1 � � zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
On 07/16/2014 05:05 AM, Ugo Borello wrote: > Hi, > > I am trying to read without success, a gff file downloaded from the NCBI > genomes ftp directory. > > library("genomes") >> annot <- > read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level. gff3') > Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : > replacement has 1 row, data has 0 I cc'd the packageMaintainer(), so that they are more likely to see this post. I don't know whether this helps in this particular case, but packages should be using rtracklayer::import rather than creating their own readers. Then at least whatever deficiencies are identified and corrected benefit the entire project. Martin > > Any hint/help? > > Thank you > > Ugo > > > > >> sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] genomes_2.10.0 Biostrings_2.32.1 XVector_0.4.0 > GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > [6] IRanges_1.22.9 BiocGenerics_0.10.0 RCurl_1.95-4.1 > bitops_1.0-6 XML_3.98-1.1 > > loaded via a namespace (and not attached): > [1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
I would also suggest using rtracklayer import or create a genome data package. At least for microbial genomes, you often just need to return features (CDS, pseudogenes, tRNAs, etc) that have a parent with a locus_tag key and assign that locus tag to the child (the read.gff default), so that's what is getting messed up with your large file. I'll probably use the rtracklayer import object in future versions instead and then join on Parent where locus_tag is NA to the ID where locus_tag is not NA. Chris >I cc'd the packageMaintainer(), so that they are more likely to see this post. >I don't know whether this helps in this particular case, but packages should be >using rtracklayer::import rather than creating their own readers. Then at least >whatever deficiencies are identified and corrected benefit the entire project. -- Chris Stubben Los Alamos National Lab Bioscience Division MS M888 Los Alamos, NM 87545 Phone: (505) 667-3295
ADD REPLY
0
Entering edit mode
Is there anything makeTranscriptDbFromGFF could do to help with this? Sounds like you typically want something like a TxDb, except perhaps with some special considerations. Following the NCBI conventions is probably worth it. On Wed, Jul 16, 2014 at 8:58 AM, Chris Stubben <stubben@lanl.gov> wrote: > I would also suggest using rtracklayer import or create a genome data > package. At least for microbial genomes, you often just need to return > features (CDS, pseudogenes, tRNAs, etc) that have a parent with a locus_tag > key and assign that locus tag to the child (the read.gff default), so > that's what is getting messed up with your large file. > I'll probably use the rtracklayer import object in future versions instead > and then join on Parent where locus_tag is NA to the ID where locus_tag is > not NA. > Chris > > > > I cc'd the packageMaintainer(), so that they are more likely to see this >> post. >> > > I don't know whether this helps in this particular case, but packages >> should be using rtracklayer::import rather than creating their own readers. >> Then at least whatever deficiencies are identified and corrected benefit >> the entire project. >> > > > > > -- > > Chris Stubben > > Los Alamos National Lab > Bioscience Division > MS M888 > Los Alamos, NM 87545 > > Phone: (505) 667-3295 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Yes you definitely can use makeTranscriptDbFromGFF if you want a TranscriptDb object. The following works for example: library("GenomicFeatures") txdb <- makeTranscriptDbFromGFF( file="ref_Macaca_fascicularis_5.0_top_level.gff3.gz", format="gff3", exonRankAttributeName=NA, gffGeneIdAttributeName=NA, chrominfo=NA, dataSource=NA, species=NA, circ_seqs=DEFAULT_CIRC_SEQS, miRBaseBuild=NA, useGenesAsTranscripts=FALSE) But is massaging this into a transcriptome what we want here? Ugo hasn't told us what he wants to do with this data. Also I didn't look closely at the data itself. It may be that you can specify a value for exonRankAttributeName (which is always what you should want to do if you can manage it). Marc On 07/16/2014 09:10 AM, Michael Lawrence wrote: > Is there anything makeTranscriptDbFromGFF could do to help with this? > Sounds like you typically want something like a TxDb, except perhaps with > some special considerations. Following the NCBI conventions is probably > worth it. > > > On Wed, Jul 16, 2014 at 8:58 AM, Chris Stubben <stubben at="" lanl.gov=""> wrote: > >> I would also suggest using rtracklayer import or create a genome data >> package. At least for microbial genomes, you often just need to return >> features (CDS, pseudogenes, tRNAs, etc) that have a parent with a locus_tag >> key and assign that locus tag to the child (the read.gff default), so >> that's what is getting messed up with your large file. >> I'll probably use the rtracklayer import object in future versions instead >> and then join on Parent where locus_tag is NA to the ID where locus_tag is >> not NA. >> Chris >> >> >> >> I cc'd the packageMaintainer(), so that they are more likely to see this >>> post. >>> >> I don't know whether this helps in this particular case, but packages >>> should be using rtracklayer::import rather than creating their own readers. >>> Then at least whatever deficiencies are identified and corrected benefit >>> the entire project. >>> >> >> >> >> -- >> >> Chris Stubben >> >> Los Alamos National Lab >> Bioscience Division >> MS M888 >> Los Alamos, NM 87545 >> >> Phone: (505) 667-3295 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Yes, indeed, I would like to mine into the fascicularis transcriptome and using makeTranscriptDbFromGFF() with my gff file is a very good suggestion. I tried earlier this approach but I have to confess that I was unable to specify an exonRankAttributeName value. I shall try to figure it out more carefully later. Thank you Ugo Le 16-07-2014 10:58, Marc Carlson a ?crit?: > Yes you definitely can use makeTranscriptDbFromGFF if you want a > TranscriptDb object. The following works for example: > > library("GenomicFeatures") > txdb <- makeTranscriptDbFromGFF( > file="ref_Macaca_fascicularis_5.0_top_level.gff3.gz", > format="gff3", > exonRankAttributeName=NA, > gffGeneIdAttributeName=NA, > chrominfo=NA, > dataSource=NA, > species=NA, > circ_seqs=DEFAULT_CIRC_SEQS, > miRBaseBuild=NA, > useGenesAsTranscripts=FALSE) > > But is massaging this into a transcriptome what we want here? Ugo > hasn't told us what he wants to do with this data. Also I didn't look > closely at the data itself. It may be that you can specify a value > for exonRankAttributeName (which is always what you should want to do > if you can manage it). > > > Marc > > > > On 07/16/2014 09:10 AM, Michael Lawrence wrote: >> Is there anything makeTranscriptDbFromGFF could do to help with this? >> Sounds like you typically want something like a TxDb, except perhaps >> with >> some special considerations. Following the NCBI conventions is >> probably >> worth it. >> >> >> On Wed, Jul 16, 2014 at 8:58 AM, Chris Stubben <stubben at="" lanl.gov=""> >> wrote: >> >>> I would also suggest using rtracklayer import or create a genome data >>> package. At least for microbial genomes, you often just need to >>> return >>> features (CDS, pseudogenes, tRNAs, etc) that have a parent with a >>> locus_tag >>> key and assign that locus tag to the child (the read.gff default), so >>> that's what is getting messed up with your large file. >>> I'll probably use the rtracklayer import object in future versions >>> instead >>> and then join on Parent where locus_tag is NA to the ID where >>> locus_tag is >>> not NA. >>> Chris >>> >>> >>> >>> I cc'd the packageMaintainer(), so that they are more likely to see >>> this >>>> post. >>>> >>> I don't know whether this helps in this particular case, but >>> packages >>>> should be using rtracklayer::import rather than creating their own >>>> readers. >>>> Then at least whatever deficiencies are identified and corrected >>>> benefit >>>> the entire project. >>>> >>> >>> >>> >>> -- >>> >>> Chris Stubben >>> >>> Los Alamos National Lab >>> Bioscience Division >>> MS M888 >>> Los Alamos, NM 87545 >>> >>> Phone: (505) 667-3295 >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6