NCBI gff3 annotation file and read.gff()

0

Entering edit mode

Ugo Borello ▴ 340

@ugo-borello-5753

Last seen 7.6 years ago

France

Hi, I am trying to read without success, a gff file downloaded from the NCBI genomes ftp directory. library("genomes") > annot <- read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level.gf f3') Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : replacement has 1 row, data has 0 Any hint/help? Thank you Ugo > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] genomes_2.10.0 Biostrings_2.32.1 XVector_0.4.0 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 [6] IRanges_1.22.9 BiocGenerics_0.10.0 RCurl_1.95-4.1 bitops_1.0-6 XML_3.98-1.1 loaded via a namespace (and not attached): [1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0

• 4.3k views

ADD COMMENT • link updated 11.5 years ago by Martin Morgan 25k • written 11.5 years ago by Ugo Borello ▴ 340

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.1 years ago

United States

Takes a couple of minutes but seems to work with: gff <- rtracklayer::import("ref_Macaca_fascicularis_5.0_top_level.gff3.gz") And it's not necessary to ungzip it. On Wed, Jul 16, 2014 at 5:05 AM, Ugo Borello <ugo.borello@inserm.fr> wrote: > Hi, > > I am trying to read without success, a gff file downloaded from the NCBI > genomes ftp directory. > > library("genomes") > > annot <- > read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level. gff3') > Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : > replacement has 1 row, data has 0 > > Any hint/help? > > Thank you > > Ugo > > > > > > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] genomes_2.10.0 Biostrings_2.32.1 XVector_0.4.0 > GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > [6] IRanges_1.22.9 BiocGenerics_0.10.0 RCurl_1.95-4.1 > bitops_1.0-6 XML_3.98-1.1 > > loaded via a namespace (and not attached): > [1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you very much Michael and Martin, it worked. Ugo From: Michael Lawrence <lawrence.michael@gene.com> Date: Wed, 16 Jul 2014 05:35:19 -0700 To: Ugo Borello <ugo.borello@inserm.fr> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: Re: [BioC] NCBI gff3 annotation file and read.gff() Takes a couple of minutes but seems to work with: gff <- rtracklayer::import("ref_Macaca_fascicularis_5.0_top_level.gff3.gz") And it's not necessary to ungzip it. On Wed, Jul 16, 2014 at 5:05 AM, Ugo Borello <ugo.borello@inserm.fr> wrote: > Hi, > > I am trying to read without success, a gff file downloaded from the NCBI > genomes ftp directory. > > library("genomes") >> > annot <- > read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level. gff3') > Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : > � replacement has 1 row, data has 0 > > Any hint/help? > > Thank you > > Ugo > > > > >> > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel �stats � � graphics �grDevices utils � � datasets �methods > base > > other attached packages: > �[1] genomes_2.10.0 � � � Biostrings_2.32.1 � �XVector_0.4.0 > GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > �[6] IRanges_1.22.9 � � � BiocGenerics_0.10.0 �RCurl_1.95-4.1 > bitops_1.0-6 � � � � XML_3.98-1.1 > > loaded via a namespace (and not attached): > [1] stats4_3.1.1 � �tools_3.1.1 � � zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 11.5 years ago Ugo Borello ▴ 340

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 6 hours ago

United States

On 07/16/2014 05:05 AM, Ugo Borello wrote: > Hi, > > I am trying to read without success, a gff file downloaded from the NCBI > genomes ftp directory. > > library("genomes") >> annot <- > read.gff('~/Desktop/NCBI_Anno/ref_Macaca_fascicularis_5.0_top_level. gff3') > Error in `$<-.data.frame`(`*tmp*`, "description", value = "") : > replacement has 1 row, data has 0 I cc'd the packageMaintainer(), so that they are more likely to see this post. I don't know whether this helps in this particular case, but packages should be using rtracklayer::import rather than creating their own readers. Then at least whatever deficiencies are identified and corrected benefit the entire project. Martin > > Any hint/help? > > Thank you > > Ugo > > > > >> sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] genomes_2.10.0 Biostrings_2.32.1 XVector_0.4.0 > GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > [6] IRanges_1.22.9 BiocGenerics_0.10.0 RCurl_1.95-4.1 > bitops_1.0-6 XML_3.98-1.1 > > loaded via a namespace (and not attached): > [1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.5 years ago Martin Morgan 25k

0

Entering edit mode

I would also suggest using rtracklayer import or create a genome data package. At least for microbial genomes, you often just need to return features (CDS, pseudogenes, tRNAs, etc) that have a parent with a locus_tag key and assign that locus tag to the child (the read.gff default), so that's what is getting messed up with your large file. I'll probably use the rtracklayer import object in future versions instead and then join on Parent where locus_tag is NA to the ID where locus_tag is not NA. Chris >I cc'd the packageMaintainer(), so that they are more likely to see this post. >I don't know whether this helps in this particular case, but packages should be >using rtracklayer::import rather than creating their own readers. Then at least >whatever deficiencies are identified and corrected benefit the entire project. -- Chris Stubben Los Alamos National Lab Bioscience Division MS M888 Los Alamos, NM 87545 Phone: (505) 667-3295

ADD REPLY • link 11.5 years ago stubben ▴ 80

0

Entering edit mode

Is there anything makeTranscriptDbFromGFF could do to help with this? Sounds like you typically want something like a TxDb, except perhaps with some special considerations. Following the NCBI conventions is probably worth it. On Wed, Jul 16, 2014 at 8:58 AM, Chris Stubben <stubben@lanl.gov> wrote: > I would also suggest using rtracklayer import or create a genome data > package. At least for microbial genomes, you often just need to return > features (CDS, pseudogenes, tRNAs, etc) that have a parent with a locus_tag > key and assign that locus tag to the child (the read.gff default), so > that's what is getting messed up with your large file. > I'll probably use the rtracklayer import object in future versions instead > and then join on Parent where locus_tag is NA to the ID where locus_tag is > not NA. > Chris > > > > I cc'd the packageMaintainer(), so that they are more likely to see this >> post. >> > > I don't know whether this helps in this particular case, but packages >> should be using rtracklayer::import rather than creating their own readers. >> Then at least whatever deficiencies are identified and corrected benefit >> the entire project. >> > > > > > -- > > Chris Stubben > > Los Alamos National Lab > Bioscience Division > MS M888 > Los Alamos, NM 87545 > > Phone: (505) 667-3295 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Yes you definitely can use makeTranscriptDbFromGFF if you want a TranscriptDb object. The following works for example: library("GenomicFeatures") txdb <- makeTranscriptDbFromGFF( file="ref_Macaca_fascicularis_5.0_top_level.gff3.gz", format="gff3", exonRankAttributeName=NA, gffGeneIdAttributeName=NA, chrominfo=NA, dataSource=NA, species=NA, circ_seqs=DEFAULT_CIRC_SEQS, miRBaseBuild=NA, useGenesAsTranscripts=FALSE) But is massaging this into a transcriptome what we want here? Ugo hasn't told us what he wants to do with this data. Also I didn't look closely at the data itself. It may be that you can specify a value for exonRankAttributeName (which is always what you should want to do if you can manage it). Marc On 07/16/2014 09:10 AM, Michael Lawrence wrote: > Is there anything makeTranscriptDbFromGFF could do to help with this? > Sounds like you typically want something like a TxDb, except perhaps with > some special considerations. Following the NCBI conventions is probably > worth it. > > > On Wed, Jul 16, 2014 at 8:58 AM, Chris Stubben <stubben at="" lanl.gov=""> wrote: > >> I would also suggest using rtracklayer import or create a genome data >> package. At least for microbial genomes, you often just need to return >> features (CDS, pseudogenes, tRNAs, etc) that have a parent with a locus_tag >> key and assign that locus tag to the child (the read.gff default), so >> that's what is getting messed up with your large file. >> I'll probably use the rtracklayer import object in future versions instead >> and then join on Parent where locus_tag is NA to the ID where locus_tag is >> not NA. >> Chris >> >> >> >> I cc'd the packageMaintainer(), so that they are more likely to see this >>> post. >>> >> I don't know whether this helps in this particular case, but packages >>> should be using rtracklayer::import rather than creating their own readers. >>> Then at least whatever deficiencies are identified and corrected benefit >>> the entire project. >>> >> >> >> >> -- >> >> Chris Stubben >> >> Los Alamos National Lab >> Bioscience Division >> MS M888 >> Los Alamos, NM 87545 >> >> Phone: (505) 667-3295 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.5 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Yes, indeed, I would like to mine into the fascicularis transcriptome and using makeTranscriptDbFromGFF() with my gff file is a very good suggestion. I tried earlier this approach but I have to confess that I was unable to specify an exonRankAttributeName value. I shall try to figure it out more carefully later. Thank you Ugo Le 16-07-2014 10:58, Marc Carlson a ?crit?: > Yes you definitely can use makeTranscriptDbFromGFF if you want a > TranscriptDb object. The following works for example: > > library("GenomicFeatures") > txdb <- makeTranscriptDbFromGFF( > file="ref_Macaca_fascicularis_5.0_top_level.gff3.gz", > format="gff3", > exonRankAttributeName=NA, > gffGeneIdAttributeName=NA, > chrominfo=NA, > dataSource=NA, > species=NA, > circ_seqs=DEFAULT_CIRC_SEQS, > miRBaseBuild=NA, > useGenesAsTranscripts=FALSE) > > But is massaging this into a transcriptome what we want here? Ugo > hasn't told us what he wants to do with this data. Also I didn't look > closely at the data itself. It may be that you can specify a value > for exonRankAttributeName (which is always what you should want to do > if you can manage it). > > > Marc > > > > On 07/16/2014 09:10 AM, Michael Lawrence wrote: >> Is there anything makeTranscriptDbFromGFF could do to help with this? >> Sounds like you typically want something like a TxDb, except perhaps >> with >> some special considerations. Following the NCBI conventions is >> probably >> worth it. >> >> >> On Wed, Jul 16, 2014 at 8:58 AM, Chris Stubben <stubben at="" lanl.gov=""> >> wrote: >> >>> I would also suggest using rtracklayer import or create a genome data >>> package. At least for microbial genomes, you often just need to >>> return >>> features (CDS, pseudogenes, tRNAs, etc) that have a parent with a >>> locus_tag >>> key and assign that locus tag to the child (the read.gff default), so >>> that's what is getting messed up with your large file. >>> I'll probably use the rtracklayer import object in future versions >>> instead >>> and then join on Parent where locus_tag is NA to the ID where >>> locus_tag is >>> not NA. >>> Chris >>> >>> >>> >>> I cc'd the packageMaintainer(), so that they are more likely to see >>> this >>>> post. >>>> >>> I don't know whether this helps in this particular case, but >>> packages >>>> should be using rtracklayer::import rather than creating their own >>>> readers. >>>> Then at least whatever deficiencies are identified and corrected >>>> benefit >>>> the entire project. >>>> >>> >>> >>> >>> -- >>> >>> Chris Stubben >>> >>> Los Alamos National Lab >>> Bioscience Division >>> MS M888 >>> Los Alamos, NM 87545 >>> >>> Phone: (505) 667-3295 >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.5 years ago Ugo Borello ▴ 340

Login before adding your answer.