rtracklayer: import.gff seems to be very slow
1
0
Entering edit mode
@michael-dondrup-3849
Last seen 9.7 years ago
Hi, I am trying to read in a genome annotation from a GFF3 file from NCBI [1] The file is about 7.5 MB and has ~17000 non-comment lines. While I can read the file with read.delim in less than a second, trying bsub = import.gff("~/Downloads/bsubtilis.gff") is very slow. I would rather like to use a standardized function form the package that understands various formats, but currently I cannot use it for whole genome annotation. Could this be improved, or is the fie format incorrect? Best Michael [1]: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis /AL009126.gff > sessionInfo()R version 2.11.1 (2010-05-31) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 loaded via a namespace (and not attached): [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.1 [4] GenomicRanges_1.0.9 IRanges_1.6.6 XML_3.1-0 >
Annotation Annotation • 839 views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
Wow thanks for a serious testing file. There were some bugs and somewhat interesting performance issues. For example, I've discovered that gregexpr with fixed=TRUE is quadratic time with respect to string length (gets real bad up in the millions). Haven't been able to figure out why. This makes fixed=FALSE much quicker. Counterintuitive. substring() is also surprisingly slow. Anyway, try the latest SVN. Or version 1.9.12. Still much slower than read.delim. It's the attributes in the last column (being translated to columns in R) that are so costly, and that one has them in significant quantity. I guess I could give an option to disable that parsing (or in general select the desired columns, as suggested previously), but it should be much quicker for you now. Thanks again, Michael On Fri, Oct 15, 2010 at 2:40 AM, Michael Dondrup <michael.dondrup@uni.no>wrote: > Hi, > > I am trying to read in a genome annotation from a GFF3 file from NCBI [1] > The file is about 7.5 MB and has ~17000 non-comment lines. While I can read > the file > with read.delim in less than a second, trying > bsub = import.gff("~/Downloads/bsubtilis.gff") > is very slow. I would rather like to use a standardized function form the > package > that understands various formats, but currently I cannot use it for whole > genome > annotation. Could this be improved, or is the fie format incorrect? > > Best > Michael > > > [1]: > ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL 009126.gff > > > sessionInfo()R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.1 > [4] GenomicRanges_1.0.9 IRanges_1.6.6 XML_3.1-0 > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi, just installes R 2.12.0 biocondutor 2.7 rtracklayer 1.10 and I can confirm that there is a major improvement in the speed of import.gff. Thanks a lot for this fix. Michael On Oct 16, 2010, at 6:39 AM, Michael Lawrence wrote: > Wow thanks for a serious testing file. There were some bugs and somewhat interesting performance issues. > > For example, I've discovered that gregexpr with fixed=TRUE is quadratic time with respect to string length (gets real bad up in the millions). Haven't been able to figure out why. This makes fixed=FALSE much quicker. Counterintuitive. substring() is also surprisingly slow. > > Anyway, try the latest SVN. Or version 1.9.12. > > Still much slower than read.delim. It's the attributes in the last column (being translated to columns in R) that are so costly, and that one has them in significant quantity. I guess I could give an option to disable that parsing (or in general select the desired columns, as suggested previously), but it should be much quicker for you now. > > Thanks again, > Michael > > On Fri, Oct 15, 2010 at 2:40 AM, Michael Dondrup <michael.dondrup at="" uni.no=""> wrote: > Hi, > > I am trying to read in a genome annotation from a GFF3 file from NCBI [1] > The file is about 7.5 MB and has ~17000 non-comment lines. While I can read the file > with read.delim in less than a second, trying > bsub = import.gff("~/Downloads/bsubtilis.gff") > is very slow. I would rather like to use a standardized function form the package > that understands various formats, but currently I cannot use it for whole genome > annotation. Could this be improved, or is the fie format incorrect? > > Best > Michael > > > [1]: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtil is/AL009126.gff > > > sessionInfo()R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.1 > [4] GenomicRanges_1.0.9 IRanges_1.6.6 XML_3.1-0 > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > Michael Dondrup Post-doctoral researcher Uni BCCS Thorm?hlensgate 55, N-5008 Bergen, Norway Phone: +47 55584157 Fax: +47 55584354 Please note my new phone number
ADD REPLY

Login before adding your answer.

Traffic: 471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6