rtracklayer: import.gff seems to be very slow

0

Entering edit mode

Michael Dondrup ▴ 550

@michael-dondrup-3849

Last seen 11.3 years ago

Hi, I am trying to read in a genome annotation from a GFF3 file from NCBI [1] The file is about 7.5 MB and has ~17000 non-comment lines. While I can read the file with read.delim in less than a second, trying bsub = import.gff("~/Downloads/bsubtilis.gff") is very slow. I would rather like to use a standardized function form the package that understands various formats, but currently I cannot use it for whole genome annotation. Could this be improved, or is the fie format incorrect? Best Michael [1]: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis /AL009126.gff > sessionInfo()R version 2.11.1 (2010-05-31) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 loaded via a namespace (and not attached): [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.1 [4] GenomicRanges_1.0.9 IRanges_1.6.6 XML_3.1-0 >

Annotation Annotation • 1.2k views

ADD COMMENT • link updated 15.2 years ago by Michael Lawrence ★ 11k • written 15.2 years ago by Michael Dondrup ▴ 550

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.0 years ago

United States

Wow thanks for a serious testing file. There were some bugs and somewhat interesting performance issues. For example, I've discovered that gregexpr with fixed=TRUE is quadratic time with respect to string length (gets real bad up in the millions). Haven't been able to figure out why. This makes fixed=FALSE much quicker. Counterintuitive. substring() is also surprisingly slow. Anyway, try the latest SVN. Or version 1.9.12. Still much slower than read.delim. It's the attributes in the last column (being translated to columns in R) that are so costly, and that one has them in significant quantity. I guess I could give an option to disable that parsing (or in general select the desired columns, as suggested previously), but it should be much quicker for you now. Thanks again, Michael On Fri, Oct 15, 2010 at 2:40 AM, Michael Dondrup <michael.dondrup@uni.no>wrote: > Hi, > > I am trying to read in a genome annotation from a GFF3 file from NCBI [1] > The file is about 7.5 MB and has ~17000 non-comment lines. While I can read > the file > with read.delim in less than a second, trying > bsub = import.gff("~/Downloads/bsubtilis.gff") > is very slow. I would rather like to use a standardized function form the > package > that understands various formats, but currently I cannot use it for whole > genome > annotation. Could this be improved, or is the fie format incorrect? > > Best > Michael > > > [1]: > ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL 009126.gff > > > sessionInfo()R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.1 > [4] GenomicRanges_1.0.9 IRanges_1.6.6 XML_3.1-0 > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 15.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, just installes R 2.12.0 biocondutor 2.7 rtracklayer 1.10 and I can confirm that there is a major improvement in the speed of import.gff. Thanks a lot for this fix. Michael On Oct 16, 2010, at 6:39 AM, Michael Lawrence wrote: > Wow thanks for a serious testing file. There were some bugs and somewhat interesting performance issues. > > For example, I've discovered that gregexpr with fixed=TRUE is quadratic time with respect to string length (gets real bad up in the millions). Haven't been able to figure out why. This makes fixed=FALSE much quicker. Counterintuitive. substring() is also surprisingly slow. > > Anyway, try the latest SVN. Or version 1.9.12. > > Still much slower than read.delim. It's the attributes in the last column (being translated to columns in R) that are so costly, and that one has them in significant quantity. I guess I could give an option to disable that parsing (or in general select the desired columns, as suggested previously), but it should be much quicker for you now. > > Thanks again, > Michael > > On Fri, Oct 15, 2010 at 2:40 AM, Michael Dondrup <michael.dondrup at="" uni.no=""> wrote: > Hi, > > I am trying to read in a genome annotation from a GFF3 file from NCBI [1] > The file is about 7.5 MB and has ~17000 non-comment lines. While I can read the file > with read.delim in less than a second, trying > bsub = import.gff("~/Downloads/bsubtilis.gff") > is very slow. I would rather like to use a standardized function form the package > that understands various formats, but currently I cannot use it for whole genome > annotation. Could this be improved, or is the fie format incorrect? > > Best > Michael > > > [1]: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtil is/AL009126.gff > > > sessionInfo()R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.1 > [4] GenomicRanges_1.0.9 IRanges_1.6.6 XML_3.1-0 > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > Michael Dondrup Post-doctoral researcher Uni BCCS Thorm?hlensgate 55, N-5008 Bergen, Norway Phone: +47 55584157 Fax: +47 55584354 Please note my new phone number

ADD REPLY • link 15.2 years ago Michael Dondrup ▴ 550

Login before adding your answer.