VariantAnnotation: Performance and memory issues in readVcf

0

Entering edit mode

UBod ▴ 290

@ubodenhofer-5425

Last seen 8 days ago

University of Applied Sciences Upper Au…

Hi, I am currently working on a package that takes input from VCF files (possibly large ones) and I wanted to use readVcf() from the VariantAnnotation package for reading the data. However, I have run into severe performance and memory issues. The following test report is based on a bgzipped and tabix-indexed VCF file with 100 samples/columns and approx 1,9 millions of variants/rows. I have previously failed to load data of this size entirely using readVcf(), so I read the data in chunks. I always had the impression that this was quite slow, but now I compared it with tabix on the command line and scanTabix() from the Rsamtools packages and the results were stunning. Here are some code chunks. As you can see, I am trying to read a 1,400,001bp region from chr1. This region actually contains 8,757 variants/rows: > tf <- TabixFile("testFile100.vcf.gz") > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, end=1500000)) > > system.time(res1 <- readVcf(tf, "hg19", param=gr)) user system elapsed 540.080 0.624 541.906 There were 50 or more warnings (use warnings() to see the first 50) The scanTabix() function from the Rsamtools package gives the following result: > system.time(res2 <- scanTabix(tf, param=gr)) user system elapsed 0.170 0.002 0.171 The tabix command line tool takes approximately the same amount of time. I admit that scanTabix() just reads the data and does no parsing or any other processing, so I digged deeper and saw that readVcf() calls scanTabix() inside the function .vcf_scan(). Interestingly, this is done in a way that all the parsing is done inside the scanTabix() function by supplying a function that does the parsing through the undocumented tbxsym argument. So I tried the same approach: > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf, fixed=character(), info=NA, + geno="GT") Warning message: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")] > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", "VariantAnnotation") > > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym, tbxstate=tbxstate)) user system elapsed 0.511 0.006 0.517 So, even if I include the same way of parsing VCF data as readVcf(), the total time is still in the range of half a second, which is still approx. 1000 times faster than calling readVcf(). So where is all the performance lost? I can hardly imagine that 539.5 of 540 seconds are spent on data transformations. A similar situation can be observed in terms of memory consumption: after having loaded the VariantAnnotation package (and all packages it depends upon), my R process occupies about 185MB main memory. Reading and parsing the data with scanTabix() increases the memory consumption by about 40MB, whereas calling readVcf() increases the memory consumption by more than 400MB . I do not think that such an amount of memory is needed to accommodate the output object res3; object.size() says it's about 8MB, but I know that these figure need not be accurate. Actually, I do not need the whole flexibility of readVcf(). If necessary, I would be satisfied with a workaround like the one based on scanTabix() above. However, I do not like the idea too much to use undocumented internals of other packages in my package. If possible, I would rather prefer to rely on readVcf(). So, I would be extremely grateful if someone could either explain the situation or, in case there are bugs, fix them. Thanks and best regards, Ulrich ---------------------------------------------------------------------- -- *Dr. Ulrich Bodenhofer* Associate Professor Institute of Bioinformatics *Johannes Kepler University* Altenberger Str. 69 4040 Linz, Austria Tel. +43 732 2468 4526 Fax +43 732 2468 4539 bodenhofer at bioinf.jku.at <mailto:bodenhofer at="" bioinf.jku.at=""> http://www.bioinf.jku.at/ <http: www.bioinf.jku.at="">

VariantAnnotation PROcess Rsamtools VariantAnnotation VariantAnnotation PROcess Rsamtools • 2.8k views

ADD COMMENT • link updated 11.0 years ago by Vincent J. Carey, Jr. 6.7k • written 11.0 years ago by UBod ▴ 290

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 6 weeks ago

United States

While I agree that achievable performance improvements should be sought on their own terms, I have a feeling that you will not get very far without a bit more specification of what you want readVcf to produce. My sense is that you can get much better performance if you set the ScanVcfParam and control the capture of INFO fields. There may be other fields that you do not need. I'd propose that we take a public vcf file like a 1000 genomes vcf (i happen to have ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz, maybe you can provide a better one) and define some target of the import task. > tf = TabixFile("ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genoty pes.vcf.gz") > mysvp = ScanVcfParam(which=GRanges("17", IRanges(1,100000)), info=NA, geno="GT", fixed="ALT") > unix.time(r1 <- readVcf(tf, "hg19", param=mysvp)) user system elapsed 1.468 0.520 1.994 Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames > mysvp2 = ScanVcfParam(which=GRanges("17", IRanges(1,100000))) # no info/geno control > unix.time(r2 <- readVcf(tf, "hg19", param=mysvp2)) user system elapsed 36.704 1.959 38.860 have a look at r2 class: CollapsedVCF dim: 1680 1092 rowData(vcf): GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER info(vcf): DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, CIPOS,... info(header(vcf)): Number Type Description LDAF 1 Float MLE Allele Frequency Accounting for LD AVGPOST 1 Float Average posterior probability from MaCH/Thunder RSQ 1 Float Genotype imputation quality from MaCH/Thunder ERATE 1 Float Per-marker Mutation rate from MaCH/Thunder THETA 1 Float Per-marker Transition rate from MaCH/Thunder CIEND 2 Integer Confidence interval around END for imprecise va... CIPOS 2 Integer Confidence interval around POS for imprecise va... END 1 Integer End position of the variant described in this r... HOMLEN . Integer Length of base pair identical micro- homology at... HOMSEQ . String Sequence of base pair identical micro- homology ... SVLEN 1 Integer Difference in length between REF and ALT alleles SVTYPE 1 String Type of structural variant AC . Integer Alternate Allele Count AN 1 Integer Total Allele Count AA 1 String Ancestral Allele, ftp://ftp.1000genomes.ebi.ac. ... AF 1 Float Global Allele Frequency based on AC/AN AMR_AF 1 Float Allele Frequency for samples from AMR based on ... ASN_AF 1 Float Allele Frequency for samples from ASN based on ... AFR_AF 1 Float Allele Frequency for samples from AFR based on ... EUR_AF 1 Float Allele Frequency for samples from EUR based on ... VT 1 String indicates what type of variant the line represents SNPSOURCE . String indicates if a snp was called when analysing th... geno(vcf): SimpleList of length 3: GT, DS, GL geno(header(vcf)): Number Type Description GT 1 String Genotype DS 1 Float Genotype dosage from MaCH/Thunder GL . Float Genotype Likelihoods there's a ton of information in there, and the volume is not predictable on the basis of the knowledge that we are confronting VCF. so the fact that readVcf can manage this is good and maybe worth a performance hit. if setting the ScanVcfParam doesn't adequately beat the approach of a drop-in parser with scanTabix, we should try to build out that approach a bit. But it all depends on what you want to get out. On Tue, May 14, 2013 at 11:59 AM, Ulrich Bodenhofer < bodenhofer@bioinf.jku.at> wrote: > Hi, > > I am currently working on a package that takes input from VCF files > (possibly large ones) and I wanted to use readVcf() from the > VariantAnnotation package for reading the data. However, I have run into > severe performance and memory issues. The following test report is based on > a bgzipped and tabix-indexed VCF file with 100 samples/columns and approx > 1,9 millions of variants/rows. I have previously failed to load data of > this size entirely using readVcf(), so I read the data in chunks. I always > had the impression that this was quite slow, but now I compared it with > tabix on the command line and scanTabix() from the Rsamtools packages and > the results were stunning. Here are some code chunks. As you can see, I am > trying to read a 1,400,001bp region from chr1. This region actually > contains 8,757 variants/rows: > > > tf <- TabixFile("testFile100.vcf.gz"**) > > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, > end=1500000)) > > > > system.time(res1 <- readVcf(tf, "hg19", param=gr)) > user system elapsed > 540.080 0.624 541.906 > There were 50 or more warnings (use warnings() to see the first 50) > > The scanTabix() function from the Rsamtools package gives the following > result: > > > system.time(res2 <- scanTabix(tf, param=gr)) > user system elapsed > 0.170 0.002 0.171 > > The tabix command line tool takes approximately the same amount of time. I > admit that scanTabix() just reads the data and does no parsing or any other > processing, so I digged deeper and saw that readVcf() calls scanTabix() > inside the function .vcf_scan(). Interestingly, this is done in a way that > all the parsing is done inside the scanTabix() function by supplying a > function that does the parsing through the undocumented tbxsym argument. So > I tried the same approach: > > > maps <- VariantAnnotation:::.vcf_scan_**header_maps(tf, > fixed=character(), info=NA, > + geno="GT") > Warning message: > In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")] > > tbxsym <- getNativeSymbolInfo(".tabix_**as_vcf", > "VariantAnnotation") > > > > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym, > tbxstate=tbxstate)) > user system elapsed > 0.511 0.006 0.517 > > So, even if I include the same way of parsing VCF data as readVcf(), the > total time is still in the range of half a second, which is still approx. > 1000 times faster than calling readVcf(). So where is all the performance > lost? I can hardly imagine that 539.5 of 540 seconds are spent on data > transformations. > > A similar situation can be observed in terms of memory consumption: after > having loaded the VariantAnnotation package (and all packages it depends > upon), my R process occupies about 185MB main memory. Reading and parsing > the data with scanTabix() increases the memory consumption by about 40MB, > whereas calling readVcf() increases the memory consumption by more than > 400MB . I do not think that such an amount of memory is needed to > accommodate the output object res3; object.size() says it's about 8MB, but > I know that these figure need not be accurate. > > Actually, I do not need the whole flexibility of readVcf(). If necessary, > I would be satisfied with a workaround like the one based on scanTabix() > above. However, I do not like the idea too much to use undocumented > internals of other packages in my package. If possible, I would rather > prefer to rely on readVcf(). So, I would be extremely grateful if someone > could either explain the situation or, in case there are bugs, fix them. > > Thanks and best regards, > Ulrich > > > ------------------------------**------------------------------** > ------------ > *Dr. Ulrich Bodenhofer* > Associate Professor > Institute of Bioinformatics > > *Johannes Kepler University* > Altenberger Str. 69 > 4040 Linz, Austria > > Tel. +43 732 2468 4526 > Fax +43 732 2468 4539 > bodenhofer@bioinf.jku.at <mailto:bodenhofer@bioinf.jku.**at<bodenhofer@bioinf.jku.at> > > > http://www.bioinf.jku.at/ <http: www.bioinf.jku.at=""> > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD COMMENT • link 11.0 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Vincent, Thanks for your fast and enlightening reply! You are absolutely right that the key is to focus on the absolute minimum of necessary information, and I admit I should have tried that before posting the question. The results are the following: > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, end=1500000)) > sp <- ScanVcfParam(which=gr, info=NA, geno="GT", fixed="ALT") > > system.time(res4 <- readVcf(tf, "hg19", param=sp)) user system elapsed 5.655 0.054 5.722 Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames So the whole thing speeds up by a factor of 100, which is indeed dramatic. Of the 5.7 seconds, roughly 0.5 seconds are spent on reading and parsing the file into a large nested list, whereas 5.2 seconds are spent on converting the large nested list into an S4 object. I could argue about whether this ratio is adequate, but I feel I don't need to, as the 5.7 seconds appear feasible to me. In the other case that readVcf() processes all data, the ratio is 0.5 to 539.5. I think this is, in any case, a weird ratio, no matter how complicated the information in a VCF file is, as all information is already contained in the list that is returned by the call to scanTabix() in .vcf_scan(), in a well-structured manner and mostly in the necessary data types. Different question: I have little idea what .makeMessage() does. Is it possible that a large proportion of the time I reported is spent by throwing warnings? The warnings produced by my original readVcf() call are the same as above, but many more of them are thrown. I suppose they are due to the fact that my VCF file is the result of merging multiple single-sample VCF files, and the header of the merged file probably contains a lot of redundant information. I do not know why one variant throws two such warnings, while the other variant throws 50+ (probably many more). I tried suppressWarnings(), but that just made the warnings disappear, yet did not accelarate readVcf(). Thanks and best regards, Ulrich On 05/15/2013 01:29 PM, Vincent Carey wrote: > While I agree that achievable performance improvements should be > sought on their own terms, I have a feeling that you will not get very > far without a bit more specification of what you want readVcf to > produce. My sense is that you can get much better performance if you > set the ScanVcfParam and control the capture of INFO fields. There > may be other fields that you do not need. I'd propose that we take a > public vcf file like a 1000 genomes vcf (i happen to > have ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz, > maybe you can provide a better one) and define some target of the > import task. > > > tf = > TabixFile("ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.geno types.vcf.gz") > > mysvp = ScanVcfParam(which=GRanges("17", IRanges(1,100000)), > info=NA, geno="GT", fixed="ALT") > > unix.time(r1 <- readVcf(tf, "hg19", param=mysvp)) > user system elapsed > 1.468 0.520 1.994 > Warning messages: > 1: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > 2: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > > mysvp2 = ScanVcfParam(which=GRanges("17", IRanges(1,100000))) # no > info/geno control > > unix.time(r2 <- readVcf(tf, "hg19", param=mysvp2)) > user system elapsed > 36.704 1.959 38.860 > > have a look at r2 > > class: CollapsedVCF > dim: 1680 1092 > rowData(vcf): > GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER > info(vcf): > DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, > CIPOS,... > info(header(vcf)): > Number Type Description > LDAF 1 Float MLE Allele Frequency Accounting for LD > AVGPOST 1 Float Average posterior probability from > MaCH/Thunder > RSQ 1 Float Genotype imputation quality from MaCH/Thunder > ERATE 1 Float Per-marker Mutation rate from MaCH/Thunder > THETA 1 Float Per-marker Transition rate from MaCH/Thunder > CIEND 2 Integer Confidence interval around END for > imprecise va... > CIPOS 2 Integer Confidence interval around POS for > imprecise va... > END 1 Integer End position of the variant described in > this r... > HOMLEN . Integer Length of base pair identical > micro-homology at... > HOMSEQ . String Sequence of base pair identical > micro-homology ... > SVLEN 1 Integer Difference in length between REF and ALT > alleles > SVTYPE 1 String Type of structural variant > AC . Integer Alternate Allele Count > AN 1 Integer Total Allele Count > AA 1 String Ancestral Allele, > ftp://ftp.1000genomes.ebi.ac.... > AF 1 Float Global Allele Frequency based on AC/AN > AMR_AF 1 Float Allele Frequency for samples from AMR > based on ... > ASN_AF 1 Float Allele Frequency for samples from ASN > based on ... > AFR_AF 1 Float Allele Frequency for samples from AFR > based on ... > EUR_AF 1 Float Allele Frequency for samples from EUR > based on ... > VT 1 String indicates what type of variant the line > represents > SNPSOURCE . String indicates if a snp was called when > analysing th... > geno(vcf): > SimpleList of length 3: GT, DS, GL > geno(header(vcf)): > Number Type Description > GT 1 String Genotype > DS 1 Float Genotype dosage from MaCH/Thunder > GL . Float Genotype Likelihoods > > there's a ton of information in there, and the volume is not > predictable on the basis of the knowledge that we are confronting > VCF. so the fact that readVcf can manage this is good and maybe worth > a performance hit. if setting the ScanVcfParam doesn't adequately > beat the approach of a drop-in parser with scanTabix, we should try to > build out that approach a bit. But it all depends on what you want to > get out. > > On Tue, May 14, 2013 at 11:59 AM, Ulrich Bodenhofer > <bodenhofer@bioinf.jku.at <mailto:bodenhofer@bioinf.jku.at="">> wrote: > > Hi, > > I am currently working on a package that takes input from VCF > files (possibly large ones) and I wanted to use readVcf() from the > VariantAnnotation package for reading the data. However, I have > run into severe performance and memory issues. The following test > report is based on a bgzipped and tabix-indexed VCF file with 100 > samples/columns and approx 1,9 millions of variants/rows. I have > previously failed to load data of this size entirely using > readVcf(), so I read the data in chunks. I always had the > impression that this was quite slow, but now I compared it with > tabix on the command line and scanTabix() from the Rsamtools > packages and the results were stunning. Here are some code chunks. > As you can see, I am trying to read a 1,400,001bp region from > chr1. This region actually contains 8,757 variants/rows: > > > tf <- TabixFile("testFile100.vcf.gz") > > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, > end=1500000)) > > > > system.time(res1 <- readVcf(tf, "hg19", param=gr)) > user system elapsed > 540.080 0.624 541.906 > There were 50 or more warnings (use warnings() to see the first 50) > > The scanTabix() function from the Rsamtools package gives the > following result: > > > system.time(res2 <- scanTabix(tf, param=gr)) > user system elapsed > 0.170 0.002 0.171 > > The tabix command line tool takes approximately the same amount of > time. I admit that scanTabix() just reads the data and does no > parsing or any other processing, so I digged deeper and saw that > readVcf() calls scanTabix() inside the function .vcf_scan(). > Interestingly, this is done in a way that all the parsing is done > inside the scanTabix() function by supplying a function that does > the parsing through the undocumented tbxsym argument. So I tried > the same approach: > > > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf, > fixed=character(), info=NA, > + geno="GT") > Warning message: > In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")] > > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", > "VariantAnnotation") > > > > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym, > tbxstate=tbxstate)) > user system elapsed > 0.511 0.006 0.517 > > So, even if I include the same way of parsing VCF data as > readVcf(), the total time is still in the range of half a second, > which is still approx. 1000 times faster than calling readVcf(). > So where is all the performance lost? I can hardly imagine that > 539.5 of 540 seconds are spent on data transformations. > > A similar situation can be observed in terms of memory > consumption: after having loaded the VariantAnnotation package > (and all packages it depends upon), my R process occupies about > 185MB main memory. Reading and parsing the data with scanTabix() > increases the memory consumption by about 40MB, whereas calling > readVcf() increases the memory consumption by more than 400MB . I > do not think that such an amount of memory is needed to > accommodate the output object res3; object.size() says it's about > 8MB, but I know that these figure need not be accurate. > > Actually, I do not need the whole flexibility of readVcf(). If > necessary, I would be satisfied with a workaround like the one > based on scanTabix() above. However, I do not like the idea too > much to use undocumented internals of other packages in my > package. If possible, I would rather prefer to rely on readVcf(). > So, I would be extremely grateful if someone could either explain > the situation or, in case there are bugs, fix them. > > Thanks and best regards, > Ulrich > > > ---------------------------------------------------------------- -------- > *Dr. Ulrich Bodenhofer* > Associate Professor > Institute of Bioinformatics > > *Johannes Kepler University* > Altenberger Str. 69 > 4040 Linz, Austria > > Tel. +43 732 2468 4526 <tel:%2b43%20732%202468%204526> > Fax +43 732 2468 4539 <tel:%2b43%20732%202468%204539> > bodenhofer@bioinf.jku.at <mailto:bodenhofer@bioinf.jku.at> > <mailto:bodenhofer@bioinf.jku.at <mailto:bodenhofer@bioinf.jku.at="">> > http://www.bioinf.jku.at/ <http: www.bioinf.jku.at=""> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 11.0 years ago UBod ▴ 290

0

Entering edit mode

Hi Ulrich, On 05/15/2013 08:54 AM, Ulrich Bodenhofer wrote: > Vincent, > > Thanks for your fast and enlightening reply! You are absolutely right > that the key is to focus on the absolute minimum of necessary > information, and I admit I should have tried that before posting the > question. The results are the following: > > > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, > end=1500000)) > > sp <- ScanVcfParam(which=gr, info=NA, geno="GT", fixed="ALT") > > > > system.time(res4 <- readVcf(tf, "hg19", param=sp)) > user system elapsed > 5.655 0.054 5.722 > Warning messages: > 1: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > 2: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > > So the whole thing speeds up by a factor of 100, which is indeed > dramatic. Of the 5.7 seconds, roughly 0.5 seconds are spent on reading > and parsing the file into a large nested list, whereas 5.2 seconds are > spent on converting the large nested list into an S4 object. I could > argue about whether this ratio is adequate, but I feel I don't need to, > as the 5.7 seconds appear feasible to me. In the other case that > readVcf() processes all data, the ratio is 0.5 to 539.5. I think this > is, in any case, a weird ratio, no matter how complicated the > information in a VCF file is, as all information is already contained in > the list that is returned by the call to scanTabix() in .vcf_scan(), in > a well-structured manner and mostly in the necessary data types. INFO variables with multiple values are parsed into CompressedLists. FORMAT variables with multiple values are parsed into arrays. We thought these parsed classes were more beneficial to the user for further exploration/analysis. If you only want the list form you can use scanBam() instead of readVcf(). scanBam() also takes a 'param' argument so you can subset on position, variables etc. > > Different question: I have little idea what .makeMessage() does. Is it > possible that a large proportion of the time I reported is spent by > throwing warnings? The warnings produced by my original readVcf() call > are the same as above, but many more of them are thrown. I suppose they > are due to the fact that my VCF file is the result of merging multiple > single-sample VCF files, and the header of the merged file promerged frombably > contains a lot of redundant information. I do not know why one variant > throws two such warnings, while the other variant throws 50+ (probably > many more). I tried suppressWarnings(), but that just made the warnings > disappear, yet did not accelarate readVcf(). The warning from .bcfHeaderAsSimpleList() is thrown when retrieving the header information from the file. You can replicate the warning with this, hdr <- scanVcfHeader(tf) This is not an expensive operation and is performed twice during readVcf(). Throwing two warnings is overkill and I will fix that. The warning itself is saying you have duplicate variables names in the header. As you mentioned, this is probably because the file is the results of a merge. You mention that one file gives you 2 warnings but another gives you 50. Are the other 50 warnings the same? Valerie > > Thanks and best regards, > Ulrich > > > On 05/15/2013 01:29 PM, Vincent Carey wrote: >> While I agree that achievable performance improvements should be >> sought on their own terms, I have a feeling that you will not get very >> far without a bit more specification of what you want readVcf to >> produce. My sense is that you can get much better performance if you >> set the ScanVcfParam and control the capture of INFO fields. There >> may be other fields that you do not need. I'd propose that we take a >> public vcf file like a 1000 genomes vcf (i happen to >> have ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz, >> maybe you can provide a better one) and define some target of the >> import task. >> >>> tf = >> TabixFile("ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.gen otypes.vcf.gz") >>> mysvp = ScanVcfParam(which=GRanges("17", IRanges(1,100000)), >> info=NA, geno="GT", fixed="ALT") >>> unix.time(r1 <- readVcf(tf, "hg19", param=mysvp)) >> user system elapsed >> 1.468 0.520 1.994 >> Warning messages: >> 1: In .bcfHeaderAsSimpleList(header) : >> duplicate keys in header will be forced to unique rownames >> 2: In .bcfHeaderAsSimpleList(header) : >> duplicate keys in header will be forced to unique rownames >>> mysvp2 = ScanVcfParam(which=GRanges("17", IRanges(1,100000))) # no >> info/geno control >>> unix.time(r2 <- readVcf(tf, "hg19", param=mysvp2)) >> user system elapsed >> 36.704 1.959 38.860 >> >> have a look at r2 >> >> class: CollapsedVCF >> dim: 1680 1092 >> rowData(vcf): >> GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER >> info(vcf): >> DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, >> CIPOS,... >> info(header(vcf)): >> Number Type Description >> LDAF 1 Float MLE Allele Frequency Accounting for LD >> AVGPOST 1 Float Average posterior probability from >> MaCH/Thunder >> RSQ 1 Float Genotype imputation quality from MaCH/Thunder >> ERATE 1 Float Per-marker Mutation rate from MaCH/Thunder >> THETA 1 Float Per-marker Transition rate from MaCH/Thunder >> CIEND 2 Integer Confidence interval around END for >> imprecise va... >> CIPOS 2 Integer Confidence interval around POS for >> imprecise va... >> END 1 Integer End position of the variant described in >> this r... >> HOMLEN . Integer Length of base pair identical >> micro-homology at... >> HOMSEQ . String Sequence of base pair identical >> micro-homology ... >> SVLEN 1 Integer Difference in length between REF and ALT >> alleles >> SVTYPE 1 String Type of structural variant >> AC . Integer Alternate Allele Count >> AN 1 Integer Total Allele Count >> AA 1 String Ancestral Allele, >> ftp://ftp.1000genomes.ebi.ac.... >> AF 1 Float Global Allele Frequency based on AC/AN >> AMR_AF 1 Float Allele Frequency for samples from AMR >> based on ... >> ASN_AF 1 Float Allele Frequency for samples from ASN >> based on ... >> AFR_AF 1 Float Allele Frequency for samples from AFR >> based on ... >> EUR_AF 1 Float Allele Frequency for samples from EUR >> based on ... >> VT 1 String indicates what type of variant the line >> represents >> SNPSOURCE . String indicates if a snp was called when >> analysing th... >> geno(vcf): >> SimpleList of length 3: GT, DS, GL >> geno(header(vcf)): >> Number Type Description >> GT 1 String Genotype >> DS 1 Float Genotype dosage from MaCH/Thunder >> GL . Float Genotype Likelihoods >> >> there's a ton of information in there, and the volume is not >> predictable on the basis of the knowledge that we are confronting >> VCF. so the fact that readVcf can manage this is good and maybe worth >> a performance hit. if setting the ScanVcfParam doesn't adequately >> beat the approach of a drop-in parser with scanTabix, we should try to >> build out that approach a bit. But it all depends on what you want to >> get out. >> >> On Tue, May 14, 2013 at 11:59 AM, Ulrich Bodenhofer >> <bodenhofer at="" bioinf.jku.at="" <mailto:bodenhofer="" at="" bioinf.jku.at="">> wrote: >> >> Hi, >> >> I am currently working on a package that takes input from VCF >> files (possibly large ones) and I wanted to use readVcf() from the >> VariantAnnotation package for reading the data. However, I have >> run into severe performance and memory issues. The following test >> report is based on a bgzipped and tabix-indexed VCF file with 100 >> samples/columns and approx 1,9 millions of variants/rows. I have >> previously failed to load data of this size entirely using >> readVcf(), so I read the data in chunks. I always had the >> impression that this was quite slow, but now I compared it with >> tabix on the command line and scanTabix() from the Rsamtools >> packages and the results were stunning. Here are some code chunks. >> As you can see, I am trying to read a 1,400,001bp region from >> chr1. This region actually contains 8,757 variants/rows: >> >> > tf <- TabixFile("testFile100.vcf.gz") >> > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, >> end=1500000)) >> > >> > system.time(res1 <- readVcf(tf, "hg19", param=gr)) >> user system elapsed >> 540.080 0.624 541.906 >> There were 50 or more warnings (use warnings() to see the first 50) >> >> The scanTabix() function from the Rsamtools package gives the >> following result: >> >> > system.time(res2 <- scanTabix(tf, param=gr)) >> user system elapsed >> 0.170 0.002 0.171 >> >> The tabix command line tool takes approximately the same amount of >> time. I admit that scanTabix() just reads the data and does no >> parsing or any other processing, so I digged deeper and saw that >> readVcf() calls scanTabix() inside the function .vcf_scan(). >> Interestingly, this is done in a way that all the parsing is done >> inside the scanTabix() function by supplying a function that does >> the parsing through the undocumented tbxsym argument. So I tried >> the same approach: >> >> > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf, >> fixed=character(), info=NA, >> + geno="GT") >> Warning message: >> In .bcfHeaderAsSimpleList(header) : >> duplicate keys in header will be forced to unique rownames >> > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")] >> > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", >> "VariantAnnotation") >> > >> > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym, >> tbxstate=tbxstate)) >> user system elapsed >> 0.511 0.006 0.517 >> >> So, even if I include the same way of parsing VCF data as >> readVcf(), the total time is still in the range of half a second, >> which is still approx. 1000 times faster than calling readVcf(). >> So where is all the performance lost? I can hardly imagine that >> 539.5 of 540 seconds are spent on data transformations. >> >> A similar situation can be observed in terms of memory >> consumption: after having loaded the VariantAnnotation package >> (and all packages it depends upon), my R process occupies about >> 185MB main memory. Reading and parsing the data with scanTabix() >> increases the memory consumption by about 40MB, whereas calling >> readVcf() increases the memory consumption by more than 400MB . I >> do not think that such an amount of memory is needed to >> accommodate the output object res3; object.size() says it's about >> 8MB, but I know that these figure need not be accurate. >> >> Actually, I do not need the whole flexibility of readVcf(). If >> necessary, I would be satisfied with a workaround like the one >> based on scanTabix() above. However, I do not like the idea too >> much to use undocumented internals of other packages in my >> package. If possible, I would rather prefer to rely on readVcf(). >> So, I would be extremely grateful if someone could either explain >> the situation or, in case there are bugs, fix them. >> >> Thanks and best regards, >> Ulrich >> >> >> -------------------------------------------------------------- ---------- >> *Dr. Ulrich Bodenhofer* >> Associate Professor >> Institute of Bioinformatics >> >> *Johannes Kepler University* >> Altenberger Str. 69 >> 4040 Linz, Austria >> >> Tel. +43 732 2468 4526 <tel:%2b43%20732%202468%204526> >> Fax +43 732 2468 4539 <tel:%2b43%20732%202468%204539> >> bodenhofer at bioinf.jku.at <mailto:bodenhofer at="" bioinf.jku.at=""> >> <mailto:bodenhofer at="" bioinf.jku.at="" <mailto:bodenhofer="" at="" bioinf.jku.at="">> >> http://www.bioinf.jku.at/ <http: www.bioinf.jku.at=""> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 11.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

> INFO variables with multiple values are parsed into CompressedLists. > FORMAT variables with multiple values are parsed into arrays. We thought > these parsed classes were more beneficial to the user for further > exploration/analysis. If you only want the list form you can use > scanBam() instead of readVcf(). scanBam() also takes a 'param' argument > so you can subset on position, variables etc. Sorry, not scanBam() but scanTabix(), as you've already discovered. Valerie

ADD REPLY • link 11.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks for your reply, Valerie! > [...] > You mention that one file gives you 2 warnings but another gives you 50. Are the other 50 warnings the same? I checked the warning messages again and it turned out that I was wrong: the "duplicate keys" message does not appear multiple times, but, consistently with the ScanVcfParam example I sent yesterday, it appears only twice. All other warning messages (at least the ones that I can see with warnings()) are the following: unpackVcf field 'AD': NAs introduced by coercion R just gives the first 50 warnings, so I do not know how often this one appears, but my estimate is that it appears as many times as the VCF sub-set has records (8,757 in my example). Do you think that this number of warnings could lead to the observed performance bottleneck? No matter whether this is the source of the problem or not: the lesson I learned is that I should always focus on the minimum necessary information when reading a VCF file. So thanks to you and Vincent for your great help! Best regards, Ulrich

ADD REPLY • link 11.0 years ago UBod ▴ 290

0

Entering edit mode

glad to see this improvement. your questions on ratios and warnings will have to be handled by Valerie. On Wed, May 15, 2013 at 11:54 AM, Ulrich Bodenhofer < bodenhofer@bioinf.jku.at> wrote: > Vincent, > > Thanks for your fast and enlightening reply! You are absolutely right that > the key is to focus on the absolute minimum of necessary information, and I > admit I should have tried that before posting the question. The results are > the following: > > > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, end=1500000)) > > sp <- ScanVcfParam(which=gr, info=NA, geno="GT", fixed="ALT") > > > > system.time(res4 <- readVcf(tf, "hg19", param=sp)) > user system elapsed > 5.655 0.054 5.722 > Warning messages: > 1: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > 2: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > > So the whole thing speeds up by a factor of 100, which is indeed dramatic. > Of the 5.7 seconds, roughly 0.5 seconds are spent on reading and parsing > the file into a large nested list, whereas 5.2 seconds are spent on > converting the large nested list into an S4 object. I could argue about > whether this ratio is adequate, but I feel I don't need to, as the 5.7 > seconds appear feasible to me. In the other case that readVcf() processes > all data, the ratio is 0.5 to 539.5. I think this is, in any case, a weird > ratio, no matter how complicated the information in a VCF file is, as all > information is already contained in the list that is returned by the call > to scanTabix() in .vcf_scan(), in a well-structured manner and mostly in > the necessary data types. > > Different question: I have little idea what .makeMessage() does. Is it > possible that a large proportion of the time I reported is spent by > throwing warnings? The warnings produced by my original readVcf() call are > the same as above, but many more of them are thrown. I suppose they are due > to the fact that my VCF file is the result of merging multiple > single-sample VCF files, and the header of the merged file probably > contains a lot of redundant information. I do not know why one variant > throws two such warnings, while the other variant throws 50+ (probably many > more). I tried suppressWarnings(), but that just made the warnings > disappear, yet did not accelarate readVcf(). > > > Thanks and best regards, > Ulrich > > > On 05/15/2013 01:29 PM, Vincent Carey wrote: > > While I agree that achievable performance improvements should be sought on > their own terms, I have a feeling that you will not get very far without a > bit more specification of what you want readVcf to produce. My sense is > that you can get much better performance if you set the ScanVcfParam and > control the capture of INFO fields. There may be other fields that you do > not need. I'd propose that we take a public vcf file like a 1000 genomes > vcf (i happen to > have ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz, > maybe you can provide a better one) and define some target of the import > task. > > > tf = > TabixFile("ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.geno types.vcf.gz") > > mysvp = ScanVcfParam(which=GRanges("17", IRanges(1,100000)), info=NA, > geno="GT", fixed="ALT") > > unix.time(r1 <- readVcf(tf, "hg19", param=mysvp)) > user system elapsed > 1.468 0.520 1.994 > Warning messages: > 1: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > 2: In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > > mysvp2 = ScanVcfParam(which=GRanges("17", IRanges(1,100000))) # no > info/geno control > > unix.time(r2 <- readVcf(tf, "hg19", param=mysvp2)) > user system elapsed > 36.704 1.959 38.860 > > have a look at r2 > > class: CollapsedVCF > dim: 1680 1092 > rowData(vcf): > GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER > info(vcf): > DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, > CIPOS,... > info(header(vcf)): > Number Type Description > > LDAF 1 Float MLE Allele Frequency Accounting for LD > > AVGPOST 1 Float Average posterior probability from > MaCH/Thunder > RSQ 1 Float Genotype imputation quality from MaCH/Thunder > > ERATE 1 Float Per-marker Mutation rate from MaCH/Thunder > > THETA 1 Float Per-marker Transition rate from MaCH/Thunder > > CIEND 2 Integer Confidence interval around END for imprecise > va... > CIPOS 2 Integer Confidence interval around POS for imprecise > va... > END 1 Integer End position of the variant described in this > r... > HOMLEN . Integer Length of base pair identical micro- homology > at... > HOMSEQ . String Sequence of base pair identical micro- homology > ... > SVLEN 1 Integer Difference in length between REF and ALT > alleles > SVTYPE 1 String Type of structural variant > > AC . Integer Alternate Allele Count > > AN 1 Integer Total Allele Count > > AA 1 String Ancestral Allele, > ftp://ftp.1000genomes.ebi.ac.... > AF 1 Float Global Allele Frequency based on AC/AN > > AMR_AF 1 Float Allele Frequency for samples from AMR based on > ... > ASN_AF 1 Float Allele Frequency for samples from ASN based on > ... > AFR_AF 1 Float Allele Frequency for samples from AFR based on > ... > EUR_AF 1 Float Allele Frequency for samples from EUR based on > ... > VT 1 String indicates what type of variant the line > represents > SNPSOURCE . String indicates if a snp was called when analysing > th... > geno(vcf): > SimpleList of length 3: GT, DS, GL > geno(header(vcf)): > Number Type Description > GT 1 String Genotype > DS 1 Float Genotype dosage from MaCH/Thunder > GL . Float Genotype Likelihoods > > there's a ton of information in there, and the volume is not predictable > on the basis of the knowledge that we are confronting VCF. so the fact > that readVcf can manage this is good and maybe worth a performance hit. if > setting the ScanVcfParam doesn't adequately beat the approach of a drop-in > parser with scanTabix, we should try to build out that approach a bit. But > it all depends on what you want to get out. > > On Tue, May 14, 2013 at 11:59 AM, Ulrich Bodenhofer < > bodenhofer@bioinf.jku.at> wrote: > >> Hi, >> >> I am currently working on a package that takes input from VCF files >> (possibly large ones) and I wanted to use readVcf() from the >> VariantAnnotation package for reading the data. However, I have run into >> severe performance and memory issues. The following test report is based on >> a bgzipped and tabix-indexed VCF file with 100 samples/columns and approx >> 1,9 millions of variants/rows. I have previously failed to load data of >> this size entirely using readVcf(), so I read the data in chunks. I always >> had the impression that this was quite slow, but now I compared it with >> tabix on the command line and scanTabix() from the Rsamtools packages and >> the results were stunning. Here are some code chunks. As you can see, I am >> trying to read a 1,400,001bp region from chr1. This region actually >> contains 8,757 variants/rows: >> >> > tf <- TabixFile("testFile100.vcf.gz") >> > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, >> end=1500000)) >> > >> > system.time(res1 <- readVcf(tf, "hg19", param=gr)) >> user system elapsed >> 540.080 0.624 541.906 >> There were 50 or more warnings (use warnings() to see the first 50) >> >> The scanTabix() function from the Rsamtools package gives the following >> result: >> >> > system.time(res2 <- scanTabix(tf, param=gr)) >> user system elapsed >> 0.170 0.002 0.171 >> >> The tabix command line tool takes approximately the same amount of time. >> I admit that scanTabix() just reads the data and does no parsing or any >> other processing, so I digged deeper and saw that readVcf() calls >> scanTabix() inside the function .vcf_scan(). Interestingly, this is done in >> a way that all the parsing is done inside the scanTabix() function by >> supplying a function that does the parsing through the undocumented tbxsym >> argument. So I tried the same approach: >> >> > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf, >> fixed=character(), info=NA, >> + geno="GT") >> Warning message: >> In .bcfHeaderAsSimpleList(header) : >> duplicate keys in header will be forced to unique rownames >> > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")] >> > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", "VariantAnnotation") >> > >> > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym, >> tbxstate=tbxstate)) >> user system elapsed >> 0.511 0.006 0.517 >> >> So, even if I include the same way of parsing VCF data as readVcf(), the >> total time is still in the range of half a second, which is still approx. >> 1000 times faster than calling readVcf(). So where is all the performance >> lost? I can hardly imagine that 539.5 of 540 seconds are spent on data >> transformations. >> >> A similar situation can be observed in terms of memory consumption: after >> having loaded the VariantAnnotation package (and all packages it depends >> upon), my R process occupies about 185MB main memory. Reading and parsing >> the data with scanTabix() increases the memory consumption by about 40MB, >> whereas calling readVcf() increases the memory consumption by more than >> 400MB . I do not think that such an amount of memory is needed to >> accommodate the output object res3; object.size() says it's about 8MB, but >> I know that these figure need not be accurate. >> >> Actually, I do not need the whole flexibility of readVcf(). If necessary, >> I would be satisfied with a workaround like the one based on scanTabix() >> above. However, I do not like the idea too much to use undocumented >> internals of other packages in my package. If possible, I would rather >> prefer to rely on readVcf(). So, I would be extremely grateful if someone >> could either explain the situation or, in case there are bugs, fix them. >> >> Thanks and best regards, >> Ulrich >> >> >> ------------------------------------------------------------------- ----- >> *Dr. Ulrich Bodenhofer* >> Associate Professor >> Institute of Bioinformatics >> >> *Johannes Kepler University* >> Altenberger Str. 69 >> 4040 Linz, Austria >> >> Tel. +43 732 2468 4526 >> Fax +43 732 2468 4539 >> bodenhofer@bioinf.jku.at <mailto:bodenhofer@bioinf.jku.at> >> http://www.bioinf.jku.at/ <http: www.bioinf.jku.at=""> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 11.0 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

UBod ▴ 290

@ubodenhofer-5425

Last seen 8 days ago

University of Applied Sciences Upper Au…

Sorry, I forgot to include some information about my versions of R and the mentioned packages: > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] VariantAnnotation_1.6.1 Rsamtools_1.12.0 Biostrings_2.28.0 [4] GenomicRanges_1.12.1 IRanges_1.18.0 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.22.1 Biobase_2.20.0 biomaRt_2.16.0 [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-5 [7] GenomicFeatures_1.12.0 RCurl_1.95-4.1 RSQLite_0.11.2 [10] rtracklayer_1.20.0 stats4_3.0.0 tools_3.0.0 [13] XML_3.96-1.1 zlibbioc_1.6.0 In the meantime, I profiled the readVcf() test to figure out where most of the time is spent: > Rprof(memory.profiling=TRUE, gc.profiling=TRUE) > system.time(res1 <- readVcf(tf, "hg19", param=gr)) user system elapsed 515.390 0.047 516.623 There were 50 or more warnings (use warnings() to see the first 50) > Rprof(NULL) > summaryRprof(memory="both") $by.self self.time self.pct total.time total.pct mem.total ".makeMessage" 128.02 24.72 145.66 28.12 22067.9 "structure" 96.72 18.67 140.00 27.03 20838.4 "makeRestartList" 33.12 6.39 126.76 24.47 18921.9 "match" 31.92 6.16 32.76 6.33 4899.5 "<gc>" 24.40 4.71 24.40 4.71 332.0 "$<-" 21.74 4.20 23.22 4.48 3312.1 "doWithOneRestart" 19.94 3.85 383.52 74.05 57610.2 "$" 13.26 2.56 13.30 2.57 2013.4 "conditionMessage" 13.10 2.53 24.04 4.64 3645.3 "lapply" 12.78 2.47 505.10 97.52 76013.9 "warning" 12.04 2.32 282.56 54.56 42405.6 "FUN" 11.34 2.19 500.86 96.71 75547.9 "withRestarts" 10.80 2.09 478.60 92.41 71807.5 "%in%" 7.00 1.35 24.56 4.74 3698.8 "paste" 6.74 1.30 6.74 1.30 1010.4 "[[" 6.00 1.16 6.00 1.16 906.0 "<anonymous>" 5.68 1.10 503.96 97.30 75073.3 "findRestart" 5.62 1.09 11.66 2.25 1760.5 "sprintf" 5.30 1.02 29.36 5.67 4458.9 "withOneRestart" 5.26 1.02 410.18 79.20 61603.1 "do.call" 5.22 1.01 13.12 2.53 1994.0 "docall" 5.10 0.98 23.28 4.49 3482.4 "makeRestart" 4.76 0.92 79.22 15.30 11743.5 "unlist" 4.12 0.80 6.66 1.29 810.2 "simpleWarning" 3.92 0.76 69.62 13.44 10428.5 "match.fun" 3.56 0.69 3.60 0.70 537.0 "invokeRestart" 3.54 0.68 17.62 3.40 2647.7 ".signalSimpleWarning" 3.14 0.61 480.36 92.75 72073.0 "conditionMessage.condition" 2.94 0.57 10.86 2.10 1649.9 "isRestart" 2.36 0.46 2.38 0.46 349.8 "environment" 1.48 0.29 1.48 0.29 226.0 ".Call" 1.12 0.22 1.14 0.22 63.2 "parent.frame" 1.04 0.20 1.08 0.21 158.8 "strsplit" 0.76 0.15 0.90 0.17 321.0 "as.character" 0.58 0.11 0.58 0.11 87.8 "mapply" 0.44 0.08 499.36 96.42 74815.4 "slot<-" 0.44 0.08 3.04 0.59 296.7 "deparse" 0.36 0.07 0.36 0.07 9.3 "fun" 0.30 0.06 0.30 0.06 42.8 "initialize" 0.20 0.04 9.64 1.86 1465.9 "endoapply" 0.20 0.04 4.16 0.80 602.5 "list" 0.20 0.04 0.26 0.05 194.8 ".collapseLists" 0.18 0.03 506.08 97.71 75986.2 "loadMethod" 0.10 0.02 0.14 0.03 59.6 "standardGeneric" 0.08 0.02 516.52 99.73 77637.0 "==" 0.08 0.02 0.08 0.02 10.0 "withCallingHandlers" 0.06 0.01 498.58 96.27 74587.6 "array" 0.06 0.01 2.86 0.55 227.4 "factor" 0.06 0.01 0.06 0.01 20.9 "eval" 0.04 0.01 499.44 96.43 74842.5 "tryCatch" 0.04 0.01 3.66 0.71 190.5 ".forceType" 0.04 0.01 2.02 0.39 120.3 "mode<-" 0.04 0.01 1.96 0.38 100.2 "validObject" 0.04 0.01 1.94 0.37 60.5 "assign" 0.04 0.01 0.22 0.04 187.2 "split.default" 0.04 0.01 0.10 0.02 27.7 "aperm.default" 0.04 0.01 0.06 0.01 10.0 "get" 0.04 0.01 0.04 0.01 11.1 "grep" 0.04 0.01 0.04 0.01 3.0 "ls" 0.04 0.01 0.04 0.01 12.7 "doTryCatch" 0.02 0.00 3.64 0.70 185.4 "tryCatchOne" 0.02 0.00 3.64 0.70 185.4 "as.numeric" 0.02 0.00 0.84 0.16 20.0 "length" 0.02 0.00 0.06 0.01 21.0 "elementMetadata" 0.02 0.00 0.04 0.01 22.8 "exists" 0.02 0.00 0.04 0.01 14.1 "getClass" 0.02 0.00 0.04 0.01 17.2 ".getClassFromCache" 0.02 0.00 0.02 0.00 11.0 "is.na" 0.02 0.00 0.02 0.00 19.5 "matrix" 0.02 0.00 0.02 0.00 7.4 "objects" 0.02 0.00 0.02 0.00 7.2 "parent.env" 0.02 0.00 0.02 0.00 6.1 "possibleExtends" 0.02 0.00 0.02 0.00 9.4 "rownames<-" 0.02 0.00 0.02 0.00 31.1 "setNames" 0.02 0.00 0.02 0.00 27.0 $by.total total.time total.pct mem.total self.time self.pct "system.time" 517.92 100.00 77637.0 0.00 0.00 "standardGeneric" 516.52 99.73 77637.0 0.08 0.02 ".readVcf" 516.52 99.73 77637.0 0.00 0.00 "readVcf" 516.52 99.73 77637.0 0.00 0.00 ".scanVcfToVCF" 516.52 99.73 77637.0 0.00 0.00 ".collapseLists" 506.08 97.71 75986.2 0.18 0.03 "sapply" 505.58 97.62 75866.1 0.00 0.00 "scanVcf" 505.20 97.54 75848.8 0.00 0.00 "lapply" 505.10 97.52 76013.9 12.78 2.47 ".vcf_scan" 504.20 97.35 75392.5 0.00 0.00 "<anonymous>" 503.96 97.30 75073.3 5.68 1.10 ".unpackVcf" 502.88 97.10 75278.6 0.00 0.00 "FUN" 500.86 96.71 75547.9 11.34 2.19 "Map" 499.46 96.44 74842.6 0.00 0.00 "eval" 499.44 96.43 74842.5 0.04 0.01 "mapply" 499.36 96.42 74815.4 0.44 0.08 ".Method" 499.36 96.42 74815.4 0.00 0.00 ".unpackVcfTag" 499.12 96.37 74795.4 0.00 0.00 "withCallingHandlers" 498.58 96.27 74587.6 0.06 0.01 ".unpackVcfField" 498.58 96.27 74587.6 0.00 0.00 ".signalSimpleWarning" 480.36 92.75 72073.0 3.14 0.61 "withRestarts" 478.60 92.41 71807.5 10.80 2.09 "withOneRestart" 410.18 79.20 61603.1 5.26 1.02 "doWithOneRestart" 383.52 74.05 57610.2 19.94 3.85 "warning" 282.56 54.56 42405.6 12.04 2.32 ".makeMessage" 145.66 28.12 22067.9 128.02 24.72 "structure" 140.00 27.03 20838.4 96.72 18.67 "makeRestartList" 126.76 24.47 18921.9 33.12 6.39 "makeRestart" 79.22 15.30 11743.5 4.76 0.92 "simpleWarning" 69.62 13.44 10428.5 3.92 0.76 "match" 32.76 6.33 4899.5 31.92 6.16 "sprintf" 29.36 5.67 4458.9 5.30 1.02 "%in%" 24.56 4.74 3698.8 7.00 1.35 "<gc>" 24.40 4.71 332.0 24.40 4.71 "conditionMessage" 24.04 4.64 3645.3 13.10 2.53 "docall" 23.28 4.49 3482.4 5.10 0.98 "$<-" 23.22 4.48 3312.1 21.74 4.20 "invokeRestart" 17.62 3.40 2647.7 3.54 0.68 "$" 13.30 2.57 2013.4 13.26 2.56 "do.call" 13.12 2.53 1994.0 5.22 1.01 "findRestart" 11.66 2.25 1760.5 5.62 1.09 "conditionMessage.condition" 10.86 2.10 1649.9 2.94 0.57 "initialize" 9.64 1.86 1465.9 0.20 0.04 "new" 9.64 1.86 1465.9 0.00 0.00 "VCF" 7.70 1.49 1443.5 0.00 0.00 "SummarizedExperiment" 7.66 1.48 1393.2 0.00 0.00 ".local" 6.82 1.32 944.4 0.00 0.00 "paste" 6.74 1.30 1010.4 6.74 1.30 "unlist" 6.66 1.29 810.2 4.12 0.80 "[[" 6.00 1.16 906.0 6.00 1.16 "endoapply" 4.16 0.80 602.5 0.20 0.04 "tryCatch" 3.66 0.71 190.5 0.04 0.01 "doTryCatch" 3.64 0.70 185.4 0.02 0.00 "tryCatchOne" 3.64 0.70 185.4 0.02 0.00 "tryCatchList" 3.64 0.70 185.4 0.00 0.00 "match.fun" 3.60 0.70 537.0 3.56 0.69 "slot<-" 3.04 0.59 296.7 0.44 0.08 "array" 2.86 0.55 227.4 0.06 0.01 "try" 2.40 0.46 99.3 0.00 0.00 "isRestart" 2.38 0.46 349.8 2.36 0.46 ".formatInfo" 2.36 0.46 155.4 0.00 0.00 ".forceType" 2.02 0.39 120.3 0.04 0.01 "mode<-" 1.96 0.38 100.2 0.04 0.01 "validObject" 1.94 0.37 60.5 0.04 0.01 "new2" 1.94 0.37 49.9 0.00 0.00 ".formatList" 1.88 0.36 43.5 0.00 0.00 ".coerceToList" 1.86 0.36 35.5 0.00 0.00 "NumericList" 1.82 0.35 18.2 0.00 0.00 "PartitioningByEnd" 1.80 0.35 7.2 0.00 0.00 "environment" 1.48 0.29 226.0 1.48 0.29 "gc" 1.40 0.27 0.0 0.00 0.00 ".Call" 1.14 0.22 63.2 1.12 0.22 "scanTabix" 1.14 0.22 80.2 0.00 0.00 ".tabix_scan" 1.14 0.22 80.2 0.00 0.00 "parent.frame" 1.08 0.21 158.8 1.04 0.20 "strsplit" 0.90 0.17 321.0 0.76 0.15 "SimpleList" 0.86 0.17 468.0 0.00 0.00 "as.numeric" 0.84 0.16 20.0 0.02 0.00 "as.character" 0.58 0.11 87.8 0.58 0.11 "as" 0.58 0.11 86.1 0.00 0.00 "asMethod" 0.56 0.11 79.4 0.00 0.00 "DataFrame" 0.54 0.10 91.6 0.00 0.00 "deparse" 0.36 0.07 9.3 0.36 0.07 "scanBcfHeader" 0.34 0.07 47.3 0.00 0.00 "scanVcfHeader" 0.34 0.07 47.3 0.00 0.00 ".bcfHeaderAsSimpleList" 0.32 0.06 47.2 0.00 0.00 "fun" 0.30 0.06 42.8 0.30 0.06 "list" 0.26 0.05 194.8 0.20 0.04 "assign" 0.22 0.04 187.2 0.04 0.01 "envRefSetField" 0.18 0.03 168.6 0.00 0.00 "tapply" 0.16 0.03 39.3 0.00 0.00 ".vcf_scan_header_maps" 0.16 0.03 23.6 0.00 0.00 "loadMethod" 0.14 0.03 59.6 0.10 0.02 ".unpackVcfInfo" 0.14 0.03 59.2 0.00 0.00 "split" 0.12 0.02 33.7 0.00 0.00 "which" 0.12 0.02 24.1 0.00 0.00 "split.default" 0.10 0.02 27.7 0.04 0.01 "==" 0.08 0.02 10.0 0.08 0.02 "as.factor" 0.08 0.02 33.7 0.00 0.00 "factor" 0.06 0.01 20.9 0.06 0.01 "aperm.default" 0.06 0.01 10.0 0.04 0.01 "length" 0.06 0.01 21.0 0.02 0.00 "aperm" 0.06 0.01 10.0 0.00 0.00 "as.list" 0.06 0.01 18.0 0.00 0.00 "get" 0.04 0.01 11.1 0.04 0.01 "grep" 0.04 0.01 3.0 0.04 0.01 "ls" 0.04 0.01 12.7 0.04 0.01 "elementMetadata" 0.04 0.01 22.8 0.02 0.00 "exists" 0.04 0.01 14.1 0.02 0.00 "getClass" 0.04 0.01 17.2 0.02 0.00 "anyStrings" 0.04 0.01 22.8 0.00 0.00 ".classEnv" 0.04 0.01 12.7 0.00 0.00 "getClassDef" 0.04 0.01 14.1 0.00 0.00 "IntegerList" 0.04 0.01 17.3 0.00 0.00 "is" 0.04 0.01 17.4 0.00 0.00 "listClassName" 0.04 0.01 14.1 0.00 0.00 "loadedNamespaces" 0.04 0.01 12.7 0.00 0.00 "mcols" 0.04 0.01 22.8 0.00 0.00 "ncol" 0.04 0.01 8.2 0.00 0.00 "relist" 0.04 0.01 13.4 0.00 0.00 ".requirePackage" 0.04 0.01 12.7 0.00 0.00 "seqnames" 0.04 0.01 28.6 0.00 0.00 "validityMethod" 0.04 0.01 22.8 0.00 0.00 ".getClassFromCache" 0.02 0.00 11.0 0.02 0.00 "is.na" 0.02 0.00 19.5 0.02 0.00 "matrix" 0.02 0.00 7.4 0.02 0.00 "objects" 0.02 0.00 7.2 0.02 0.00 "parent.env" 0.02 0.00 6.1 0.02 0.00 "possibleExtends" 0.02 0.00 9.4 0.02 0.00 "rownames<-" 0.02 0.00 31.1 0.02 0.00 "setNames" 0.02 0.00 27.0 0.02 0.00 "as.vector" 0.02 0.00 5.7 0.00 0.00 "CharacterList" 0.02 0.00 8.0 0.00 0.00 "checkAtAssignment" 0.02 0.00 11.0 0.00 0.00 "close" 0.02 0.00 10.0 0.00 0.00 "close.TabixFile" 0.02 0.00 10.0 0.00 0.00 "coercerToClass" 0.02 0.00 9.9 0.00 0.00 "colnames" 0.02 0.00 19.1 0.00 0.00 "dim" 0.02 0.00 19.1 0.00 0.00 "dimnames<-" 0.02 0.00 31.1 0.00 0.00 "elementLengths" 0.02 0.00 6.5 0.00 0.00 "elementMetadata<-" 0.02 0.00 19.1 0.00 0.00 ".findInheritedMethods" 0.02 0.00 7.2 0.00 0.00 ".findMethodInTable" 0.02 0.00 9.4 0.00 0.00 ".formatALT" 0.02 0.00 5.2 0.00 0.00 "getDanglingSeqlevels" 0.02 0.00 15.7 0.00 0.00 "is.factor" 0.02 0.00 12.9 0.00 0.00 "mcols<-" 0.02 0.00 19.1 0.00 0.00 "names<-" 0.02 0.00 11.0 0.00 0.00 "newCompressedList0" 0.02 0.00 7.4 0.00 0.00 "newGRanges" 0.02 0.00 3.7 0.00 0.00 "nrow" 0.02 0.00 6.6 0.00 0.00 "PartitioningByWidth" 0.02 0.00 5.2 0.00 0.00 ".quickCoerceSelect" 0.02 0.00 9.4 0.00 0.00 "rowData" 0.02 0.00 19.1 0.00 0.00 "ScanBcfParam" 0.02 0.00 6.1 0.00 0.00 "ScanVcfParam" 0.02 0.00 6.1 0.00 0.00 "selectMethod" 0.02 0.00 7.2 0.00 0.00 "seqinfo<-" 0.02 0.00 15.7 0.00 0.00 "seqlevels" 0.02 0.00 15.7 0.00 0.00 "splitAsList" 0.02 0.00 6.1 0.00 0.00 ".splitAsList_by_Rle" 0.02 0.00 6.1 0.00 0.00 "splitAsListReturnedClass" 0.02 0.00 6.1 0.00 0.00 ".toDNAStringSetList" 0.02 0.00 5.2 0.00 0.00 "topenv" 0.02 0.00 6.1 0.00 0.00 "unique" 0.02 0.00 0.1 0.00 0.00 "valid.func" 0.02 0.00 3.7 0.00 0.00 ".valid.GenomicRanges.mcols" 0.02 0.00 19.1 0.00 0.00 ".valid.VCF.fixed" 0.02 0.00 19.1 0.00 0.00 ".valid.Vector.mcols" 0.02 0.00 3.7 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 517.92 I hope this helps to clarify my questions. Thanks to everybody in advance, Ulrich On 05/14/2013 05:59 PM, Ulrich Bodenhofer wrote: > Hi, > > I am currently working on a package that takes input from VCF files > (possibly large ones) and I wanted to use readVcf() from the > VariantAnnotation package for reading the data. However, I have run > into severe performance and memory issues. The following test report > is based on a bgzipped and tabix-indexed VCF file with 100 > samples/columns and approx 1,9 millions of variants/rows. I have > previously failed to load data of this size entirely using readVcf(), > so I read the data in chunks. I always had the impression that this > was quite slow, but now I compared it with tabix on the command line > and scanTabix() from the Rsamtools packages and the results were > stunning. Here are some code chunks. As you can see, I am trying to > read a 1,400,001bp region from chr1. This region actually contains > 8,757 variants/rows: > > > tf <- TabixFile("testFile100.vcf.gz") > > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000, > end=1500000)) > > > > system.time(res1 <- readVcf(tf, "hg19", param=gr)) > user system elapsed > 540.080 0.624 541.906 > There were 50 or more warnings (use warnings() to see the first 50) > > The scanTabix() function from the Rsamtools package gives the > following result: > > > system.time(res2 <- scanTabix(tf, param=gr)) > user system elapsed > 0.170 0.002 0.171 > > The tabix command line tool takes approximately the same amount of > time. I admit that scanTabix() just reads the data and does no parsing > or any other processing, so I digged deeper and saw that readVcf() > calls scanTabix() inside the function .vcf_scan(). Interestingly, this > is done in a way that all the parsing is done inside the scanTabix() > function by supplying a function that does the parsing through the > undocumented tbxsym argument. So I tried the same approach: > > > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf, > fixed=character(), info=NA, > + geno="GT") > Warning message: > In .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")] > > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", "VariantAnnotation") > > > > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym, > tbxstate=tbxstate)) >> > > > -- > -------------------------------------------------------------------- ---- > *Dr. Ulrich Bodenhofer* > Associate Professor > Institute of Bioinformatics > > *Johannes Kepler University* > Altenberger Str. 69 > 4040 Linz, Austria > > Tel. +43 732 2468 4526 > Fax +43 732 2468 4539 > bodenhofer at bioinf.jku.at <mailto:bodenhofer at="" bioinf.jku.at=""> > http://www.bioinf.jku.at/ <http: www.bioinf.jku.at=""> > user system elapsed > 0.511 0.006 0.517 > > So, even if I include the same way of parsing VCF data as readVcf(), > the total time is still in the range of half a second, which is still > approx. 1000 times faster than calling readVcf(). So where is all the > performance lost? I can hardly imagine that 539.5 of 540 seconds are > spent on data transformations. > > A similar situation can be observed in terms of memory consumption: > after having loaded the VariantAnnotation package (and all packages it > depends upon), my R process occupies about 185MB main memory. Reading > and parsing the data with scanTabix() increases the memory consumption > by about 40MB, whereas calling readVcf() increases the memory > consumption by more than 400MB . I do not think that such an amount of > memory is needed to accommodate the output object res3; object.size() > says it's about 8MB, but I know that these figure need not be accurate. > > Actually, I do not need the whole flexibility of readVcf(). If > necessary, I would be satisfied with a workaround like the one based > on scanTabix() above. However, I do not like the idea too much to use > undocumented internals of other packages in my package. If possible, I > would rather prefer to rely on readVcf(). So, I would be extremely > grateful if someone could either explain the situation or, in case > there are bugs, fix them. > > Thanks and best regards, > Ulrich > > > -------------------------------------------------------------------- ---- > *Dr. Ulrich Bodenhofer* > Associate Professor > Institute of Bioinformatics > > *Johannes Kepler University* > Altenberger Str. 69 > 4040 Linz, Austria > > Tel. +43 732 2468 4526 > Fax +43 732 2468 4539 > bodenhofer at bioinf.jku.at <mailto:bodenhofer at="" bioinf.jku.at=""> > http://www.bioinf.jku.at/ <http: www.bioinf.jku.at="">

ADD COMMENT • link 11.0 years ago UBod ▴ 290

Login before adding your answer.