merging VCF files
2
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
rbind and cbind are now implemented for SummarizedExperiment (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34). rbind is appropriate for the case of different ranges (variants) and the same samples. cbind is appropriate for the same ranges and different samples. Let me know if you run into problems/bugs. Valerie On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote: > Hi all, > > Does VariantAnnotation currently have a method to merge VCF objects? > I've been looking through the documentation and code and haven't found > anything like this. If not, I think it would be a useful feature to add. > > My use case: I have two VCF files, with the same samples (but in > different order in each file). The two files have non-overlapping > variants. I would love to have an rbind(VCF, VCF) method; then I could > do something like: > > vcf2 <- vcf2[,colnames(vcf1)] > vcf <- rbind(vcf1, vcf2) > > cbind() would also be useful, for combining files with the same variants > but different samples. > > thanks, > Stephanie > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
VariantAnnotation VariantAnnotation VariantAnnotation VariantAnnotation • 5.6k views
ADD COMMENT
0
Entering edit mode
@stephanie-m-gogarten-5121
Last seen 4 weeks ago
University of Washington
I don't think that this qualifies as a "problem," but I found that an extra step is needed (beyond my initial simple use case) to harmonize colData if the colnames of the VCFs have to be re-ordered. From the man page for "VCF-class" it is not clear that "colData(x) <- value" is possible, but happily it works. Here is what I did: svp <- ScanVcfParam(geno="GT") vcf1 <- readVcf(vcffile1, "hg19", svp) vcf2 <- readVcf(vcffile2, "hg19", svp) ## vcf1 and vcf2 have the same samples, but in different order vcf2 <- vcf2[,colnames(vcf1)] vcf <- rbind(vcf1, vcf2) ## Error in FUN("Samples"[[1L]], ...) : ## column(s) 'Samples' in ?colData? are duplicated and the data do not match ## colData is automatically created by readVcf with "Samples" ordered 1:N ## after re-ordering columns, have to change it in one of the VCFs to use rbind colData(vcf2) <- colData(vcf1) vcf <- rbind(vcf1, vcf2) seqnames(vcf) ## factor-Rle of length 226660 with 48 runs ## Lengths: 19000 7086 11905 9249 3070 ... 2050 1568 1855 824 5 ## Values : 1 10 11 12 13 ... 7 8 9 X Y ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y ## re-order rows so chromosomes are in blocks vcf <- vcf[order(rownames(vcf)),] seqnames(vcf) ## factor-Rle of length 226660 with 24 runs ## Lengths: 23538 8847 14865 11463 3835 ... 10391 8244 9637 4695 33 ## Values : 1 10 11 12 13 ... 7 8 9 X Y ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y ## we can also order by chromosome and position chrom <- as.character(seqnames(vcf)) chrom[chrom == "X"] <- 23 chrom[chrom == "Y"] <- 24 chrom <- as.integer(chrom) pos <- start(ranges(rowData(vcf))) vcf <- vcf[order(chrom, pos),] seqnames(vcf) ## factor-Rle of length 226660 with 24 runs ## Lengths: 23538 16118 13559 9365 10466 ... 6120 2642 4982 4695 33 ## Values : 1 2 3 4 5 ... 20 21 22 X Y ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y Thanks for the helpful additions! Stephanie On 2/5/13 1:43 PM, Valerie Obenchain wrote: > rbind and cbind are now implemented for SummarizedExperiment > (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34). > > rbind is appropriate for the case of different ranges (variants) and the > same samples. cbind is appropriate for the same ranges and different > samples. > > Let me know if you run into problems/bugs. > > Valerie > > > On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote: >> Hi all, >> >> Does VariantAnnotation currently have a method to merge VCF objects? >> I've been looking through the documentation and code and haven't found >> anything like this. If not, I think it would be a useful feature to add. >> >> My use case: I have two VCF files, with the same samples (but in >> different order in each file). The two files have non-overlapping >> variants. I would love to have an rbind(VCF, VCF) method; then I could >> do something like: >> >> vcf2 <- vcf2[,colnames(vcf1)] >> vcf <- rbind(vcf1, vcf2) >> >> cbind() would also be useful, for combining files with the same variants >> but different samples. >> >> thanks, >> Stephanie >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Stephanie, In the case of same samples, different order I've renamed 'Samples' to 'Samples.*' and kept them all. cbind() also keeps all columns so at least this is consistent between the two. An error will still be thrown for all other columns in colData with the same name but different data. Here is the behavior in 1.5.40. VCF file 1: > colData(vcf) DataFrame with 3 rows and 1 column Samples <integer> NA00001 1 NA00002 2 NA00003 3 VCF file 2 with samples in different order: > colData(vcf2) DataFrame with 3 rows and 1 column Samples <integer> NA00003 1 NA00002 2 NA00001 3 Reorder the second file so we can rbind with the first. > vcf2 <- vcf2[,colnames(vcf)] > colData(vcf2) DataFrame with 3 rows and 1 column Samples <integer> NA00001 3 NA00002 2 NA00003 1 The result keeps all columns. The .* extension corresponds to the order the files were input to rbind(). > res <- rbind(vcf, vcf2) > colData(res) DataFrame with 3 rows and 2 columns Samples.1 Samples.2 <integer> <integer> 1 1 3 2 2 2 3 3 1 I missed the point of reordering by chromosome blocks in your example below. Was there trouble with the ordering of other slots after rbind() or ... ? Thanks, Valerie On 02/15/2013 01:56 PM, Stephanie M. Gogarten wrote: > I don't think that this qualifies as a "problem," but I found that an > extra step is needed (beyond my initial simple use case) to harmonize > colData if the colnames of the VCFs have to be re-ordered. From the man > page for "VCF-class" it is not clear that "colData(x) <- value" is > possible, but happily it works. > > Here is what I did: > > svp <- ScanVcfParam(geno="GT") > vcf1 <- readVcf(vcffile1, "hg19", svp) > vcf2 <- readVcf(vcffile2, "hg19", svp) > > ## vcf1 and vcf2 have the same samples, but in different order > vcf2 <- vcf2[,colnames(vcf1)] > vcf <- rbind(vcf1, vcf2) > ## Error in FUN("Samples"[[1L]], ...) : > ## column(s) 'Samples' in ?colData? are duplicated and the data do not > match > > ## colData is automatically created by readVcf with "Samples" ordered 1:N > ## after re-ordering columns, have to change it in one of the VCFs to > use rbind > colData(vcf2) <- colData(vcf1) > vcf <- rbind(vcf1, vcf2) > > seqnames(vcf) > ## factor-Rle of length 226660 with 48 runs > ## Lengths: 19000 7086 11905 9249 3070 ... 2050 1568 1855 824 > 5 > ## Values : 1 10 11 12 13 ... 7 8 9 X > Y > ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y > > ## re-order rows so chromosomes are in blocks > vcf <- vcf[order(rownames(vcf)),] > seqnames(vcf) > ## factor-Rle of length 226660 with 24 runs > ## Lengths: 23538 8847 14865 11463 3835 ... 10391 8244 9637 4695 > 33 > ## Values : 1 10 11 12 13 ... 7 8 9 X > Y > ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y > > ## we can also order by chromosome and position > chrom <- as.character(seqnames(vcf)) > chrom[chrom == "X"] <- 23 > chrom[chrom == "Y"] <- 24 > chrom <- as.integer(chrom) > pos <- start(ranges(rowData(vcf))) > vcf <- vcf[order(chrom, pos),] > seqnames(vcf) > ## factor-Rle of length 226660 with 24 runs > ## Lengths: 23538 16118 13559 9365 10466 ... 6120 2642 4982 4695 > 33 > ## Values : 1 2 3 4 5 ... 20 21 22 X > Y > ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y > > > Thanks for the helpful additions! > > Stephanie > > On 2/5/13 1:43 PM, Valerie Obenchain wrote: >> rbind and cbind are now implemented for SummarizedExperiment >> (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34). >> >> rbind is appropriate for the case of different ranges (variants) and the >> same samples. cbind is appropriate for the same ranges and different >> samples. >> >> Let me know if you run into problems/bugs. >> >> Valerie >> >> >> On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote: >>> Hi all, >>> >>> Does VariantAnnotation currently have a method to merge VCF objects? >>> I've been looking through the documentation and code and haven't found >>> anything like this. If not, I think it would be a useful feature to add. >>> >>> My use case: I have two VCF files, with the same samples (but in >>> different order in each file). The two files have non-overlapping >>> variants. I would love to have an rbind(VCF, VCF) method; then I could >>> do something like: >>> >>> vcf2 <- vcf2[,colnames(vcf1)] >>> vcf <- rbind(vcf1, vcf2) >>> >>> cbind() would also be useful, for combining files with the same variants >>> but different samples. >>> >>> thanks, >>> Stephanie >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>
ADD REPLY
0
Entering edit mode
@stephanie-m-gogarten-5121
Last seen 4 weeks ago
University of Washington
No problem with ordering of rbind, just (perhaps) a common use case to put in an example. thanks, Stephanie On 2/25/13 1:27 PM, Valerie Obenchain wrote: > Hi Stephanie, > > In the case of same samples, different order I've renamed 'Samples' to > 'Samples.*' and kept them all. cbind() also keeps all columns so at > least this is consistent between the two. An error will still be thrown > for all other columns in colData with the same name but different data. > > Here is the behavior in 1.5.40. > > VCF file 1: > > colData(vcf) > DataFrame with 3 rows and 1 column > Samples > <integer> > NA00001 1 > NA00002 2 > NA00003 3 > > VCF file 2 with samples in different order: > > colData(vcf2) > DataFrame with 3 rows and 1 column > Samples > <integer> > NA00003 1 > NA00002 2 > NA00001 3 > > Reorder the second file so we can rbind with the first. > > vcf2 <- vcf2[,colnames(vcf)] > > colData(vcf2) > DataFrame with 3 rows and 1 column > Samples > <integer> > NA00001 3 > NA00002 2 > NA00003 1 > > The result keeps all columns. The .* extension corresponds to the order > the files were input to rbind(). > > > res <- rbind(vcf, vcf2) > > colData(res) > DataFrame with 3 rows and 2 columns > Samples.1 Samples.2 > <integer> <integer> > 1 1 3 > 2 2 2 > 3 3 1 > > I missed the point of reordering by chromosome blocks in your example > below. Was there trouble with the ordering of other slots after rbind() > or ... ? > > Thanks, > Valerie > > On 02/15/2013 01:56 PM, Stephanie M. Gogarten wrote: >> I don't think that this qualifies as a "problem," but I found that an >> extra step is needed (beyond my initial simple use case) to harmonize >> colData if the colnames of the VCFs have to be re-ordered. From the man >> page for "VCF-class" it is not clear that "colData(x) <- value" is >> possible, but happily it works. >> >> Here is what I did: >> >> svp <- ScanVcfParam(geno="GT") >> vcf1 <- readVcf(vcffile1, "hg19", svp) >> vcf2 <- readVcf(vcffile2, "hg19", svp) >> >> ## vcf1 and vcf2 have the same samples, but in different order >> vcf2 <- vcf2[,colnames(vcf1)] >> vcf <- rbind(vcf1, vcf2) >> ## Error in FUN("Samples"[[1L]], ...) : >> ## column(s) 'Samples' in ?colData? are duplicated and the data do not >> match >> >> ## colData is automatically created by readVcf with "Samples" ordered 1:N >> ## after re-ordering columns, have to change it in one of the VCFs to >> use rbind >> colData(vcf2) <- colData(vcf1) >> vcf <- rbind(vcf1, vcf2) >> >> seqnames(vcf) >> ## factor-Rle of length 226660 with 48 runs >> ## Lengths: 19000 7086 11905 9249 3070 ... 2050 1568 1855 824 >> 5 >> ## Values : 1 10 11 12 13 ... 7 8 9 X >> Y >> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 >> 9 X Y >> >> ## re-order rows so chromosomes are in blocks >> vcf <- vcf[order(rownames(vcf)),] >> seqnames(vcf) >> ## factor-Rle of length 226660 with 24 runs >> ## Lengths: 23538 8847 14865 11463 3835 ... 10391 8244 9637 4695 >> 33 >> ## Values : 1 10 11 12 13 ... 7 8 9 X >> Y >> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 >> 9 X Y >> >> ## we can also order by chromosome and position >> chrom <- as.character(seqnames(vcf)) >> chrom[chrom == "X"] <- 23 >> chrom[chrom == "Y"] <- 24 >> chrom <- as.integer(chrom) >> pos <- start(ranges(rowData(vcf))) >> vcf <- vcf[order(chrom, pos),] >> seqnames(vcf) >> ## factor-Rle of length 226660 with 24 runs >> ## Lengths: 23538 16118 13559 9365 10466 ... 6120 2642 4982 4695 >> 33 >> ## Values : 1 2 3 4 5 ... 20 21 22 X >> Y >> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 >> 9 X Y >> >> >> Thanks for the helpful additions! >> >> Stephanie >> >> On 2/5/13 1:43 PM, Valerie Obenchain wrote: >>> rbind and cbind are now implemented for SummarizedExperiment >>> (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34). >>> >>> rbind is appropriate for the case of different ranges (variants) and the >>> same samples. cbind is appropriate for the same ranges and different >>> samples. >>> >>> Let me know if you run into problems/bugs. >>> >>> Valerie >>> >>> >>> On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote: >>>> Hi all, >>>> >>>> Does VariantAnnotation currently have a method to merge VCF objects? >>>> I've been looking through the documentation and code and haven't found >>>> anything like this. If not, I think it would be a useful feature to >>>> add. >>>> >>>> My use case: I have two VCF files, with the same samples (but in >>>> different order in each file). The two files have non-overlapping >>>> variants. I would love to have an rbind(VCF, VCF) method; then I could >>>> do something like: >>>> >>>> vcf2 <- vcf2[,colnames(vcf1)] >>>> vcf <- rbind(vcf1, vcf2) >>>> >>>> cbind() would also be useful, for combining files with the same >>>> variants >>>> but different samples. >>>> >>>> thanks, >>>> Stephanie >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >
ADD COMMENT

Login before adding your answer.

Traffic: 610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6