Search
Question: Best practices to find intersection among variants
0
4.2 years ago by
United States/Kansas City/Stowers Institute for Medical Research
Marco Blanchette210 wrote:
I am very very new to working with variants, maybe this question is very basic but I need to get kickstarted a bit? Just ran an analysis to find the common variation in a set of lab strains used in house in the haploid genomes of S. pombe. I used GATK best practices and I am at the stage where I have filtered variants and I would like to find the common ones. My first intuition for this was to turn to R (tired to running Java command line?) and I fumbled on VariantAnnotation which uses my favorite object, GRanges, under the hood. So I should be good. However, I can?t seem to figure out how to create intersect and I am a bit nervous as I want to find the common variants, not just the location (I can see that I could extract the Granges and do overlap operation on them, but then, I run into the danger of losing the variant information?). Could anyone provide a simple workflow to get me started? I could provide a basic starting code but it would only be restricted to loading the VariantAnnotation package and reading two vcf files? so I figured that I would not add it? Thanks a bunch and sorry for the very vcf newby question. -- Marco Blanchette, Ph.D. Genomic Scientist Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 [[alternative HTML version deleted]]
modified 4.2 years ago by Julian Gehring1.3k • written 4.2 years ago by Marco Blanchette210
0
4.2 years ago by
Julian Gehring1.3k
Julian Gehring1.3k wrote:
Hi Marco, Essentially, you want to find a intersection with both the locus and the ref/alt alleles matching. The 'VariantAnnotation' has the 'match' function which does this (matching the alt allele). Here is some of my code for getting the common variants from two 'VRanges' objects: intersectVRanges <- function(x, y) { m = match(x, y) s1 = x[!is.na(m)] s2 = y[m[!is.na(m)]] res = list(s1, s2) return(res) } Best Julian On 25.08.2014 16:35, Blanchette, Marco wrote: > I am very very new to working with variants, maybe this question is very basic but I need to get kickstarted a bit? > > Just ran an analysis to find the common variation in a set of lab strains used in house in the haploid genomes of S. pombe. I used GATK best practices and I am at the stage where I have filtered variants and I would like to find the common ones. My first intuition for this was to turn to R (tired to running Java command line?) and I fumbled on VariantAnnotation which uses my favorite object, GRanges, under the hood. So I should be good. > > However, I can?t seem to figure out how to create intersect and I am a bit nervous as I want to find the common variants, not just the location (I can see that I could extract the Granges and do overlap operation on them, but then, I run into the danger of losing the variant information?). > > Could anyone provide a simple workflow to get me started? I could provide a basic starting code but it would only be restricted to loading the VariantAnnotation package and reading two vcf files? so I figured that I would not add it? > > Thanks a bunch and sorry for the very vcf newby question. > > > -- Marco Blanchette, Ph.D. > Genomic Scientist > Stowers Institute for Medical Research > 1000 East 50th Street > Kansas City MO 64110 > www.stowers.org > > Tel: 816-926-4071 > Cell: 816-726-8419 > Fax: 816-926-2018 > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
4.2 years ago by
Julian Gehring1.3k
Julian Gehring1.3k wrote:
Hi Marco, Essentially, you want to find a intersection with both the locus and the ref/alt alleles matching. The 'VariantAnnotation' has the 'match' function which does this (matching the alt allele). Here is some of my code for getting the common variants from two 'VRanges' objects: intersectVRanges <- function(x, y) { m = match(x, y) s1 = x[!is.na(m)] s2 = y[m[!is.na(m)]] res = list(s1, s2) return(res) } Best Julian On 25.08.2014 16:35, Blanchette, Marco wrote: > I am very very new to working with variants, maybe this question is very basic but I need to get kickstarted a bit? > > Just ran an analysis to find the common variation in a set of lab strains used in house in the haploid genomes of S. pombe. I used GATK best practices and I am at the stage where I have filtered variants and I would like to find the common ones. My first intuition for this was to turn to R (tired to running Java command line?) and I fumbled on VariantAnnotation which uses my favorite object, GRanges, under the hood. So I should be good. > > However, I can?t seem to figure out how to create intersect and I am a bit nervous as I want to find the common variants, not just the location (I can see that I could extract the Granges and do overlap operation on them, but then, I run into the danger of losing the variant information?). > > Could anyone provide a simple workflow to get me started? I could provide a basic starting code but it would only be restricted to loading the VariantAnnotation package and reading two vcf files? so I figured that I would not add it? > > Thanks a bunch and sorry for the very vcf newby question. > > > -- Marco Blanchette, Ph.D. > Genomic Scientist > Stowers Institute for Medical Research > 1000 East 50th Street > Kansas City MO 64110 > www.stowers.org > > Tel: 816-926-4071 > Cell: 816-726-8419 > Fax: 816-926-2018 > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
Just to clarify, you'll need to get a VRanges via vr <- as(vcf, "VRanges") Michael On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at="" embl.de=""> wrote: > Hi Marco, > > Essentially, you want to find a intersection with both the locus and the > ref/alt alleles matching. The 'VariantAnnotation' has the 'match' function > which does this (matching the alt allele). Here is some of my code for > getting the common variants from two 'VRanges' objects: > > intersectVRanges <- function(x, y) { > m = match(x, y) > s1 = x[!is.na(m)] > s2 = y[m[!is.na(m)]] > res = list(s1, s2) > return(res) > } > > Best > Julian > > > On 25.08.2014 16:35, Blanchette, Marco wrote: > >> I am very very new to working with variants, maybe this question is very >> basic but I need to get kickstarted a bit? >> >> Just ran an analysis to find the common variation in a set of lab strains >> used in house in the haploid genomes of S. pombe. I used GATK best >> practices and I am at the stage where I have filtered variants and I would >> like to find the common ones. My first intuition for this was to turn to R >> (tired to running Java command line?) and I fumbled on VariantAnnotation >> which uses my favorite object, GRanges, under the hood. So I should be good. >> >> However, I can?t seem to figure out how to create intersect and I am a >> bit nervous as I want to find the common variants, not just the location (I >> can see that I could extract the Granges and do overlap operation on them, >> but then, I run into the danger of losing the variant information?). >> >> Could anyone provide a simple workflow to get me started? I could provide >> a basic starting code but it would only be restricted to loading the >> VariantAnnotation package and reading two vcf files? so I figured that I >> would not add it? >> >> Thanks a bunch and sorry for the very vcf newby question. >> >> >> -- Marco Blanchette, Ph.D. >> Genomic Scientist >> Stowers Institute for Medical Research >> 1000 East 50th Street >> Kansas City MO 64110 >> www.stowers.org >> >> Tel: 816-926-4071 >> Cell: 816-726-8419 >> Fax: 816-926-2018 >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]
Thanks a lot, make sense. So, just to confirm, o.l <- findOverlaps(vcf1,vcf2) Where vcf1 and vcf2 are VCF object would not work, it would operate on the Granges and would consider the alt. -- Marco Blanchette, Ph.D. Genomic Scientist Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 On 8/25/14, 8:01 PM, "Michael Lawrence" <lawrence.michael at="" gene.com=""> wrote: >Just to clarify, you'll need to get a VRanges via > >vr <- as(vcf, "VRanges") > >Michael > > >On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at="" embl.de=""> >wrote: > >> Hi Marco, >> >> Essentially, you want to find a intersection with both the locus and the >> ref/alt alleles matching. The 'VariantAnnotation' has the 'match' >>function >> which does this (matching the alt allele). Here is some of my code for >> getting the common variants from two 'VRanges' objects: >> >> intersectVRanges <- function(x, y) { >> m = match(x, y) >> s1 = x[!is.na(m)] >> s2 = y[m[!is.na(m)]] >> res = list(s1, s2) >> return(res) >> } >> >> Best >> Julian >> >> >> On 25.08.2014 16:35, Blanchette, Marco wrote: >> >>> I am very very new to working with variants, maybe this question is >>>very >>> basic but I need to get kickstarted a bit? >>> >>> Just ran an analysis to find the common variation in a set of lab >>>strains >>> used in house in the haploid genomes of S. pombe. I used GATK best >>> practices and I am at the stage where I have filtered variants and I >>>would >>> like to find the common ones. My first intuition for this was to turn >>>to R >>> (tired to running Java command line?) and I fumbled on >>>VariantAnnotation >>> which uses my favorite object, GRanges, under the hood. So I should be >>>good. >>> >>> However, I can?t seem to figure out how to create intersect and I am a >>> bit nervous as I want to find the common variants, not just the >>>location (I >>> can see that I could extract the Granges and do overlap operation on >>>them, >>> but then, I run into the danger of losing the variant information?). >>> >>> Could anyone provide a simple workflow to get me started? I could >>>provide >>> a basic starting code but it would only be restricted to loading the >>> VariantAnnotation package and reading two vcf files? so I figured that >>>I >>> would not add it? >>> >>> Thanks a bunch and sorry for the very vcf newby question. >>> >>> >>> -- Marco Blanchette, Ph.D. >>> Genomic Scientist >>> Stowers Institute for Medical Research >>> 1000 East 50th Street >>> Kansas City MO 64110 >>> www.stowers.org >>> >>> Tel: 816-926-4071 >>> Cell: 816-726-8419 >>> Fax: 816-926-2018 >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor
All right! Thank you guys. Here is my little pipeline 1) slurping in the vcf files in a directory 2) Filtering out the one that didn?t passed the GATK filtering step (This could be probably be done during reading from file, couldn?t figure out how yet?) 3) finding the common one among all the files 4) writing back the file to disk Feel free to comment if you think there would be better ways to do that (It?s already better than pairwise comparison of file using java command line :-}) library(VariantAnnotation) vcfDir <- "./filtSNPs" vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE) vcfs <- lapply(vcfFiles,readVcf,"spombe") filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS']) intersectVCF <- function(v1,v2){ m <- match(as(v1,'VRanges'),as(v2,'VRanges')) v2[m[!is.na(m)]] } commonVCFs <- Reduce(intersectVCF,filtered.vcfs) writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE) Thanks -- Marco Blanchette, Ph.D. Genomic Scientist Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 On 8/25/14, 8:01 PM, "Michael Lawrence" <lawrence.michael at="" gene.com=""> wrote: >Just to clarify, you'll need to get a VRanges via > >vr <- as(vcf, "VRanges") > >Michael > > >On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at="" embl.de=""> >wrote: > >> Hi Marco, >> >> Essentially, you want to find a intersection with both the locus and the >> ref/alt alleles matching. The 'VariantAnnotation' has the 'match' >>function >> which does this (matching the alt allele). Here is some of my code for >> getting the common variants from two 'VRanges' objects: >> >> intersectVRanges <- function(x, y) { >> m = match(x, y) >> s1 = x[!is.na(m)] >> s2 = y[m[!is.na(m)]] >> res = list(s1, s2) >> return(res) >> } >> >> Best >> Julian >> >> >> On 25.08.2014 16:35, Blanchette, Marco wrote: >> >>> I am very very new to working with variants, maybe this question is >>>very >>> basic but I need to get kickstarted a bit? >>> >>> Just ran an analysis to find the common variation in a set of lab >>>strains >>> used in house in the haploid genomes of S. pombe. I used GATK best >>> practices and I am at the stage where I have filtered variants and I >>>would >>> like to find the common ones. My first intuition for this was to turn >>>to R >>> (tired to running Java command line?) and I fumbled on >>>VariantAnnotation >>> which uses my favorite object, GRanges, under the hood. So I should be >>>good. >>> >>> However, I can?t seem to figure out how to create intersect and I am a >>> bit nervous as I want to find the common variants, not just the >>>location (I >>> can see that I could extract the Granges and do overlap operation on >>>them, >>> but then, I run into the danger of losing the variant information?). >>> >>> Could anyone provide a simple workflow to get me started? I could >>>provide >>> a basic starting code but it would only be restricted to loading the >>> VariantAnnotation package and reading two vcf files? so I figured that >>>I >>> would not add it? >>> >>> Thanks a bunch and sorry for the very vcf newby question. >>> >>> >>> -- Marco Blanchette, Ph.D. >>> Genomic Scientist >>> Stowers Institute for Medical Research >>> 1000 East 50th Street >>> Kansas City MO 64110 >>> www.stowers.org >>> >>> Tel: 816-926-4071 >>> Cell: 816-726-8419 >>> Fax: 816-926-2018 >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor ADD REPLYlink written 4.2 years ago by Marco Blanchette210 On Tue, Aug 26, 2014 at 7:19 AM, Blanchette, Marco <mab at="" stowers.org=""> wrote: > All right! Thank you guys. > > Here is my little pipeline > 1) slurping in the vcf files in a directory > 2) Filtering out the one that didn?t passed the GATK filtering step (This > could be probably be done during reading from file, couldn?t figure out > how yet?) > 3) finding the common one among all the files > 4) writing back the file to disk > > Feel free to comment if you think there would be better ways to do that > (It?s already better than pairwise comparison of file using java command > line :-}) > > > library(VariantAnnotation) > > vcfDir <- "./filtSNPs" > > vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE) > > Consider tools::list_files_with_exts() instead of the above. > vcfs <- lapply(vcfFiles,readVcf,"spombe") > > filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS']) > > intersectVCF <- function(v1,v2){ > m <- match(as(v1,'VRanges'),as(v2,'VRanges')) > v2[m[!is.na(m)]] > } > > Since you're just looking for v2 that are in v1, use: intersectVCF <- function(v1,v2) { v2[v2 %in% v1] } commonVCFs <- Reduce(intersectVCF,filtered.vcfs) > > writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE) > > > Thanks > > > > -- Marco Blanchette, Ph.D. > Genomic Scientist > Stowers Institute for Medical Research > 1000 East 50th Street > Kansas City MO 64110 > www.stowers.org > > > Tel: 816-926-4071 > Cell: 816-726-8419 > Fax: 816-926-2018 > > > > > > On 8/25/14, 8:01 PM, "Michael Lawrence" <lawrence.michael at="" gene.com=""> wrote: > > >Just to clarify, you'll need to get a VRanges via > > > >vr <- as(vcf, "VRanges") > > > >Michael > > > > > >On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at="" embl.de=""> > >wrote: > > > >> Hi Marco, > >> > >> Essentially, you want to find a intersection with both the locus and the > >> ref/alt alleles matching. The 'VariantAnnotation' has the 'match' > >>function > >> which does this (matching the alt allele). Here is some of my code for > >> getting the common variants from two 'VRanges' objects: > >> > >> intersectVRanges <- function(x, y) { > >> m = match(x, y) > >> s1 = x[!is.na(m)] > >> s2 = y[m[!is.na(m)]] > >> res = list(s1, s2) > >> return(res) > >> } > >> > >> Best > >> Julian > >> > >> > >> On 25.08.2014 16:35, Blanchette, Marco wrote: > >> > >>> I am very very new to working with variants, maybe this question is > >>>very > >>> basic but I need to get kickstarted a bit? > >>> > >>> Just ran an analysis to find the common variation in a set of lab > >>>strains > >>> used in house in the haploid genomes of S. pombe. I used GATK best > >>> practices and I am at the stage where I have filtered variants and I > >>>would > >>> like to find the common ones. My first intuition for this was to turn > >>>to R > >>> (tired to running Java command line?) and I fumbled on > >>>VariantAnnotation > >>> which uses my favorite object, GRanges, under the hood. So I should be > >>>good. > >>> > >>> However, I can?t seem to figure out how to create intersect and I am a > >>> bit nervous as I want to find the common variants, not just the > >>>location (I > >>> can see that I could extract the Granges and do overlap operation on > >>>them, > >>> but then, I run into the danger of losing the variant information?). > >>> > >>> Could anyone provide a simple workflow to get me started? I could > >>>provide > >>> a basic starting code but it would only be restricted to loading the > >>> VariantAnnotation package and reading two vcf files? so I figured that > >>>I > >>> would not add it? > >>> > >>> Thanks a bunch and sorry for the very vcf newby question. > >>> > >>> > >>> -- Marco Blanchette, Ph.D. > >>> Genomic Scientist > >>> Stowers Institute for Medical Research > >>> 1000 East 50th Street > >>> Kansas City MO 64110 > >>> www.stowers.org > >>> > >>> Tel: 816-926-4071 > >>> Cell: 816-726-8419 > >>> Fax: 816-926-2018 > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/gmane. > >>> science.biology.informatics.conductor > >>> > >>> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane. > >> science.biology.informatics.conductor > >> > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at r-project.org > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]
Thanks Michael, Just as a clarification, one needs to first re-cast the VCF as VRanges, right? I just tried it, only work if: intersectVCF <- function(v1,v2){ ## In the real life, I would make that a one operation but for the sake of clarity m <- as(v1,'VRanges?) %in% (v2,'VRanges') v1[[m]] } Also, I could not find where the %in% operator is described in the VariantAnnotation documentation. I understand that VRanges are derived from GRanges and one would expect most of the method to work but I guess that if the %in% operator also does comparison on the variant alleles to also be identical (didn?t test it though), it might be important to point it out. Thanks a bunch, again, great package! Easy to use. -- Marco Blanchette, Ph.D. Genomic Scientist Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 From: Michael Lawrence <lawrence.michael@gene.com> Date: Tuesday, August 26, 2014 at 9:58 AM To: "... ..." <mab at="" stowers.org=""> Cc: Michael Lawrence <lawrence.michael at="" gene.com="">, Julian Gehring <julian.gehring at="" embl.de="">, BioConductor <bioconductor at="" stat.math.ethz.ch=""> Subject: Re: [BioC] Best practices to find intersection among variants On Tue, Aug 26, 2014 at 7:19 AM, Blanchette, Marco <mab at="" stowers.org=""> wrote: All right! Thank you guys. Here is my little pipeline 1) slurping in the vcf files in a directory 2) Filtering out the one that didn?t passed the GATK filtering step (This could be probably be done during reading from file, couldn?t figure out how yet?) 3) finding the common one among all the files 4) writing back the file to disk Feel free to comment if you think there would be better ways to do that (It?s already better than pairwise comparison of file using java command line :-}) library(VariantAnnotation) vcfDir <- "./filtSNPs" vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE) Consider tools::list_files_with_exts() instead of the above. vcfs <- lapply(vcfFiles,readVcf,"spombe") filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS']) intersectVCF <- function(v1,v2){ m <- match(as(v1,'VRanges'),as(v2,'VRanges')) v2[m[!is.na <http: is.na="">(m)]] } Since you're just looking for v2 that are in v1, use: intersectVCF <- function(v1,v2) { v2[v2 %in% v1] } commonVCFs <- Reduce(intersectVCF,filtered.vcfs) writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE) Thanks -- Marco Blanchette, Ph.D. Genomic Scientist Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org <http: www.stowers.org=""> Tel: 816-926-4071 <tel:816-926-4071> Cell: 816-726-8419 <tel:816-726-8419> Fax: 816-926-2018 <tel:816-926-2018> On 8/25/14, 8:01 PM, "Michael Lawrence" <lawrence.michael at="" gene.com=""> wrote: >Just to clarify, you'll need to get a VRanges via > >vr <- as(vcf, "VRanges") > >Michael > > >On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at="" embl.de=""> >wrote: > >> Hi Marco, >> >> Essentially, you want to find a intersection with both the locus and the >> ref/alt alleles matching. The 'VariantAnnotation' has the 'match' >>function >> which does this (matching the alt allele). Here is some of my code for >> getting the common variants from two 'VRanges' objects: >> >> intersectVRanges <- function(x, y) { >> m = match(x, y) >> s1 = x[!is.na <http: is.na="">(m)] >> s2 = y[m[!is.na <http: is.na="">(m)]] >> res = list(s1, s2) >> return(res) >> } >> >> Best >> Julian >> >> >> On 25.08.2014 16 <tel:25.08.2014%2016>:35, Blanchette, Marco wrote: >> >>> I am very very new to working with variants, maybe this question is >>>very >>> basic but I need to get kickstarted a bit? >>> >>> Just ran an analysis to find the common variation in a set of lab >>>strains >>> used in house in the haploid genomes of S. pombe. I used GATK best >>> practices and I am at the stage where I have filtered variants and I >>>would >>> like to find the common ones. My first intuition for this was to turn >>>to R >>> (tired to running Java command line?) and I fumbled on >>>VariantAnnotation >>> which uses my favorite object, GRanges, under the hood. So I should be >>>good. >>> >>> However, I can?t seem to figure out how to create intersect and I am a >>> bit nervous as I want to find the common variants, not just the >>>location (I >>> can see that I could extract the Granges and do overlap operation on >>>them, >>> but then, I run into the danger of losing the variant information?). >>> >>> Could anyone provide a simple workflow to get me started? I could >>>provide >>> a basic starting code but it would only be restricted to loading the >>> VariantAnnotation package and reading two vcf files? so I figured that >>>I >>> would not add it? >>> >>> Thanks a bunch and sorry for the very vcf newby question. >>> >>> >>> -- Marco Blanchette, Ph.D. >>> Genomic Scientist >>> Stowers Institute for Medical Research >>> 1000 East 50th Street >>> Kansas City MO 64110 >>> www.stowers.org <http: www.stowers.org=""> >>> >>> Tel: 816-926-4071 <tel:816-926-4071> >>> Cell: 816-726-8419 <tel:816-726-8419> >>> Fax: 816-926-2018 <tel:816-926-2018> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor ADD REPLYlink written 4.2 years ago by Marco Blanchette210 Hi Marco, Getting a bit away from Bioconductor: You could also genotype all samples together with the GATK UGT which will give you the information about variant alleles over in each samples (as described in the documentation and best practices). This is helpful for eliminating false positive/negative calls. Best Julian On 26.08.2014 07:19, Blanchette, Marco wrote: > All right! Thank you guys. > > Here is my little pipeline > 1) slurping in the vcf files in a directory > 2) Filtering out the one that didn?t passed the GATK filtering step (This > could be probably be done during reading from file, couldn?t figure out > how yet?) > 3) finding the common one among all the files > 4) writing back the file to disk > > Feel free to comment if you think there would be better ways to do that > (It?s already better than pairwise comparison of file using java command > line :-}) > > > library(VariantAnnotation) > > vcfDir <- "./filtSNPs" > > vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE) > > vcfs <- lapply(vcfFiles,readVcf,"spombe") > > filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS']) > > intersectVCF <- function(v1,v2){ > m <- match(as(v1,'VRanges'),as(v2,'VRanges')) > v2[m[!is.na(m)]] > } > > commonVCFs <- Reduce(intersectVCF,filtered.vcfs) > > writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE) > > > Thanks >