readVcf - ALT allele as CompressedCharacterList instead of DNAStringSetList
1
0
Entering edit mode
@lescai-francesco-5078
Last seen 5.6 years ago
Denmark
Hi there, I was importing a new VCF file generated with the GATK v2 HaplotypeCaller, to compare several parameters with a previous one on the same samples generated with UnifiedGenotyper. "thanks" to an error in predictCoding, I discovered the new VCF file formatted the ALT allele field as a CompressedCharacterList instead of a DNAStringSetList. see here > head(fixed(haplo),n=2L) GRanges with 2 ranges and 5 metadata columns: seqnames ranges strand | paramRangeID REF <rle> <iranges> <rle> | <factor> <dnastringset> rs139570490 1 [865738, 865738] * | <na> A rs4372192 1 [876499, 876499] * | <na> A ALT QUAL FILTER <compressedcharacterlist> <numeric> <character> rs139570490 G 4280.02 PASS rs4372192 G 11733.02 PASS --- seqlengths: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > head(fixed(uni),n=2L) GRanges with 2 ranges and 5 metadata columns: seqnames ranges strand | paramRangeID REF <rle> <iranges> <rle> | <factor> <dnastringset> 1:865738 1 [865738, 865738] * | <na> A rs4372192 1 [876499, 876499] * | <na> A ALT QUAL FILTER <dnastringsetlist> <numeric> <character> 1:865738 ######## 5055.97 PASS rs4372192 ######## 13526.97 PASS --- seqlengths: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA The only very important difference I could find between the two files, is that in the one generated with UnifiedGenotyper there is one single allele in the ALT field, while in the new VCF file, there are several variants with multiple alternative alleles [me@wilder test]$ grep -v "##" union.filtered.vcf | cut -f 1,2,3,4,5 | grep "," [me@wilder test]$ grep -v "##" haploC.filtered.both.vcf | cut -f 1,2,3,4,5 | grep "," 1 1247578 . T G,TGG 1 1920434 rs144487103 TAA TA,TAAA 1 2303347 rs60972860 CAA CA,CAAA 1 3753032 rs36051675 GTT GTTTTT,GTTTTTTTT 1 6291918 rs148639379 TA TAA,T 1 7913184 rs34962249 CAAAA CAAA,CAAAAA 1 10357206 rs58344165 ATTT ATT,ATTTT,ATTTTT 1 14057425 rs35643856 GA GAA,G 1 15654737 rs71000392 CT CTT,C [...cut...] but I thought a DNAStringSetList was meant to accommodate precisely the case where you have multiple alternative alleles, wasn't it? thanks for your help, Francesco > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.3.32 Rsamtools_1.9.31 Biostrings_2.25.12 [4] GenomicRanges_1.9.66 IRanges_1.15.48 BiocGenerics_0.3.2 loaded via a namespace (and not attached): [1] AnnotationDbi_1.19.46 Biobase_2.17.8 biomaRt_2.13.2 [4] bitops_1.0-4.1 BSgenome_1.25.9 colorspace_1.1-1 [7] DBI_0.2-5 dichromat_1.2-4 digest_0.5.2 [10] GenomicFeatures_1.9.44 ggplot2_0.9.2.1 grid_2.15.0 [13] gtable_0.1.1 labeling_0.1 MASS_7.3-21 [16] memoise_0.1 munsell_0.4 parallel_2.15.0 [19] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5 [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.2 [25] rtracklayer_1.17.21 scales_0.2.2 stats4_2.15.0 [28] stringr_0.6.1 tools_2.15.0 XML_3.95-0.1 [31] zlibbioc_1.3.0 ---------------------------------------------------------------------- ----------- Francesco Lescai, PhD, EDBT Senior Research Associate in Genome Analysis University College London Faculty of Population Health Sciences Dept. Genes, Development & Disease ICH - Molecular Medicine Unit, GOSgene team 30 Guilford Street WC1N 1EH London UK email: f.lescai@ucl.ac.uk<mailto:f.lescai@ucl.ac.uk> phone: +44.(0)207.905.2274 [ext: 2274] ---------------------------------------------------------------------- ---------- [[alternative HTML version deleted]]
• 1.4k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
Hi Francesco, Yes, the DNAStringSetList is intended to handle multiple alternate alleles. ALT should only be made a CompressedList when structural variants are present. Try the following with your file, debug(VariantAnnotation:::.formatALT) fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") readVcf(fl, "hg19") debug: { if (is.null(x)) return(NULL) structural <- grep("<", x, fixed = TRUE) if (!identical(integer(0), structural)) seqsplit(x, seq_len(length(x))) else .toDNAStringSetList(x) } Here 'x' is the alternate allele values, Browse[2]> x [1] "A" "A" "G,T" "." "G,GTCT" My guess is that if you inspect these you'll see at least one structural variant in there. Valerie On 10/03/2012 07:23 AM, Lescai, Francesco wrote: > Hi there, > I was importing a new VCF file generated with the GATK v2 HaplotypeCaller, to compare several parameters with a previous one on the same samples generated with UnifiedGenotyper. > > "thanks" to an error in predictCoding, I discovered the new VCF file formatted the ALT allele field as a CompressedCharacterList instead of a DNAStringSetList. > > see here > >> head(fixed(haplo),n=2L) > GRanges with 2 ranges and 5 metadata columns: > seqnames ranges strand | paramRangeID REF > <rle> <iranges> <rle> |<factor> <dnastringset> > rs139570490 1 [865738, 865738] * |<na> A > rs4372192 1 [876499, 876499] * |<na> A > ALT QUAL FILTER > <compressedcharacterlist> <numeric> <character> > rs139570490 G 4280.02 PASS > rs4372192 G 11733.02 PASS > --- > seqlengths: > 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y > NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > >> head(fixed(uni),n=2L) > GRanges with 2 ranges and 5 metadata columns: > seqnames ranges strand | paramRangeID REF > <rle> <iranges> <rle> |<factor> <dnastringset> > 1:865738 1 [865738, 865738] * |<na> A > rs4372192 1 [876499, 876499] * |<na> A > ALT QUAL FILTER > <dnastringsetlist> <numeric> <character> > 1:865738 ######## 5055.97 PASS > rs4372192 ######## 13526.97 PASS > --- > seqlengths: > 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y > NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > > The only very important difference I could find between the two files, is that in the one generated with UnifiedGenotyper there is one single allele in the ALT field, while in the new VCF file, there are several variants with multiple alternative alleles > > [me at wilder test]$ grep -v "##" union.filtered.vcf | cut -f 1,2,3,4,5 | grep "," > [me at wilder test]$ grep -v "##" haploC.filtered.both.vcf | cut -f 1,2,3,4,5 | grep "," > 1 1247578 . T G,TGG > 1 1920434 rs144487103 TAA TA,TAAA > 1 2303347 rs60972860 CAA CA,CAAA > 1 3753032 rs36051675 GTT GTTTTT,GTTTTTTTT > 1 6291918 rs148639379 TA TAA,T > 1 7913184 rs34962249 CAAAA CAAA,CAAAAA > 1 10357206 rs58344165 ATTT ATT,ATTTT,ATTTTT > 1 14057425 rs35643856 GA GAA,G > 1 15654737 rs71000392 CT CTT,C > [...cut...] > > but I thought a DNAStringSetList was meant to accommodate precisely the case where you have multiple alternative alleles, wasn't it? > > thanks for your help, > Francesco > > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.3.32 Rsamtools_1.9.31 Biostrings_2.25.12 > [4] GenomicRanges_1.9.66 IRanges_1.15.48 BiocGenerics_0.3.2 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.19.46 Biobase_2.17.8 biomaRt_2.13.2 > [4] bitops_1.0-4.1 BSgenome_1.25.9 colorspace_1.1-1 > [7] DBI_0.2-5 dichromat_1.2-4 digest_0.5.2 > [10] GenomicFeatures_1.9.44 ggplot2_0.9.2.1 grid_2.15.0 > [13] gtable_0.1.1 labeling_0.1 MASS_7.3-21 > [16] memoise_0.1 munsell_0.4 parallel_2.15.0 > [19] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5 > [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.2 > [25] rtracklayer_1.17.21 scales_0.2.2 stats4_2.15.0 > [28] stringr_0.6.1 tools_2.15.0 XML_3.95-0.1 > [31] zlibbioc_1.3.0 > > > -------------------------------------------------------------------- ------------- > Francesco Lescai, PhD, EDBT > Senior Research Associate in Genome Analysis > University College London > Faculty of Population Health Sciences > Dept. Genes, Development& Disease > ICH - Molecular Medicine Unit, GOSgene team > 30 Guilford Street > WC1N 1EH London UK > > email: f.lescai at ucl.ac.uk<mailto:f.lescai at="" ucl.ac.uk=""> > phone: +44.(0)207.905.2274 > [ext: 2274] > -------------------------------------------------------------------- ------------ > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Oh yes, [66991] "<unassembled_event>" [71698] "<unassembled_event>" [98080] "<unassembled_event>" what can I do in these cases to translate the variants? maybe excluding SVs and reverting the ALT to a DNAStringSet? in the next future will will have more and more SVs encoded into VCFs because that's one of the most important features of the new HaplotypeCaller within GATK. thanks very much, Francesco On 3 Oct 2012, at 17:54, Valerie Obenchain <vobencha@fhcrc.org<mailto:vobencha@fhcrc.org>> wrote: Hi Francesco, Yes, the DNAStringSetList is intended to handle multiple alternate alleles. ALT should only be made a CompressedList when structural variants are present. Try the following with your file, debug(VariantAnnotation:::.formatALT) fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") readVcf(fl, "hg19") debug: { if (is.null(x)) return(NULL) structural <- grep("<", x, fixed = TRUE) if (!identical(integer(0), structural)) seqsplit(x, seq_len(length(x))) else .toDNAStringSetList(x) } Here 'x' is the alternate allele values, Browse[2]> x [1] "A" "A" "G,T" "." "G,GTCT" My guess is that if you inspect these you'll see at least one structural variant in there. Valerie On 10/03/2012 07:23 AM, Lescai, Francesco wrote: Hi there, I was importing a new VCF file generated with the GATK v2 HaplotypeCaller, to compare several parameters with a previous one on the same samples generated with UnifiedGenotyper. "thanks" to an error in predictCoding, I discovered the new VCF file formatted the ALT allele field as a CompressedCharacterList instead of a DNAStringSetList. see here head(fixed(haplo),n=2L) GRanges with 2 ranges and 5 metadata columns: seqnames ranges strand | paramRangeID REF <rle> <iranges> <rle> |<factor> <dnastringset> rs139570490 1 [865738, 865738] * |<na> A rs4372192 1 [876499, 876499] * |<na> A ALT QUAL FILTER <compressedcharacterlist> <numeric> <character> rs139570490 G 4280.02 PASS rs4372192 G 11733.02 PASS --- seqlengths: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA head(fixed(uni),n=2L) GRanges with 2 ranges and 5 metadata columns: seqnames ranges strand | paramRangeID REF <rle> <iranges> <rle> |<factor> <dnastringset> 1:865738 1 [865738, 865738] * |<na> A rs4372192 1 [876499, 876499] * |<na> A ALT QUAL FILTER <dnastringsetlist> <numeric> <character> 1:865738 ######## 5055.97 PASS rs4372192 ######## 13526.97 PASS --- seqlengths: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA The only very important difference I could find between the two files, is that in the one generated with UnifiedGenotyper there is one single allele in the ALT field, while in the new VCF file, there are several variants with multiple alternative alleles [me@wilder test]$ grep -v "##" union.filtered.vcf | cut -f 1,2,3,4,5 | grep "," [me@wilder test]$ grep -v "##" haploC.filtered.both.vcf | cut -f 1,2,3,4,5 | grep "," 1 1247578 . T G,TGG 1 1920434 rs144487103 TAA TA,TAAA 1 2303347 rs60972860 CAA CA,CAAA 1 3753032 rs36051675 GTT GTTTTT,GTTTTTTTT 1 6291918 rs148639379 TA TAA,T 1 7913184 rs34962249 CAAAA CAAA,CAAAAA 1 10357206 rs58344165 ATTT ATT,ATTTT,ATTTTT 1 14057425 rs35643856 GA GAA,G 1 15654737 rs71000392 CT CTT,C [...cut...] but I thought a DNAStringSetList was meant to accommodate precisely the case where you have multiple alternative alleles, wasn't it? thanks for your help, Francesco sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.3.32 Rsamtools_1.9.31 Biostrings_2.25.12 [4] GenomicRanges_1.9.66 IRanges_1.15.48 BiocGenerics_0.3.2 loaded via a namespace (and not attached): [1] AnnotationDbi_1.19.46 Biobase_2.17.8 biomaRt_2.13.2 [4] bitops_1.0-4.1 BSgenome_1.25.9 colorspace_1.1-1 [7] DBI_0.2-5 dichromat_1.2-4 digest_0.5.2 [10] GenomicFeatures_1.9.44 ggplot2_0.9.2.1 grid_2.15.0 [13] gtable_0.1.1 labeling_0.1 MASS_7.3-21 [16] memoise_0.1 munsell_0.4 parallel_2.15.0 [19] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5 [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.2 [25] rtracklayer_1.17.21 scales_0.2.2 stats4_2.15.0 [28] stringr_0.6.1 tools_2.15.0 XML_3.95-0.1 [31] zlibbioc_1.3.0 ---------------------------------------------------------------------- ----------- Francesco Lescai, PhD, EDBT Senior Research Associate in Genome Analysis University College London Faculty of Population Health Sciences Dept. Genes, Development& Disease ICH - Molecular Medicine Unit, GOSgene team 30 Guilford Street WC1N 1EH London UK email: f.lescai@ucl.ac.uk<mailto:f.lescai@ucl.ac.uk><mailto:f.lescai@u cl.ac.uk=""> phone: +44.(0)207.905.2274 [ext: 2274] ---------------------------------------------------------------------- ---------- [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------------------------------------------------------------------- ----------- Francesco Lescai, PhD, EDBT Senior Research Associate in Genome Analysis University College London Faculty of Population Health Sciences Dept. Genes, Development & Disease ICH - Molecular Medicine Unit, GOSgene team 30 Guilford Street WC1N 1EH London UK email: f.lescai@ucl.ac.uk<mailto:f.lescai@ucl.ac.uk> phone: +44.(0)207.905.2274 [ext: 2274] ---------------------------------------------------------------------- ---------- [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
We don't currently have a function that does this. Maybe this should be a feature request since files with a mixture of structural and non-structural variants are becoming more common. In the meantime you can use some internal functions in VariantAnnotation. This toy example isn't great because it only has one non-structural variant but the same idea would apply to a file with many. fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") > dim(vcf) [1] 7 1 > alt(vcf) CompressedCharacterList of length 7 [[1]] C [[2]] [[3]] [[4]] <del:me:alu> [[5]] <ins:me:l1> [[6]] <dup> [[7]] <dup:tandem> ## Identify the structural variants and remove them: structural <- logical(length(alt(vcf))) structural[grep("<", unlist(alt(vcf)), fixed=TRUE)] <- TRUE vcf2 <- vcf[!structural, ] > alt(vcf2) CompressedCharacterList of length 1 [[1]] C ## Convert the alleles to a DNAStringSetList using the internal function .toDNAStringSetList(). alt(vcf2) <- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf2), use.names=FALSE)) > alt(vcf2) DNAStringSetList of length 1 [[1]] C ## Rename seqlevels and use the new vcf in predictCoding(). library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene newnames <- paste("chr", seqlevels(vcf2), sep="") names(newnames) <- seqlevels(vcf2) vcf2 <- renameSeqlevels(vcf2, newnames) > intersect(seqlevels(vcf2), seqlevels(txdb)) [1] "chr1" "chr2" "chr3" "chr4" ## Again, not an exciting example because this variant does not fall in a coding region so our result is empty. predictCoding(vcf2, txdb, Hsapiens) GRanges with 0 ranges and 1 metadata column: seqnames ranges strand | GENEID <rle> <iranges> <rle> | <character> --- seqlengths: If you think this would be a worthy feature/function request let me know. Valerie On 10/04/2012 03:36 AM, Lescai, Francesco wrote: > Oh yes, > [66991] "<unassembled_event>" > [71698] "<unassembled_event>" > [98080] "<unassembled_event>" > > what can I do in these cases to translate the variants? maybe excluding SVs and reverting the ALT to a DNAStringSet? > in the next future will will have more and more SVs encoded into VCFs because that's one of the most important features of the new HaplotypeCaller within GATK. > > thanks very much, > Francesco > > > On 3 Oct 2012, at 17:54, Valerie Obenchain<vobencha at="" fhcrc.org<mailto:vobencha="" at="" fhcrc.org="">> wrote: > > Hi Francesco, > > Yes, the DNAStringSetList is intended to handle multiple alternate alleles. ALT should only be made a CompressedList when structural variants are present. > > Try the following with your file, > > debug(VariantAnnotation:::.formatALT) > fl<- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > readVcf(fl, "hg19") > debug: { > if (is.null(x)) > return(NULL) > structural<- grep("<", x, fixed = TRUE) > if (!identical(integer(0), structural)) > seqsplit(x, seq_len(length(x))) > else .toDNAStringSetList(x) > } > > Here 'x' is the alternate allele values, > > Browse[2]> x > [1] "A" "A" "G,T" "." "G,GTCT" > > My guess is that if you inspect these you'll see at least one structural variant in there. > > Valerie > > On 10/03/2012 07:23 AM, Lescai, Francesco wrote: > Hi there, > I was importing a new VCF file generated with the GATK v2 HaplotypeCaller, to compare several parameters with a previous one on the same samples generated with UnifiedGenotyper. > > "thanks" to an error in predictCoding, I discovered the new VCF file formatted the ALT allele field as a CompressedCharacterList instead of a DNAStringSetList. > > see here > > head(fixed(haplo),n=2L) > GRanges with 2 ranges and 5 metadata columns: > seqnames ranges strand | paramRangeID REF > <rle> <iranges> <rle> |<factor> <dnastringset> > rs139570490 1 [865738, 865738] * |<na> A > rs4372192 1 [876499, 876499] * |<na> A > ALT QUAL FILTER > <compressedcharacterlist> <numeric> <character> > rs139570490 G 4280.02 PASS > rs4372192 G 11733.02 PASS > --- > seqlengths: > 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y > NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > > head(fixed(uni),n=2L) > GRanges with 2 ranges and 5 metadata columns: > seqnames ranges strand | paramRangeID REF > <rle> <iranges> <rle> |<factor> <dnastringset> > 1:865738 1 [865738, 865738] * |<na> A > rs4372192 1 [876499, 876499] * |<na> A > ALT QUAL FILTER > <dnastringsetlist> <numeric> <character> > 1:865738 ######## 5055.97 PASS > rs4372192 ######## 13526.97 PASS > --- > seqlengths: > 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y > NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > > The only very important difference I could find between the two files, is that in the one generated with UnifiedGenotyper there is one single allele in the ALT field, while in the new VCF file, there are several variants with multiple alternative alleles > > [me at wilder test]$ grep -v "##" union.filtered.vcf | cut -f 1,2,3,4,5 | grep "," > [me at wilder test]$ grep -v "##" haploC.filtered.both.vcf | cut -f 1,2,3,4,5 | grep "," > 1 1247578 . T G,TGG > 1 1920434 rs144487103 TAA TA,TAAA > 1 2303347 rs60972860 CAA CA,CAAA > 1 3753032 rs36051675 GTT GTTTTT,GTTTTTTTT > 1 6291918 rs148639379 TA TAA,T > 1 7913184 rs34962249 CAAAA CAAA,CAAAAA > 1 10357206 rs58344165 ATTT ATT,ATTTT,ATTTTT > 1 14057425 rs35643856 GA GAA,G > 1 15654737 rs71000392 CT CTT,C > [...cut...] > > but I thought a DNAStringSetList was meant to accommodate precisely the case where you have multiple alternative alleles, wasn't it? > > thanks for your help, > Francesco > > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.3.32 Rsamtools_1.9.31 Biostrings_2.25.12 > [4] GenomicRanges_1.9.66 IRanges_1.15.48 BiocGenerics_0.3.2 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.19.46 Biobase_2.17.8 biomaRt_2.13.2 > [4] bitops_1.0-4.1 BSgenome_1.25.9 colorspace_1.1-1 > [7] DBI_0.2-5 dichromat_1.2-4 digest_0.5.2 > [10] GenomicFeatures_1.9.44 ggplot2_0.9.2.1 grid_2.15.0 > [13] gtable_0.1.1 labeling_0.1 MASS_7.3-21 > [16] memoise_0.1 munsell_0.4 parallel_2.15.0 > [19] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5 > [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.2 > [25] rtracklayer_1.17.21 scales_0.2.2 stats4_2.15.0 > [28] stringr_0.6.1 tools_2.15.0 XML_3.95-0.1 > [31] zlibbioc_1.3.0 > > > -------------------------------------------------------------------- ------------- > Francesco Lescai, PhD, EDBT > Senior Research Associate in Genome Analysis > University College London > Faculty of Population Health Sciences > Dept. Genes, Development& Disease > ICH - Molecular Medicine Unit, GOSgene team > 30 Guilford Street > WC1N 1EH London UK > > email: f.lescai at ucl.ac.uk<mailto:f.lescai at="" ucl.ac.uk=""><mailto:f.lescai at="" ucl.ac.uk=""> > phone: +44.(0)207.905.2274 > [ext: 2274] > -------------------------------------------------------------------- ------------ > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -------------------------------------------------------------------- ------------- > Francesco Lescai, PhD, EDBT > Senior Research Associate in Genome Analysis > University College London > Faculty of Population Health Sciences > Dept. Genes, Development& Disease > ICH - Molecular Medicine Unit, GOSgene team > 30 Guilford Street > WC1N 1EH London UK > > email: f.lescai at ucl.ac.uk<mailto:f.lescai at="" ucl.ac.uk=""> > phone: +44.(0)207.905.2274 > [ext: 2274] > -------------------------------------------------------------------- ------------ > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6