VCF class: different length when unlisting INFO CompressedCharacterList

0

Entering edit mode

Lescai, Francesco ▴ 380

@lescai-francesco-5078

Last seen 7.4 years ago

Denmark

Hi all and Hi Valerie (I suppose), I was extracting a field of the INFO column from a VCF, but when I unlist it I get a different length compared the number of variants, so I don't know anymore which refers to each variant. Here's what I'm doing > vcf class: VCF dim: 50273 30 genome: hg19 exptData(1): header fixed(4): REF ALT QUAL FILTER info(28): AC AF ... culprit set geno(5): AD DP GQ GT PL rownames(50273): [.. cut for clarity ..] genotypes<-as.data.frame(geno(vcf)$GT) dim(genotypes) [1] 50273 30 list.va<-info(vcf)$VA > length(info(vcf)$VA) [1] 50273 > list.va CompressedCharacterList of length 50273 info.va<-unlist(info(vcf)$VA) > lengthinfo.va) [1] 53391 This is an annotation from Variant Annotation Tool, which modifies the VCF Info. But if I do the same for other more "standard" fields, some of them have the same length of the variants, others don't when unlisted > length(unlist(info(vcf)$HaplotypeScore)) [1] 50273 > length(unlist(info(vcf)$AC)) [1] 50489 > length(unlist(info(vcf)$AF)) [1] 50489 am I doing something wrong? or is the unlist method on the CompressedCharacterList splitting on some field delimiter? below my sessionInfo. thanks for any help you might provide, cheers, Francesco > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] reshape_0.8.4 plyr_1.8 ggbio_1.6.6 ggplot2_0.9.3.1 VariantAnnotation_1.4.12 Rsamtools_1.10.2 [7] Biostrings_2.26.3 GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.20.7 Biobase_2.18.0 biomaRt_2.14.0 biovizBase_1.6.2 bitops_1.0-4.2 BSgenome_1.26.1 cluster_1.14.4 [8] colorspace_1.2-1 DBI_0.2-5 dichromat_2.0-0 digest_0.6.3 GenomicFeatures_1.10.2 grid_2.15.1 gridExtra_0.9.1 [15] gtable_0.1.2 Hmisc_3.10-1 labeling_0.1 lattice_0.20-15 MASS_7.3-23 munsell_0.4 parallel_2.15.1 [22] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2 RSQLite_0.11.2 rtracklayer_1.18.2 scales_0.2.3 [29] stats4_2.15.1 stringr_0.6.2 tools_2.15.1 XML_3.96-1.1 zlibbioc_1.4.0

Annotation Annotation • 1.8k views

ADD COMMENT • link updated 12.7 years ago by Valerie Obenchain ★ 6.8k • written 12.7 years ago by Lescai, Francesco ▴ 380

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.0 years ago

United States

Hi Francesco, The expand,VCF-method was written for this purpose. Using expand() on a VCF will produce an object that is 'flattened' in the sense that the variant rows are repeated to match the unlisted ALT column. expand() will unlist ALT and any INFO or FORMAT variables that have one value per alternate allele which is indicated by 'Number=A' in the header. For example, ##INFO=<id=af,number=a,type=float,description="allele frequency"=""> If you are working with a DataFrame, you can use expand() to specify exactly which columns you want 'flattened'. > DF <- DataFrame(one=IntegerList(1:3, 4, 5), two=letters[1:3], three=CharacterList("A", c("B", "C"), "D")) > expand(DF, colnames="three", keepEmptyRows=FALSE) DataFrame with 4 rows and 3 columns one two three <integerlist> <character> <character> 1 1,2,3 a A 2 4 b B 3 4 b C 4 5 c D Details and examples are at, ?'VCF-class' ## VCF method ?'expand' ## DataFrame method I think this is what you were after ... let me know if this doesn't answer your question. Valerie On 05/14/13 01:09, Francesco Lescai wrote: > Hi all and Hi Valerie (I suppose), > I was extracting a field of the INFO column from a VCF, but when I unlist it I get a different length compared the number of variants, so I don't know anymore which refers to each variant. > > Here's what I'm doing > >> vcf > class: VCF > dim: 50273 30 > genome: hg19 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(28): AC AF ... culprit set > geno(5): AD DP GQ GT PL > rownames(50273): > [.. cut for clarity ..] > > genotypes<-as.data.frame(geno(vcf)$GT) > dim(genotypes) > [1] 50273 30 > > list.va<-info(vcf)$VA >> length(info(vcf)$VA) > [1] 50273 > >> list.va > CompressedCharacterList of length 50273 > > info.va<-unlist(info(vcf)$VA) >> lengthinfo.va) > [1] 53391 > > This is an annotation from Variant Annotation Tool, which modifies the VCF Info. > But if I do the same for other more "standard" fields, some of them have the same length of the variants, others don't when unlisted > >> length(unlist(info(vcf)$HaplotypeScore)) > [1] 50273 >> length(unlist(info(vcf)$AC)) > [1] 50489 >> length(unlist(info(vcf)$AF)) > [1] 50489 > > am I doing something wrong? or is the unlist method on the CompressedCharacterList splitting on some field delimiter? > > below my sessionInfo. > thanks for any help you might provide, > cheers, > Francesco > > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] reshape_0.8.4 plyr_1.8 ggbio_1.6.6 ggplot2_0.9.3.1 VariantAnnotation_1.4.12 Rsamtools_1.10.2 > [7] Biostrings_2.26.3 GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.7 Biobase_2.18.0 biomaRt_2.14.0 biovizBase_1.6.2 bitops_1.0-4.2 BSgenome_1.26.1 cluster_1.14.4 > [8] colorspace_1.2-1 DBI_0.2-5 dichromat_2.0-0 digest_0.6.3 GenomicFeatures_1.10.2 grid_2.15.1 gridExtra_0.9.1 > [15] gtable_0.1.2 Hmisc_3.10-1 labeling_0.1 lattice_0.20-15 MASS_7.3-23 munsell_0.4 parallel_2.15.1 > [22] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2 RSQLite_0.11.2 rtracklayer_1.18.2 scales_0.2.3 > [29] stats4_2.15.1 stringr_0.6.2 tools_2.15.1 XML_3.96-1.1 zlibbioc_1.4.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 12.7 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Francesco -- to add to what Valerie said... On 05/14/2013 08:20 AM, Valerie Obenchain wrote: > Hi Francesco, > > The expand,VCF-method was written for this purpose. Using expand() on a VCF will I'm not sure, but I think expand,VCF-method was introduced more recently than the version of R / Bioconductor you have installed; maybe it's time to update? ... > produce an object that is 'flattened' in the sense that the variant rows are > repeated to match the unlisted ALT column. expand() will unlist ALT and any INFO > or FORMAT variables that have one value per alternate allele which is indicated > by 'Number=A' in the header. For example, > > ##INFO=<id=af,number=a,type=float,description="allele frequency"=""> > > > If you are working with a DataFrame, you can use expand() to specify exactly > which columns you want 'flattened'. > > > DF <- DataFrame(one=IntegerList(1:3, 4, 5), > two=letters[1:3], > three=CharacterList("A", c("B", "C"), "D")) > > expand(DF, colnames="three", keepEmptyRows=FALSE) > DataFrame with 4 rows and 3 columns > one two three > <integerlist> <character> <character> > 1 1,2,3 a A > 2 4 b B > 3 4 b C > 4 5 c D > > Details and examples are at, > ?'VCF-class' ## VCF method > ?'expand' ## DataFrame method > > I think this is what you were after ... let me know if this doesn't answer your > question. > > Valerie > > > > On 05/14/13 01:09, Francesco Lescai wrote: >> Hi all and Hi Valerie (I suppose), >> I was extracting a field of the INFO column from a VCF, but when I unlist it I >> get a different length compared the number of variants, so I don't know >> anymore which refers to each variant. >> >> Here's what I'm doing >> >>> vcf >> class: VCF >> dim: 50273 30 >> genome: hg19 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(28): AC AF ... culprit set >> geno(5): AD DP GQ GT PL >> rownames(50273): >> [.. cut for clarity ..] >> >> genotypes<-as.data.frame(geno(vcf)$GT) Hmm, not sure that this is a good idea -- geno(vcf)$GT is a matrix, operations on a matrix can be much more efficient than on a data.frame, and the matrix conveys valuable information about data types, in particular that all elements are the same class. >> dim(genotypes) >> [1] 50273 30 >> >> list.va<-info(vcf)$VA >>> length(info(vcf)$VA) >> [1] 50273 >> >>> list.va >> CompressedCharacterList of length 50273 >> >> info.va<-unlist(info(vcf)$VA) >>> lengthinfo.va) >> [1] 53391 it's worth understanding why this column is represented as a CharacterList (the listed form) rather than a simple character vector (the unlisted form). According to the INFO line from the VCF header Val mentions, some members of this field have more than one entry, specifically info(vcf)$VA[elementLengths(info(vcf)$VA != 1] Here's a simple example > x = CharacterList(list(letters[1:2], LETTERS[1:3], character(0))) > x CharacterList of length 2 [[1]] a b [[2]] A B C [[3]] character(0) > unlist(x) [1] "a" "b" "A" "B" "C" I guess you are perhaps hoping for > sapply(x, paste, collapse=",") [1] "a,b" "A,B,C" "" but in some ways this isn't so convenient as it first appears -- now you'll have to split() these back out to work on them, and you've translated 'no data' (character(0)) to 'no characters' (""). One pattern that simplifies working with these CharacterList and similar objects is that they can be unlist'ed and relist'ed cheaply, so long as the total number of elements and their order don't change. > y = toupper(unlist(x)) > relist(y, x) CharacterList of length 2 [[1]] A B [[2]] A B C [[3]] character(0) (actually, toupper(x) works without unlisting so this isn't a perfect example ...) Martin >> >> This is an annotation from Variant Annotation Tool, which modifies the VCF Info. >> But if I do the same for other more "standard" fields, some of them have the >> same length of the variants, others don't when unlisted >> >>> length(unlist(info(vcf)$HaplotypeScore)) >> [1] 50273 >>> length(unlist(info(vcf)$AC)) >> [1] 50489 >>> length(unlist(info(vcf)$AF)) >> [1] 50489 >> >> am I doing something wrong? or is the unlist method on the >> CompressedCharacterList splitting on some field delimiter? >> >> below my sessionInfo. >> thanks for any help you might provide, >> cheers, >> Francesco >> >> >>> sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] reshape_0.8.4 plyr_1.8 >> ggbio_1.6.6 ggplot2_0.9.3.1 VariantAnnotation_1.4.12 >> Rsamtools_1.10.2 >> [7] Biostrings_2.26.3 GenomicRanges_1.10.7 >> IRanges_1.16.6 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.20.7 Biobase_2.18.0 biomaRt_2.14.0 >> biovizBase_1.6.2 bitops_1.0-4.2 BSgenome_1.26.1 >> cluster_1.14.4 >> [8] colorspace_1.2-1 DBI_0.2-5 dichromat_2.0-0 >> digest_0.6.3 GenomicFeatures_1.10.2 grid_2.15.1 >> gridExtra_0.9.1 >> [15] gtable_0.1.2 Hmisc_3.10-1 labeling_0.1 >> lattice_0.20-15 MASS_7.3-23 munsell_0.4 >> parallel_2.15.1 >> [22] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 >> reshape2_1.2.2 RSQLite_0.11.2 rtracklayer_1.18.2 scales_0.2.3 >> [29] stats4_2.15.1 stringr_0.6.2 tools_2.15.1 >> XML_3.96-1.1 zlibbioc_1.4.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 12.7 years ago Martin Morgan 25k

0

Entering edit mode

thanks Martin, indeed the expand on VCF doesn't work in my version: time for update :-) The point you both highlighted also reminded me that such annotation field is a CharacterList for a reason, i.e. doesn't list just the most severe consequence but all the different consequences on overlapping transcripts. I will extract the information differently. btw, I always learn something new from your examples on how to use more efficiently R :-) thanks Francesco On 14 May 2013, at 19:17, Martin Morgan <mtmorgan@fhcrc.org<mailto:mtmorgan@fhcrc.org>> wrote: Hi Francesco -- to add to what Valerie said... On 05/14/2013 08:20 AM, Valerie Obenchain wrote: Hi Francesco, The expand,VCF-method was written for this purpose. Using expand() on a VCF will I'm not sure, but I think expand,VCF-method was introduced more recently than the version of R / Bioconductor you have installed; maybe it's time to update? ... produce an object that is 'flattened' in the sense that the variant rows are repeated to match the unlisted ALT column. expand() will unlist ALT and any INFO or FORMAT variables that have one value per alternate allele which is indicated by 'Number=A' in the header. For example, ##INFO=<id=af,number=a,type=float,description="allele frequency"=""> If you are working with a DataFrame, you can use expand() to specify exactly which columns you want 'flattened'. > DF <- DataFrame(one=IntegerList(1:3, 4, 5), two=letters[1:3], three=CharacterList("A", c("B", "C"), "D")) > expand(DF, colnames="three", keepEmptyRows=FALSE) DataFrame with 4 rows and 3 columns one two three <integerlist> <character> <character> 1 1,2,3 a A 2 4 b B 3 4 b C 4 5 c D Details and examples are at, ?'VCF-class' ## VCF method ?'expand' ## DataFrame method I think this is what you were after ... let me know if this doesn't answer your question. Valerie On 05/14/13 01:09, Francesco Lescai wrote: Hi all and Hi Valerie (I suppose), I was extracting a field of the INFO column from a VCF, but when I unlist it I get a different length compared the number of variants, so I don't know anymore which refers to each variant. Here's what I'm doing vcf class: VCF dim: 50273 30 genome: hg19 exptData(1): header fixed(4): REF ALT QUAL FILTER info(28): AC AF ... culprit set geno(5): AD DP GQ GT PL rownames(50273): [.. cut for clarity ..] genotypes<-as.data.frame(geno(vcf)$GT) Hmm, not sure that this is a good idea -- geno(vcf)$GT is a matrix, operations on a matrix can be much more efficient than on a data.frame, and the matrix conveys valuable information about data types, in particular that all elements are the same class. dim(genotypes) [1] 50273 30 list.va<-info(vcf)$VA length(info(vcf)$VA) [1] 50273 list.va CompressedCharacterList of length 50273 info.va<-unlist(info(vcf)$VA) lengthinfo.va) [1] 53391 it's worth understanding why this column is represented as a CharacterList (the listed form) rather than a simple character vector (the unlisted form). According to the INFO line from the VCF header Val mentions, some members of this field have more than one entry, specifically info(vcf)$VA[elementLengths(info(vcf)$VA != 1] Here's a simple example > x = CharacterList(list(letters[1:2], LETTERS[1:3], character(0))) > x CharacterList of length 2 [[1]] a b [[2]] A B C [[3]] character(0) > unlist(x) [1] "a" "b" "A" "B" "C" I guess you are perhaps hoping for > sapply(x, paste, collapse=",") [1] "a,b" "A,B,C" "" but in some ways this isn't so convenient as it first appears -- now you'll have to split() these back out to work on them, and you've translated 'no data' (character(0)) to 'no characters' (""). One pattern that simplifies working with these CharacterList and similar objects is that they can be unlist'ed and relist'ed cheaply, so long as the total number of elements and their order don't change. > y = toupper(unlist(x)) > relist(y, x) CharacterList of length 2 [[1]] A B [[2]] A B C [[3]] character(0) (actually, toupper(x) works without unlisting so this isn't a perfect example ...) Martin This is an annotation from Variant Annotation Tool, which modifies the VCF Info. But if I do the same for other more "standard" fields, some of them have the same length of the variants, others don't when unlisted length(unlist(info(vcf)$HaplotypeScore)) [1] 50273 length(unlist(info(vcf)$AC)) [1] 50489 length(unlist(info(vcf)$AF)) [1] 50489 am I doing something wrong? or is the unlist method on the CompressedCharacterList splitting on some field delimiter? below my sessionInfo. thanks for any help you might provide, cheers, Francesco sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] reshape_0.8.4 plyr_1.8 ggbio_1.6.6 ggplot2_0.9.3.1 VariantAnnotation_1.4.12 Rsamtools_1.10.2 [7] Biostrings_2.26.3 GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.20.7 Biobase_2.18.0 biomaRt_2.14.0 biovizBase_1.6.2 bitops_1.0-4.2 BSgenome_1.26.1 cluster_1.14.4 [8] colorspace_1.2-1 DBI_0.2-5 dichromat_2.0-0 digest_0.6.3 GenomicFeatures_1.10.2 grid_2.15.1 gridExtra_0.9.1 [15] gtable_0.1.2 Hmisc_3.10-1 labeling_0.1 lattice_0.20-15 MASS_7.3-23 munsell_0.4 parallel_2.15.1 [22] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2 RSQLite_0.11.2 rtracklayer_1.18.2 scales_0.2.3 [29] stats4_2.15.1 stringr_0.6.2 tools_2.15.1 XML_3.96-1.1 zlibbioc_1.4.0 _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ________________________________ [X] Francesco Lescai, PhD, EDBT Associate Professor Department of Biomedicine, Human Genetics Building 1243 - Bioinformatics Wilhelm Meyers Alle 4 8000 Aarhus C, Denmark Tel: +45 871 68496 francesco.lescai@hum-gen.au.dk<mailto:francesco.lescai@hum-gen.au.dk> iSEQ Centre for iSequencing iPSYCH Initiative for Integrative Psychiatric Research www.ipsych.dk<http: www.ipsych.dk=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Lescai, Francesco ▴ 380

Login before adding your answer.