VCF class: different length when unlisting INFO CompressedCharacterList

0

Entering edit mode

Sigve Nakken ▴ 50

@sigve-nakken-6575

Last seen 18 months ago

Norway

Hi, Ive had similar challenges as Francesco, and have unsuccessfully tried to use the data structures and functions provided by VariantAnnotation. My experience is that I need the expand' functionality more often with respect to the samples rather than the annotation tags. And as my experience with R is somewhat limited, I tend to like to work with simple data.frames when I want to summarise and characterise the data. Here is how I generated a simple data frame with all sample variants (variants and samples are dummy encoded): ## read VCF (100 samples) > all_vcf <- readVcf(cancer.exome.project.vcf.gz','hg19') > seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19) > rowData(all_vcf) GRanges with 50383 ranges and 5 metadata columns: ## get variants with PASS filter > all_vcf_PASS <- all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),] > rowData(all_vcf_PASS) GRanges with 4252 ranges and 5 metadata columns: ... ## get genotype information for all samples that have called genotypes (GT != .) ## From my VCF: ## FORMAT=<id=adt,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (tumor)"=""> ## FORMAT=<id=gt,number=1,type=string,description="genotype"> ## FORMAT=<id=adc,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (control)"=""> ## > tumor_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) > normal_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) > tumor_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) > normal_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) ## get sample ids and variant ids for the expanded set of variants' > b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col > c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row > sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b]) > variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c]) ## make simple data.frame of all samples with genotype information > row.names(tumor_ref_support) <- NULL > row.names(normal_ref_support) <- NULL > row.names(tumor_alt_support) <- NULL > row.names(external_passed) <- NULL > row.names(normal_alt_support) <- NULL > tmp <- as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_support, tumor_alt_support,normal_ref_support,normal_alt_support)) > colnames(tmp) <- c('variant_id','sample_id','tumor_ref_support','tum or_alt_support','normal_ref_support','normal_alt_support') > tmp$allele_frac_tumor <- rep(0,nrow(tmp)) > tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support > tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support > tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > 0,]$tumor_depth,2) ## > head(tmp) variant_id sample_id tumor_ref_support tumor_alt_support normal_ref_support normal_alt_support allele_frac_tumor tumor_depth normal_depth 1 chr5_10000000_T_C 001B_001T 17 7 21 0 0.29 24 21 2 chr11_10000000_C_T 001B_001T 31 4 33 0 0.11 35 33 3 chr18_10000000_C_T 001B_001T 21 7 37 1 0.25 28 38 4 chrY_1000000_A_G 001B_001T 10 2 11 0 0.17 12 11 5 chr1_1000000_G_C 011B_011T 17 3 48 0 0.15 20 48 6 chr1_1000000_A_G 011B_011T 77 13 114 0 0.14 90 114 > str(tmp) 'data.frame': 4418 obs. of 9 variables: $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 620 1644 4251 359 381 391 401 246 1875 ... $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 1 1 1 2 2 2 2 2 2 ... $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ... $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ... $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ... $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ... $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 0.15 ... $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ... $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ... Next, I plan to do a merge with my functional annotations (info) using the variant_id as the key, which I think would be straightforward. If there is a more convenient way to get here using the VariantAnnotation package, I would be grateful to hear about this. --- Sigve Nakken, PhD Postdoctoral Fellow, Dept. of Tumor Biology Institute for Cancer Research Oslo University Hospital, Norway phone: +4795753022 email: sigven@ifi.uio.no [[alternative HTML version deleted]]

Cancer Cancer • 1.0k views

ADD COMMENT • link updated 9.9 years ago by Michael Lawrence ★ 11k • written 9.9 years ago by Sigve Nakken ▴ 50

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi, VariantAnnotation doesn't have a function that expands the data 'by sample' as you've described but there are several helpers that can make the process more straight forward. (1) filterVcf() This function filters records from a vcf and outputs another vcf on disk. You can use this to create a subset file of records with FILTER=PASS. The function can't be used to remove records with GT='.' because of the structure of the vcf file. The filtering removes variants (rows) that don't meet the criteria. In the case of the GT filtering you are removing select variants per sample so this is better done at a later stage. Filtering by the presence/absence of a value can be done with a pre- filter: ## sample data fl <- system.file("extdata", "chr7-sub.vcf.gz", package="VariantAnnotation") ## pre-filter for 'PASS': pass_filter = function(x) grepl("PASS", x, fixed=TRUE) PF <- FilterRules(list(pass_filter)) ## here the filtered vcf is sent to a temp file: filtered_vcf <- filterVcf(fl, "hg19", tempfile(), prefilters=PF) (2) ScanVcfParam You may have already discovered that a ScanVcfParam object specifies subsets of data to be read in. Using a 'param' can speed up readVcf() and makes for a lighter VCF object in R. ## read in specific genotype (FORMAT) fields and no INFO data: param <- ScanVcfParam(info=NA, geno=c("GT", "CGA_OAD", "CGA_ORDP")) vcf <- readVcf(filtered_vcf, "hg19", param=param) >> geno(vcf) > List of length 3 > names(3): GT CGA_OAD CGA_ORDP The file has 2 samples: >> geno(vcf)$GT[1:4,] > HCC1187-H-200-37-ASM-N1 HCC1187-H-200-37-ASM-T1 > 7:55003988_A/G "1/1" "1/1" > 7:55010562_GA/G "1/1" "1/1" > 7:55012097_T/C "1/0" "1/0" > 7:55015505_C/T "1/0" "1/0" More below. On 05/26/2014 06:15 AM, Sigve Nakken wrote: > Hi, > > I?ve had similar challenges as Francesco, and have unsuccessfully tried to use the data structures and functions provided by VariantAnnotation. My experience is that I need the ?expand' functionality more often with respect to the samples rather than the annotation tags. And as my experience with R is somewhat limited, I tend to like to work with simple data.frames when I want to summarise and characterise the data. > > Here is how I generated a simple data frame with all sample variants (variants and samples are dummy encoded): > > ## read VCF (100 samples) >> all_vcf <- readVcf(?cancer.exome.project.vcf.gz','hg19') >> seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19) >> rowData(all_vcf) > GRanges with 50383 ranges and 5 metadata columns: > ? > > ## get variants with ?PASS? filter >> all_vcf_PASS <- all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),] >> rowData(all_vcf_PASS) > GRanges with 4252 ranges and 5 metadata columns: > ... > > > ## get genotype information for all samples that have called genotypes (GT != ?.?) > ## From my VCF: > ## FORMAT=<id=adt,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (tumor)"=""> > ## FORMAT=<id=gt,number=1,type=string,description="genotype"> > ## FORMAT=<id=adc,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (control)"=""> > ## >> tumor_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) >> normal_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) >> tumor_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) >> normal_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) I don't think it's necessary (or efficient) to subset the data on GT = '.'. The ADC and ADT values should be NA if GT was '.' and the subsequent math operations (i.e., '+') will produce NAs but no errors. If you really want to remove them you can use is.na(geno(all_vcf_PASS)$ADT. > > ## get sample ids and variant ids for the ?expanded set of variants' >> b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col >> c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row >> sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b]) >> variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c]) > > > ## make simple data.frame of all samples with genotype information >> row.names(tumor_ref_support) <- NULL >> row.names(normal_ref_support) <- NULL >> row.names(tumor_alt_support) <- NULL >> row.names(external_passed) <- NULL >> row.names(normal_alt_support) <- NULL > >> tmp <- as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_support ,tumor_alt_support,normal_ref_support,normal_alt_support)) >> colnames(tmp) <- c('variant_id','sample_id','tumor_ref_support','tu mor_alt_support','normal_ref_support','normal_alt_support') >> tmp$allele_frac_tumor <- rep(0,nrow(tmp)) >> tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support >> tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support >> tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > 0,]$tumor_depth,2) Because the geno variables are stored in matrices you can add them 'as is'. Underneath arrays are just vectors with a dimension attached. This operation on my sample variable doesn't make biological sense but it's the idea that you can just add the arrays of 'tumor_ref_support' and 'tumor_alt_support'. ## bogus new variable new_var <- geno(vcf)$CGA_OAD + geno(vcf)$CGA_OAD The result is another array with the same dimensions: >> dim(geno(vcf)$CGA_OAD) > [1] 1184 2 2 >> dim(new_var) > [1] 1184 2 2 Isolate the samples to separate array elements by rearranging the array with aperm(). ss <- aperm(new_var, c(1, 3, 2)) array_dims <- dim(ss) >> array_dims > [1] 1184 2 2 Create vectors of variant and sample names: variant_id <- rep.int(dimnames(vcf)[[1]], array_dims[2] * array_dims[3]) sample_id <- rep(dimnames(vcf)[[2]], each=array_dims[1] * array_dims[2]) Create the data.frame: df <- data.frame(variant_id, sample_id, new_var=as.vector(ss)) >> head(df) > variant_id sample_id new_var > 1 7:55003988_A/G HCC1187-H-200-37-ASM-N1 168 > 2 7:55010562_GA/G HCC1187-H-200-37-ASM-N1 100 > 3 7:55012097_T/C HCC1187-H-200-37-ASM-N1 64 > 4 7:55015505_C/T HCC1187-H-200-37-ASM-N1 38 > 5 7:55015541_A/. HCC1187-H-200-37-ASM-N1 NA > 6 7:55015699_C/T HCC1187-H-200-37-ASM-N1 20 You might also take a look at the VRanges class. It's a flat/light form of vcf data necessary for variant calling and filtering. It doesn't expand the genotype data as per this example but it may be useful to you in other applications. See ?VRanges for details. Valerie > > ## >> head(tmp) > variant_id sample_id tumor_ref_support tumor_alt_support normal_ref_support normal_alt_support allele_frac_tumor tumor_depth normal_depth > 1 chr5_10000000_T_C 001B_001T 17 7 21 0 0.29 24 21 > 2 chr11_10000000_C_T 001B_001T 31 4 33 0 0.11 35 33 > 3 chr18_10000000_C_T 001B_001T 21 7 37 1 0.25 28 38 > 4 chrY_1000000_A_G 001B_001T 10 2 11 0 0.17 12 11 > 5 chr1_1000000_G_C 011B_011T 17 3 48 0 0.15 20 48 > 6 chr1_1000000_A_G 011B_011T 77 13 114 0 0.14 90 114 > >> str(tmp) > 'data.frame': 4418 obs. of 9 variables: > $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 620 1644 4251 359 381 391 401 246 1875 ... > $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 1 1 1 2 2 2 2 2 2 ... > $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ... > $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ... > $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ... > $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ... > $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 0.15 ... > $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ... > $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ... > > Next, I plan to do a merge with my functional annotations (info) using the variant_id as the key, which I think would be straightforward. > > If there is a more convenient way to get here using the VariantAnnotation package, I would be grateful to hear about this. > > --- > Sigve Nakken, PhD > Postdoctoral Fellow, Dept. of Tumor Biology > Institute for Cancer Research > Oslo University Hospital, Norway > phone: +4795753022 > email: sigven at ifi.uio.no > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD COMMENT • link 9.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

I think you want to use VRanges. See ?VRanges. You can use readVcfAsVRanges to get one from a VCF. It expands samples, i.e., it is a long-form tabular representation of the VCF file. It has explicit columns for the read depths, but it takes them from the conventional "AD" geno field, while you have paired tumor/normal in ADC and ADT. So you won't be able to use those convenience fields out of the box, but ADC and ADT will at least land in the mcols, which is probably sufficient for your purposes. Please let me know how it goes, Michael On Mon, May 26, 2014 at 6:15 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > Hi, > > Iâve had similar challenges as Francesco, and have unsuccessfully tried to > use the data structures and functions provided by VariantAnnotation. My > experience is that I need the âexpand' functionality more often with > respect to the samples rather than the annotation tags. And as my > experience with R is somewhat limited, I tend to like to work with simple > data.frames when I want to summarise and characterise the data. > > Here is how I generated a simple data frame with all sample variants > (variants and samples are dummy encoded): > > ## read VCF (100 samples) > > all_vcf <- readVcf(âcancer.exome.project.vcf.gz','hg19') > > seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19) > > rowData(all_vcf) > GRanges with 50383 ranges and 5 metadata columns: > â¦ > > ## get variants with âPASSâ filter > > all_vcf_PASS <- > all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),] > > rowData(all_vcf_PASS) > GRanges with 4252 ranges and 5 metadata columns: > ... > > > ## get genotype information for all samples that have called genotypes (GT > != â.â) > ## From my VCF: > ## FORMAT=<id=adt,number=.,type=integer,description="allelic depths="" for=""> the ref and alt alleles in the order listed (tumor)"> > ## FORMAT=<id=gt,number=1,type=string,description="genotype"> > ## FORMAT=<id=adc,number=.,type=integer,description="allelic depths="" for=""> the ref and alt alleles in the order listed (control)"> > ## > > tumor_ref_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[1,]) > > normal_ref_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[1,]) > > tumor_alt_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[2,]) > > normal_alt_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[2,]) > > ## get sample ids and variant ids for the âexpanded set of variants' > > b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col > > c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row > > sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b]) > > variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c]) > > > ## make simple data.frame of all samples with genotype information > > row.names(tumor_ref_support) <- NULL > > row.names(normal_ref_support) <- NULL > > row.names(tumor_alt_support) <- NULL > > row.names(external_passed) <- NULL > > row.names(normal_alt_support) <- NULL > > > tmp <- > as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_support,tumor_a lt_support,normal_ref_support,normal_alt_support)) > > colnames(tmp) <- > c('variant_id','sample_id','tumor_ref_support','tumor_alt_support',' normal_ref_support','normal_alt_support') > > tmp$allele_frac_tumor <- rep(0,nrow(tmp)) > > tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support > > tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support > > tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- round(tmp[tmp$tumor_depth > > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > 0,]$tumor_depth,2) > > ## > > head(tmp) > variant_id sample_id tumor_ref_support tumor_alt_support > normal_ref_support normal_alt_support allele_frac_tumor tumor_depth > normal_depth > 1 chr5_10000000_T_C 001B_001T 17 7 > 21 0 0.29 24 21 > 2 chr11_10000000_C_T 001B_001T 31 4 > 33 0 0.11 35 33 > 3 chr18_10000000_C_T 001B_001T 21 7 > 37 1 0.25 28 38 > 4 chrY_1000000_A_G 001B_001T 10 2 > 11 0 0.17 12 11 > 5 chr1_1000000_G_C 011B_011T 17 3 > 48 0 0.15 20 48 > 6 chr1_1000000_A_G 011B_011T 77 13 > 114 0 0.14 90 114 > > > str(tmp) > 'data.frame': 4418 obs. of 9 variables: > $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 > 620 1644 4251 359 381 391 401 246 1875 ... > $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 1 > 1 1 2 2 2 2 2 2 ... > $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ... > $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ... > $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ... > $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ... > $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 > 0.15 ... > $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ... > $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ... > > Next, I plan to do a merge with my functional annotations (info) using the > variant_id as the key, which I think would be straightforward. > > If there is a more convenient way to get here using the VariantAnnotation > package, I would be grateful to hear about this. > > --- > Sigve Nakken, PhD > Postdoctoral Fellow, Dept. of Tumor Biology > Institute for Cancer Research > Oslo University Hospital, Norway > phone: +4795753022 > email: sigven@ifi.uio.no > > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 9.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, readVcfasVRanges works! Makes life much easier for me. Thanks so much for the clarification:) It cannot seem to find the FILTER column in the resulting DataFrame however, yet i specified this in my ScanVcfParam object, e.g.: info_tags <- rownames(info(scanVcfHeader(filtered_vcf))) vcfparam <- ScanVcfParam(fixed=c("ALT","FILTER"),geno=c("ADT","ADC","E SC","GT"),info=info_tags) Doing filtering on the various annotation columns (CharacterList, IntegerList etc) in the resulting DataFrame is my next challenge. Specifically, how could I convert these columns to simple character lists (and filling in NA in empty rows (character(0) and integer(0)) that could be used in a basic data.frame? best, Sigve On 28 May 2014, at 01:01, Michael Lawrence <lawrence.michael@gene.com> wrote: > I think you want to use VRanges. See ?VRanges. You can use readVcfAsVRanges to get one from a VCF. It expands samples, i.e., it is a long-form tabular representation of the VCF file. It has explicit columns for the read depths, but it takes them from the conventional "AD" geno field, while you have paired tumor/normal in ADC and ADT. So you won't be able to use those convenience fields out of the box, but ADC and ADT will at least land in the mcols, which is probably sufficient for your purposes. > > Please let me know how it goes, > Michael > > > > On Mon, May 26, 2014 at 6:15 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > Hi, > > Ive had similar challenges as Francesco, and have unsuccessfully tried to use the data structures and functions provided by VariantAnnotation. My experience is that I need the expand' functionality more often with respect to the samples rather than the annotation tags. And as my experience with R is somewhat limited, I tend to like to work with simple data.frames when I want to summarise and characterise the data. > > Here is how I generated a simple data frame with all sample variants (variants and samples are dummy encoded): > > ## read VCF (100 samples) > > all_vcf <- readVcf(cancer.exome.project.vcf.gz','hg19') > > seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19) > > rowData(all_vcf) > GRanges with 50383 ranges and 5 metadata columns: > > > ## get variants with PASS filter > > all_vcf_PASS <- all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),] > > rowData(all_vcf_PASS) > GRanges with 4252 ranges and 5 metadata columns: > ... > > > ## get genotype information for all samples that have called genotypes (GT != .) > ## From my VCF: > ## FORMAT=<id=adt,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (tumor)"=""> > ## FORMAT=<id=gt,number=1,type=string,description="genotype"> > ## FORMAT=<id=adc,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (control)"=""> > ## > > tumor_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) > > normal_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) > > tumor_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) > > normal_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) > > ## get sample ids and variant ids for the expanded set of variants' > > b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col > > c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row > > sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b]) > > variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c]) > > > ## make simple data.frame of all samples with genotype information > > row.names(tumor_ref_support) <- NULL > > row.names(normal_ref_support) <- NULL > > row.names(tumor_alt_support) <- NULL > > row.names(external_passed) <- NULL > > row.names(normal_alt_support) <- NULL > > > tmp <- as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_suppor t,tumor_alt_support,normal_ref_support,normal_alt_support)) > > colnames(tmp) <- c('variant_id','sample_id','tumor_ref_support','t umor_alt_support','normal_ref_support','normal_alt_support') > > tmp$allele_frac_tumor <- rep(0,nrow(tmp)) > > tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support > > tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support > > tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > 0,]$tumor_depth,2) > > ## > > head(tmp) > variant_id sample_id tumor_ref_support tumor_alt_support normal_ref_support normal_alt_support allele_frac_tumor tumor_depth normal_depth > 1 chr5_10000000_T_C 001B_001T 17 7 21 0 0.29 24 21 > 2 chr11_10000000_C_T 001B_001T 31 4 33 0 0.11 35 33 > 3 chr18_10000000_C_T 001B_001T 21 7 37 1 0.25 28 38 > 4 chrY_1000000_A_G 001B_001T 10 2 11 0 0.17 12 11 > 5 chr1_1000000_G_C 011B_011T 17 3 48 0 0.15 20 48 > 6 chr1_1000000_A_G 011B_011T 77 13 114 0 0.14 90 114 > > > str(tmp) > 'data.frame': 4418 obs. of 9 variables: > $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 620 1644 4251 359 381 391 401 246 1875 ... > $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 1 1 1 2 2 2 2 2 2 ... > $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ... > $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ... > $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ... > $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ... > $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 0.15 ... > $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ... > $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ... > > Next, I plan to do a merge with my functional annotations (info) using the variant_id as the key, which I think would be straightforward. > > If there is a more convenient way to get here using the VariantAnnotation package, I would be grateful to hear about this. > > --- > Sigve Nakken, PhD > Postdoctoral Fellow, Dept. of Tumor Biology > Institute for Cancer Research > Oslo University Hospital, Norway > phone: +4795753022 > email: sigven@ifi.uio.no > > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > --- Sigve Nakken, PhD Postdoctoral Fellow, Dept. of Tumor Biology Institute for Cancer Research Oslo University Hospital, Norway phone: +4795753022 email: sigven@ifi.uio.no [[alternative HTML version deleted]]

ADD REPLY • link 9.9 years ago Sigve Nakken ▴ 50

0

Entering edit mode

On Wed, May 28, 2014 at 2:42 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > Hi, > > readVcfasVRanges works! Makes life much easier for me. Thanks so much for > the clarification:) It cannot seem to find the âFILTERâ column in the > resulting DataFrame The FILTER column is parsed into the softFilterMatrix. There is a column for each filter, TRUE if passing. > however, yet i specified this in my ScanVcfParam object, e.g.: > > info_tags <- rownames(info(scanVcfHeader(filtered_vcf))) > vcfparam <- > ScanVcfParam(fixed=c("ALT","FILTER"),geno=c("ADT","ADC","ESC","GT"), info=info_tags) > > Doing filtering on the various annotation columns (CharacterList, > IntegerList etc) in the resulting DataFrame is my next challenge. > Specifically, how could I convert these columns to simple character lists > (and filling in NA in empty rows (character(0) and integer(0)) that could > be used in a basic data.frame? > > Just call drop() on each column. But I'm curious as to what is coming through as a List. I'm also curious as to why you want to use a data.frame. > best, > Sigve > > On 28 May 2014, at 01:01, Michael Lawrence <lawrence.michael@gene.com> > wrote: > > > I think you want to use VRanges. See ?VRanges. You can use > readVcfAsVRanges to get one from a VCF. It expands samples, i.e., it is a > long-form tabular representation of the VCF file. It has explicit columns > for the read depths, but it takes them from the conventional "AD" geno > field, while you have paired tumor/normal in ADC and ADT. So you won't be > able to use those convenience fields out of the box, but ADC and ADT will > at least land in the mcols, which is probably sufficient for your purposes. > > > > Please let me know how it goes, > > Michael > > > > > > > > On Mon, May 26, 2014 at 6:15 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > > Hi, > > > > Iâve had similar challenges as Francesco, and have unsuccessfully tried > to use the data structures and functions provided by VariantAnnotation. My > experience is that I need the âexpand' functionality more often with > respect to the samples rather than the annotation tags. And as my > experience with R is somewhat limited, I tend to like to work with simple > data.frames when I want to summarise and characterise the data. > > > > Here is how I generated a simple data frame with all sample variants > (variants and samples are dummy encoded): > > > > ## read VCF (100 samples) > > > all_vcf <- readVcf(âcancer.exome.project.vcf.gz','hg19') > > > seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19) > > > rowData(all_vcf) > > GRanges with 50383 ranges and 5 metadata columns: > > â¦ > > > > ## get variants with âPASSâ filter > > > all_vcf_PASS <- > all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),] > > > rowData(all_vcf_PASS) > > GRanges with 4252 ranges and 5 metadata columns: > > ... > > > > > > ## get genotype information for all samples that have called genotypes > (GT != â.â) > > ## From my VCF: > > ## FORMAT=<id=adt,number=.,type=integer,description="allelic depths="" for=""> the ref and alt alleles in the order listed (tumor)"> > > ## FORMAT=<id=gt,number=1,type=string,description="genotype"> > > ## FORMAT=<id=adc,number=.,type=integer,description="allelic depths="" for=""> the ref and alt alleles in the order listed (control)"> > > ## > > > tumor_ref_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[1,]) > > > normal_ref_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[1,]) > > > tumor_alt_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[2,]) > > > normal_alt_support <- > t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', > arr.ind=T)],row.names=NULL)[2,]) > > > > ## get sample ids and variant ids for the âexpanded set of variants' > > > b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col > > > c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row > > > sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b]) > > > variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c]) > > > > > > ## make simple data.frame of all samples with genotype information > > > row.names(tumor_ref_support) <- NULL > > > row.names(normal_ref_support) <- NULL > > > row.names(tumor_alt_support) <- NULL > > > row.names(external_passed) <- NULL > > > row.names(normal_alt_support) <- NULL > > > > > tmp <- > as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_support,tumor_a lt_support,normal_ref_support,normal_alt_support)) > > > colnames(tmp) <- > c('variant_id','sample_id','tumor_ref_support','tumor_alt_support',' normal_ref_support','normal_alt_support') > > > tmp$allele_frac_tumor <- rep(0,nrow(tmp)) > > > tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support > > > tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support > > > tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- > round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > > 0,]$tumor_depth,2) > > > > ## > > > head(tmp) > > variant_id sample_id tumor_ref_support tumor_alt_support > normal_ref_support normal_alt_support allele_frac_tumor tumor_depth > normal_depth > > 1 chr5_10000000_T_C 001B_001T 17 7 > 21 0 0.29 24 21 > > 2 chr11_10000000_C_T 001B_001T 31 4 > 33 0 0.11 35 33 > > 3 chr18_10000000_C_T 001B_001T 21 7 > 37 1 0.25 28 38 > > 4 chrY_1000000_A_G 001B_001T 10 2 > 11 0 0.17 12 11 > > 5 chr1_1000000_G_C 011B_011T 17 3 > 48 0 0.15 20 48 > > 6 chr1_1000000_A_G 011B_011T 77 13 > 114 0 0.14 90 114 > > > > > str(tmp) > > 'data.frame': 4418 obs. of 9 variables: > > $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 > 620 1644 4251 359 381 391 401 246 1875 ... > > $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 > 1 1 1 2 2 2 2 2 2 ... > > $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ... > > $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ... > > $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ... > > $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ... > > $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 > 0.15 ... > > $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ... > > $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ... > > > > Next, I plan to do a merge with my functional annotations (info) using > the variant_id as the key, which I think would be straightforward. > > > > If there is a more convenient way to get here using the > VariantAnnotation package, I would be grateful to hear about this. > > > > --- > > Sigve Nakken, PhD > > Postdoctoral Fellow, Dept. of Tumor Biology > > Institute for Cancer Research > > Oslo University Hospital, Norway > > phone: +4795753022 > > email: sigven@ifi.uio.no > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > --- > Sigve Nakken, PhD > Postdoctoral Fellow, Dept. of Tumor Biology > Institute for Cancer Research > Oslo University Hospital, Norway > phone: +4795753022 > email: sigven@ifi.uio.no > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 9.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On 28 May 2014, at 14:26, Michael Lawrence <lawrence.michael@gene.com> wrote: > > > > On Wed, May 28, 2014 at 2:42 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > Hi, > > readVcfasVRanges works! Makes life much easier for me. Thanks so much for the clarification:) It cannot seem to find the FILTER column in the resulting DataFrame > > > The FILTER column is parsed into the softFilterMatrix. There is a column for each filter, TRUE if passing. OK. Got that. > however, yet i specified this in my ScanVcfParam object, e.g.: > > info_tags <- rownames(info(scanVcfHeader(filtered_vcf))) > vcfparam <- ScanVcfParam(fixed=c("ALT","FILTER"),geno=c("ADT","ADC", "ESC","GT"),info=info_tags) > > Doing filtering on the various annotation columns (CharacterList, IntegerList etc) in the resulting DataFrame is my next challenge. > Specifically, how could I convert these columns to simple character lists (and filling in NA in empty rows (character(0) and integer(0)) that could be used in a basic data.frame? > > > Just call drop() on each column. But I'm curious as to what is coming through as a List. I'm also curious as to why you want to use a data.frame. > All INFO columns which can contain multiple values, e.g. ##INFO=<id=tag1,number=.,type=integer" or="" ##info="<ID=TAG2," number=".," type="String)," are="" parsed="" into="" integerlists="" and="" characterlists,="" respectively.="" i="" am="" afraid="" drop()="" would="" not="" handle="" these.="" i="" prefer="" data.frame="" since="" that's="" beneficial="" for="" downstream="" analysis="" (joining,="" summarising,="" grouping="" etc.="" with="" dplyr="" and="" other="" data,="" as="" well="" as="" exploratory="" plotting="" with="" ggplot2="" etc.).=""> best, > Sigve > > On 28 May 2014, at 01:01, Michael Lawrence <lawrence.michael@gene.com> wrote: > > > I think you want to use VRanges. See ?VRanges. You can use readVcfAsVRanges to get one from a VCF. It expands samples, i.e., it is a long-form tabular representation of the VCF file. It has explicit columns for the read depths, but it takes them from the conventional "AD" geno field, while you have paired tumor/normal in ADC and ADT. So you won't be able to use those convenience fields out of the box, but ADC and ADT will at least land in the mcols, which is probably sufficient for your purposes. > > > > Please let me know how it goes, > > Michael > > > > > > > > On Mon, May 26, 2014 at 6:15 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > > Hi, > > > > Ive had similar challenges as Francesco, and have unsuccessfully tried to use the data structures and functions provided by VariantAnnotation. My experience is that I need the expand' functionality more often with respect to the samples rather than the annotation tags. And as my experience with R is somewhat limited, I tend to like to work with simple data.frames when I want to summarise and characterise the data. > > > > Here is how I generated a simple data frame with all sample variants (variants and samples are dummy encoded): > > > > ## read VCF (100 samples) > > > all_vcf <- readVcf(cancer.exome.project.vcf.gz','hg19') > > > seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19) > > > rowData(all_vcf) > > GRanges with 50383 ranges and 5 metadata columns: > > > > > > ## get variants with PASS filter > > > all_vcf_PASS <- all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),] > > > rowData(all_vcf_PASS) > > GRanges with 4252 ranges and 5 metadata columns: > > ... > > > > > > ## get genotype information for all samples that have called genotypes (GT != .) > > ## From my VCF: > > ## FORMAT=<id=adt,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (tumor)"=""> > > ## FORMAT=<id=gt,number=1,type=string,description="genotype"> > > ## FORMAT=<id=adc,number=.,type=integer,description="allelic depths="" for="" the="" ref="" and="" alt="" alleles="" in="" the="" order="" listed="" (control)"=""> > > ## > > > tumor_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) > > > normal_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,]) > > > tumor_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) > > > normal_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,]) > > > > ## get sample ids and variant ids for the expanded set of variants' > > > b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col > > > c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row > > > sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b]) > > > variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c]) > > > > > > ## make simple data.frame of all samples with genotype information > > > row.names(tumor_ref_support) <- NULL > > > row.names(normal_ref_support) <- NULL > > > row.names(tumor_alt_support) <- NULL > > > row.names(external_passed) <- NULL > > > row.names(normal_alt_support) <- NULL > > > > > tmp <- as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_supp ort,tumor_alt_support,normal_ref_support,normal_alt_support)) > > > colnames(tmp) <- c('variant_id','sample_id','tumor_ref_support', 'tumor_alt_support','normal_ref_support','normal_alt_support') > > > tmp$allele_frac_tumor <- rep(0,nrow(tmp)) > > > tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support > > > tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support > > > tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > 0,]$tumor_depth,2) > > > > ## > > > head(tmp) > > variant_id sample_id tumor_ref_support tumor_alt_support normal_ref_support normal_alt_support allele_frac_tumor tumor_depth normal_depth > > 1 chr5_10000000_T_C 001B_001T 17 7 21 0 0.29 24 21 > > 2 chr11_10000000_C_T 001B_001T 31 4 33 0 0.11 35 33 > > 3 chr18_10000000_C_T 001B_001T 21 7 37 1 0.25 28 38 > > 4 chrY_1000000_A_G 001B_001T 10 2 11 0 0.17 12 11 > > 5 chr1_1000000_G_C 011B_011T 17 3 48 0 0.15 20 48 > > 6 chr1_1000000_A_G 011B_011T 77 13 114 0 0.14 90 114 > > > > > str(tmp) > > 'data.frame': 4418 obs. of 9 variables: > > $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 620 1644 4251 359 381 391 401 246 1875 ... > > $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 1 1 1 2 2 2 2 2 2 ... > > $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ... > > $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ... > > $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ... > > $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ... > > $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 0.15 ... > > $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ... > > $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ... > > > > Next, I plan to do a merge with my functional annotations (info) using the variant_id as the key, which I think would be straightforward. > > > > If there is a more convenient way to get here using the VariantAnnotation package, I would be grateful to hear about this. > > > > --- > > Sigve Nakken, PhD > > Postdoctoral Fellow, Dept. of Tumor Biology > > Institute for Cancer Research > > Oslo University Hospital, Norway > > phone: +4795753022 > > email: sigven@ifi.uio.no > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > --- > Sigve Nakken, PhD > Postdoctoral Fellow, Dept. of Tumor Biology > Institute for Cancer Research > Oslo University Hospital, Norway > phone: +4795753022 > email: sigven@ifi.uio.no > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor --- Sigve Nakken, PhD Postdoctoral Fellow, Dept. of Tumor Biology Institute for Cancer Research Oslo University Hospital, Norway phone: +4795753022 email: sigven@ifi.uio.no [[alternative HTML version deleted]]

ADD REPLY • link 9.9 years ago Sigve Nakken ▴ 50

0

Entering edit mode

On Wed, May 28, 2014 at 6:09 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > > On 28 May 2014, at 14:26, Michael Lawrence <lawrence.michael@gene.com> > wrote: > > > > > On Wed, May 28, 2014 at 2:42 AM, Sigve Nakken <sigven@ifi.uio.no> wrote: > >> Hi, >> >> readVcfasVRanges works! Makes life much easier for me. Thanks so much for >> the clarification:) It cannot seem to find the âFILTERâ column in the >> resulting DataFrame > > > > The FILTER column is parsed into the softFilterMatrix. There is a column > for each filter, TRUE if passing. > > > OK. Got that. > > however, yet i specified this in my ScanVcfParam object, e.g.: >> >> info_tags <- rownames(info(scanVcfHeader(filtered_vcf))) >> vcfparam <- >> ScanVcfParam(fixed=c("ALT","FILTER"),geno=c("ADT","ADC","ESC","GT") ,info=info_tags) >> >> Doing filtering on the various annotation columns (CharacterList, >> IntegerList etc) in the resulting DataFrame is my next challenge. >> Specifically, how could I convert these columns to simple character lists >> (and filling in NA in empty rows (character(0) and integer(0)) that could >> be used in a basic data.frame? >> >> > Just call drop() on each column. But I'm curious as to what is coming > through as a List. I'm also curious as to why you want to use a data.frame. > > > > All INFO columns which can contain multiple values, e.g. â##INFO=<id=tag1,number=.,type=integer"> or â##INFO=<id=tag2, number=".," type="String)," are="" parsed="" into="" integerlists=""> and CharacterLists, respectively. I am afraid drop() would not handle these. > Don't be afraid, just try it. > I prefer data.frame since that's beneficial for downstream analysis > (joining, summarising, grouping etc. with dplyr and other data, as well as > exploratory plotting with ggplot2 etc.). > > I can see the LCD interoperability argument, but you might be surprised what you can do with IRanges while retaining the semantics of your data. There's a lot of buried treasure, at various depths. > > best, >> Sigve >> >> On 28 May 2014, at 01:01, Michael Lawrence <lawrence.michael@gene.com> >> wrote: >> >> > I think you want to use VRanges. See ?VRanges. You can use >> readVcfAsVRanges to get one from a VCF. It expands samples, i.e., it is a >> long-form tabular representation of the VCF file. It has explicit columns >> for the read depths, but it takes them from the conventional "AD" geno >> field, while you have paired tumor/normal in ADC and ADT. So you won't be >> able to use those convenience fields out of the box, but ADC and ADT will >> at least land in the mcols, which is probably sufficient for your purposes. >> > >> > Please let me know how it goes, >> > Michael >> > >> > >> > >> > On Mon, May 26, 2014 at 6:15 AM, Sigve Nakken <sigven@ifi.uio.no> >> wrote: >> > Hi, >> > >> > Iâve had similar challenges as Francesco, and have unsuccessfully tried >> to use the data structures and functions provided by VariantAnnotation. My >> experience is that I need the âexpand' functionality more often with >> respect to the samples rather than the annotation tags. And as my >> experience with R is somewhat limited, I tend to like to work with simple >> data.frames when I want to summarise and characterise the data. >> > >> > Here is how I generated a simple data frame with all sample variants >> (variants and samples are dummy encoded): >> > >> > ## read VCF (100 samples) >> > > all_vcf <- readVcf(âcancer.exome.project.vcf.gz','hg19') >> > > seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19) >> > > rowData(all_vcf) >> > GRanges with 50383 ranges and 5 metadata columns: >> > â¦ >> > >> > ## get variants with âPASSâ filter >> > > all_vcf_PASS <- >> all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),] >> > > rowData(all_vcf_PASS) >> > GRanges with 4252 ranges and 5 metadata columns: >> > ... >> > >> > >> > ## get genotype information for all samples that have called genotypes >> (GT != â.â) >> > ## From my VCF: >> > ## FORMAT=<id=adt,number=.,type=integer,description="allelic depths="" for="">> the ref and alt alleles in the order listed (tumor)"> >> > ## FORMAT=<id=gt,number=1,type=string,description="genotype"> >> > ## FORMAT=<id=adc,number=.,type=integer,description="allelic depths="" for="">> the ref and alt alleles in the order listed (control)"> >> > ## >> > > tumor_ref_support <- >> t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', >> arr.ind=T)],row.names=NULL)[1,]) >> > > normal_ref_support <- >> t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', >> arr.ind=T)],row.names=NULL)[1,]) >> > > tumor_alt_support <- >> t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', >> arr.ind=T)],row.names=NULL)[2,]) >> > > normal_alt_support <- >> t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', >> arr.ind=T)],row.names=NULL)[2,]) >> > >> > ## get sample ids and variant ids for the âexpanded set of variants' >> > > b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col >> > > c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row >> > > sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b]) >> > > variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c]) >> > >> > >> > ## make simple data.frame of all samples with genotype information >> > > row.names(tumor_ref_support) <- NULL >> > > row.names(normal_ref_support) <- NULL >> > > row.names(tumor_alt_support) <- NULL >> > > row.names(external_passed) <- NULL >> > > row.names(normal_alt_support) <- NULL >> > >> > > tmp <- >> as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_support,tumor_ alt_support,normal_ref_support,normal_alt_support)) >> > > colnames(tmp) <- >> c('variant_id','sample_id','tumor_ref_support','tumor_alt_support', 'normal_ref_support','normal_alt_support') >> > > tmp$allele_frac_tumor <- rep(0,nrow(tmp)) >> > > tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support >> > > tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support >> > > tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- >> round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > >> 0,]$tumor_depth,2) >> > >> > ## >> > > head(tmp) >> > variant_id sample_id tumor_ref_support tumor_alt_support >> normal_ref_support normal_alt_support allele_frac_tumor tumor_depth >> normal_depth >> > 1 chr5_10000000_T_C 001B_001T 17 7 >> 21 0 0.29 24 21 >> > 2 chr11_10000000_C_T 001B_001T 31 4 >> 33 0 0.11 35 33 >> > 3 chr18_10000000_C_T 001B_001T 21 7 >> 37 1 0.25 28 38 >> > 4 chrY_1000000_A_G 001B_001T 10 2 >> 11 0 0.17 12 11 >> > 5 chr1_1000000_G_C 011B_011T 17 3 >> 48 0 0.15 20 48 >> > 6 chr1_1000000_A_G 011B_011T 77 13 >> 114 0 0.14 90 114 >> > >> > > str(tmp) >> > 'data.frame': 4418 obs. of 9 variables: >> > $ variant_id : Factor w/ 4252 levels "chr1_1000000_G_T",..: >> 3254 620 1644 4251 359 381 391 401 246 1875 ... >> > $ sample_id : Factor w/ 99 levels "001B_001T","011B_011T",..: >> 1 1 1 1 2 2 2 2 2 2 ... >> > $ tumor_ref_support : int 17 31 21 10 17 77 12 40 75 53 ... >> > $ tumor_alt_support : int 7 4 7 2 3 13 3 6 8 9 ... >> > $ normal_ref_support: int 21 33 37 11 48 114 19 52 88 89 ... >> > $ normal_alt_support: int 0 0 1 0 0 0 0 0 0 0 ... >> > $ allele_frac_tumor : num 0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 >> 0.15 ... >> > $ tumor_depth : int 24 35 28 12 20 90 15 46 83 62 ... >> > $ normal_depth : int 21 33 38 11 48 114 19 52 88 89 ... >> > >> > Next, I plan to do a merge with my functional annotations (info) using >> the variant_id as the key, which I think would be straightforward. >> > >> > If there is a more convenient way to get here using the >> VariantAnnotation package, I would be grateful to hear about this. >> > >> > --- >> > Sigve Nakken, PhD >> > Postdoctoral Fellow, Dept. of Tumor Biology >> > Institute for Cancer Research >> > Oslo University Hospital, Norway >> > phone: +4795753022 >> > email: sigven@ifi.uio.no >> > >> > >> > >> > >> > >> > >> > [[alternative HTML version deleted]] >> > >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> --- >> Sigve Nakken, PhD >> Postdoctoral Fellow, Dept. of Tumor Biology >> Institute for Cancer Research >> Oslo University Hospital, Norway >> phone: +4795753022 >> email: sigven@ifi.uio.no >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > --- > Sigve Nakken, PhD > Postdoctoral Fellow, Dept. of Tumor Biology > Institute for Cancer Research > Oslo University Hospital, Norway > phone: +4795753022 > email: sigven@ifi.uio.no > > > > > [[alternative HTML version deleted]]

ADD REPLY • link 9.9 years ago Michael Lawrence ★ 11k

Login before adding your answer.