Complementing alleles in a VCF
1
0
Entering edit mode
@alex-gutteridge-2935
Last seen 9.6 years ago
United States
Hi All, I'm having some difficulty with VCF handling and feeding 1000 genome VCFs into predictCoding from the VariantAnnotation package. Can anyone help? When I read 1000 genomes VCFs the SNPs come through with no strand specified e.g (where vcf.file and param are a 1000 genomes VCF file and a GRanges respectively): > vcf = readVcf(vcf.file, "hg19", param) > fixed(vcf) GRanges with 2610 ranges and 5 elementMetadata cols: seqnames ranges strand | paramRangeID <rle> <iranges> <rle> | <factor> rs186838828 chr2 [167051756, 167051756] * | SCN9A rs191667986 chr2 [167051865, 167051865] * | SCN9A rs139483482 chr2 [167051900, 167051900] * | SCN9A rs182687583 chr2 [167052080, 167052080] * | SCN9A rs115766730 chr2 [167052144, 167052144] * | SCN9A rs186613025 chr2 [167052168, 167052168] * | SCN9A rs73017538 chr2 [167052328, 167052328] * | SCN9A rs114327563 chr2 [167052375, 167052375] * | SCN9A rs191401619 chr2 [167052418, 167052418] * | SCN9A ... ... ... ... ... ... rs73025590 chr2 [167231750, 167231750] * | SCN9A rs141453198 chr2 [167231812, 167231812] * | SCN9A rs16852069 chr2 [167231890, 167231890] * | SCN9A rs181276399 chr2 [167231932, 167231932] * | SCN9A rs185839773 chr2 [167232251, 167232251] * | SCN9A rs191091185 chr2 [167232439, 167232439] * | SCN9A rs148362057 chr2 [167232446, 167232446] * | SCN9A rs141521157 chr2 [167232450, 167232450] * | SCN9A rs1881440 chr2 [167232463, 167232463] * | SCN9A REF ALT QUAL FILTER <dnastringset> <dnastringsetlist> <numeric> <character> rs186838828 T ######## 100 PASS rs191667986 T ######## 100 PASS rs139483482 T ######## 100 PASS rs182687583 G ######## 100 PASS rs115766730 G ######## 100 PASS rs186613025 A ######## 100 PASS rs73017538 A ######## 100 PASS rs114327563 A ######## 100 PASS rs191401619 C ######## 100 PASS ... ... ... ... ... rs73025590 A ######## 100 PASS rs141453198 C ######## 100 PASS rs16852069 A ######## 100 PASS rs181276399 G ######## 100 PASS rs185839773 A ######## 100 PASS rs191091185 C ######## 100 PASS rs148362057 A ######## 100 PASS rs141521157 A ######## 100 PASS rs1881440 C ######## 100 PASS --- seqlengths: chr2 NA When I feed these to predictCoding(), genes on the complement strand are not dealt with correctly - the ALT alleles given in the VCF are not complemented first, so the variant codons are not translated right. > txdb = TxDb.Hsapiens.UCSC.hg19.knownGene > aa = predictCoding(vcf.filt, txdb, Hsapiens) It seems like predictCoding should be able to deal with genes on the complement strand, but maybe it requires the strand to be set correctly in the VCF? I've tried and failed to find a way of editing the VCF - if nothing else to just complement the ALT alleles before running predictcoding, but it doesn't seem to work. Though the manual seems to suggest it is possible. I didn't find a way to actually set the strand directly either: ?alt(x)?, ?alt(x) <- value?: Returns or sets the alternate allele data from the ALT column of the VCF file. ?value? can be a ?DNAStringSet? or a ?CharacterList? (for a structural VCF file). > tmp.alt = complement(unlist(values(alt(vcf))[["ALT"]])) > tmp.alt A DNAStringSet instance of length 2610 width seq [1] 1 T [2] 1 G [3] 1 G [4] 1 T [5] 1 T [6] 1 C [7] 1 C [8] 1 C [9] 1 C ... ... ... [2602] 1 A [2603] 1 T [2604] 1 C [2605] 1 T [2606] 1 C [2607] 1 C [2608] 1 C [2609] 1 C [2610] 1 T > alt(vcf) = tmp.alt Error in function (classes, fdef, mtable) : unable to find an inherited method for function "alt<-", for signature "VCF", "DNAStringSet" > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RColorBrewer_1.0-5 [2] caTools_1.12 [3] bitops_1.0-4.1 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 [5] GenomicFeatures_1.8.1 [6] org.Hs.eg.db_2.7.1 [7] RSQLite_0.11.1 [8] DBI_0.2-5 [9] AnnotationDbi_1.18.0 [10] Biobase_2.16.0 [11] VariantAnnotation_1.2.5 [12] Rsamtools_1.8.3 [13] BSgenome.Hsapiens.UCSC.hg19_1.3.17 [14] BSgenome_1.24.0 [15] Biostrings_2.24.1 [16] GenomicRanges_1.8.3 [17] IRanges_1.14.2 [18] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] biomaRt_2.12.0 grid_2.15.0 lattice_0.20-6 Matrix_1.0-6 [5] RCurl_1.91-1 rtracklayer_1.16.1 snpStats_1.6.0 splines_2.15.0 [9] stats4_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4 [13] zlibbioc_1.2.0 -- Alex Gutteridge
VariantAnnotation BSgenome BSgenome genomes VariantAnnotation VariantAnnotation BSgenome • 1.3k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 2 days ago
United States
Hi Alex -- On 04/19/2012 02:39 AM, Alex Gutteridge wrote: > Hi All, > > I'm having some difficulty with VCF handling and feeding 1000 genome > VCFs into predictCoding from the VariantAnnotation package. Can anyone > help? > > When I read 1000 genomes VCFs the SNPs come through with no strand > specified e.g (where vcf.file and param are a 1000 genomes VCF file and > a GRanges respectively): > >> vcf = readVcf(vcf.file, "hg19", param) >> fixed(vcf) > GRanges with 2610 ranges and 5 elementMetadata cols: > seqnames ranges strand | paramRangeID > <rle> <iranges> <rle> | <factor> > rs186838828 chr2 [167051756, 167051756] * | SCN9A > rs191667986 chr2 [167051865, 167051865] * | SCN9A > rs139483482 chr2 [167051900, 167051900] * | SCN9A > rs182687583 chr2 [167052080, 167052080] * | SCN9A > rs115766730 chr2 [167052144, 167052144] * | SCN9A > rs186613025 chr2 [167052168, 167052168] * | SCN9A > rs73017538 chr2 [167052328, 167052328] * | SCN9A > rs114327563 chr2 [167052375, 167052375] * | SCN9A > rs191401619 chr2 [167052418, 167052418] * | SCN9A > ... ... ... ... ... ... > rs73025590 chr2 [167231750, 167231750] * | SCN9A > rs141453198 chr2 [167231812, 167231812] * | SCN9A > rs16852069 chr2 [167231890, 167231890] * | SCN9A > rs181276399 chr2 [167231932, 167231932] * | SCN9A > rs185839773 chr2 [167232251, 167232251] * | SCN9A > rs191091185 chr2 [167232439, 167232439] * | SCN9A > rs148362057 chr2 [167232446, 167232446] * | SCN9A > rs141521157 chr2 [167232450, 167232450] * | SCN9A > rs1881440 chr2 [167232463, 167232463] * | SCN9A > REF ALT QUAL FILTER > <dnastringset> <dnastringsetlist> <numeric> <character> > rs186838828 T ######## 100 PASS > rs191667986 T ######## 100 PASS > rs139483482 T ######## 100 PASS > rs182687583 G ######## 100 PASS > rs115766730 G ######## 100 PASS > rs186613025 A ######## 100 PASS > rs73017538 A ######## 100 PASS > rs114327563 A ######## 100 PASS > rs191401619 C ######## 100 PASS > ... ... ... ... ... > rs73025590 A ######## 100 PASS > rs141453198 C ######## 100 PASS > rs16852069 A ######## 100 PASS > rs181276399 G ######## 100 PASS > rs185839773 A ######## 100 PASS > rs191091185 C ######## 100 PASS > rs148362057 A ######## 100 PASS > rs141521157 A ######## 100 PASS > rs1881440 C ######## 100 PASS > --- > seqlengths: > chr2 > NA > > When I feed these to predictCoding(), genes on the complement strand are > not dealt with correctly - the ALT alleles given in the VCF are not > complemented first, so the variant codons are not translated right. A definitive answer won't be available until next week. Probably a more straight-forward way, but... > >> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene >> aa = predictCoding(vcf.filt, txdb, Hsapiens) > > It seems like predictCoding should be able to deal with genes on the > complement strand, but maybe it requires the strand to be set correctly > in the VCF? I've tried and failed to find a way of editing the VCF - if > nothing else to just complement the ALT alleles before running > predictcoding, but it doesn't seem to work. Though the manual seems to > suggest it is possible. I didn't find a way to actually set the strand > directly either: > > ?alt(x)?, ?alt(x) <- value?: Returns or sets the alternate allele data > from the ALT column of the VCF file. ?value? can be a > ?DNAStringSet? or a ?CharacterList? (for a structural VCF > file). > >> tmp.alt = complement(unlist(values(alt(vcf))[["ALT"]])) >> tmp.alt > A DNAStringSet instance of length 2610 > width seq > [1] 1 T > [2] 1 G > [3] 1 G > [4] 1 T > [5] 1 T > [6] 1 C > [7] 1 C > [8] 1 C > [9] 1 C > ... ... ... > [2602] 1 A > [2603] 1 T > [2604] 1 C > [2605] 1 T > [2606] 1 C > [2607] 1 C > [2608] 1 C > [2609] 1 C > [2610] 1 T >> alt(vcf) = tmp.alt > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function "alt<-", for signature > "VCF", "DNAStringSet" as an example fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") you can set the strand on vcf with > strand(rowData(vcf)) 'factor' Rle of length 5 with 1 run Lengths: 5 Values : * Levels(3): + - * > strand(rowData(vcf)) <- "+" > strand(rowData(vcf)) 'factor' Rle of length 5 with 1 run Lengths: 5 Values : + Levels(3): + - * (probably there should be strand,VCF, and strand<-,VCF,ANY methods). For alt, it seems like DNAStringSetList needs a constructor DNAStringSetList <- function(...) { listData <- list(...) if (length(listData) == 1 && is.list(listData[[1L]])) listData <- listData[[1L]] ok <- sapply(listData, is, "DNAStringSet") if (!all(ok)) listData <- lapply(listData, as, "DNAStringSet") IRanges:::newCompressedList("DNAStringSetList", listData) } and then > orig <- values(alt(vcf))[["ALT"]] > dna <- complement(unlist(orig)) > alt(vcf) <- relist(dna, orig) (It's confusing, though convenient, how alt() returns a GRanges, but alt<- expects 'value' to be a DNAStringSetList; also as you note the docs say there should be a VCF,DNAStringSet method for alt<-, but there doesn't seem to be one (it wouldn't be appropriate in your case, because ALT is a DNAStringSetList). Martin > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] RColorBrewer_1.0-5 > [2] caTools_1.12 > [3] bitops_1.0-4.1 > [4] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 > [5] GenomicFeatures_1.8.1 > [6] org.Hs.eg.db_2.7.1 > [7] RSQLite_0.11.1 > [8] DBI_0.2-5 > [9] AnnotationDbi_1.18.0 > [10] Biobase_2.16.0 > [11] VariantAnnotation_1.2.5 > [12] Rsamtools_1.8.3 > [13] BSgenome.Hsapiens.UCSC.hg19_1.3.17 > [14] BSgenome_1.24.0 > [15] Biostrings_2.24.1 > [16] GenomicRanges_1.8.3 > [17] IRanges_1.14.2 > [18] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.12.0 grid_2.15.0 lattice_0.20-6 Matrix_1.0-6 > [5] RCurl_1.91-1 rtracklayer_1.16.1 snpStats_1.6.0 splines_2.15.0 > [9] stats4_2.15.0 survival_2.36-12 tools_2.15.0 XML_3.9-4 > [13] zlibbioc_1.2.0 > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
On 19.04.2012 14:34, Martin Morgan wrote: > Hi Alex -- > > On 04/19/2012 02:39 AM, Alex Gutteridge wrote: >> Hi All, >> >> I'm having some difficulty with VCF handling and feeding 1000 genome >> VCFs into predictCoding from the VariantAnnotation package. Can >> anyone >> help? >> [...] > > as an example > > fl <- system.file("extdata", "ex2.vcf", > package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > > you can set the strand on vcf with > >> strand(rowData(vcf)) > 'factor' Rle of length 5 with 1 run > Lengths: 5 > Values : * > Levels(3): + - * >> strand(rowData(vcf)) <- "+" >> strand(rowData(vcf)) > 'factor' Rle of length 5 with 1 run > Lengths: 5 > Values : + > Levels(3): + - * > > (probably there should be strand,VCF, and strand<-,VCF,ANY methods). > > For alt, it seems like DNAStringSetList needs a constructor > > DNAStringSetList <- > function(...) > { > listData <- list(...) > if (length(listData) == 1 && is.list(listData[[1L]])) > listData <- listData[[1L]] > ok <- sapply(listData, is, "DNAStringSet") > if (!all(ok)) > listData <- lapply(listData, as, "DNAStringSet") > IRanges:::newCompressedList("DNAStringSetList", listData) > } > > and then > >> orig <- values(alt(vcf))[["ALT"]] >> dna <- complement(unlist(orig)) >> alt(vcf) <- relist(dna, orig) > > (It's confusing, though convenient, how alt() returns a GRanges, but > alt<- expects 'value' to be a DNAStringSetList; also as you note the > docs say there should be a VCF,DNAStringSet method for alt<-, but > there doesn't seem to be one (it wouldn't be appropriate in your > case, > because ALT is a DNAStringSetList). > > Martin Thanks Martin - very helpful. In the end I ran predictCoding after ripping the Granges and varAllele information out of the vcf separately. This works, but handling the VCF directly would seem more elegant: > predictCoding(rowData(vcf), txdb, Hsapiens, > complement(unlist(values(alt(vcf.filt))[["ALT"]]))) -- Alex Gutteridge
ADD REPLY
0
Entering edit mode
Hi Alex, Thanks for reporting the problem with predictCoding() and the handling of variants on different strands. You've probably seen this thread, https://stat.ethz.ch/pipermail/bioconductor/2012-April/045072.html from Jeremiah reporting the same problem. In that exchange Herve summarized the fix we will implement and it is next on my list. As for the other issues in this email, - strand,VCF, and strand<-,VCF methods have been added - DNAStringSetList() constructor has been added to Biostrings, see ?DNAStringSetList - The method for alt<-,VCF,DNAStringSet was removed leaving only alt<-,VCF,DNAStringSetList and alt<-,VCF,CharacterList. Documentation has been updated. These changes are available in VariantAnnotation 1.3.8 and Biostrings 2.25.3, both in devel. When I have the fix for predictCoding done I'll post to the thread Jeremiah started. Valerie On 04/19/2012 07:53 AM, Alex Gutteridge wrote: > On 19.04.2012 14:34, Martin Morgan wrote: >> Hi Alex -- >> >> On 04/19/2012 02:39 AM, Alex Gutteridge wrote: >>> Hi All, >>> >>> I'm having some difficulty with VCF handling and feeding 1000 genome >>> VCFs into predictCoding from the VariantAnnotation package. Can anyone >>> help? >>> > > [...] > >> >> as an example >> >> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") >> vcf <- readVcf(fl, "hg19") >> >> you can set the strand on vcf with >> >>> strand(rowData(vcf)) >> 'factor' Rle of length 5 with 1 run >> Lengths: 5 >> Values : * >> Levels(3): + - * >>> strand(rowData(vcf)) <- "+" >>> strand(rowData(vcf)) >> 'factor' Rle of length 5 with 1 run >> Lengths: 5 >> Values : + >> Levels(3): + - * >> >> (probably there should be strand,VCF, and strand<-,VCF,ANY methods). >> >> For alt, it seems like DNAStringSetList needs a constructor >> >> DNAStringSetList <- >> function(...) >> { >> listData <- list(...) >> if (length(listData) == 1 && is.list(listData[[1L]])) >> listData <- listData[[1L]] >> ok <- sapply(listData, is, "DNAStringSet") >> if (!all(ok)) >> listData <- lapply(listData, as, "DNAStringSet") >> IRanges:::newCompressedList("DNAStringSetList", listData) >> } >> >> and then >> >>> orig <- values(alt(vcf))[["ALT"]] >>> dna <- complement(unlist(orig)) >>> alt(vcf) <- relist(dna, orig) >> >> (It's confusing, though convenient, how alt() returns a GRanges, but >> alt<- expects 'value' to be a DNAStringSetList; also as you note the >> docs say there should be a VCF,DNAStringSet method for alt<-, but >> there doesn't seem to be one (it wouldn't be appropriate in your case, >> because ALT is a DNAStringSetList). >> >> Martin > > Thanks Martin - very helpful. In the end I ran predictCoding after > ripping the Granges and varAllele information out of the vcf > separately. This works, but handling the VCF directly would seem more > elegant: > >> predictCoding(rowData(vcf), txdb, Hsapiens, >> complement(unlist(values(alt(vcf.filt))[["ALT"]]))) >
ADD REPLY

Login before adding your answer.

Traffic: 594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6