Converting a DNAStringSetList to characters quickly
4
0
Entering edit mode
dan.gatti • 0
@dangatti-7637
Last seen 7.5 years ago

I'm using read.vcf() in the VariantAnnotation package to get SNPs from a VCF file. It returns a CollapsedVCF object. I then extract the alternate allele calls using fixed(vcf)$ALT, which returns a DNAStringSetList. Each element contains a DNAStringSet with one or more characters (there are tri-morphic SNPs in this data set). I would then like to write this to a new file, concatenating the allele for each SNP with a comma. To be clear, I need to convert from this:

DNAStringSetList of length 6
[[1]] A
[[2]] C
[[3]] A G
[[4]] T
[[5]] A C T
[[6]] A

to this character vector:

[1] "A"     "C"     "A,G"   "T"     "A,C,T" "A"

and do it quickly for over a million SNPs. I have a slow method below, but I'm wondering if there is some slick trick that will do it more quickly.  I checked the DNAStringSetList and DNAStringSet documentation and don't see a quicker way to make this conversion.

Here is sample code for 6 SNPs.

library(VariantAnnotation)
alt = DNAStringSetList("A", "C", c("A", "G"), "T", c("A", "C", "T"), "A")
x = lapply(alt, as.character)
x = sapply(x, paste, collapse = ",")

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] VariantAnnotation_1.12.9 Rsamtools_1.18.3         Biostrings_2.34.1       
[4] XVector_0.6.0            GenomicRanges_1.18.4     GenomeInfoDb_1.2.5      
[7] IRanges_2.0.1            S4Vectors_0.4.0          BiocGenerics_0.12.1     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.2    base64enc_0.1-2         BatchJobs_1.6          
 [4] BBmisc_1.9              Biobase_2.26.0          BiocParallel_1.0.3     
 [7] biomaRt_2.22.0          bitops_1.0-6            brew_1.0-6             
[10] BSgenome_1.34.1         checkmate_1.5.2         codetools_0.2-11       
[13] DBI_0.3.1               digest_0.6.8            fail_1.2               
[16] foreach_1.4.2           GenomicAlignments_1.2.2 GenomicFeatures_1.18.7
[19] iterators_1.0.7         RCurl_1.95-4.5          RSQLite_1.0.0          
[22] rtracklayer_1.26.3      sendmailR_1.2-1         stringr_0.6.2          
[25] tools_3.1.1             XML_3.98-1.1            zlibbioc_1.12.0

Thanks in advance,

Daniel Gatti

The Jackson Laboratory

readVcf DNAStringSet DNAStringSetList variantannotation readvcf • 4.3k views
ADD COMMENT
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States

Hi,

These tips aren't on the man page but are just general strategies for working with long character lists/vectors. The CharacterList coercion is faster than the lapply:

clist <- CharacterList(alt)

Paste and replace only the elements with multiple alts:

mult <- elementLengths(clist) > 1L
clist[mult] <- lapply(clist[mult], paste0, collapse=",")
> unlist(clist)
[1] "A"     "C"     "A,G"   "T"     "A,C,T" "A"    

I'm guessing you're aware of writeVcf() but instead are interested in writing out just the alts?

Valerie

ADD COMMENT
0
Entering edit mode

Hi Dan, Val,

You can use unstrsplit() for concatenating the allele for each SNP with a comma. Should be much faster than lapply( , paste0).

Cheers,

H.

ADD REPLY
0
Entering edit mode
dan.gatti • 0
@dangatti-7637
Last seen 7.5 years ago

Thanks. The CharacterList is a new class for me. I need to write out a table with polymorphic SNPs in a subset of the strains from the VCF. And they need to be coded as 0 or 1. Otherwise, I'd use writeVcf().

Just to close out the question for others, I performed a few timing runs. The last two methods are barely distinguishable in terms of time. I decided to go with:

 x = CharacterList(alt)
 x = unstrsplit(x, sep = ",")

 

> library(VariantAnnotation)
> alt = replicate(1000, expr =
+          sample(x = c("A", "C", "G", "T"), size = sample(1:3, prob = c(0.9, 0.05, 0.05))))
> alt = DNAStringSetList(alt)
>
> system.time({
+   x = lapply(alt, as.character)
+   x = sapply(x, paste, collapse = ",")
+ })
   user  system elapsed
   3.76    0.00    3.76
>
> system.time({
+   x = CharacterList(alt)
+   x = sapply(x, paste, collapse = ",")
+ })
   user  system elapsed
   0.17    0.00    0.17
>
> system.time({
+   x = CharacterList(alt)
+   mult = elementLengths(x) > 1L
+   x[mult] = lapply(x[mult], paste0, collapse = ",")
+   x = unlist(x)
+ })
   user  system elapsed
   0.03    0.00    0.04
>
> system.time({
+   x = CharacterList(alt)
+   x = unstrsplit(x, sep = ",")
+ })
   user  system elapsed
   0.01    0.00    0.01

 

Thanks for your help!

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 25 minutes ago
Seattle, WA, United States

"barely distinguishable" really?

CharacterList() is about 400x faster than lapply(alt, as.character):

system.time(x <- lapply(alt, as.character))
#   user  system elapsed 
#  1.848   0.000   1.850 
system.time(x <- CharacterList(alt))
#   user  system elapsed 
#  0.004   0.000   0.004 

And unstrsplit() is about 20x faster than the lapply() solution:

unstrsplit2 <- function(x)
{
  mult <- elementLengths(x) > 1L
  x[mult] <- lapply(x[mult], paste0, collapse = ",")
  unlist(x, use.names=FALSE)
}
library(microbenchmark)
microbenchmark(unstrsplit(x, sep=","), unstrsplit2(x))
# Unit: microseconds
#                      expr       min        lq       mean    median        uq
#  unstrsplit(x, sep = ",")   443.123   484.241   513.8259   517.505   528.236
#            unstrsplit2(x) 10229.687 10366.235 10759.4753 10464.023 10729.687
#        max neval
#    796.788   100
#  15143.757   100

Cheers,

H.

ADD COMMENT
0
Entering edit mode
dan.gatti • 0
@dangatti-7637
Last seen 7.5 years ago

Good point. I looked at the 0.04 and 0.01 and misinterpreted them. Thanks for the clarification.
 

ADD COMMENT

Login before adding your answer.

Traffic: 803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6