Search
Question: Converting a DNAStringSetList to characters quickly
0
gravatar for dan.gatti
3.3 years ago by
dan.gatti0
dan.gatti0 wrote:

I'm using read.vcf() in the VariantAnnotation package to get SNPs from a VCF file. It returns a CollapsedVCF object. I then extract the alternate allele calls using fixed(vcf)$ALT, which returns a DNAStringSetList. Each element contains a DNAStringSet with one or more characters (there are tri-morphic SNPs in this data set). I would then like to write this to a new file, concatenating the allele for each SNP with a comma. To be clear, I need to convert from this:

DNAStringSetList of length 6
[[1]] A
[[2]] C
[[3]] A G
[[4]] T
[[5]] A C T
[[6]] A

to this character vector:

[1] "A"     "C"     "A,G"   "T"     "A,C,T" "A"

and do it quickly for over a million SNPs. I have a slow method below, but I'm wondering if there is some slick trick that will do it more quickly.  I checked the DNAStringSetList and DNAStringSet documentation and don't see a quicker way to make this conversion.

Here is sample code for 6 SNPs.

library(VariantAnnotation)
alt = DNAStringSetList("A", "C", c("A", "G"), "T", c("A", "C", "T"), "A")
x = lapply(alt, as.character)
x = sapply(x, paste, collapse = ",")

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] VariantAnnotation_1.12.9 Rsamtools_1.18.3         Biostrings_2.34.1       
[4] XVector_0.6.0            GenomicRanges_1.18.4     GenomeInfoDb_1.2.5      
[7] IRanges_2.0.1            S4Vectors_0.4.0          BiocGenerics_0.12.1     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.2    base64enc_0.1-2         BatchJobs_1.6          
 [4] BBmisc_1.9              Biobase_2.26.0          BiocParallel_1.0.3     
 [7] biomaRt_2.22.0          bitops_1.0-6            brew_1.0-6             
[10] BSgenome_1.34.1         checkmate_1.5.2         codetools_0.2-11       
[13] DBI_0.3.1               digest_0.6.8            fail_1.2               
[16] foreach_1.4.2           GenomicAlignments_1.2.2 GenomicFeatures_1.18.7
[19] iterators_1.0.7         RCurl_1.95-4.5          RSQLite_1.0.0          
[22] rtracklayer_1.26.3      sendmailR_1.2-1         stringr_0.6.2          
[25] tools_3.1.1             XML_3.98-1.1            zlibbioc_1.12.0

Thanks in advance,

Daniel Gatti

The Jackson Laboratory

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by dan.gatti0
1
gravatar for Valerie Obenchain
3.3 years ago by
Valerie Obenchain ♦♦ 6.5k
United States
Valerie Obenchain ♦♦ 6.5k wrote:

Hi,

These tips aren't on the man page but are just general strategies for working with long character lists/vectors. The CharacterList coercion is faster than the lapply:

clist <- CharacterList(alt)

Paste and replace only the elements with multiple alts:

mult <- elementLengths(clist) > 1L
clist[mult] <- lapply(clist[mult], paste0, collapse=",")
> unlist(clist)
[1] "A"     "C"     "A,G"   "T"     "A,C,T" "A"    

I'm guessing you're aware of writeVcf() but instead are interested in writing out just the alts?

Valerie

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Valerie Obenchain ♦♦ 6.5k

Hi Dan, Val,

You can use unstrsplit() for concatenating the allele for each SNP with a comma. Should be much faster than lapply( , paste0).

Cheers,

H.

ADD REPLYlink written 3.3 years ago by Hervé Pagès ♦♦ 13k
0
gravatar for dan.gatti
3.3 years ago by
dan.gatti0
dan.gatti0 wrote:

Thanks. The CharacterList is a new class for me. I need to write out a table with polymorphic SNPs in a subset of the strains from the VCF. And they need to be coded as 0 or 1. Otherwise, I'd use writeVcf().

Just to close out the question for others, I performed a few timing runs. The last two methods are barely distinguishable in terms of time. I decided to go with:

 x = CharacterList(alt)
 x = unstrsplit(x, sep = ",")

 

> library(VariantAnnotation)
> alt = replicate(1000, expr =
+          sample(x = c("A", "C", "G", "T"), size = sample(1:3, prob = c(0.9, 0.05, 0.05))))
> alt = DNAStringSetList(alt)
>
> system.time({
+   x = lapply(alt, as.character)
+   x = sapply(x, paste, collapse = ",")
+ })
   user  system elapsed
   3.76    0.00    3.76
>
> system.time({
+   x = CharacterList(alt)
+   x = sapply(x, paste, collapse = ",")
+ })
   user  system elapsed
   0.17    0.00    0.17
>
> system.time({
+   x = CharacterList(alt)
+   mult = elementLengths(x) > 1L
+   x[mult] = lapply(x[mult], paste0, collapse = ",")
+   x = unlist(x)
+ })
   user  system elapsed
   0.03    0.00    0.04
>
> system.time({
+   x = CharacterList(alt)
+   x = unstrsplit(x, sep = ",")
+ })
   user  system elapsed
   0.01    0.00    0.01

 

Thanks for your help!

ADD COMMENTlink written 3.3 years ago by dan.gatti0
0
gravatar for Hervé Pagès
3.3 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

"barely distinguishable" really?

CharacterList() is about 400x faster than lapply(alt, as.character):

system.time(x <- lapply(alt, as.character))
#   user  system elapsed 
#  1.848   0.000   1.850 
system.time(x <- CharacterList(alt))
#   user  system elapsed 
#  0.004   0.000   0.004 

And unstrsplit() is about 20x faster than the lapply() solution:

unstrsplit2 <- function(x)
{
  mult <- elementLengths(x) > 1L
  x[mult] <- lapply(x[mult], paste0, collapse = ",")
  unlist(x, use.names=FALSE)
}
library(microbenchmark)
microbenchmark(unstrsplit(x, sep=","), unstrsplit2(x))
# Unit: microseconds
#                      expr       min        lq       mean    median        uq
#  unstrsplit(x, sep = ",")   443.123   484.241   513.8259   517.505   528.236
#            unstrsplit2(x) 10229.687 10366.235 10759.4753 10464.023 10729.687
#        max neval
#    796.788   100
#  15143.757   100

Cheers,

H.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Hervé Pagès ♦♦ 13k
0
gravatar for dan.gatti
3.3 years ago by
dan.gatti0
dan.gatti0 wrote:

Good point. I looked at the 0.04 and 0.01 and misinterpreted them. Thanks for the clarification.
 

ADD COMMENTlink written 3.3 years ago by dan.gatti0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 132 users visited in the last hour