Question

Converting a DNAStringSetList to characters quickly

0

Entering edit mode

dan.gatti • 0

@dangatti-7637

Last seen 7.5 years ago

I'm using read.vcf() in the VariantAnnotation package to get SNPs from a VCF file. It returns a CollapsedVCF object. I then extract the alternate allele calls using fixed(vcf)$ALT, which returns a DNAStringSetList. Each element contains a DNAStringSet with one or more characters (there are tri-morphic SNPs in this data set). I would then like to write this to a new file, concatenating the allele for each SNP with a comma. To be clear, I need to convert from this:

DNAStringSetList of length 6
[[1]] A
[[2]] C
[[3]] A G
[[4]] T
[[5]] A C T
[[6]] A

to this character vector:

[1] "A" "C" "A,G" "T" "A,C,T" "A"

and do it quickly for over a million SNPs. I have a slow method below, but I'm wondering if there is some slick trick that will do it more quickly. I checked the DNAStringSetList and DNAStringSet documentation and don't see a quicker way to make this conversion.

Here is sample code for 6 SNPs.

library(VariantAnnotation)
alt = DNAStringSetList("A", "C", c("A", "G"), "T", c("A", "C", "T"), "A")
x = lapply(alt, as.character)
x = sapply(x, paste, collapse = ",")

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] VariantAnnotation_1.12.9 Rsamtools_1.18.3         Biostrings_2.34.1
[4] XVector_0.6.0            GenomicRanges_1.18.4     GenomeInfoDb_1.2.5
[7] IRanges_2.0.1            S4Vectors_0.4.0          BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.2    base64enc_0.1-2         BatchJobs_1.6
[4] BBmisc_1.9              Biobase_2.26.0          BiocParallel_1.0.3
[7] biomaRt_2.22.0          bitops_1.0-6            brew_1.0-6
[10] BSgenome_1.34.1         checkmate_1.5.2         codetools_0.2-11
[13] DBI_0.3.1               digest_0.6.8            fail_1.2
[16] foreach_1.4.2           GenomicAlignments_1.2.2 GenomicFeatures_1.18.7
[19] iterators_1.0.7         RCurl_1.95-4.5          RSQLite_1.0.0
[22] rtracklayer_1.26.3      sendmailR_1.2-1         stringr_0.6.2
[25] tools_3.1.1             XML_3.98-1.1            zlibbioc_1.12.0

Thanks in advance,

Daniel Gatti

The Jackson Laboratory

readVcf DNAStringSet DNAStringSetList variantannotation readvcf • 4.3k views

ADD COMMENT • link 9.0 years ago dan.gatti • 0

score 1 · Answer 1 · 2015-04-21

1

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi,

These tips aren't on the man page but are just general strategies for working with long character lists/vectors. The CharacterList coercion is faster than the lapply:

clist <- CharacterList(alt)

Paste and replace only the elements with multiple alts:

mult <- elementLengths(clist) > 1L
clist[mult] <- lapply(clist[mult], paste0, collapse=",")
> unlist(clist)
[1] "A"     "C"     "A,G"   "T"     "A,C,T" "A"

I'm guessing you're aware of writeVcf() but instead are interested in writing out just the alts?

Valerie

ADD COMMENT • link 9.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Dan, Val,

You can use unstrsplit() for concatenating the allele for each SNP with a comma. Should be much faster than lapply( , paste0).

Cheers,

H.

ADD REPLY • link 9.0 years ago Hervé Pagès 16k

score 0 · Answer 2 · 2015-04-21

Thanks. The CharacterList is a new class for me. I need to write out a table with polymorphic SNPs in a subset of the strains from the VCF. And they need to be coded as 0 or 1. Otherwise, I'd use writeVcf().

Just to close out the question for others, I performed a few timing runs. The last two methods are barely distinguishable in terms of time. I decided to go with:

x = CharacterList(alt)
x = unstrsplit(x, sep = ",")

> library(VariantAnnotation)
> alt = replicate(1000, expr =
+          sample(x = c("A", "C", "G", "T"), size = sample(1:3, prob = c(0.9, 0.05, 0.05))))
> alt = DNAStringSetList(alt)
>
> system.time({
+   x = lapply(alt, as.character)
+   x = sapply(x, paste, collapse = ",")
+ })
   user system elapsed
   3.76    0.00    3.76
>
> system.time({
+   x = CharacterList(alt)
+   x = sapply(x, paste, collapse = ",")
+ })
   user system elapsed
   0.17    0.00    0.17
>
> system.time({
+   x = CharacterList(alt)
+   mult = elementLengths(x) > 1L
+   x[mult] = lapply(x[mult], paste0, collapse = ",")
+   x = unlist(x)
+ })
   user system elapsed
   0.03    0.00    0.04
>
> system.time({
+   x = CharacterList(alt)
+   x = unstrsplit(x, sep = ",")
+ })
   user system elapsed
   0.01    0.00    0.01

Thanks for your help!

score 0 · Answer 3 · 2015-04-21

"barely distinguishable" really?

CharacterList() is about 400x faster than lapply(alt, as.character):

system.time(x <- lapply(alt, as.character))
#   user  system elapsed 
#  1.848   0.000   1.850 
system.time(x <- CharacterList(alt))
#   user  system elapsed 
#  0.004   0.000   0.004

And unstrsplit() is about 20x faster than the lapply() solution:

unstrsplit2 <- function(x)
{
  mult <- elementLengths(x) > 1L
  x[mult] <- lapply(x[mult], paste0, collapse = ",")
  unlist(x, use.names=FALSE)
}
library(microbenchmark)
microbenchmark(unstrsplit(x, sep=","), unstrsplit2(x))
# Unit: microseconds
#                      expr       min        lq       mean    median        uq
#  unstrsplit(x, sep = ",")   443.123   484.241   513.8259   517.505   528.236
#            unstrsplit2(x) 10229.687 10366.235 10759.4753 10464.023 10729.687
#        max neval
#    796.788   100
#  15143.757   100

Cheers,

H.

score 0 · Answer 4 · 2015-04-22

0

Entering edit mode

dan.gatti • 0

@dangatti-7637

Last seen 7.5 years ago

Good point. I looked at the 0.04 and 0.01 and misinterpreted them. Thanks for the clarification.

ADD COMMENT • link 9.0 years ago dan.gatti • 0