Entering edit mode
I am trying to get the sequence of 200bp upstream of all zebrafish transcripts.
I am using the following code:
library(GenomicFeatures)
library(BSgenome)
library(BSgenome.Drerio.UCSC.danRer10)
library(biomaRt)
txdb <- makeTxDbFromUCSC(genome="danRer10", tablename="refGene")
transcripts <- transcripts(txdb, columns = c("tx_id", "tx_name"))
# Get TSS
tss <- resize(transcripts, width = 1, fix = 'start')
# Get 200bp upstream of the TSS.
upstream <- flank(tss, 200)
# Trim any range that is out of chromosome boundaries
upstream <- trim(upstream)
upseq <- Views(Drerio, upstream)
This seems to work perfectly and I get:
> upseq
BSgenomeViews object with 16586 views and 2 metadata columns:
seqnames ranges strand dna | tx_id tx_name
<Rle> <IRanges> <Rle> <DNAStringSet> | <integer> <character>
[1] chr1 [ 11767, 11966] + [AGACGGGGAA...CGGTGACACA] | 1 NM_198371
[2] chr1 [ 27488, 27687] + [TGATGATGAT...TCGTGCAGTC] | 2 NM_001017766
[3] chr1 [ 36534, 36733] + [CTCTGTCTCT...AGTGTCTGCT] | 3 NM_001017577
[4] chr1 [109598, 109797] + [AGTAAGCGGG...GAGTTTGTGG] | 4 NM_001089558
[5] chr1 [119760, 119959] + [GATATATGCA...TTCAAACAGA] | 5 NM_131819
... ... ... ... ... . ... ...
[16582] chrUn_KN150696v1 [ 7666, 7865] + [GAGCCTTGAG...TTTGTGTTTT] | 16582 NM_131603
[16583] chrUn_KN150699v1 [ 8272, 8471] + [GTTCGGACAA...CTTGACTCGA] | 16583 NM_207058
[16584] chrUn_KN150699v1 [15187, 15386] + [AGCTGAGCTC...CTGATGTAGA] | 16584 NM_001320205
[16585] chrUn_KN150706v1 [ 4311, 4510] - [ATACTCCACT...NNNNNNNNNN] | 16585 NM_001199970
[16586] chrUn_KN150708v1 [31564, 31763] - [CGTTACTCTA...AGACACCACA] | 16586 NM_001256832
-------
seqinfo: 1061 sequences (1 circular) from danRer10 genome
Now, I'd like to get those sequences in a character vector. I thought I could do:
sequences <- lapply(upseq, as.character)
This works but takes several minutes, which makes me think I am doing something wrong. Could you suggest a specific way of extracting those sequences?
Thank you Martin, that was an obvious solution... and it is definitely much much faster. Any idea why `lapply` takes so long?
Approximately, lapply() is an iteration, whereas the above are vectorized; the iterations are expensive because they create a DNAStringSet for each element, then coerce to character -- creating length(upseq) DNAStringSets, instead of 1.