Extracting sequences from BSgenomeViews object
1
1
Entering edit mode
@nicolaromano-13629
Last seen 9 months ago

I am trying to get the sequence of 200bp upstream of all zebrafish transcripts.

I am using the following code:

library(GenomicFeatures)
library(BSgenome)
library(BSgenome.Drerio.UCSC.danRer10)
library(biomaRt)
txdb <- makeTxDbFromUCSC(genome="danRer10", tablename="refGene")
transcripts <- transcripts(txdb, columns = c("tx_id", "tx_name"))

# Get TSS
tss <- resize(transcripts, width = 1, fix = 'start')
# Get 200bp upstream of the TSS. 
upstream <- flank(tss, 200)
# Trim any range that is out of chromosome boundaries 
upstream <- trim(upstream)

upseq <- Views(Drerio, upstream)

This seems to work perfectly and I get:

> upseq
BSgenomeViews object with 16586 views and 2 metadata columns:
                  seqnames           ranges strand                       dna |     tx_id      tx_name
                     <Rle>        <IRanges>  <Rle>            <DNAStringSet> | <integer>  <character>
      [1]             chr1 [ 11767,  11966]      + [AGACGGGGAA...CGGTGACACA] |         1    NM_198371
      [2]             chr1 [ 27488,  27687]      + [TGATGATGAT...TCGTGCAGTC] |         2 NM_001017766
      [3]             chr1 [ 36534,  36733]      + [CTCTGTCTCT...AGTGTCTGCT] |         3 NM_001017577
      [4]             chr1 [109598, 109797]      + [AGTAAGCGGG...GAGTTTGTGG] |         4 NM_001089558
      [5]             chr1 [119760, 119959]      + [GATATATGCA...TTCAAACAGA] |         5    NM_131819
      ...              ...              ...    ...                       ... .       ...          ...
  [16582] chrUn_KN150696v1   [ 7666,  7865]      + [GAGCCTTGAG...TTTGTGTTTT] |     16582    NM_131603
  [16583] chrUn_KN150699v1   [ 8272,  8471]      + [GTTCGGACAA...CTTGACTCGA] |     16583    NM_207058
  [16584] chrUn_KN150699v1   [15187, 15386]      + [AGCTGAGCTC...CTGATGTAGA] |     16584 NM_001320205
  [16585] chrUn_KN150706v1   [ 4311,  4510]      - [ATACTCCACT...NNNNNNNNNN] |     16585 NM_001199970
  [16586] chrUn_KN150708v1   [31564, 31763]      - [CGTTACTCTA...AGACACCACA] |     16586 NM_001256832
  -------
  seqinfo: 1061 sequences (1 circular) from danRer10 genome

Now, I'd like to get those sequences in a character vector. I thought I could do:

sequences <- lapply(upseq, as.character)

This works but takes several minutes, which makes me think I am doing something wrong. Could you suggest a specific way of extracting those sequences?

sequence bsgenome • 942 views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 10 weeks ago
United States

You could do DNAStringSet(upseq) to stay in the Bioconductor universe, or as.character(upseq) to work with plain character vectors.

ADD COMMENT
0
Entering edit mode

Thank you Martin, that was an obvious solution... and it is definitely much much faster. Any idea why `lapply` takes so long?

ADD REPLY
1
Entering edit mode

Approximately, lapply() is an iteration, whereas the above are vectorized; the iterations are expensive because they create a DNAStringSet for each element, then coerce to character -- creating length(upseq) DNAStringSets, instead of 1.

ADD REPLY

Login before adding your answer.

Traffic: 323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6