Question

Rsamtools scanFa slow?

1

Entering edit mode

cmdcolin ▴ 10

@cmdcolin-12337

Last seen 7.1 years ago

Hi, I was looking to use Rsamtools to access some data from an indexed FASTA file, but I feel like it is a bit slow, especially the scanFa step.

system.time({ idx=scanFaIndex('file.fa'); myseq=as.character(getSeq(open(FaFile('file.fa')), idx[seqnames(idx)=='seqid'])); })
   user  system elapsed
  2.286   0.175   2.460

time samtools view file.fa seqid
0.35s user 0.05s system 98% cpu 0.403 total```

I am just concerned because I may want to do this operation over many FASTA files to get orthologs from diff species, and that could add up!

If there are any ideas for speeding it up let me know :)

Rsamtools • 1.2k views

ADD COMMENT • link updated 7.2 years ago by Martin Morgan 25k • written 7.2 years ago by cmdcolin ▴ 10

score 1 · Answer 1 · 2017-02-10

You could save on the overall call by creating one instance of FaFile, and not worrying about coercing to character (since DNAStringSet are actually very useful to work with).

fa = open(FaFile('file.fa'))
idx = scanFaIndex(fa)
mySeq = getSeq(fa, idx[seqnames(idx) =='seqid'])
close(fa)

Remember also that getSeq() is vectorized, so easy and efficient to query many sequences in a single call.

I played around a bit using resources from AnnotationHub, for instance

> library(AnnotationHub)
> hub = AnnotationHub()
> query(hub, c("Homo_sapiens", "release-81", "dna"))
AnnotationHub with 4 records
# snapshotDate(): 2017-02-07 
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: FaFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH49183"]]' 

            title                                 
  AH49183 | Homo_sapiens.GRCh38.cdna.all.fa       
  AH49184 | Homo_sapiens.GRCh38.dna_rm.toplevel.fa
  AH49185 | Homo_sapiens.GRCh38.dna_sm.toplevel.fa
  AH49186 | Homo_sapiens.GRCh38.dna.toplevel.fa   
> fa = hub[["AH49186"]]
downloading from 'https://annotationhub.bioconductor.org/fetch/55651'
    'https://annotationhub.bioconductor.org/fetch/55652'
retrieving 2 resources
  |======================================================================| 100%
  |======================================================================| 100%
> file.size(path(fa), index(fa))
[1] 1099580445      20991
> system.time({ open(fa); idx = scanFaIndex(fa); getSeq(fa, idx[seqnames(idx) == "17"]) })
   user  system elapsed 
  1.884   0.016   1.901

Which is about 83M nucleotides from somewhere in the middle of the file, or

> fa = hub[["AH49183"]]
loading from cache '/home/mtmorgan//.AnnotationHub/55645'
    '/home/mtmorgan//.AnnotationHub/55646'
> length(scanFaIndex(fa))
[1] 175372
> id = which.max(width(scanFaIndex(fa))); id
[1] 116918
> width(scanFaIndex(fa))[id]
[1] 109224
> system.time({ open(fa); idx = scanFaIndex(fa); getSeq(fa, id]) })
   user  system elapsed 
  0.788   0.012   0.801

which is a smaller sequence but from towards the end of a more complicated file. And

> id = sample(length(idx), 100)
> system.time({ open(fa); idx = scanFaIndex(fa); res <- getSeq(fa, idx[id]) })
   user  system elapsed 
  0.732   0.000   0.733

Which is somehow faster (measurement error, I guess) but querying 100 sequences with (for the test above) about 213000 nucleotides. The performance seems ok to me.