Rsamtools scanFa slow?
1
1
Entering edit mode
cmdcolin ▴ 10
@cmdcolin-12337
Last seen 7.1 years ago

Hi, I was looking to use Rsamtools to access some data from an indexed FASTA file, but I feel like it is a bit slow, especially the scanFa step.

system.time({ idx=scanFaIndex('file.fa'); myseq=as.character(getSeq(open(FaFile('file.fa')), idx[seqnames(idx)=='seqid'])); })
   user  system elapsed
  2.286   0.175   2.460
time samtools view file.fa seqid
0.35s user 0.05s system 98% cpu 0.403 total```

I am just concerned because I may want to do this operation over many FASTA files to get orthologs from diff species, and that could add up!

If there are any ideas for speeding it up let me know :)

Rsamtools • 1.2k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 4 days ago
United States

You could save on the overall call by creating one instance of FaFile, and not worrying about coercing to character (since DNAStringSet are actually very useful to work with).

fa = open(FaFile('file.fa'))
idx = scanFaIndex(fa)
mySeq = getSeq(fa, idx[seqnames(idx) =='seqid'])
close(fa)

Remember also that getSeq() is vectorized, so easy and efficient to query many sequences in a single call.

I played around a bit using resources from AnnotationHub, for instance

> library(AnnotationHub)
> hub = AnnotationHub()
> query(hub, c("Homo_sapiens", "release-81", "dna"))
AnnotationHub with 4 records
# snapshotDate(): 2017-02-07 
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: FaFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH49183"]]' 

            title                                 
  AH49183 | Homo_sapiens.GRCh38.cdna.all.fa       
  AH49184 | Homo_sapiens.GRCh38.dna_rm.toplevel.fa
  AH49185 | Homo_sapiens.GRCh38.dna_sm.toplevel.fa
  AH49186 | Homo_sapiens.GRCh38.dna.toplevel.fa   
> fa = hub[["AH49186"]]
downloading from 'https://annotationhub.bioconductor.org/fetch/55651'
    'https://annotationhub.bioconductor.org/fetch/55652'
retrieving 2 resources
  |======================================================================| 100%
  |======================================================================| 100%
> file.size(path(fa), index(fa))
[1] 1099580445      20991
> system.time({ open(fa); idx = scanFaIndex(fa); getSeq(fa, idx[seqnames(idx) == "17"]) })
   user  system elapsed 
  1.884   0.016   1.901 

Which is about 83M nucleotides from somewhere in the middle of the file, or

> fa = hub[["AH49183"]]
loading from cache '/home/mtmorgan//.AnnotationHub/55645'
    '/home/mtmorgan//.AnnotationHub/55646'
> length(scanFaIndex(fa))
[1] 175372
> id = which.max(width(scanFaIndex(fa))); id
[1] 116918
> width(scanFaIndex(fa))[id]
[1] 109224
> system.time({ open(fa); idx = scanFaIndex(fa); getSeq(fa, id]) })
   user  system elapsed 
  0.788   0.012   0.801 

which is a smaller sequence but from towards the end of a more complicated file. And

> id = sample(length(idx), 100)
> system.time({ open(fa); idx = scanFaIndex(fa); res <- getSeq(fa, idx[id]) })
   user  system elapsed 
  0.732   0.000   0.733 

Which is somehow faster (measurement error, I guess) but querying 100 sequences with (for the test above) about 213000 nucleotides. The performance seems ok to me.

 

ADD COMMENT
0
Entering edit mode

I think maybe the issue with my particular dataset is the FASTA index is actually pretty large, with many many scaffolds. I might have to bite the bullet and accept that the scanFaIndex is a little slower, just by a couple second or two though. Thanks for the feedback!

ADD REPLY
0
Entering edit mode

To be more specific on sizes of the FAI files...smallest is 1.5MB, largest is 17MB...darn ultrafragmented genome assemblies

 

    

ADD REPLY

Login before adding your answer.

Traffic: 886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6