It sounds a little like you're working with a position weight matrix, so maybe ?Biostrings::PWM will help.
Alternatively, you could read in the fasta file using
library(Biostrings)
dna = readDNAStringSet("my/file.fastq")
wd = width(dna)
Suppose your scores are in a matrix
m = matrix(rnorm(4 * max(width(dna))), 4, dimnames = list(c("A", "C", "G", "T"), NULL))
Split the DNA into individual letters, and figure out the row and column index to find the score for each letter in m
letters = strsplit(as.character(unlist(dna)), "")[[1]]
ridx = match(letters, rownames(m))
cidx = seq_len(sum(wd)) - rep(c(0, head(cumsum(wd), -1)), wd)
The scores are then
scores = m[cbind(ridx, cidx)]
These can be reshaped as a list-of-scores with
scores_list = relist(scores, dna)
If all reads are the same width, one could make a matrix
scores_matrix = matrix(scores, wd, byrow = TRUE)
Operate on these, e.g.,
score = sum(scores_list)
score = rowSums(scores_matrix)
This uses some extra memory (to represent the sequences as a character vector and then individual nucleotides as letters), but the other calculations, especially cidx, looking up scores in m, sum() and rowSums(), are vectorized (a single call to an R function) rather than iterations (for, apply, etc, with multiple calls to an R function) so fast.
Two small tweaks. Reversing the order of
c()
andhead()
makes the line robust to zero-length
wd
. Usingas.integer()
avoids the (relatively expensive) need to create a vector-of-characters so is both more memory efficient and faster