Question

Scoring a seq based on a matrix

0

Entering edit mode

ahmad.iut • 0

@ahmadiut-15858

Last seen 7.7 years ago

Dear Bioconductors,

Is there any Bioconductor package for scoring a fasta file based on a customized matrix? If not, could any body help to write a function for it?

I have a multi fasta file and want to score each seq based on a matrix like below:

The matrix:

base	position1	pos2	pos3	pos4
A	1	0	1	0
T	0	0	1	0
C	0	1	1	1
G	1	1	0	1

The seq:

>seq1

ATCG

The score:

1+0+1+1=3

Thank you in advance

matrix • 1.2k views

ADD COMMENT • link updated 7.7 years ago by Martin Morgan 25k • written 7.7 years ago by ahmad.iut • 0

score 3 · Accepted Answer · 2018-05-19

It sounds a little like you're working with a position weight matrix, so maybe ?Biostrings::PWM will help.

Alternatively, you could read in the fasta file using

library(Biostrings)
dna = readDNAStringSet("my/file.fastq")
wd  = width(dna)

Suppose your scores are in a matrix

m = matrix(rnorm(4 * max(width(dna))), 4, dimnames = list(c("A", "C", "G", "T"), NULL))

Split the DNA into individual letters, and figure out the row and column index to find the score for each letter in m

letters = strsplit(as.character(unlist(dna)), "")[[1]]
ridx = match(letters, rownames(m))
cidx = seq_len(sum(wd)) - rep(c(0, head(cumsum(wd), -1)), wd)

The scores are then

scores = m[cbind(ridx, cidx)]

These can be reshaped as a list-of-scores with

scores_list = relist(scores, dna)

If all reads are the same width, one could make a matrix

scores_matrix = matrix(scores, wd, byrow = TRUE)

Operate on these, e.g.,

score = sum(scores_list)
score = rowSums(scores_matrix)

This uses some extra memory (to represent the sequences as a character vector and then individual nucleotides as letters), but the other calculations, especially cidx, looking up scores in m, sum() and rowSums(), are vectorized (a single call to an R function) rather than iterations (for, apply, etc, with multiple calls to an R function) so fast.