Scoring a seq based on a matrix
1
0
Entering edit mode
ahmad.iut • 0
@ahmadiut-15858
Last seen 6.6 years ago

Dear Bioconductors,

Is there any Bioconductor package for scoring a fasta file based on a customized matrix? If not, could any body help to write a function for it?

I have a multi fasta file and want to score each seq based on a matrix like below:

The matrix:

base position1 pos2 pos3 pos4
A 1 0 1 0
T 0 0 1 0
C 0 1 1 1
G 1 1 0 1

 

The seq:

>seq1

ATCG

The score:

1+0+1+1=3

Thank you in advance

         
         
         
         
         

 

matrix • 922 views
ADD COMMENT
3
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

It sounds a little like you're working with a position weight matrix, so maybe ?Biostrings::PWM will help.

Alternatively, you could read in the fasta file using

library(Biostrings)
dna = readDNAStringSet("my/file.fastq")
wd  = width(dna)

Suppose your scores are in a matrix

m = matrix(rnorm(4 * max(width(dna))), 4, dimnames = list(c("A", "C", "G", "T"), NULL))

Split the DNA into individual letters, and figure out the row and column index to find the score for each letter in m

letters = strsplit(as.character(unlist(dna)), "")[[1]]
ridx = match(letters, rownames(m))
cidx = seq_len(sum(wd)) - rep(c(0, head(cumsum(wd), -1)), wd)

The scores are then

scores = m[cbind(ridx, cidx)]

These can be reshaped as a list-of-scores with

scores_list = relist(scores, dna)

If all reads are the same width, one could make a matrix

scores_matrix = matrix(scores, wd, byrow = TRUE)

Operate on these, e.g.,

score = sum(scores_list)
score = rowSums(scores_matrix) 

This uses some extra memory (to represent the sequences as a character vector and then individual nucleotides as letters), but the other calculations, especially cidx, looking up scores in m, sum() and rowSums(), are vectorized (a single call to an R function) rather than iterations (for, apply, etc, with multiple calls to an R function) so fast.

ADD COMMENT
0
Entering edit mode

Two small tweaks. Reversing the order of c() and head()

cidx = seq_len(sum(wd)) - rep(head(c(0, cumsum(wd)), -1), wd)

makes the line robust to zero-length wd. Using as.integer()

ridx = match(as.integer(unlist(dna)), as.integer(unlist(DNAStringSet(rownames(m)))))

avoids the (relatively expensive) need to create a vector-of-characters so is both more memory efficient and faster

ADD REPLY

Login before adding your answer.

Traffic: 527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6