ShortRead trim sequence at specific position
3
1
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Hi,

I´m running some stats on fastq files as shown below

fq <- readFastq(file)

# get quality score list as PhredQuality
qual <- as(quality(fq), "matrix")

#Probability error per base
pe <- apply(qual,2, function(x){10^(-(x/10))}) # P=10^(Qscore/10)

#Cumulated error per read. The probabilty error is increased towards the end of the sequence
cum_pe <- apply(pe,1,cumsum)

cum_pe[,1]

 5.011872e-04 8.992944e-04 1.297402e-03 1.695509e-03 2.093616e-03 2.491723e-03
 3.286051e-03 3.485578e-03 3.685104e-03 4.083211e-03 4.334400e-03 4.533926e-03
 8.514998e-03 9.016185e-03 9.414292e-03 1.442616e-02 1.642143e-02 1.681953e-02
.....
 1.226770e+02 1.233080e+02 1.239390e+02 1.245699e+02 1.252009e+02 1.258318e+02

> dim(cum_pe)
   300 56761   So that i have a matrix of 56761 reads of 300nt length each.

How can i trim each read end when the cum_pe gets > 1.0 . For example if the cum_pe is 1 at position 150 of the read i would like to trim at position 150.

Thanks,

shortread fastq • 1.3k views
1
Entering edit mode
Mike Smith ★ 5.7k
@mike-smith
Last seen 18 hours ago
EMBL Heidelberg / de.NBI

You can do something like this to find the first point where each entry in cum_pe goes greater than 1.  If no entry does then the function will return the full length of the read.

trimPos <- apply(cum_pe, 2, function(x) { min(min(which(x > 1)), length(x)) } )

Then you can use narrow() to trim each read to the specified length.

fq2 <- narrow(x = fq, start = 1, end = trimPos)
0
Entering edit mode

The calculation of pe in the original question can be vectorized

pe = 10^(-as(quality(fq), "matrix") / 10)

I might write the pos-finding function as one of

rowSums(apply(pe, 1, cumsum) < 1)
apply(pe, 1, function(x) sum(cumsum(x) < 1))

(Mike's fails if none of x > 1). One problem might be that early low-quality bases lead to premature termination

0
Entering edit mode

I think mine will catch instances where none of  > 1 since it'll return min(Inf, length(x)) in that case, but yours is a much more elegant solution.

0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Great Martin,

Works well. As for the low-quality bases if they are located at the 5´end of the sequence  I will remove them before trimming ends. In fact the trimming 3´end is a second filtering in my case.

0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Hi,

following the example above there is something i don´t understand

trimPos <- apply(pe, 1, function(x) sum(cumsum(x) < 1))
# Trim when pe <1
# My matrix is transposed but it does the work nicely

> mean(trimPos)
277.3139

Ok so the mean trim position for all reads of my file  is 277. So far so good.

# Compute the mean probability at each position of the sequence
​> mean_pe <- apply(pe,2,mean)
# and then do the cumsum so that i get a vector with a mean probability at each position of the sequence
>cum_pe <- cumsum(mean_pe)
>cum_pe < 1 # get the cumulated probability error < 1 (this should
# normally be the same as the mean(trimPos)
147

What i don´t understand is that it should return 277. Since this is the mean probabilty error per position it should give the same results as the trimPos ?????

Can you help ??