Search
Question: ShortRead trim sequence at specific position
1
3.6 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Hi,

I´m running some stats on fastq files as shown below

fq <- readFastq(file)

# get quality score list as PhredQuality
qual <- as(quality(fq), "matrix")

#Probability error per base
pe <- apply(qual,2, function(x){10^(-(x/10))}) # P=10^(Qscore/10)

#Cumulated error per read. The probabilty error is increased towards the end of the sequence
cum_pe <- apply(pe,1,cumsum)

cum_pe[,1]

[1] 5.011872e-04 8.992944e-04 1.297402e-03 1.695509e-03 2.093616e-03 2.491723e-03
[7] 3.286051e-03 3.485578e-03 3.685104e-03 4.083211e-03 4.334400e-03 4.533926e-03
[13] 8.514998e-03 9.016185e-03 9.414292e-03 1.442616e-02 1.642143e-02 1.681953e-02
.....
[295] 1.226770e+02 1.233080e+02 1.239390e+02 1.245699e+02 1.252009e+02 1.258318e+02

> dim(cum_pe)
[1]   300 56761   So that i have a matrix of 56761 reads of 300nt length each.

How can i trim each read end when the cum_pe gets > 1.0 . For example if the cum_pe is 1 at position 150 of the read i would like to trim at position 150.

Thanks,

modified 3.5 years ago • written 3.6 years ago by danova_fr20
1
3.6 years ago by
Mike Smith3.1k
EMBL Heidelberg / de.NBI
Mike Smith3.1k wrote:

You can do something like this to find the first point where each entry in cum_pe goes greater than 1.  If no entry does then the function will return the full length of the read.

trimPos <- apply(cum_pe, 2, function(x) { min(min(which(x > 1)), length(x)) } )

Then you can use narrow() to trim each read to the specified length.

fq2 <- narrow(x = fq, start = 1, end = trimPos)

The calculation of pe in the original question can be vectorized

pe = 10^(-as(quality(fq), "matrix") / 10)

I might write the pos-finding function as one of

rowSums(apply(pe, 1, cumsum) < 1)
apply(pe, 1, function(x) sum(cumsum(x) < 1))

(Mike's fails if none of x > 1). One problem might be that early low-quality bases lead to premature termination

I think mine will catch instances where none of  > 1 since it'll return min(Inf, length(x)) in that case, but yours is a much more elegant solution.

0
3.6 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Great Martin,

Works well. As for the low-quality bases if they are located at the 5´end of the sequence  I will remove them before trimming ends. In fact the trimming 3´end is a second filtering in my case.

0
3.5 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Hi,

following the example above there is something i don´t understand

trimPos <- apply(pe, 1, function(x) sum(cumsum(x) < 1))
# Trim when pe <1
# My matrix is transposed but it does the work nicely

> mean(trimPos)
277.3139

Ok so the mean trim position for all reads of my file  is 277. So far so good.

# Compute the mean probability at each position of the sequence
​> mean_pe <- apply(pe,2,mean)
# and then do the cumsum so that i get a vector with a mean probability at each position of the sequence
>cum_pe <- cumsum(mean_pe)
>cum_pe < 1 # get the cumulated probability error < 1 (this should
# normally be the same as the mean(trimPos)
147

What i don´t understand is that it should return 277. Since this is the mean probabilty error per position it should give the same results as the trimPos ?????

Can you help ??

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.