ShortRead trim sequence at specific position
3
1
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Hi,

I´m running some stats on fastq files as shown below

fq <- readFastq(file)

# get quality score list as PhredQuality
qual <- as(quality(fq), "matrix")

#Probability error per base
pe <- apply(qual,2, function(x){10^(-(x/10))}) # P=10^(Qscore/10)

#Cumulated error per read. The probabilty error is increased towards the end of the sequence
cum_pe <- apply(pe,1,cumsum)

cum_pe[,1]

[1] 5.011872e-04 8.992944e-04 1.297402e-03 1.695509e-03 2.093616e-03 2.491723e-03
  [7] 3.286051e-03 3.485578e-03 3.685104e-03 4.083211e-03 4.334400e-03 4.533926e-03
 [13] 8.514998e-03 9.016185e-03 9.414292e-03 1.442616e-02 1.642143e-02 1.681953e-02
.....
[295] 1.226770e+02 1.233080e+02 1.239390e+02 1.245699e+02 1.252009e+02 1.258318e+02

> dim(cum_pe)
[1]   300 56761   So that i have a matrix of 56761 reads of 300nt length each.

How can i trim each read end when the cum_pe gets > 1.0 . For example if the cum_pe is 1 at position 150 of the read i would like to trim at position 150.

 

Thanks,

 

 

 

shortread fastq • 1.3k views
ADD COMMENT
1
Entering edit mode
Mike Smith ★ 5.7k
@mike-smith
Last seen 18 hours ago
EMBL Heidelberg / de.NBI

You can do something like this to find the first point where each entry in cum_pe goes greater than 1.  If no entry does then the function will return the full length of the read.

trimPos <- apply(cum_pe, 2, function(x) { min(min(which(x > 1)), length(x)) } )

Then you can use narrow() to trim each read to the specified length.

fq2 <- narrow(x = fq, start = 1, end = trimPos)
ADD COMMENT
0
Entering edit mode

The calculation of pe in the original question can be vectorized

pe = 10^(-as(quality(fq), "matrix") / 10)

I might write the pos-finding function as one of

rowSums(apply(pe, 1, cumsum) < 1)
apply(pe, 1, function(x) sum(cumsum(x) < 1))

(Mike's fails if none of x > 1). One problem might be that early low-quality bases lead to premature termination

 

ADD REPLY
0
Entering edit mode

I think mine will catch instances where none of  > 1 since it'll return min(Inf, length(x)) in that case, but yours is a much more elegant solution.

ADD REPLY
0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Great Martin,

Works well. As for the low-quality bases if they are located at the 5´end of the sequence  I will remove them before trimming ends. In fact the trimming 3´end is a second filtering in my case.

 

 

 

ADD COMMENT
0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Hi,

following the example above there is something i don´t understand

trimPos <- apply(pe, 1, function(x) sum(cumsum(x) < 1))
    # Trim when pe <1
    # My matrix is transposed but it does the work nicely

> mean(trimPos)
277.3139

Ok so the mean trim position for all reads of my file  is 277. So far so good.

# Compute the mean probability at each position of the sequence
​> mean_pe <- apply(pe,2,mean)
# and then do the cumsum so that i get a vector with a mean probability at each position of the sequence
>cum_pe <- cumsum(mean_pe)
>cum_pe < 1 # get the cumulated probability error < 1 (this should 
            # normally be the same as the mean(trimPos)
147

What i don´t understand is that it should return 277. Since this is the mean probabilty error per position it should give the same results as the trimPos ?????

Can you help ??

 

ADD COMMENT

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6