Search
Question: ShortRead trim sequence at specific position
1
gravatar for danova_fr
2.6 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Hi,

I´m running some stats on fastq files as shown below

fq <- readFastq(file)

# get quality score list as PhredQuality
qual <- as(quality(fq), "matrix")

#Probability error per base
pe <- apply(qual,2, function(x){10^(-(x/10))}) # P=10^(Qscore/10)

#Cumulated error per read. The probabilty error is increased towards the end of the sequence
cum_pe <- apply(pe,1,cumsum)

cum_pe[,1]

[1] 5.011872e-04 8.992944e-04 1.297402e-03 1.695509e-03 2.093616e-03 2.491723e-03
  [7] 3.286051e-03 3.485578e-03 3.685104e-03 4.083211e-03 4.334400e-03 4.533926e-03
 [13] 8.514998e-03 9.016185e-03 9.414292e-03 1.442616e-02 1.642143e-02 1.681953e-02
.....
[295] 1.226770e+02 1.233080e+02 1.239390e+02 1.245699e+02 1.252009e+02 1.258318e+02

> dim(cum_pe)
[1]   300 56761   So that i have a matrix of 56761 reads of 300nt length each.

How can i trim each read end when the cum_pe gets > 1.0 . For example if the cum_pe is 1 at position 150 of the read i would like to trim at position 150.

 

Thanks,

 

 

 

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by danova_fr20
1
gravatar for Mike Smith
2.6 years ago by
Mike Smith2.1k
EMBL Heidelberg / de.NBI
Mike Smith2.1k wrote:

You can do something like this to find the first point where each entry in cum_pe goes greater than 1.  If no entry does then the function will return the full length of the read.

trimPos <- apply(cum_pe, 2, function(x) { min(min(which(x > 1)), length(x)) } )

Then you can use narrow() to trim each read to the specified length.

fq2 <- narrow(x = fq, start = 1, end = trimPos)
ADD COMMENTlink written 2.6 years ago by Mike Smith2.1k

The calculation of pe in the original question can be vectorized

pe = 10^(-as(quality(fq), "matrix") / 10)

I might write the pos-finding function as one of

rowSums(apply(pe, 1, cumsum) < 1)
apply(pe, 1, function(x) sum(cumsum(x) < 1))

(Mike's fails if none of x > 1). One problem might be that early low-quality bases lead to premature termination

 

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Martin Morgan ♦♦ 20k

I think mine will catch instances where none of  > 1 since it'll return min(Inf, length(x)) in that case, but yours is a much more elegant solution.

ADD REPLYlink written 2.6 years ago by Mike Smith2.1k
0
gravatar for danova_fr
2.6 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Great Martin,

Works well. As for the low-quality bases if they are located at the 5´end of the sequence  I will remove them before trimming ends. In fact the trimming 3´end is a second filtering in my case.

 

 

 

ADD COMMENTlink written 2.6 years ago by danova_fr20
0
gravatar for danova_fr
2.6 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Hi,

following the example above there is something i don´t understand

trimPos <- apply(pe, 1, function(x) sum(cumsum(x) < 1))
    # Trim when pe <1
    # My matrix is transposed but it does the work nicely

> mean(trimPos)
277.3139

Ok so the mean trim position for all reads of my file  is 277. So far so good.

# Compute the mean probability at each position of the sequence
​> mean_pe <- apply(pe,2,mean)
# and then do the cumsum so that i get a vector with a mean probability at each position of the sequence
>cum_pe <- cumsum(mean_pe)
>cum_pe < 1 # get the cumulated probability error < 1 (this should 
            # normally be the same as the mean(trimPos)
147

What i don´t understand is that it should return 277. Since this is the mean probabilty error per position it should give the same results as the trimPos ?????

Can you help ??

 

ADD COMMENTlink modified 2.6 years ago by Martin Morgan ♦♦ 20k • written 2.6 years ago by danova_fr20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 146 users visited in the last hour