ShortRead with large fastq files
6
0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

HI ,

I have a problem getting quality scores for the following large file.

 

My first step is to check the quality  to generate some stats for each fastq file, however it seems that i cannot allocate enough memory even though i have enough memory (10Gb).

> fq
class: ShortReadQ
length: 24377237 reads; width: 100 cycles

qual <- as(quality(fq), "matrix")

Error in asMethod(object) : long vectors not supported yet: memory.c:3324

 

How can i get around this to get the quality scores ?

 

shortread • 2.0k views
ADD COMMENT
0
Entering edit mode

What is the output of sessionInfo()?

ADD REPLY
1
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Ok tried I have tried the following:

f <- FastqStreamer(file, 100000) #chunks of 100000
allqual <- list()
i=1
while (length(fq <- yield(f)) ) {
  chunkqual <- as(quality(fq), "matrix")
  allqual[[i]] <- chunkqual
  i = i+1
}
close(f)
qual <- do.call(rbind,allqual)

#quantile
qt <- apply(qual,2,quantile)

It crashes inside the loop. Maybe there is a better way to do more memory efficiently ?

ADD COMMENT
0
Entering edit mode

Do you really need to have quality scores of each read, at each cycle, or can you work on a summary of one sort or another, e.g., the abcq example on ?alphabetByCycle ? From your use of apply(), it seems like you can easily summarize each chunk, and then combine iteratively or at the end of the loop.

The 'copy and append' strategy allqual <- list() ... allqual[[i]] <- ... is very inefficient, you would rather pre-allocate and fill, maybe by over-estimating the length of the list, allqual <- vector("list", 10000).

It's not really helpful enough to say 'it crashes', because crashes can happen in many different ways and for correspondingly different reasons.

ADD REPLY
0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=ca_AD.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=ca_AD.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ca_AD.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ShortRead_1.24.0        GenomicAlignments_1.2.2 Rsamtools_1.18.3       
 [4] GenomicRanges_1.18.4    GenomeInfoDb_1.2.4      Biostrings_2.34.1      
 [7] XVector_0.6.0           IRanges_2.0.1           S4Vectors_0.4.0        
[10] BiocParallel_1.0.3      BiocGenerics_0.12.1    

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2     BatchJobs_1.6       BBmisc_1.9         
 [4] Biobase_2.26.0      bitops_1.0-6        brew_1.0-6         
 [7] checkmate_1.5.2     codetools_0.2-11    DBI_0.3.1          
[10] digest_0.6.8        fail_1.2            foreach_1.4.2      
[13] grid_3.1.2          hwriter_1.3.2       iterators_1.0.7    
[16] lattice_0.20-30     latticeExtra_0.6-26 RColorBrewer_1.1-2
[19] RSQLite_1.0.0       sendmailR_1.2-1     stringr_0.6.2      
[22] tools_3.1.2         zlibbioc_1.12.0    
ADD COMMENT
0
Entering edit mode
Mike Smith ★ 5.7k
@mike-smith
Last seen 19 hours ago
EMBL Heidelberg / de.NBI

Depending upon what you're interested in doing with the quality scores, you could try using narrow() to look at a reduced number of cycles.  This code will get you the quality matrix for the first 20 bases from each read:

qual <- as( quality( narrow( fq, start = 1, width = 20 ) ), "matrix" )

You can then summarise each column or whatever, and then repeat the process with another set of cycles.

ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 13 days ago
United States

Use FastqStreamer() to iterate (process your quality assessment in chunks, say of 1M reads per chunk) or FastqSampler() (to read a random subset), as illustrated on the help pages,e.g., ?FastqStreamer.

ADD COMMENT
0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Thanks for the update.

What i´m trying to achieve here is ( so I might be wrong the way i do it) is to compute the quantile at each cycle across all reads. This is working pretty wall with bacterial genomes using 16s amplicons . Now working with higher datasets (metagenomics samples doing shotgun sequencing i get much more data)

So for read1 i will have the score for each cycle (i.g 1..100)

for read i will also get the score for each cycle (again 1..100)

at the end i want a matrix (basically what the as(quality(fq),"matrix") would return. With this matrix i compute the quantile per cycle and generate a boxplot that show the quality per cycle.

See attached plot. Obviously i could just use fastQC from Babraham bioinformatics but that would not be that funny

https://www.dropbox.com/s/h3v6c564gequxe3/Screen%20Shot%202015-06-02%20at%2022.46.42.png?dl=0

 

crashes in my case are problems allocating memory. SInce i don´t need the sequences but just the scores i thought i could just get the quality scores.

 

 

 

ADD COMMENT
0
Entering edit mode

Hi,

You cannot have more than 2^31-1 values in a matrix. This is a limitation of R. That means you won't be able to store the qualities for all the cycles of all your reads in a single matrix at once. So even if you didn't get a crash inside your while loop, you would end up with the same problem that you had originally when you try to combine all your "small" matrices into a single big one at the end with qual <- do.call(rbind,allqual).

As Mike suggested earlier, you can work around this by working with smaller sets of cycles. For example, working with 1 cycle at a time:

qfq <- quality(fq)
ncycles <- width(qfq)[1]
stopifnot(all(width(qfq) == ncycles))
qt <- sapply(seq_len(ncycles), function(i)
             quantile(as(narrow(qfq, start=i, width=1L), "matrix")))

The number of cycles is typically small so looping on the cycles should be quick.

Cheers,
H.

Edit: Martin pointed to me that matrices of more than 2^31-1 elements are officially supported. However, in practice it seems that the underlying code is still a little bit rough around the edges:

## Creating an integer matrix of the same dimensions as the one you
## would get if you were able to do as(quality(fq), "matrix"):
m <- matrix(nrow=24377237, ncol=100)
dim(m)
# [1] 24377237      100
object.size(m)
# 9750895000 bytes

## However, as(quality(fq), "matrix") would give you a character
## matrix instead of an integer matrix:
m2 <- matrix("A", nrow=24377237, ncol=100)  # takes a long time!
dim(m2)
# [1] 24377237      100
object.size(m2)
# Error in structure(.Call(C_objectSize, x), class = "object_size") :
#   long vectors not supported yet: /home/hpages/src/R-3.2.r68389/src/main/unique.c:1656

m3 <- matrix(nrow=24377237, ncol=100)
# Error in matrix(nrow = 24377237, ncol = 100) :
#   long vectors not supported yet: /home/hpages/src/R-3.2.r68389/src/main/memory.c:1648 

Anyway, this particular topic would probably be best discussed on the R-devel mailing list...

ADD REPLY
0
Entering edit mode

Maybe qrqc does the trick (also not very fun, I guess).

I would have summarized the data as

fl <- system.file(package="ShortRead", "extdata", "E-MTAB-1147",
                  "ERR127302_1_subset.fastq.gz")

strm <- FastqStreamer(fl, 10000)
abc <- NULL
repeat {
    fq <- yield(strm)
    if (length(fq) == 0)
        break
    abc <- if (is.null(abc)) {
        alphabetByCycle(quality(fq))
    } else abc + alphabetByCycle(quality(fq))
}

and then worked with abc for quantiles, either quick-and-dirty

ridx <- seq(1, which.max(cumsum(rowSums(abc) != 0)))
abc <- abc[ridx, ]

image(t(asinh(abc)), axes=FALSE, col=rev(heat.colors(12)))
axis(1, at=seq(10, ncol(abc), by=10) / ncol(abc) - 1 / ncol(abc),
     label=seq(10, ncol(abc), by=10))
axis(2, at=seq(5, nrow(abc), by=5) / nrow(abc) - 1 / nrow(abc),
     label=rownames(abc)[seq(5, nrow(abc), by=5)])

or a little inelegantly

abcq <- apply(abc, 2, function(x) quantile(rep(seq_along(x), x)))

(the mapping between character and encoding is given by encoding()). Since you mention fun, the pattern of iterating through a large file is a very common way to manage memory in genomics. So we have

library(GenomicFiles)
abc <- reduceByYield(FastqStreamer(fl), yield, function(x) {
    alphabetByCycle(quality(x))
})
ADD REPLY
0
Entering edit mode
danova_fr ▴ 20
@danova_fr-7502
Last seen 7.5 years ago
Andorra

Thanks for answers.

I will follow your advice and work with 1  cycle at a time.

 

Best,

david

 

ADD COMMENT

Login before adding your answer.

Traffic: 469 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6