Question: ShortRead with large fastq files
0
4.1 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

HI ,

I have a problem getting quality scores for the following large file.

My first step is to check the quality  to generate some stats for each fastq file, however it seems that i cannot allocate enough memory even though i have enough memory (10Gb).

> fq
length: 24377237 reads; width: 100 cycles

qual <- as(quality(fq), "matrix")

Error in asMethod(object) : long vectors not supported yet: memory.c:3324

How can i get around this to get the quality scores ?

modified 4.1 years ago • written 4.1 years ago by danova_fr20

What is the output of sessionInfo()?

1
4.1 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Ok tried I have tried the following:

f <- FastqStreamer(file, 100000) #chunks of 100000
allqual <- list()
i=1
while (length(fq <- yield(f)) ) {
chunkqual <- as(quality(fq), "matrix")
allqual[[i]] <- chunkqual
i = i+1
}
close(f)
qual <- do.call(rbind,allqual)

#quantile
qt <- apply(qual,2,quantile)

It crashes inside the loop. Maybe there is a better way to do more memory efficiently ?

Do you really need to have quality scores of each read, at each cycle, or can you work on a summary of one sort or another, e.g., the abcq example on ?alphabetByCycle ? From your use of apply(), it seems like you can easily summarize each chunk, and then combine iteratively or at the end of the loop.

The 'copy and append' strategy allqual <- list() ... allqual[[i]] <- ... is very inefficient, you would rather pre-allocate and fill, maybe by over-estimating the length of the list, allqual <- vector("list", 10000).

It's not really helpful enough to say 'it crashes', because crashes can happen in many different ways and for correspondingly different reasons.

0
4.1 years ago by
danova_fr20
Andorra
danova_fr20 wrote:
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[4] GenomicRanges_1.18.4    GenomeInfoDb_1.2.4      Biostrings_2.34.1
[7] XVector_0.6.0           IRanges_2.0.1           S4Vectors_0.4.0
[10] BiocParallel_1.0.3      BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] base64enc_0.1-2     BatchJobs_1.6       BBmisc_1.9
[4] Biobase_2.26.0      bitops_1.0-6        brew_1.0-6
[7] checkmate_1.5.2     codetools_0.2-11    DBI_0.3.1
[10] digest_0.6.8        fail_1.2            foreach_1.4.2
[13] grid_3.1.2          hwriter_1.3.2       iterators_1.0.7
[16] lattice_0.20-30     latticeExtra_0.6-26 RColorBrewer_1.1-2
[19] RSQLite_1.0.0       sendmailR_1.2-1     stringr_0.6.2
[22] tools_3.1.2         zlibbioc_1.12.0

0
4.1 years ago by
Mike Smith3.8k
EMBL Heidelberg / de.NBI
Mike Smith3.8k wrote:

Depending upon what you're interested in doing with the quality scores, you could try using narrow() to look at a reduced number of cycles.  This code will get you the quality matrix for the first 20 bases from each read:

qual <- as( quality( narrow( fq, start = 1, width = 20 ) ), "matrix" )

You can then summarise each column or whatever, and then repeat the process with another set of cycles.

0
4.1 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Use FastqStreamer() to iterate (process your quality assessment in chunks, say of 1M reads per chunk) or FastqSampler() (to read a random subset), as illustrated on the help pages,e.g., ?FastqStreamer.

0
4.1 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Thanks for the update.

What i´m trying to achieve here is ( so I might be wrong the way i do it) is to compute the quantile at each cycle across all reads. This is working pretty wall with bacterial genomes using 16s amplicons . Now working with higher datasets (metagenomics samples doing shotgun sequencing i get much more data)

So for read1 i will have the score for each cycle (i.g 1..100)

for read i will also get the score for each cycle (again 1..100)

at the end i want a matrix (basically what the as(quality(fq),"matrix") would return. With this matrix i compute the quantile per cycle and generate a boxplot that show the quality per cycle.

See attached plot. Obviously i could just use fastQC from Babraham bioinformatics but that would not be that funny

https://www.dropbox.com/s/h3v6c564gequxe3/Screen%20Shot%202015-06-02%20at%2022.46.42.png?dl=0

crashes in my case are problems allocating memory. SInce i don´t need the sequences but just the scores i thought i could just get the quality scores.

Hi,

You cannot have more than 2^31-1 values in a matrix. This is a limitation of R. That means you won't be able to store the qualities for all the cycles of all your reads in a single matrix at once. So even if you didn't get a crash inside your while loop, you would end up with the same problem that you had originally when you try to combine all your "small" matrices into a single big one at the end with qual <- do.call(rbind,allqual).

As Mike suggested earlier, you can work around this by working with smaller sets of cycles. For example, working with 1 cycle at a time:

qfq <- quality(fq)
ncycles <- width(qfq)[1]
stopifnot(all(width(qfq) == ncycles))
qt <- sapply(seq_len(ncycles), function(i)
quantile(as(narrow(qfq, start=i, width=1L), "matrix")))

The number of cycles is typically small so looping on the cycles should be quick.

Cheers,
H.

Edit: Martin pointed to me that matrices of more than 2^31-1 elements are officially supported. However, in practice it seems that the underlying code is still a little bit rough around the edges:

## Creating an integer matrix of the same dimensions as the one you
## would get if you were able to do as(quality(fq), "matrix"):
m <- matrix(nrow=24377237, ncol=100)
dim(m)
# [1] 24377237      100
object.size(m)
# 9750895000 bytes

## However, as(quality(fq), "matrix") would give you a character
## matrix instead of an integer matrix:
m2 <- matrix("A", nrow=24377237, ncol=100)  # takes a long time!
dim(m2)
# [1] 24377237      100
object.size(m2)
# Error in structure(.Call(C_objectSize, x), class = "object_size") :
#   long vectors not supported yet: /home/hpages/src/R-3.2.r68389/src/main/unique.c:1656

m3 <- matrix(nrow=24377237, ncol=100)
# Error in matrix(nrow = 24377237, ncol = 100) :
#   long vectors not supported yet: /home/hpages/src/R-3.2.r68389/src/main/memory.c:1648


Anyway, this particular topic would probably be best discussed on the R-devel mailing list...

Maybe qrqc does the trick (also not very fun, I guess).

I would have summarized the data as

fl <- system.file(package="ShortRead", "extdata", "E-MTAB-1147",
"ERR127302_1_subset.fastq.gz")

strm <- FastqStreamer(fl, 10000)
abc <- NULL
repeat {
fq <- yield(strm)
if (length(fq) == 0)
break
abc <- if (is.null(abc)) {
alphabetByCycle(quality(fq))
} else abc + alphabetByCycle(quality(fq))
}

and then worked with abc for quantiles, either quick-and-dirty

ridx <- seq(1, which.max(cumsum(rowSums(abc) != 0)))
abc <- abc[ridx, ]

image(t(asinh(abc)), axes=FALSE, col=rev(heat.colors(12)))
axis(1, at=seq(10, ncol(abc), by=10) / ncol(abc) - 1 / ncol(abc),
label=seq(10, ncol(abc), by=10))
axis(2, at=seq(5, nrow(abc), by=5) / nrow(abc) - 1 / nrow(abc),
label=rownames(abc)[seq(5, nrow(abc), by=5)])

or a little inelegantly

abcq <- apply(abc, 2, function(x) quantile(rep(seq_along(x), x)))

(the mapping between character and encoding is given by encoding()). Since you mention fun, the pattern of iterating through a large file is a very common way to manage memory in genomics. So we have

library(GenomicFiles)
abc <- reduceByYield(FastqStreamer(fl), yield, function(x) {
alphabetByCycle(quality(x))
})
0
4.1 years ago by
danova_fr20
Andorra
danova_fr20 wrote:

Best,

david