Dear Bioconductors,
I am fairly new in NGS analysis using R so my question can seem quiet idiot. I have RNAseq data from one species in two different conditions and I would like to determine what are the ten differentially expressed according to the conditions. I have a transcriptome of this species partially annotated and I mapped reads from each condition against this transcriptome. Now I would like to use DESeq2 package and I saw it is necessary to get the read count table before. I tried to use summarizeOverlaps package but I didn't work. I would like to know if having off or gtf file as features argument is mandatory or if there is a way to use summarizeOverlaps without an annotated reference ?
Tank you in advance
Jean
Dear Valerie,
Thank you for your answer.
At this stage I only have the bam of aligned reads files imported in a bamlistfile object. Now, I would like to obtain read counts for each transcript. In the DESeq2 manual it is advised to use summarizeOverlaps, however, as I am working on de novo transcriptome, I do not have any annotation to provide as GRanges. My initial question was: is it possible to use summarizeOverlaps to obtain read count for each transcripts without providing 'features' ?
Sorry I was unclear,
Jean
No it's not possible. The 'features' are the annotation (ie, the transcripts). The 'counts' represent how many reads are in each annotation region (transcript). If you don't provide the transcript regions you can't do any counting.
Valerie
I think you've aligned to transcripts, rather than to the genome, so your BAM files have, say, 10's of thousands of
seqlevels(BamFile("your.bam"))
. So create abfl = BamFileList(vector_of_file_paths_to_bams)
and get the coordinates of the transcripts as a GRangesgr = as(seqinfo(bfl), "GRanges")
. Use these tose = summarizeOverlaps(gr ,bfl)
.Yes that is right. I created a Granges object using the command you gave me and the summarizeOverlaps function started. However, after hours of run, I got this error message :
Here is the script I used :
Do you know what means this error ?
Thank you in advance;
Jean
Try to evaluate sequentially rather than in parallel (by passing the argument BPPARAM=SerialParam() to
summarizeOverlaps()
. Report the output oftraceback()
immediately after the error occurs, and the output ofsessionInfo()
once all relevant packages have been loaded.I think 'hours of run' sounds too long (!) how many bam files and reads per bam file do you have? What does
parallel::detectCores()
report, and how much memory do you have?Thank you for your answer. I tried to add the BPPARAM=SerialParam() to summarizeOverlaps().
This time, the time of running and RAM usage have been greatly reduced but the error code changed : > se=summarizeOverlaps(gr,bamlst,mode=union,inter.feature=F,singleEnd=F,ignore.strand = T, BPPARAM=SerialParam())
Error in base::union(x, y, ...) :
arguments inutilisés (algorithm = "nclist", ignore.strand = TRUE, inter.feature = FALSE)
Here is the traceback() output :
> traceback()
13: base::union(x, y, ...)
12: mode(features, reads, algorithm = algorithm, ignore.strand = ignore.strand,
inter.feature = inter.feature)
11: mode(features, reads, algorithm = algorithm, ignore.strand = ignore.strand,
inter.feature = inter.feature)
10: .dispatchOverlaps(features, x, mode, algorithm, ignore.strand,
inter.feature, preprocess.reads, ...)
9: .countWithYieldSize(FUN, features, bf, mode, algorithm, ignore.strand,
inter.feature, param, preprocess.reads, ...)
8: FUN(X[[i]], ...)
7: lapply(X, FUN, ...)
6: bplapply(setNames(seq_along(reads), names(reads)), function(i,
FUN, reads, features, mode, algorithm, ignore.strand, inter.feature,
param, preprocess.reads) {
bf <- reads[[i]]
.countWithYieldSize(FUN, features, bf, mode, algorithm, ignore.strand,
inter.feature, param, preprocess.reads, ...)
}, FUN, reads, features, mode = match.fun(mode), algorithm = algorithm,
ignore.strand = ignore.strand, inter.feature = inter.feature,
param = param, preprocess.reads = preprocess.reads, ...)
5: bplapply(setNames(seq_along(reads), names(reads)), function(i,
FUN, reads, features, mode, algorithm, ignore.strand, inter.feature,
param, preprocess.reads) {
bf <- reads[[i]]
.countWithYieldSize(FUN, features, bf, mode, algorithm, ignore.strand,
inter.feature, param, preprocess.reads, ...)
}, FUN, reads, features, mode = match.fun(mode), algorithm = algorithm,
ignore.strand = ignore.strand, inter.feature = inter.feature,
param = param, preprocess.reads = preprocess.reads, ...)
4: .dispatchBamFiles(features, reads, mode, match.arg(algorithm),
ignore.strand, inter.feature = inter.feature, singleEnd = singleEnd,
fragments = fragments, param = param, preprocess.reads = preprocess.reads,
...)
3: .local(features, reads, mode, algorithm, ignore.strand, ...)
2: summarizeOverlaps(gr, bamlst, mode = union, inter.feature = F,
singleEnd = F, ignore.strand = T, BPPARAM = SerialParam())
1: summarizeOverlaps(gr, bamlst, mode = union, inter.feature = F,
singleEnd = F, ignore.strand = T, BPPARAM = SerialParam())
And the SessionInfo() output:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocParallel_1.4.3 GenomicAlignments_1.6.3 Rsamtools_1.22.0
[4] Biostrings_2.38.4 XVector_0.10.0 SummarizedExperiment_1.0.2
[7] Biobase_2.30.0 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3
[10] IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] bitops_1.0-6 futile.options_1.0.0 zlibbioc_1.16.0 futile.logger_1.4.1
[5] lambda.r_1.1.7 tools_3.2.3
I have 6 bam files with around 40000000 of paired-end reads in each. I am working on a IMac with 4 cores and 8Go of RAM (I also tried, with the same results on a PC with 8 cores and 20Go of RAM).
Best,
Jean
The error
takes a little scrutiny to figure out, but the problem is that
base::union
is being invoked, instead ofGenomicAlignments::Union
-- it is a typo in your command, usemode=Union
instead of mode=union.Also, add
yieldSize=1000000
to the BamFileList() command, so that the files are processed in chunks.Thank you very much for all the help you gave me ! Now it works !