Question

SummarizeOverlaps without reference genome ?

1

Entering edit mode

jean.keller ▴ 10

@jeankeller-9931

Last seen 7.8 years ago

Dear Bioconductors,

I am fairly new in NGS analysis using R so my question can seem quiet idiot. I have RNAseq data from one species in two different conditions and I would like to determine what are the ten differentially expressed according to the conditions. I have a transcriptome of this species partially annotated and I mapped reads from each condition against this transcriptome. Now I would like to use DESeq2 package and I saw it is necessary to get the read count table before. I tried to use summarizeOverlaps package but I didn't work. I would like to know if having off or gtf file as features argument is mandatory or if there is a way to use summarizeOverlaps without an annotated reference ?

Tank you in advance

Jean

summarizeoverlaps • 1.6k views

ADD COMMENT • link updated 8.1 years ago by Valerie Obenchain ★ 6.8k • written 8.1 years ago by jean.keller ▴ 10

Martin Morgan · Answer 1 · 2016-03-16

1

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

I'm not sure if "I mapped reads from each condition against this transcriptome" means you have the read counts in each annotation region or if that just means you have BAM files of aligned reads.

If you do have the counts and just need to put the data in a DESeq2 class, you may want DESeq2::DESeqDataSetFromMatrix. See the man page ?DESeqDataSetFromMatrix for details.

As far as summarizeOverlaps() 'not working' please show the code chunk you tried, the resulting error message and the output of your sessionInfo(). summarizeOverlaps() produces counts of the number of reads in specific annotation region so you do need an annotation to get the counts. The 'features' argument is usually a GRanges or GRangesList which comes from an extraction from a TxDb object (exonsBy(), transcripts(), etc.). I can explain more once I see the code you tried.

Valerie

ADD COMMENT • link 8.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie,

Thank you for your answer.

At this stage I only have the bam of aligned reads files imported in a bamlistfile object. Now, I would like to obtain read counts for each transcript. In the DESeq2 manual it is advised to use summarizeOverlaps, however, as I am working on de novo transcriptome, I do not have any annotation to provide as GRanges. My initial question was: is it possible to use summarizeOverlaps to obtain read count for each transcripts without providing 'features' ?

Sorry I was unclear,

Jean

ADD REPLY • link 8.1 years ago jean.keller ▴ 10

0

Entering edit mode

No it's not possible. The 'features' are the annotation (ie, the transcripts). The 'counts' represent how many reads are in each annotation region (transcript). If you don't provide the transcript regions you can't do any counting.

Valerie

ADD REPLY • link 8.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

I think you've aligned to transcripts, rather than to the genome, so your BAM files have, say, 10's of thousands of seqlevels(BamFile("your.bam")). So create a bfl = BamFileList(vector_of_file_paths_to_bams) and get the coordinates of the transcripts as a GRanges gr = as(seqinfo(bfl), "GRanges"). Use these to se = summarizeOverlaps(gr ,bfl).

ADD REPLY • link 8.1 years ago Martin Morgan 25k

0

Entering edit mode

Yes that is right. I created a Granges object using the command you gave me and the summarizeOverlaps function started. However, after hours of run, I got this error message :

Error in validObject(.Object) : 

  invalid class “SummarizedExperiment0” object: 'assays' nrow differs from 'mcols' nrow

Here is the script I used :

fls = list.files(recursive = T, pattern = "*.bam$",full=TRUE)
bamlst=BamFileList(fls)
seqlevels(bamlst)
gr=as(seqinfo(bamlst),"GRanges")
se=summarizeOverlaps(gr,bamlst,mode=union,inter.feature=F,singleEnd=F,ignore.strand = T)

Do you know what means this error ?
Thank you in advance;
Jean

ADD REPLY • link updated 8.1 years ago by Martin Morgan 25k • written 8.1 years ago by jean.keller ▴ 10

0

Entering edit mode

Try to evaluate sequentially rather than in parallel (by passing the argument BPPARAM=SerialParam() to summarizeOverlaps(). Report the output of traceback() immediately after the error occurs, and the output of sessionInfo() once all relevant packages have been loaded.

I think 'hours of run' sounds too long (!) how many bam files and reads per bam file do you have? What does parallel::detectCores() report, and how much memory do you have?

ADD REPLY • link 8.1 years ago Martin Morgan 25k

0

Entering edit mode

Thank you for your answer. I tried to add the BPPARAM=SerialParam() to summarizeOverlaps().

This time, the time of running and RAM usage have been greatly reduced but the error code changed : > se=summarizeOverlaps(gr,bamlst,mode=union,inter.feature=F,singleEnd=F,ignore.strand = T, BPPARAM=SerialParam())
Error in base::union(x, y, ...) :
arguments inutilisés (algorithm = "nclist", ignore.strand = TRUE, inter.feature = FALSE)

Here is the traceback() output :

> traceback()
13: base::union(x, y, ...)
12: mode(features, reads, algorithm = algorithm, ignore.strand = ignore.strand,
inter.feature = inter.feature)
11: mode(features, reads, algorithm = algorithm, ignore.strand = ignore.strand,
inter.feature = inter.feature)
10: .dispatchOverlaps(features, x, mode, algorithm, ignore.strand,
inter.feature, preprocess.reads, ...)
9: .countWithYieldSize(FUN, features, bf, mode, algorithm, ignore.strand,
inter.feature, param, preprocess.reads, ...)
8: FUN(X[[i]], ...)
7: lapply(X, FUN, ...)
6: bplapply(setNames(seq_along(reads), names(reads)), function(i,
FUN, reads, features, mode, algorithm, ignore.strand, inter.feature,
param, preprocess.reads) {
bf <- reads[[i]]
.countWithYieldSize(FUN, features, bf, mode, algorithm, ignore.strand,
inter.feature, param, preprocess.reads, ...)
}, FUN, reads, features, mode = match.fun(mode), algorithm = algorithm,
ignore.strand = ignore.strand, inter.feature = inter.feature,
param = param, preprocess.reads = preprocess.reads, ...)
5: bplapply(setNames(seq_along(reads), names(reads)), function(i,
FUN, reads, features, mode, algorithm, ignore.strand, inter.feature,
param, preprocess.reads) {
bf <- reads[[i]]
.countWithYieldSize(FUN, features, bf, mode, algorithm, ignore.strand,
inter.feature, param, preprocess.reads, ...)
}, FUN, reads, features, mode = match.fun(mode), algorithm = algorithm,
ignore.strand = ignore.strand, inter.feature = inter.feature,
param = param, preprocess.reads = preprocess.reads, ...)
4: .dispatchBamFiles(features, reads, mode, match.arg(algorithm),
ignore.strand, inter.feature = inter.feature, singleEnd = singleEnd,
fragments = fragments, param = param, preprocess.reads = preprocess.reads,
...)
3: .local(features, reads, mode, algorithm, ignore.strand, ...)
2: summarizeOverlaps(gr, bamlst, mode = union, inter.feature = F,
singleEnd = F, ignore.strand = T, BPPARAM = SerialParam())
1: summarizeOverlaps(gr, bamlst, mode = union, inter.feature = F,
singleEnd = F, ignore.strand = T, BPPARAM = SerialParam())

And the SessionInfo() output:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] BiocParallel_1.4.3 GenomicAlignments_1.6.3 Rsamtools_1.22.0
[4] Biostrings_2.38.4 XVector_0.10.0 SummarizedExperiment_1.0.2
[7] Biobase_2.30.0 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3
[10] IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] bitops_1.0-6 futile.options_1.0.0 zlibbioc_1.16.0 futile.logger_1.4.1
[5] lambda.r_1.1.7 tools_3.2.3

I have 6 bam files with around 40000000 of paired-end reads in each. I am working on a IMac with 4 cores and 8Go of RAM (I also tried, with the same results on a PC with 8 cores and 20Go of RAM).

Best,

Jean

ADD REPLY • link 8.1 years ago jean.keller ▴ 10

0

Entering edit mode

The error

Error in base::union(x, y, ...) : 
  arguments inutilisés (algorithm = "nclist", ignore.strand = TRUE, inter.feature = FALSE)

takes a little scrutiny to figure out, but the problem is that base::union is being invoked, instead of GenomicAlignments::Union -- it is a typo in your command, use mode=Union instead of mode=union.

Also, add yieldSize=1000000 to the BamFileList() command, so that the files are processed in chunks.