Question: Enquiry about the summarizeOverlaps() function
0
3.8 years ago by
shania90.lk10
shania90.lk10 wrote:

Hi to the experts,

First of all, I would like to thank you for the efforts you put in to the Bioconductor team. Those efforts make the lives of struggling bioinformatics students like me much easy.

This is a very naive inquiry about an issue I face with counting features using summarizeOverlaps() function.

I have an analysis on stranded, paired end RNA-Seq data of transgenic Arabidopsis. A snippet of my code used is below.

> metadata$Code <- paste(metadata$Tissue,
+                                  metadata$Treatment, metadata$Genotype,
+                                  sep="_")
> metadata$ExpGroup <- paste(substr(metadata$Tissue, 1, 1),
+                                  metadata$Genotype, + substr(metadata$Treatment, 1, 1),
+                                  substr(metadata$Time, 1, 1), sep='-') > metadata$biolrep        <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,
+                              10,10,11,11,12,12,13,13,14,14,15,15,
+                              16,16,17,17,18,18,19,19,20,20,21,21,
+                              22,22,23,23,24,24,25,25,26,26,27,27,
+                              28,28,29,29,30,30,31,31)
> #extract out the names of the BAM files
> #to the bamFls variable (character variable)
> bamFls                  <- list.files(path="bamFiles/",
+                                       pattern="bam$", full=TRUE) > #ordering according to the order of metadata > bamFls_ordered <- bamFls[match(metadata$Filename,
+                                         (sub("\\..*", "", basename(bamFls))))]
> #Assigning names to each data in bamFls_ordred using a
> #section of its basename
> names(bamFls_ordered)   <- sub("\\..*", "",
+                                basename(bamFls_ordered))

> source("https://bioconductor.org/biocLite.R")
> biocLite(c("GenomicAlignments", “GenomicFeatures”))
> library("GenomicAlignments")
> library("GenomicFeatures")
> txdb                    <- makeTxDbFromGFF(
+                            file="TAIR10_GFF3_genes.gff",
+                            format="gff",
+                            dataSource="TAIR",
+                            organism="Arabidopsis thaliana")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

This is where the issue begins. I managed to count the reads for each gene the first time without an issue using the following code.

> gnModel               <- transcriptsBy(txdb, "gene")
> options(mc.cores=1)
> ine_counts            <- summarizeOverlaps(gnModel,
+                          bamFls_ordered, ignore.strand=FALSE,
+                          singleEnd=FALSE,
+                          mode="Union")

This seem to function properly. However, the first attempt to create the count matrix gave me  an “acceptable” count matrix (completed in December 2015) with reads spanning up to ~1200 for the overexpressed transgene (attached the counts barplot as 1.png).

As I had a little time on my hand until the second dataset arrives, I tried to reproduce the data (in Jan 2016) using the same codes I used previously. However, now the count matrix looked very different compared to the first attempt (has very low numbers). I have attached the barplot for the counts of my overexpressed transgene from this count matrix as 2.png.

I tried the following to see whether it is a problem with my gff file, bam files, metadata file, etc.

> counter <- function(fl, gnModel) {
+     strand(aln)   <- "*" # for strand-blind sample prep protocol
+     hits          <- countOverlaps(aln, gnModel)
+     counts        <- countOverlaps(gnModel, aln[hits==1])
+     names(counts) <- names(gnModel)
+     counts
+ }
> counts            <- sapply(bamFls_ordered, counter, gnModel)

And the resulting count matrix was much similar to the initial count matrix I produced (attached the barplot for counts for the overexpressed transgene as 3.png). Of course there are higher numbers as the double counting is allowed.

Therefore, I concluded that (I might be wrong) it has got to do something with the summarizeOverlaps() function or the GenomicFetures() package.  I tried looking at the information from updates. However, I was unable to locate anything related to this. I prefer to use the summarizeOverlaps() function, as it lets me define whether my data is stranded/unstranded and paired/single end. And also the double counting is avoided.

If you find a free time, please advise me on where I might be going wrong, or please direct me to a place where I can find a solution.  I  pondered on it for nearly 2 weeks now, and cannot seem to find a way out. I’m sure it is a very naïve issue that has just slipped out of my sight. I would really appreciate your help.

Thanks you so much for your time.

Yours sincerely,

Shani A.

modified 3.8 years ago by Valerie Obenchain6.7k • written 3.8 years ago by shania90.lk10
0
3.8 years ago by
United States
Valerie Obenchain6.7k wrote:

Hi Shari,

Sorry for the delayed response - this one slipped by me.

The first thing I notice is that in the 'ine_counts' example ignore.strand is FALSE and in the counter() function you change strand to '*' which is equivalent to ignore.strand=TRUE. When ignore.strand=TRUE you will (generally) see higher counts.

Another potential issue is treatment of single vs paired end reads. If your bam files are paired end you can use summarizeOverlaps() with 'single.end=FALSE'. For a rough comparison readGAlignmentPairs() followed by countOverlaps(). readGAlignments() reads data in as single-end reads. For example, if you read in paired-end data with readGAlignements() you will get roughly double the number of records than if you read the data in with readGAlignmentPairs(). When performing overlap operations, the GAlignmentPairs container only counts one hit per pair not one hit for each read in the pair.

Also there's no need to set mc.cores=1; instead use BPPARAM=MulticoreParam(1). The BPPARAM argument is passed to bplapply() called from summarizedOverlaps(). I see this isn't well documented on the man page and I will fix that.

I would recommend testing with a single bam files for starters. See if you can't get your expected counts with the single file and then go ahead and process the list.

Valerie

Hi Valerie,

Thank you for your response. However, maybe I wasn't clear. I have no issue in getting different results for countOverlaps() and summarizeOverlaps() functions. You are correct on the fact that I wasn't using the same criteria for both. Even if I did, I believe they should not give the same count values for samples due to the difference in the underlying methodology.

My issue lies in not being able to reproduce the summarizeOverlaps() count matrix the second time I ran it. For an example, if you can see in the 1.png and  2.png, I took the counts from all the samples for the over-expressed gene in my experiment. Since we have double checked using RT-PCR, I know that over-expressed gene is expressed largely in transgenic plants. It is quite obvious in 1.png (this is the time I ran summarizeOverlaps() on my samples for the first time). But surprisingly the second time I tried to reproduced the data using the same script, and same samples, the reads for the over-expressed transgene have reduced from maximum ~1200 (from 1.png) to ~14 (this is evident in 2.png).

My question is, what could be really causing this non-reproducability, while all my BAM files, paths, gff file and the metadata file are correct. Given that I get maximum ~2000 counts for the over-expressed gene from countOverlaps() function, I believe it might have got to do something with the...umm, the summarizeOverlaps() function or the GenomicFetures package. Because when I checked, every other thing is a constant. I'm using GenomicFeatures 1.22.4 on R version 3.2.1 (2015-06-18) on x86_64-pc-linux-gnu platform.

I therefore, kindly would like to know whether an update for summarizeOverlaps() function or genomicFetures() package could cause this?

I'm sorry, if I am not clear again. Please let me know I can elaborate more. In the meantime, I will redo the summarizeOverlaps() and countOverlaps() with a single BAM file.

Thanks heaps.

Shani.

1

If the bam files and TxDb have not changed then results from summarizeOverlaps() should not have changed. No, there have not been changes to summarizeOverlaps() or makeTxDbFromGFF() that should cause this type of discrepancy. Both of the GenomicAlignments and GenomicFeatues packages are under active development but changes made should not cause this type of reduction in overlaps.

It is difficult to help when I can't reproduce the inconsistent results. You say you ran the same script with the same bam files with R 3.2.1.

- what were the versions of GenomicFeatures and GenomicAlignments for the two runs?

- did you save the GFF/TxDb from that first run or did you have to re-download the GFF and re-generate the TxDb for the second run?

If you know the different package versions (first run vs second) and have the TxDb(s) I can try to reproduce this on one of your bam files.

Valerie

Hi Valerie,

The versions I used for first run was GenomicAlignments 1.4.2 but for the second run 1.6.3. For GenomicFeatures first run was in 1.20.6 and for the next 1.22.12. New versions were used for countOverlaps() as well.

I saved the TxDb from the first run and loaded it back to use for the second run (this loaded TxDb was used for the countOverlaps() function as well). I have added the TxDb in sqlite format here. The BAM file can be found here.

Thanks a lot for your help.

Shani.

1

Thanks. I will take a look. (sorry for misspelling your name in that first post!).

Thanks Valerie.

No issue with the name. It is not quite a familiar name anyways... Looking forward to your update on the data.

Thanks,

Shani

I've done a run with the current release, sessionInfo() below.

There were some warnings related to differently named seqlevels. This may not be relevant to your gene of interest but I thought it worth mentioning.

> genes <- transcriptsBy(txdb, "gene")
> bf <- "B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam"
> se <- summarizeOverlaps(genes, bf, ignore.strand=FALSE,
+                         singleEnd=FALSE, mode="Union")
There were 22 warnings (use warnings() to see them)
>
> warnings()
Warning messages:
1: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': ChrC, ChrM
- in 'y': chloroplast, mitochondria
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).

You can take a closer look at the seqlevels with seqlevels(txdb) and seqlevels(BamFile(bf)).

I'm not sure which genes in the txdb (names start with AT*) correspond to "B32-R-Transgenic-S-3" in the figure so here I've included all genes with counts above 1000:

> cts <- assays(se)$counts > dim(cts) [1] 27416 1 > colSums(cts) B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam 514967 > cts[cts > 1000,] AT1G08840 AT1G17744 AT1G17870 AT1G24822 AT1G24996 AT1G25097 AT1G77122 AT2G01021 3115 1467 1029 4632 3293 4634 1530 18473 AT2G07599 AT2G07671 AT2G07687 AT2G07698 AT2G07827 AT2G07835 AT2G23080 AT2G31380 1057 1365 1370 1230 1149 5459 1323 1027 AT2G47115 AT3G05870 AT3G09250 AT3G41762 AT3G52700 AT4G02970 AT4G37409 AT5G02370 4472 1228 4975 24321 1001 69157 6794 2954 AT5G10540 AT5G19150 AT5G19500 AT5G49440 1966 1465 2727 1837 Let me know which gene(s) in the txdb correspond to the over-expressed gene of interest and I'll re-run with the earlier package versions. Valerie editing to add sessionInfo(): > sessionInfo() R version 3.2.3 Patched (2016-01-04 r69860) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Fedora 22 (Twenty Two) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3 [3] GenomicAlignments_1.6.3 Rsamtools_1.22.0 [5] Biostrings_2.38.4 XVector_0.10.0 [7] SummarizedExperiment_1.0.2 Biobase_2.30.0 [9] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 [11] IRanges_2.4.7 S4Vectors_0.8.11 [13] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] zlibbioc_1.16.0 BiocParallel_1.4.3 tools_3.2.3 [4] DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1 [7] rtracklayer_1.30.2 futile.options_1.0.0 bitops_1.0-6 [10] RCurl_1.95-4.7 biomaRt_2.26.1 RSQLite_1.0.0 [13] XML_3.98-1.3 ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Valerie Obenchain6.7k Thanks a lot for this Valerie. The gene over-expressed would be AT2G25090. I checked the same with my data for the two runs of summerizeOverlaps(). ..._1 stands for the first run and ..._2 is the second run. > dim(ine_counts_1) [1] 27416 62 > dim(ine_counts_2) [1] 27416 62 > colSums(assay(ine_counts_1[, "B32_GTTTCG_L001_Aligned"])) B32_GTTTCG_L001_Aligned 15357637 > colSums(assay(ine_counts_2[, "B32_GTTTCG_L001_Aligned"])) B32_GTTTCG_L001_Aligned 514967 As you can see, you have reproduced the results from my second run which is very different to the first run for the same sample. This is still so confusing. Can I please know whether there is a new normalization step taking place within the summerizeOverlaps() function now that could cause this? Thanks a lot. Shani. ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by shania90.lk10 1 Yes, I understand the code above was equivalent to your second run. I wanted information about the gene before running the older versions so I could use a ScanBamParm to focus on the region of interest - you'll see this in the code below. The older package versions (release April-Oct 2015: GenomicFeatures 1.20.6, GenomicAlignments 1.4.2) produce the exact same results as the current release (Oct 2015 - April 2016: GenomicFeatures 1.22.13, GenomicAlignments 1.6.3). The difference I see is in the value of 'ignore.strand'. With ignore.strand=TRUE counts are 11 and with ignore.strand=FALSE counts are 1249. These numbers correspond to what I see in 1.png and 2.png. Valerie library(GenomicAlignments) library(GenomicFeatures) ## Isolate AT2G25090 txdb <- loadDb("at.sqlite") AT2G25090 <- transcriptsBy(txdb, "gene")["AT2G25090"] ## Index the bam so we can use ScanBamParam bf <- "B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam" indexBam(bf) ## Read in Chr2 only chr2Length <- seqlengths(BamFile(bf))["Chr2"] param <- ScanBamParam(which=GRanges("Chr2", IRanges(0, width=chr2Length))) ## ----------------------------------------------------------------------- ## Current release (R-3.2; Bioconductor 3.2) ## ----------------------------------------------------------------------- ## ignore.strand=FALSE se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=FALSE, singleEnd=FALSE, mode="Union") > assays(se)$counts
B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam
AT2G25090                                            11

## ignore.strand=TRUE
se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=TRUE,
singleEnd=FALSE, mode="Union")

>  assays(se)$counts B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam AT2G25090 1249 ## sessionInfo() for this run was attached previously ## ----------------------------------------------------------------------- ## Old release April-Oct 2015 (R-3.2; Bioconductor 3.1) ## ----------------------------------------------------------------------- ## ignore.strand=FALSE se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=FALSE, singleEnd=FALSE, mode="Union") > assays(se)$counts
B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam
AT2G25090                                            11

## ignore.strand=TRUE
se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=TRUE,
singleEnd=FALSE, mode="Union")

> assays(se)\$counts
B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam
AT2G25090                                          1249

> sessionInfo()
R version 3.2.3 Patched (2016-02-15 r70177)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 22 (Twenty Two)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.20.6  AnnotationDbi_1.30.1    Biobase_2.28.0
[4] BiocInstaller_1.18.5    GenomicAlignments_1.4.2 Rsamtools_1.20.5
[7] Biostrings_2.36.4       XVector_0.8.0           GenomicRanges_1.20.8
[10] GenomeInfoDb_1.4.3      IRanges_2.2.9           S4Vectors_0.6.6
[13] BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0      BiocParallel_1.2.22  tools_3.2.3
[4] DBI_0.3.1            lambda.r_1.1.7       futile.logger_1.4.1
[7] rtracklayer_1.28.10  futile.options_1.0.0 bitops_1.0-6
[10] RCurl_1.95-4.7       biomaRt_2.24.1       RSQLite_1.0.0
[13] XML_3.98-1.3

Yes Valerie, you are right. The values correspond with the results from the two runs as well.

colSums(assay(ine_counts_2["AT2G25090", "B32_GTTTCG_L001_Aligned"]))
B32_GTTTCG_L001_Aligned
11
colSums(assay(ine_counts_1["AT2G25090", "B32_GTTTCG_L001_Aligned"]))
B32_GTTTCG_L001_Aligned
1249

However, I used ignore.strand = FALSE in both instances. This is because my data are paired end reads.

Thank you,

Shani.

1

This is now a non-reproducible error. The sessionInfo() I show above demonstrates that ignore.strand=FALSE produces 11 counts with the old and new package versions. If you can download a fresh R 3.2 and install the appropriate Bioconductor 3.1 packages and show that using ignore.strand=FALSE produces 1249 counts (show your sessionInfo()) I will look into this further. Otherwise I'm considering this issue closed.

The best way to install the old packages is into a fresh R 3.2 install get BiocInstaller from

and install with R CMD INSTALL BiocInstaller. Then from within an R session with this fresh install use biocLite() to get GenomicFeatures and GenomicAlignments.

Valerie