Question

Enquiry about the summarizeOverlaps() function

0

Entering edit mode

shania90.lk ▴ 10

@shania90lk-9656

Last seen 8.0 years ago

Hi to the experts,

First of all, I would like to thank you for the efforts you put in to the Bioconductor team. Those efforts make the lives of struggling bioinformatics students like me much easy.

This is a very naive inquiry about an issue I face with counting features using summarizeOverlaps() function.

I have an analysis on stranded, paired end RNA-Seq data of transgenic Arabidopsis. A snippet of my code used is below.

> metadata                <- as.data.frame(read.table("bamFiles/metadata.txt",
+                                                     header=TRUE))
> metadata$Code           <- paste(metadata$Tissue,
+                                  metadata$Treatment, metadata$Genotype,
+                                  sep="_")
> metadata$ExpGroup       <- paste(substr(metadata$Tissue, 1, 1),
+                                  metadata$Genotype,
+                                  substr(metadata$Treatment, 1, 1),
+                                  substr(metadata$Time, 1, 1), sep='-')
> metadata$biolrep        <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,
+                              10,10,11,11,12,12,13,13,14,14,15,15,
+                              16,16,17,17,18,18,19,19,20,20,21,21,
+                              22,22,23,23,24,24,25,25,26,26,27,27,
+                              28,28,29,29,30,30,31,31)
> #extract out the names of the BAM files
> #to the bamFls variable (character variable)
> bamFls                  <- list.files(path="bamFiles/",
+                                       pattern="bam$", full=TRUE)
> #ordering according to the order of metadata
> bamFls_ordered          <- bamFls[match(metadata$Filename,
+                                         (sub("\\..*", "", basename(bamFls))))]
> #Assigning names to each data in bamFls_ordred using a
> #section of its basename
> names(bamFls_ordered)   <- sub("\\..*", "",
+                                basename(bamFls_ordered))

> source("https://bioconductor.org/biocLite.R")
> biocLite(c("GenomicAlignments", “GenomicFeatures”))
> library("GenomicAlignments")
> library("GenomicFeatures")

> txdb                    <- makeTxDbFromGFF(
+                            file="TAIR10_GFF3_genes.gff",
+                            format="gff",
+                            dataSource="TAIR",
+                            organism="Arabidopsis thaliana")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

This is where the issue begins. I managed to count the reads for each gene the first time without an issue using the following code.

> gnModel               <- transcriptsBy(txdb, "gene")
> options(mc.cores=1)
> ine_counts            <- summarizeOverlaps(gnModel,
+                          bamFls_ordered, ignore.strand=FALSE,
+                          singleEnd=FALSE,
+                          mode="Union")

This seem to function properly. However, the first attempt to create the count matrix gave me an “acceptable” count matrix (completed in December 2015) with reads spanning up to ~1200 for the overexpressed transgene (attached the counts barplot as 1.png).

As I had a little time on my hand until the second dataset arrives, I tried to reproduce the data (in Jan 2016) using the same codes I used previously. However, now the count matrix looked very different compared to the first attempt (has very low numbers). I have attached the barplot for the counts of my overexpressed transgene from this count matrix as 2.png.

I tried the following to see whether it is a problem with my gff file, bam files, metadata file, etc.

> counter <- function(fl, gnModel) {
+     aln           <- readGAlignments(fl)
+     strand(aln)   <- "*" # for strand-blind sample prep protocol
+     hits          <- countOverlaps(aln, gnModel)
+     counts        <- countOverlaps(gnModel, aln[hits==1])
+     names(counts) <- names(gnModel)
+     counts
+ }
> counts            <- sapply(bamFls_ordered, counter, gnModel)

And the resulting count matrix was much similar to the initial count matrix I produced (attached the barplot for counts for the overexpressed transgene as 3.png). Of course there are higher numbers as the double counting is allowed.

Therefore, I concluded that (I might be wrong) it has got to do something with the summarizeOverlaps() function or the GenomicFetures() package. I tried looking at the information from updates. However, I was unable to locate anything related to this. I prefer to use the summarizeOverlaps() function, as it lets me define whether my data is stranded/unstranded and paired/single end. And also the double counting is avoided.

If you find a free time, please advise me on where I might be going wrong, or please direct me to a place where I can find a solution. I pondered on it for nearly 2 weeks now, and cannot seem to find a way out. I’m sure it is a very naïve issue that has just slipped out of my sight. I would really appreciate your help.

Thanks you so much for your time.

Yours sincerely,

Shani A.

summarizeoverlaps summarizedexperiment count matrix low counts rna-seq • 2.3k views

ADD COMMENT • link updated 8.2 years ago by Valerie Obenchain ★ 6.8k • written 8.3 years ago by shania90.lk ▴ 10

score 0 · Answer 1 · 2016-02-09

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Shari,

Sorry for the delayed response - this one slipped by me.

The first thing I notice is that in the 'ine_counts' example ignore.strand is FALSE and in the counter() function you change strand to '*' which is equivalent to ignore.strand=TRUE. When ignore.strand=TRUE you will (generally) see higher counts.

Another potential issue is treatment of single vs paired end reads. If your bam files are paired end you can use summarizeOverlaps() with 'single.end=FALSE'. For a rough comparison readGAlignmentPairs() followed by countOverlaps(). readGAlignments() reads data in as single-end reads. For example, if you read in paired-end data with readGAlignements() you will get roughly double the number of records than if you read the data in with readGAlignmentPairs(). When performing overlap operations, the GAlignmentPairs container only counts one hit per pair not one hit for each read in the pair.

Also there's no need to set mc.cores=1; instead use BPPARAM=MulticoreParam(1). The BPPARAM argument is passed to bplapply() called from summarizedOverlaps(). I see this isn't well documented on the man page and I will fix that.

I would recommend testing with a single bam files for starters. See if you can't get your expected counts with the single file and then go ahead and process the list.

Valerie

ADD COMMENT • link 8.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie,

Thank you for your response. However, maybe I wasn't clear. I have no issue in getting different results for countOverlaps() and summarizeOverlaps() functions. You are correct on the fact that I wasn't using the same criteria for both. Even if I did, I believe they should not give the same count values for samples due to the difference in the underlying methodology.

My issue lies in not being able to reproduce the summarizeOverlaps() count matrix the second time I ran it. For an example, if you can see in the 1.png and 2.png, I took the counts from all the samples for the over-expressed gene in my experiment. Since we have double checked using RT-PCR, I know that over-expressed gene is expressed largely in transgenic plants. It is quite obvious in 1.png (this is the time I ran summarizeOverlaps() on my samples for the first time). But surprisingly the second time I tried to reproduced the data using the same script, and same samples, the reads for the over-expressed transgene have reduced from maximum ~1200 (from 1.png) to ~14 (this is evident in 2.png).

My question is, what could be really causing this non-reproducability, while all my BAM files, paths, gff file and the metadata file are correct. Given that I get maximum ~2000 counts for the over-expressed gene from countOverlaps() function, I believe it might have got to do something with the...umm, the summarizeOverlaps() function or the GenomicFetures package. Because when I checked, every other thing is a constant. I'm using GenomicFeatures 1.22.4 on R version 3.2.1 (2015-06-18) on x86_64-pc-linux-gnu platform.

I therefore, kindly would like to know whether an update for summarizeOverlaps() function or genomicFetures() package could cause this?

I'm sorry, if I am not clear again. Please let me know I can elaborate more. In the meantime, I will redo the summarizeOverlaps() and countOverlaps() with a single BAM file.

Thanks heaps.

Shani.

ADD REPLY • link 8.2 years ago shania90.lk ▴ 10

1

Entering edit mode

If the bam files and TxDb have not changed then results from summarizeOverlaps() should not have changed. No, there have not been changes to summarizeOverlaps() or makeTxDbFromGFF() that should cause this type of discrepancy. Both of the GenomicAlignments and GenomicFeatues packages are under active development but changes made should not cause this type of reduction in overlaps.

It is difficult to help when I can't reproduce the inconsistent results. You say you ran the same script with the same bam files with R 3.2.1.

- what were the versions of GenomicFeatures and GenomicAlignments for the two runs?

- did you save the GFF/TxDb from that first run or did you have to re-download the GFF and re-generate the TxDb for the second run?

If you know the different package versions (first run vs second) and have the TxDb(s) I can try to reproduce this on one of your bam files.

Valerie

ADD REPLY • link 8.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie,

Thanks for your reply. .

The versions I used for first run was GenomicAlignments 1.4.2 but for the second run 1.6.3. For GenomicFeatures first run was in 1.20.6 and for the next 1.22.12. New versions were used for countOverlaps() as well.

I saved the TxDb from the first run and loaded it back to use for the second run (this loaded TxDb was used for the countOverlaps() function as well). I have added the TxDb in sqlite format here. The BAM file can be found here.

Thanks a lot for your help.

Shani.

ADD REPLY • link 8.2 years ago shania90.lk ▴ 10

1

Entering edit mode

Thanks. I will take a look. (sorry for misspelling your name in that first post!).

ADD REPLY • link 8.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks Valerie.

No issue with the name. It is not quite a familiar name anyways... Looking forward to your update on the data.

Thanks,

Shani

ADD REPLY • link 8.2 years ago shania90.lk ▴ 10

0

Entering edit mode

I've done a run with the current release, sessionInfo() below.

There were some warnings related to differently named seqlevels. This may not be relevant to your gene of interest but I thought it worth mentioning.

> txdb <- loadDb("at.sqlite")
> genes <- transcriptsBy(txdb, "gene")
> bf <- "B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam"
> se <- summarizeOverlaps(genes, bf, ignore.strand=FALSE,
+                         singleEnd=FALSE, mode="Union")
There were 22 warnings (use warnings() to see them)
> 
> warnings()
Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': ChrC, ChrM
  - in 'y': chloroplast, mitochondria
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).

You can take a closer look at the seqlevels with seqlevels(txdb) and seqlevels(BamFile(bf)).

I'm not sure which genes in the txdb (names start with AT*) correspond to "B32-R-Transgenic-S-3" in the figure so here I've included all genes with counts above 1000:

> cts <- assays(se)$counts
> dim(cts)
[1] 27416     1
> colSums(cts)
B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam 
                                       514967 
> cts[cts > 1000,]
AT1G08840 AT1G17744 AT1G17870 AT1G24822 AT1G24996 AT1G25097 AT1G77122 AT2G01021 
     3115      1467      1029      4632      3293      4634      1530     18473 
AT2G07599 AT2G07671 AT2G07687 AT2G07698 AT2G07827 AT2G07835 AT2G23080 AT2G31380 
     1057      1365      1370      1230      1149      5459      1323      1027 
AT2G47115 AT3G05870 AT3G09250 AT3G41762 AT3G52700 AT4G02970 AT4G37409 AT5G02370 
     4472      1228      4975     24321      1001     69157      6794      2954 
AT5G10540 AT5G19150 AT5G19500 AT5G49440 
     1966      1465      2727      1837

Let me know which gene(s) in the txdb correspond to the over-expressed gene of interest and I'll re-run with the earlier package versions.

Valerie

editing to add sessionInfo():

> sessionInfo()
R version 3.2.3 Patched (2016-01-04 r69860)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 22 (Twenty Two)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3
[3] GenomicAlignments_1.6.3 Rsamtools_1.22.0
[5] Biostrings_2.38.4 XVector_0.10.0
[7] SummarizedExperiment_1.0.2 Biobase_2.30.0
[9] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3
[11] IRanges_2.4.7 S4Vectors_0.8.11
[13] BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 BiocParallel_1.4.3 tools_3.2.3
[4] DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1
[7] rtracklayer_1.30.2 futile.options_1.0.0 bitops_1.0-6
[10] RCurl_1.95-4.7 biomaRt_2.26.1 RSQLite_1.0.0
[13] XML_3.98-1.3

ADD REPLY • link 8.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks a lot for this Valerie.

The gene over-expressed would be AT2G25090. I checked the same with my data for the two runs of summerizeOverlaps(). ..._1 stands for the first run and ..._2 is the second run.

> dim(ine_counts_1)
[1] 27416    62

> dim(ine_counts_2)
[1] 27416    62

> colSums(assay(ine_counts_1[, "B32_GTTTCG_L001_Aligned"]))
B32_GTTTCG_L001_Aligned
               15357637

> colSums(assay(ine_counts_2[, "B32_GTTTCG_L001_Aligned"]))
B32_GTTTCG_L001_Aligned
                 514967

As you can see, you have reproduced the results from my second run which is very different to the first run for the same sample. This is still so confusing. Can I please know whether there is a new normalization step taking place within the summerizeOverlaps() function now that could cause this?

Thanks a lot.

Shani.

ADD REPLY • link 8.2 years ago shania90.lk ▴ 10

1

Entering edit mode

Yes, I understand the code above was equivalent to your second run. I wanted information about the gene before running the older versions so I could use a ScanBamParm to focus on the region of interest - you'll see this in the code below.

The older package versions (release April-Oct 2015: GenomicFeatures 1.20.6, GenomicAlignments 1.4.2) produce the exact same results as the current release (Oct 2015 - April 2016: GenomicFeatures 1.22.13, GenomicAlignments 1.6.3). The difference I see is in the value of 'ignore.strand'. With ignore.strand=TRUE counts are 11 and with ignore.strand=FALSE counts are 1249. These numbers correspond to what I see in 1.png and 2.png.

Valerie

library(GenomicAlignments)
library(GenomicFeatures)

## Isolate AT2G25090
txdb <- loadDb("at.sqlite")
AT2G25090 <- transcriptsBy(txdb, "gene")["AT2G25090"]

## Index the bam so we can use ScanBamParam
bf <- "B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam"
indexBam(bf)

## Read in Chr2 only
chr2Length <- seqlengths(BamFile(bf))["Chr2"]
param <- ScanBamParam(which=GRanges("Chr2", IRanges(0, width=chr2Length))) 

## -----------------------------------------------------------------------
## Current release (R-3.2; Bioconductor 3.2)
## -----------------------------------------------------------------------

## ignore.strand=FALSE
se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=FALSE,
                        singleEnd=FALSE, mode="Union")

> assays(se)$counts
          B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam
AT2G25090                                            11

## ignore.strand=TRUE
se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=TRUE,
                        singleEnd=FALSE, mode="Union")

>  assays(se)$counts
          B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam
AT2G25090                                          1249

## sessionInfo() for this run was attached previously

## -----------------------------------------------------------------------
## Old release April-Oct 2015 (R-3.2; Bioconductor 3.1)
## -----------------------------------------------------------------------

## ignore.strand=FALSE
se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=FALSE,
                        singleEnd=FALSE, mode="Union")

> assays(se)$counts
          B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam
AT2G25090                                            11

## ignore.strand=TRUE
se <- summarizeOverlaps(AT2G25090, bf, param=param, ignore.strand=TRUE,
                        singleEnd=FALSE, mode="Union")

> assays(se)$counts
          B32_GTTTCG_L001_Aligned.sortedByCoord.out.bam
AT2G25090                                          1249

> sessionInfo()
R version 3.2.3 Patched (2016-02-15 r70177)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 22 (Twenty Two)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicFeatures_1.20.6  AnnotationDbi_1.30.1    Biobase_2.28.0         
 [4] BiocInstaller_1.18.5    GenomicAlignments_1.4.2 Rsamtools_1.20.5       
 [7] Biostrings_2.36.4       XVector_0.8.0           GenomicRanges_1.20.8   
[10] GenomeInfoDb_1.4.3      IRanges_2.2.9           S4Vectors_0.6.6        
[13] BiocGenerics_0.14.0    

loaded via a namespace (and not attached):
 [1] zlibbioc_1.14.0      BiocParallel_1.2.22  tools_3.2.3         
 [4] DBI_0.3.1            lambda.r_1.1.7       futile.logger_1.4.1 
 [7] rtracklayer_1.28.10  futile.options_1.0.0 bitops_1.0-6        
[10] RCurl_1.95-4.7       biomaRt_2.24.1       RSQLite_1.0.0       
[13] XML_3.98-1.3

ADD REPLY • link 8.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Yes Valerie, you are right. The values correspond with the results from the two runs as well.

colSums(assay(ine_counts_2["AT2G25090", "B32_GTTTCG_L001_Aligned"]))
B32_GTTTCG_L001_Aligned 
                     11 
colSums(assay(ine_counts_1["AT2G25090", "B32_GTTTCG_L001_Aligned"]))
B32_GTTTCG_L001_Aligned 
                   1249

However, I used ignore.strand = FALSE in both instances. This is because my data are paired end reads.

Thank you,

Shani.

ADD REPLY • link 8.2 years ago shania90.lk ▴ 10

1

Entering edit mode

This is now a non-reproducible error. The sessionInfo() I show above demonstrates that ignore.strand=FALSE produces 11 counts with the old and new package versions. If you can download a fresh R 3.2 and install the appropriate Bioconductor 3.1 packages and show that using ignore.strand=FALSE produces 1249 counts (show your sessionInfo()) I will look into this further. Otherwise I'm considering this issue closed.

The best way to install the old packages is into a fresh R 3.2 install get BiocInstaller from

svn co https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_3_1/madman/Rpacks/BiocInstaller

and install with R CMD INSTALL BiocInstaller. Then from within an R session with this fresh install use biocLite() to get GenomicFeatures and GenomicAlignments.

Valerie

ADD REPLY • link 8.2 years ago Valerie Obenchain ★ 6.8k