Question

Using summarizeOverlaps for bacterial genomes

0

Entering edit mode

as9309 • 0

@as9309-9643

Last seen 8.2 years ago

Hi,

I have a set of bam files for my samples and I just want to align them to the e coli reference genome to get non-normalised unique reads for each gene. I am trying to do this using summarizeOverlaps (as I'm using R and I don't have a mac for featurecounts) and I have looked at the example on bioconductor as follows:

(ebg <- exonsBy(txdb, by="gene"))

se <- summarizeOverlaps(features=ebg, reads=bamfiles,
                        mode="Union",
                        singleEnd=FALSE,
                        ignore.strand=TRUE,
                        fragments=TRUE )

I don't really understand how my code will differ for the E coli genome compared to the example; what if I don't have introns of exons and just want to align to each gene? What do I write/create for "features="?

I assume I keep the "Union" mode as shown in Fig 1 of the vignette? Are there other optional modes?

As my data is single ended, I assume I also put true and do not put anything for ignore.strand or fragments if singleEnd=TRUE?

So far my code is:

(ebg <- exonsBy(txdb, by="gene"))

se <- summarizeOverlaps(features=ebg, reads=bamfiles,
                        mode="Union",
                        singleEnd=FALSE)

Is this correct or do I need to change ebg?

I've searched through the forum and still can't understand, sorry.

Thanks for your help.

summarizeoverlaps bacteria e coli unique reads • 1.5k views

ADD COMMENT • link updated 8.2 years ago by Valerie Obenchain ★ 6.8k • written 8.2 years ago by as9309 • 0

score 0 · Answer 1 · 2016-02-17

Hi,

The GenomicFeatures package has many extractors for isolating regions of interest, e.g., transcriptsBy(), genes(), etc. See the man pages ?genes and ?exonsBy. The genes(txdb) extractor returns the full range of the gene. Using this as 'features' will count reads that fall in any portion of the gene range (i.e., they aren't restricted to introns, exons, etc.). To count at the gene-level but enforce that the reads fall in the transcripts use transcriptsBy(txdb, "gene"). Similarly, exonsBy(txdb, "gene") counts at the gene-level but enforces that the reads fall in the exons, and so on.

The different counting modes for summarizeOverlaps() are all described on the man page ?summarizeOverlaps and in the 'Counting Reads with summarizeOverlaps' vignette, see browseVignettes("GenomicAlignments"). You can also provide your own count function as the 'mode' argument, this is shown in the examples.

If your data are single end, yes, you want singleEnd=TRUE. How you set ignore.strand depends on your data (how processed, what type) and what type of analysis you are planning to do.

Valerie