Hi,
Short answer
You can use something like this to invert the strand when doing strand-specific counting:
summarizeOverlaps(features, "path/to/your/BAM/file",
singleEnd=FALSE, preprocess.reads=invertStrand)
See the long answer below for the definition of the invertStrand()
function.
Long answer
1. With paired-end stranded RNA-seq, my understanding is that, depending on the protocol that was used, the strand of the original RNA is either that of the first read or its opposite.
Examples: protocols that lead to the strand of the first read being the opposite to that of the original RNA (1st family):
- dUTP, NSR, NNSR, Illumina stranded TruSeq PE protocol
and protocols that lead to the strand of the first read being that of the original RNA (2nd family):
- Directional Illumina (Ligation), Standard SOLiD
TopHat2 and Cufflinks have a --library-type
option to let the user specify which protocol was used: fr-firststrand
for the first family of protocols and fr-secondstrand
for the 2nd family.
Disclaimer: This is my understanding of the situation based on the information I gathered from various sources. This information was not always clear and was sometimes contradictory. I'd be happy to edit the above if it's not accurate.
2. Anyway, the bottom line is that the protocol that was used has a direct consequence on the strand of the first and second mates that are stored in a BAM file and in the GAlignmentPairs object obtained by calling readGAlignmentPairs()
on the file. In the BAM file, the 2 mates typically have opposite strands. In the GAlignmentPairs object, they always have opposite strands (readGAlignmentPairs()
will drop the pairs that don't satisfy this with a warning):
galp
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
# seqnames strand : ranges -- ranges
# <Rle> <Rle> : <IRanges> -- <IRanges>
# EAS54_61:4:143:69:578 chr1 + : [ 36, 70] -- [ 185, 219]
# EAS114_26:7:37:79:581 chr2 - : [1533, 1567] -- [1349, 1383]
# -------
# seqinfo: 2 sequences from an unspecified genome
first(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
# seqnames strand cigar qwidth start
# <Rle> <Rle> <character> <integer> <integer>
# EAS54_61:4:143:69:578 chr1 + 35M 35 36
# EAS114_26:7:37:79:581 chr2 - 35M 35 1533
# end width njunc
# <integer> <integer> <integer>
# EAS54_61:4:143:69:578 70 35 0
# EAS114_26:7:37:79:581 1567 35 0
# -------
# seqinfo: 2 sequences from an unspecified genome
last(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
# seqnames strand cigar qwidth start
# <Rle> <Rle> <character> <integer> <integer>
# EAS54_61:4:143:69:578 chr1 - 35M 35 185
# EAS114_26:7:37:79:581 chr2 + 35M 35 1349
# end width njunc
# <integer> <integer> <integer>
# EAS54_61:4:143:69:578 219 35 0
# EAS114_26:7:37:79:581 1383 35 0
# -------
# seqinfo: 2 sequences from an unspecified genome
Depending on what RNA-seq protocol was used, the meaningful strand is either that of the first or that of the second mate. But note that when galp
is displayed, the strand of the first mate is shown. Also when calling strand()
on a GAlignmentPairs, the strand of the first mate is returned. This reflects the fact that, when the GAlignmentPairs container was designed and implemented, the arbitrary choice was made to treat the strand of the first mate as the meaningful strand.
3. summarizeOverlaps()
calls findOverlaps()
internally to find overlaps between the features and the reads. When the reads
argument is a BamFile object, the alignments in the file must be loaded in a GAlignments or GAlignmentPairs object before findOverlaps()
can be called. The singleEnd
argument of summarizeOverlaps()
lets the user control how the reads in the file must be loaded: as a GAlignments object if singleEnd=TRUE
(the default), or as a GAlignmentPairs object if singleEnd=FALSE
. So make sure you used singleEnd=FALSE
.
4. When the GAlignmentPairs is passed to findOverlaps()
it's first turned into a GRangesList object with grglist()
. This operation also assumes that the meaningful strand is that of the first mate so it inverts the strand of the second mate. For example:
grglist(galp)
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 36, 70] +
# [2] chr1 [185, 219] +
#
# $EAS114_26:7:37:79:581
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# [1] chr2 [1349, 1383] -
# [2] chr2 [1533, 1567] -
#
# -------
# seqinfo: 2 sequences from an unspecified genome
5. If the RNA-seq data was generated using a protocol where the meaningful strand is that of the second mate, then grglist()
is not doing the right thing. You can work around this by inverting the strand in the GAlignmentPairs object manually:
invertStrand <- function(galp)
{
## Using a non-exported helper function and direct slot access is
## bad practice and is strongly discouraged. Sorry for doing this here!
invertRleStrand <- GenomicAlignments:::invertRleStrand
galp@first <- invertRleStrand(galp@first)
galp@last <- invertRleStrand(galp@last)
galp
}
invertStrand(galp)
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
# seqnames strand : ranges -- ranges
# <Rle> <Rle> : <IRanges> -- <IRanges>
# EAS54_61:4:143:69:578 chr1 - : [ 36, 70] -- [ 185, 219]
# EAS114_26:7:37:79:581 chr2 + : [1533, 1567] -- [1349, 1383]
# -------
# seqinfo: 2 sequences from an unspecified genome
Then, grglist()
returns the right GRangesList object:
grglist(invertStrand(galp))
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [185, 219] -
# [2] chr1 [ 36, 70] -
#
# $EAS114_26:7:37:79:581
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# [1] chr2 [1533, 1567] +
# [2] chr2 [1349, 1383] +
#
# -------
# seqinfo: 2 sequences from an unspecified genome
6. In the context of using summarizeOverlaps()
, you can either load all the reads in a GAlignmentPairs object and invert their strand manually with something like:
bamfile <- BamFile("path/to/your/BAM/file")
galp <- readGAlignmentPairs(bamfile)
summarizeOverlaps(features, invertStrand(galp))
or you can use the preprocess.reads
argument of summarizeOverlaps()
like this:
bamfile <- BamFile("path/to/your/BAM/file")
summarizeOverlaps(features, bamfile,
singleEnd=FALSE, preprocess.reads=invertStrand)
The advantage of using preprocess.reads
is that if the file is too big to have all the reads in memory at once, you can process it by chunks of constant size with something like:
bamfile <- BamFile("path/to/your/BAM/file", yieldSize=1000000)
open(bamfile)
summarizeOverlaps(features, bamfile,
singleEnd=FALSE, preprocess.reads=invertStrand)
Hope this helps,
H.
hi Herve,
Is there a plan to export invertStrand() in the next release?
It won't be done before the release but it's on the TODO for early in the next devel cycle.
Valerie
Hi Hervé, Michael, and Valerie,
Is the solution suggested still the way to go when using summarizeOverlaps() to count on RNA-seq data that is stranded and paired-end? It works for me, but since this thread is pretty old, I am wondering whether there is another way. I am using:
Thanks,
Patrick
Hi Patrick,
It depends on the stranded protocol that was used to generate your paired-end RNA-seq data.
But first (and before I get to the details): Since I posted my original answer to this thread 3.5 years ago, 3 important functionalities have been added to the GenomicAlignments package:
strandMode()
getter and setter. See?strandMode
for the meaning of the 3 supported modes and other important information.readGAlignmentPairs()
got a new argument,strandMode
, that is set to 1 by default.invertStrand()
. This provides a convenient way to switch the "strand mode" of an object back and forth between modes 1 and 2.Note that
readGAlignmentPairs()
sets the "strand mode" of the returned object to 1 by default. This can be changed via the newstrandMode
argument, or, after the object is loaded, by using thestrandMode()
setter on the returned GAlignmentPairs object. Here is the important thing to understand:readGAlignmentPairs()
has no way to know the real strand mode (i.e. how the stranded protocol treated the strand), at least that's my understanding (or is this information somehow stored somewhere in the BAM file?). So setting the "strand mode" to 1 by default was an arbitrary decision (not completely though because this choice kept backward compatibility with howreadGAlignmentPairs()
behaved before the introduction of the "strand mode"). This means that if your protocol was unstranded (strand mode 0), or stranded but following the strand mode 2 convention (e.g. dUTP, NSR, NNSR, Illumina stranded TruSeq PE protocol), then by defaultreadGAlignmentPairs()
will return an object with an incorrect "strand mode" (i.e. 1 instead of 0 or 2).So what you need to do really depends on what the real strand mode of your experiment is. Use
preprocess.reads=invertStrand
in your call tosummarizeOverlaps()
only if the real strand mode is 2. And that's assuming that you are callingsummarizeOverlaps()
directly on the BAM files. If you call it on a GAlignmentPairs object, no need to do this either: just set the strand mode of the object to 2 before callingsummarizeOverlaps()
on it.Hope this helps,
H.
Dear Herve,
I have the information that the RNA-Seq data I am working with is single-end strand-specific. I need to count reads on the reverse strand per gene, the gene names/position are in the GTF file and reads in the BAM file as shown below. Now, after researching many forums, tutorials and documentation, I came up with the following piece of code. We need to publish this data and I want to be sure if what I ran is completely correct before sending the paper. My confusion is that you have mentioned about using the option "preprocess.reads = invertStrand" for Paired End data, would this work with Single End data as well? I have been unable to find this in any of the documentation and if you could provide me some links to the right document in case I missed it, I will be grateful.
Your help and time is much appreciated. Thank you very much,
Ashu
Hi Ashu,
I don't know how your BAM file was generated. Each aligner has its own settings and these setting must be understood and used carefully to reflect the protocol that was used. In the case of single end RNA-seq data, the question boils down to this: is the strand reported in the BAM file the "real strand" or is it the opposite of the "real strand"? In other words, do you want to use the strand that's in the BAM file or its opposite?
More precisely, do you want this semantic (use the strand as-is):
or this one (use the opposite of the strand reported in the BAM file):
Note that my understanding is that the right semantic (i.e. the semantic that reflects the protocol that was used) will normally lead to more overlaps.
Also look at
?summarizeJunctions
in GenomicAlignments for a simple test for checking whether the RNA-seq protocol was stranded or not:Hope this helps,
H.
Did that help?