Question

GenomicAlignments: Inverting strand when counting using summarizeOverlaps

1

Entering edit mode

rxzlmn ▴ 10

@rxzlmn-7490

Last seen 9.1 years ago

Singapore

Hi,

I have RNA-seq data which has been mapped with STAR (paired-end). If I use summarizeOverlaps with ignore.strand=FALSE, the reads from these bam files get assigned to the opposite strand, rendering it impossible to count strand-specific. I am using a workaround by manually switching the strand information in the GRanges I count reads on, but that's clearly not optimal as I am changing the (correct) data and always need to make sure I change it back after counting.

Is there a better way to do that, i.e. is there a way to 'invert' the strand when passing on the BamFile?

Thanks!

summarizeoverlaps genomicalignments • 4.5k views

ADD COMMENT • link updated 9.1 years ago by Hervé Pagès 16k • written 9.1 years ago by rxzlmn ▴ 10

score 8 · Answer 1 · 2015-03-24

Hi,

Short answer

You can use something like this to invert the strand when doing strand-specific counting:

summarizeOverlaps(features, "path/to/your/BAM/file",
                  singleEnd=FALSE, preprocess.reads=invertStrand)

See the long answer below for the definition of the invertStrand() function.

Long answer

1. With paired-end stranded RNA-seq, my understanding is that, depending on the protocol that was used, the strand of the original RNA is either that of the first read or its opposite.

Examples: protocols that lead to the strand of the first read being the opposite to that of the original RNA (1st family):

dUTP, NSR, NNSR, Illumina stranded TruSeq PE protocol

and protocols that lead to the strand of the first read being that of the original RNA (2nd family):

Directional Illumina (Ligation), Standard SOLiD

TopHat2 and Cufflinks have a --library-type option to let the user specify which protocol was used: fr-firststrand for the first family of protocols and fr-secondstrand for the 2nd family.

Disclaimer: This is my understanding of the situation based on the information I gathered from various sources. This information was not always clear and was sometimes contradictory. I'd be happy to edit the above if it's not accurate.

2. Anyway, the bottom line is that the protocol that was used has a direct consequence on the strand of the first and second mates that are stored in a BAM file and in the GAlignmentPairs object obtained by calling readGAlignmentPairs() on the file. In the BAM file, the 2 mates typically have opposite strands. In the GAlignmentPairs object, they always have opposite strands (readGAlignmentPairs() will drop the pairs that don't satisfy this with a warning):

galp
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
#                         seqnames strand :       ranges --       ranges
#                            <Rle>  <Rle> :    <IRanges> --    <IRanges>
#   EAS54_61:4:143:69:578     chr1      + : [  36,   70] -- [ 185,  219]
#   EAS114_26:7:37:79:581     chr2      - : [1533, 1567] -- [1349, 1383]
#   -------
#   seqinfo: 2 sequences from an unspecified genome

first(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
#                         seqnames strand       cigar    qwidth     start
#                            <Rle>  <Rle> <character> <integer> <integer>
#   EAS54_61:4:143:69:578     chr1      +         35M        35        36
#   EAS114_26:7:37:79:581     chr2      -         35M        35      1533
#                               end     width     njunc
#                         <integer> <integer> <integer>
#   EAS54_61:4:143:69:578        70        35         0
#   EAS114_26:7:37:79:581      1567        35         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

last(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
#                         seqnames strand       cigar    qwidth     start
#                            <Rle>  <Rle> <character> <integer> <integer>
#   EAS54_61:4:143:69:578     chr1      -         35M        35       185
#   EAS114_26:7:37:79:581     chr2      +         35M        35      1349
#                               end     width     njunc
#                         <integer> <integer> <integer>
#   EAS54_61:4:143:69:578       219        35         0
#   EAS114_26:7:37:79:581      1383        35         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Depending on what RNA-seq protocol was used, the meaningful strand is either that of the first or that of the second mate. But note that when galp is displayed, the strand of the first mate is shown. Also when calling strand() on a GAlignmentPairs, the strand of the first mate is returned. This reflects the fact that, when the GAlignmentPairs container was designed and implemented, the arbitrary choice was made to treat the strand of the first mate as the meaningful strand.

3. summarizeOverlaps() calls findOverlaps() internally to find overlaps between the features and the reads. When the reads argument is a BamFile object, the alignments in the file must be loaded in a GAlignments or GAlignmentPairs object before findOverlaps() can be called. The singleEnd argument of summarizeOverlaps() lets the user control how the reads in the file must be loaded: as a GAlignments object if singleEnd=TRUE (the default), or as a GAlignmentPairs object if singleEnd=FALSE. So make sure you used singleEnd=FALSE.

4. When the GAlignmentPairs is passed to findOverlaps() it's first turned into a GRangesList object with grglist(). This operation also assumes that the meaningful strand is that of the first mate so it inverts the strand of the second mate. For example:

grglist(galp)
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [ 36,  70]      +
#   [2]     chr1 [185, 219]      +
#
# $EAS114_26:7:37:79:581 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]     chr2 [1349, 1383]      -
#   [2]     chr2 [1533, 1567]      -
#
# -------
# seqinfo: 2 sequences from an unspecified genome

5. If the RNA-seq data was generated using a protocol where the meaningful strand is that of the second mate, then grglist() is not doing the right thing. You can work around this by inverting the strand in the GAlignmentPairs object manually:

invertStrand <- function(galp)
{
    ## Using a non-exported helper function and direct slot access is
    ## bad practice and is strongly discouraged. Sorry for doing this here!
    invertRleStrand <- GenomicAlignments:::invertRleStrand
    galp@first <- invertRleStrand(galp@first)
    galp@last <- invertRleStrand(galp@last)
    galp
}

invertStrand(galp)
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
#                         seqnames strand :       ranges --       ranges
#                            <Rle>  <Rle> :    <IRanges> --    <IRanges>
#   EAS54_61:4:143:69:578     chr1      - : [  36,   70] -- [ 185,  219]
#   EAS114_26:7:37:79:581     chr2      + : [1533, 1567] -- [1349, 1383]
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Then, grglist() returns the right GRangesList object:

grglist(invertStrand(galp))
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [185, 219]      -
#   [2]     chr1 [ 36,  70]      -
#
# $EAS114_26:7:37:79:581 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]     chr2 [1533, 1567]      +
#   [2]     chr2 [1349, 1383]      +
#
# -------
# seqinfo: 2 sequences from an unspecified genome

6. In the context of using summarizeOverlaps(), you can either load all the reads in a GAlignmentPairs object and invert their strand manually with something like:

bamfile <- BamFile("path/to/your/BAM/file")
galp <- readGAlignmentPairs(bamfile)
summarizeOverlaps(features, invertStrand(galp))

or you can use the preprocess.reads argument of summarizeOverlaps() like this:

bamfile <- BamFile("path/to/your/BAM/file")
summarizeOverlaps(features, bamfile,
                  singleEnd=FALSE, preprocess.reads=invertStrand)

The advantage of using preprocess.reads is that if the file is too big to have all the reads in memory at once, you can process it by chunks of constant size with something like:

bamfile <- BamFile("path/to/your/BAM/file", yieldSize=1000000)
open(bamfile)
summarizeOverlaps(features, bamfile,
                  singleEnd=FALSE, preprocess.reads=invertStrand)

Hope this helps,
H.