Search
Question: GenomicAlignments: Inverting strand when counting using summarizeOverlaps
1
gravatar for rxzlmn
2.1 years ago by
rxzlmn10
Singapore
rxzlmn10 wrote:

Hi,

I have RNA-seq data which has been mapped with STAR (paired-end). If I use summarizeOverlaps with ignore.strand=FALSE, the reads from these bam files get assigned to the opposite strand, rendering it impossible to count strand-specific. I am using a workaround by manually switching the strand information in the GRanges I count reads on, but that's clearly not optimal as I am changing the (correct) data and always need to make sure I change it back after counting.

Is there a better way to do that, i.e. is there a way to 'invert' the strand when passing on the BamFile?

 

Thanks!

ADD COMMENTlink modified 2.1 years ago by Hervé Pagès ♦♦ 12k • written 2.1 years ago by rxzlmn10
6
gravatar for Hervé Pagès
2.1 years ago by
Hervé Pagès ♦♦ 12k
United States
Hervé Pagès ♦♦ 12k wrote:

Hi,

Short answer

You can use something like this to invert the strand when doing strand-specific counting:

summarizeOverlaps(features, "path/to/your/BAM/file",
                  singleEnd=FALSE, preprocess.reads=invertStrand)

See the long answer below for the definition of the invertStrand() function.

Long answer

1. With paired-end stranded RNA-seq, my understanding is that, depending on the protocol that was used, the strand of the original RNA is either that of the first read or its opposite.

Examples: protocols that lead to the strand of the first read being the opposite to that of the original RNA (1st family):

  • dUTP, NSR, NNSR, Illumina stranded TruSeq PE protocol

and protocols that lead to the strand of the first read being that of the original RNA (2nd family):

  • Directional Illumina (Ligation), Standard SOLiD

TopHat2 and Cufflinks have a --library-type option to let the user specify which protocol was used: fr-firststrand for the first family of protocols and fr-secondstrand for the 2nd family.

Disclaimer: This is my understanding of the situation based on the information I gathered from various sources. This information was not always clear and was sometimes contradictory. I'd be happy to edit the above if it's not accurate.

2. Anyway, the bottom line is that the protocol that was used has a direct consequence on the strand of the first and second mates that are stored in a BAM file and in the GAlignmentPairs object obtained by calling readGAlignmentPairs() on the file. In the BAM file, the 2 mates typically have opposite strands. In the GAlignmentPairs object, they always have opposite strands (readGAlignmentPairs() will drop the pairs that don't satisfy this with a warning):

galp
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
#                         seqnames strand :       ranges --       ranges
#                            <Rle>  <Rle> :    <IRanges> --    <IRanges>
#   EAS54_61:4:143:69:578     chr1      + : [  36,   70] -- [ 185,  219]
#   EAS114_26:7:37:79:581     chr2      - : [1533, 1567] -- [1349, 1383]
#   -------
#   seqinfo: 2 sequences from an unspecified genome

first(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
#                         seqnames strand       cigar    qwidth     start
#                            <Rle>  <Rle> <character> <integer> <integer>
#   EAS54_61:4:143:69:578     chr1      +         35M        35        36
#   EAS114_26:7:37:79:581     chr2      -         35M        35      1533
#                               end     width     njunc
#                         <integer> <integer> <integer>
#   EAS54_61:4:143:69:578        70        35         0
#   EAS114_26:7:37:79:581      1567        35         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

last(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
#                         seqnames strand       cigar    qwidth     start
#                            <Rle>  <Rle> <character> <integer> <integer>
#   EAS54_61:4:143:69:578     chr1      -         35M        35       185
#   EAS114_26:7:37:79:581     chr2      +         35M        35      1349
#                               end     width     njunc
#                         <integer> <integer> <integer>
#   EAS54_61:4:143:69:578       219        35         0
#   EAS114_26:7:37:79:581      1383        35         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Depending on what RNA-seq protocol was used, the meaningful strand is either that of the first or that of the second mate. But note that when galp is displayed, the strand of the first mate is shown. Also when calling strand() on a GAlignmentPairs, the strand of the first mate is returned. This reflects the fact that, when the GAlignmentPairs container was designed and implemented, the arbitrary choice was made to treat the strand of the first mate as the meaningful strand.

3. summarizeOverlaps() calls findOverlaps() internally to find overlaps between the features and the reads. When the reads argument is a BamFile object, the alignments in the file must be loaded in a GAlignments or GAlignmentPairs object before findOverlaps() can be called. The singleEnd argument of summarizeOverlaps() lets the user control how the reads in the file must be loaded: as a GAlignments object if singleEnd=TRUE (the default), or as a GAlignmentPairs object if singleEnd=FALSE. So make sure you used singleEnd=FALSE.

4. When the GAlignmentPairs is passed to findOverlaps() it's first turned into a GRangesList object with grglist(). This operation also assumes that the meaningful strand is that of the first mate so it inverts the strand of the second mate. For example:

grglist(galp)
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [ 36,  70]      +
#   [2]     chr1 [185, 219]      +
#
# $EAS114_26:7:37:79:581 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]     chr2 [1349, 1383]      -
#   [2]     chr2 [1533, 1567]      -
#
# -------
# seqinfo: 2 sequences from an unspecified genome

5. If the RNA-seq data was generated using a protocol where the meaningful strand is that of the second mate, then grglist() is not doing the right thing. You can work around this by inverting the strand in the GAlignmentPairs object manually:

invertStrand <- function(galp)
{
    ## Using a non-exported helper function and direct slot access is
    ## bad practice and is strongly discouraged. Sorry for doing this here!
    invertRleStrand <- GenomicAlignments:::invertRleStrand
    galp@first <- invertRleStrand(galp@first)
    galp@last <- invertRleStrand(galp@last)
    galp
}

invertStrand(galp)
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
#                         seqnames strand :       ranges --       ranges
#                            <Rle>  <Rle> :    <IRanges> --    <IRanges>
#   EAS54_61:4:143:69:578     chr1      - : [  36,   70] -- [ 185,  219]
#   EAS114_26:7:37:79:581     chr2      + : [1533, 1567] -- [1349, 1383]
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Then, grglist() returns the right GRangesList object:

grglist(invertStrand(galp))
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [185, 219]      -
#   [2]     chr1 [ 36,  70]      -
#
# $EAS114_26:7:37:79:581 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]     chr2 [1533, 1567]      +
#   [2]     chr2 [1349, 1383]      +
#
# -------
# seqinfo: 2 sequences from an unspecified genome

6. In the context of using summarizeOverlaps(), you can either load all the reads in a GAlignmentPairs object and invert their strand manually with something like:

bamfile <- BamFile("path/to/your/BAM/file")
galp <- readGAlignmentPairs(bamfile)
summarizeOverlaps(features, invertStrand(galp))

or you can use the preprocess.reads argument of summarizeOverlaps() like this:

bamfile <- BamFile("path/to/your/BAM/file")
summarizeOverlaps(features, bamfile,
                  singleEnd=FALSE, preprocess.reads=invertStrand)

The advantage of using preprocess.reads is that if the file is too big to have all the reads in memory at once, you can process it by chunks of constant size with something like:

bamfile <- BamFile("path/to/your/BAM/file", yieldSize=1000000)
open(bamfile)
summarizeOverlaps(features, bamfile,
                  singleEnd=FALSE, preprocess.reads=invertStrand)

Hope this helps,
H.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Hervé Pagès ♦♦ 12k

hi Herve,

Is there a plan to export invertStrand() in the next release?

ADD REPLYlink modified 13 months ago • written 13 months ago by Michael Love11k
1

It won't be done before the release but it's on the TODO for early in the next devel cycle.

Valerie

ADD REPLYlink written 13 months ago by Valerie Obenchain ♦♦ 6.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 168 users visited in the last hour