GenomicAlignments: Inverting strand when counting using summarizeOverlaps
1
1
Entering edit mode
rxzlmn ▴ 10
@rxzlmn-7490
Last seen 9.7 years ago
Singapore

Hi,

I have RNA-seq data which has been mapped with STAR (paired-end). If I use summarizeOverlaps with ignore.strand=FALSE, the reads from these bam files get assigned to the opposite strand, rendering it impossible to count strand-specific. I am using a workaround by manually switching the strand information in the GRanges I count reads on, but that's clearly not optimal as I am changing the (correct) data and always need to make sure I change it back after counting.

Is there a better way to do that, i.e. is there a way to 'invert' the strand when passing on the BamFile?

 

Thanks!

summarizeoverlaps genomicalignments • 4.8k views
ADD COMMENT
8
Entering edit mode
@herve-pages-1542
Last seen 6 hours ago
Seattle, WA, United States

Hi,

Short answer

You can use something like this to invert the strand when doing strand-specific counting:

summarizeOverlaps(features, "path/to/your/BAM/file",
                  singleEnd=FALSE, preprocess.reads=invertStrand)

See the long answer below for the definition of the invertStrand() function.

Long answer

1. With paired-end stranded RNA-seq, my understanding is that, depending on the protocol that was used, the strand of the original RNA is either that of the first read or its opposite.

Examples: protocols that lead to the strand of the first read being the opposite to that of the original RNA (1st family):

  • dUTP, NSR, NNSR, Illumina stranded TruSeq PE protocol

and protocols that lead to the strand of the first read being that of the original RNA (2nd family):

  • Directional Illumina (Ligation), Standard SOLiD

TopHat2 and Cufflinks have a --library-type option to let the user specify which protocol was used: fr-firststrand for the first family of protocols and fr-secondstrand for the 2nd family.

Disclaimer: This is my understanding of the situation based on the information I gathered from various sources. This information was not always clear and was sometimes contradictory. I'd be happy to edit the above if it's not accurate.

2. Anyway, the bottom line is that the protocol that was used has a direct consequence on the strand of the first and second mates that are stored in a BAM file and in the GAlignmentPairs object obtained by calling readGAlignmentPairs() on the file. In the BAM file, the 2 mates typically have opposite strands. In the GAlignmentPairs object, they always have opposite strands (readGAlignmentPairs() will drop the pairs that don't satisfy this with a warning):

galp
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
#                         seqnames strand :       ranges --       ranges
#                            <Rle>  <Rle> :    <IRanges> --    <IRanges>
#   EAS54_61:4:143:69:578     chr1      + : [  36,   70] -- [ 185,  219]
#   EAS114_26:7:37:79:581     chr2      - : [1533, 1567] -- [1349, 1383]
#   -------
#   seqinfo: 2 sequences from an unspecified genome

first(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
#                         seqnames strand       cigar    qwidth     start
#                            <Rle>  <Rle> <character> <integer> <integer>
#   EAS54_61:4:143:69:578     chr1      +         35M        35        36
#   EAS114_26:7:37:79:581     chr2      -         35M        35      1533
#                               end     width     njunc
#                         <integer> <integer> <integer>
#   EAS54_61:4:143:69:578        70        35         0
#   EAS114_26:7:37:79:581      1567        35         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

last(galp)
# GAlignments object with 2 alignments and 0 metadata columns:
#                         seqnames strand       cigar    qwidth     start
#                            <Rle>  <Rle> <character> <integer> <integer>
#   EAS54_61:4:143:69:578     chr1      -         35M        35       185
#   EAS114_26:7:37:79:581     chr2      +         35M        35      1349
#                               end     width     njunc
#                         <integer> <integer> <integer>
#   EAS54_61:4:143:69:578       219        35         0
#   EAS114_26:7:37:79:581      1383        35         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Depending on what RNA-seq protocol was used, the meaningful strand is either that of the first or that of the second mate. But note that when galp is displayed, the strand of the first mate is shown. Also when calling strand() on a GAlignmentPairs, the strand of the first mate is returned. This reflects the fact that, when the GAlignmentPairs container was designed and implemented, the arbitrary choice was made to treat the strand of the first mate as the meaningful strand.

3. summarizeOverlaps() calls findOverlaps() internally to find overlaps between the features and the reads. When the reads argument is a BamFile object, the alignments in the file must be loaded in a GAlignments or GAlignmentPairs object before findOverlaps() can be called. The singleEnd argument of summarizeOverlaps() lets the user control how the reads in the file must be loaded: as a GAlignments object if singleEnd=TRUE (the default), or as a GAlignmentPairs object if singleEnd=FALSE. So make sure you used singleEnd=FALSE.

4. When the GAlignmentPairs is passed to findOverlaps() it's first turned into a GRangesList object with grglist(). This operation also assumes that the meaningful strand is that of the first mate so it inverts the strand of the second mate. For example:

grglist(galp)
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [ 36,  70]      +
#   [2]     chr1 [185, 219]      +
#
# $EAS114_26:7:37:79:581 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]     chr2 [1349, 1383]      -
#   [2]     chr2 [1533, 1567]      -
#
# -------
# seqinfo: 2 sequences from an unspecified genome

5. If the RNA-seq data was generated using a protocol where the meaningful strand is that of the second mate, then grglist() is not doing the right thing. You can work around this by inverting the strand in the GAlignmentPairs object manually:

invertStrand <- function(galp)
{
    ## Using a non-exported helper function and direct slot access is
    ## bad practice and is strongly discouraged. Sorry for doing this here!
    invertRleStrand <- GenomicAlignments:::invertRleStrand
    galp@first <- invertRleStrand(galp@first)
    galp@last <- invertRleStrand(galp@last)
    galp
}

invertStrand(galp)
# GAlignmentPairs object with 2 alignment pairs and 0 metadata columns:
#                         seqnames strand :       ranges --       ranges
#                            <Rle>  <Rle> :    <IRanges> --    <IRanges>
#   EAS54_61:4:143:69:578     chr1      - : [  36,   70] -- [ 185,  219]
#   EAS114_26:7:37:79:581     chr2      + : [1533, 1567] -- [1349, 1383]
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Then, grglist() returns the right GRangesList object:

grglist(invertStrand(galp))
# GRangesList object of length 2:
# $EAS54_61:4:143:69:578 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [185, 219]      -
#   [2]     chr1 [ 36,  70]      -
#
# $EAS114_26:7:37:79:581 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]     chr2 [1533, 1567]      +
#   [2]     chr2 [1349, 1383]      +
#
# -------
# seqinfo: 2 sequences from an unspecified genome

6. In the context of using summarizeOverlaps(), you can either load all the reads in a GAlignmentPairs object and invert their strand manually with something like:

bamfile <- BamFile("path/to/your/BAM/file")
galp <- readGAlignmentPairs(bamfile)
summarizeOverlaps(features, invertStrand(galp))

or you can use the preprocess.reads argument of summarizeOverlaps() like this:

bamfile <- BamFile("path/to/your/BAM/file")
summarizeOverlaps(features, bamfile,
                  singleEnd=FALSE, preprocess.reads=invertStrand)

The advantage of using preprocess.reads is that if the file is too big to have all the reads in memory at once, you can process it by chunks of constant size with something like:

bamfile <- BamFile("path/to/your/BAM/file", yieldSize=1000000)
open(bamfile)
summarizeOverlaps(features, bamfile,
                  singleEnd=FALSE, preprocess.reads=invertStrand)

Hope this helps,
H.

ADD COMMENT
0
Entering edit mode

hi Herve,

Is there a plan to export invertStrand() in the next release?

ADD REPLY
1
Entering edit mode

It won't be done before the release but it's on the TODO for early in the next devel cycle.

Valerie

ADD REPLY
0
Entering edit mode

Hi Hervé, Michael, and Valerie,

Is the solution suggested still the way to go when using summarizeOverlaps() to count on RNA-seq data that is stranded and paired-end? It works for me, but since this thread is pretty old, I am wondering whether there is another way. I am using:

se <- summarizeOverlaps(features = ebg, reads = bamfiles,
                        mode = "Union",
                        singleEnd = FALSE,
                        ignore.strand = FALSE,
                        fragments = TRUE,
                        preprocess.reads = invertStrand)

 

Thanks,

Patrick

ADD REPLY
0
Entering edit mode

Hi Patrick,

It depends on the stranded protocol that was used to generate your paired-end RNA-seq data.

But first (and before I get to the details): Since I posted my original answer to this thread 3.5 years ago, 3 important functionalities have been added to the GenomicAlignments package:

  1. GAlignmentPairs objects now have a "strand mode" that can be set to 0, 1, or 2, and accessed (and modified) with the strandMode() getter and setter. See ?strandMode for the meaning of the 3 supported modes and other important information.
  2. readGAlignmentPairs() got a new argument, strandMode, that is set to 1 by default.
  3. GAlignmentPairs objects now support invertStrand(). This provides a convenient way to switch the "strand mode" of an object back and forth between modes 1 and 2.

Note that readGAlignmentPairs() sets the "strand mode" of the returned object to 1 by default. This can be changed via the new strandMode argument, or, after the object is loaded, by using the strandMode() setter on the returned GAlignmentPairs object. Here is the important thing to understand: readGAlignmentPairs() has no way to know the real strand mode (i.e. how the stranded protocol treated the strand), at least that's my understanding (or is this information somehow stored somewhere in the BAM file?). So setting the "strand mode" to 1 by default was an arbitrary decision (not completely though because this choice kept backward compatibility with how readGAlignmentPairs() behaved before the introduction of the "strand mode"). This means that if your protocol was unstranded (strand mode 0), or stranded but following the strand mode 2 convention (e.g. dUTP, NSR, NNSR, Illumina stranded TruSeq PE protocol), then by default readGAlignmentPairs() will return an object with an incorrect "strand mode" (i.e. 1 instead of 0 or 2).

So what you need to do really depends on what the real strand mode of your experiment is. Use preprocess.reads=invertStrand in your call to summarizeOverlaps() only if the real strand mode is 2. And that's assuming that you are calling summarizeOverlaps() directly on the BAM files. If you call it on a GAlignmentPairs object, no need to do this either: just set the strand mode of the object to 2 before calling summarizeOverlaps() on it.

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

Dear Herve,

I have the information that the RNA-Seq data I am working with is single-end strand-specific. I need to count reads on the reverse strand per gene, the gene names/position are in the GTF file and reads in the BAM file as shown below. Now, after researching many forums, tutorials and documentation, I came up with the following piece of code. We need to publish this data and I want to be sure if what I ran is completely correct before sending the paper. My confusion is that you have mentioned about using the option "preprocess.reads = invertStrand" for Paired End data, would this work with Single End data as well? I have been unable to find this in any of the documentation and if you could provide me some links to the right document in case I missed it, I will be grateful.

bamFile <- "Aligned.sortedByCoord.out.bam"
gtf <- "gencode.v27.primary_assembly.annotation.gtf.ucsc.gtf"
gtf.gRanges <- rtracklayer::import(gtf)
idx <-  mcols(gtf.gRanges)$type == "exon" 
gtf.gRanges.exons <- gtf.gRanges[idx]
gtf.split.genes <- split(gtf.gRanges.exons, mcols(gtf.gRanges.exons)$gene_id)

# Run SummarizeOverlap
GenomicAlignments::summarizeOverlaps(gtf.split.genes, bamFile, mode="Union", ignore.strand=F, preprocess.reads = invertStrand, singleEnd = T)

Your help and time is much appreciated. Thank you very much,

Ashu

ADD REPLY
1
Entering edit mode

Hi Ashu,

I don't know how your BAM file was generated. Each aligner has its own settings and these setting must be understood and used carefully to reflect the protocol that was used. In the case of single end RNA-seq data, the question boils down to this: is the strand reported in the BAM file the "real strand" or is it the opposite of the "real strand"? In other words, do you want to use the strand that's in the BAM file or its opposite?

More precisely, do you want this semantic (use the strand as-is):

library(GenomicAlignments)

## 2 fake genes:
gtf.split.genes <- GRangesList(
    gene1=GRanges(c("chrA:11-20:+", "chrA:51-100:+")),
    gene2=GRanges(c("chrA:131-140:-", "chrA:181-200:-"))
)

## 5 fake reads:
reads <- GAlignments(
    Rle("chrA", 5),
    pos=c(10L, 11L, 55L, 55L, 135L),
    cigar=rep("10M", 5),
    strand=strand(c("-", "-", "+", "-", "+"))
)

se1 <- summarizeOverlaps(gtf.split.genes, reads, ignore.strand=FALSE)
assay(se1)
#       reads
# gene1     1
# gene2     0

or this one (use the opposite of the strand reported in the BAM file):

se2 <- summarizeOverlaps(gtf.split.genes, invertStrand(reads), ignore.strand=FALSE)
assay(se2)
#      reads
# gene1     3
# gene2     1

Note that my understanding is that the right semantic (i.e. the semantic that reflects the protocol that was used) will normally lead to more overlaps.

Also look at ?summarizeJunctions in GenomicAlignments for a simple test for checking whether the RNA-seq protocol was stranded or not:

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

Did that help?

ADD REPLY

Login before adding your answer.

Traffic: 923 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6