How to use Rsamtool or GenomicAlignments to calculate RAN-seq read coverage curves
1
0
Entering edit mode
Jiping Wang ▴ 90
@jiping-wang-6687
Last seen 21 months ago
United States

We recently did some work on correction for RNA degradation bias for RNA-seq data. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1682-7. This requires us to calculate the RNA-seq read coverage score on all exon regions of every genes. I wonder whether Rsamtools or GenomicAlignments can do this job. Here are two complexities:

  1. Does this package provide functionality of read coverage for each gene, only on the exon region? I understand like Samtools it can calculate the coverage score on the genome. But for RNA-seq analysis, we want to focus on reads count and coverage that completely aligned onto the legitimate exon regions. For example, some reads might partially align to intron due to incomplete splicing, those reads should be opted out before calculation of the read coverage score.

  2. Some reads could be aligned to multiple genes but at unique genomic location, if the genes are on different strands and overlap (which happens pretty often). To avoid ambiguity, we should drop such reads as well (this was similarly done in HTSeq for counting reads, but HTseq does not calculate the coverage). Does either package take this into consideration for constructing the coverage score for RNA-seq data specifically?

thanks for help.

Jiping Wang

Rsamtools GenomicAlignments RNA-seq coverage score • 2.7k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi Jiping,

No high level tool for doing exactly that with Rsamtools or GenomicAlignments, AFAIK, but you should be able to use some lower level functionalities to get exactly what you want. Roughly, the following steps should get you there:

1. Load you reads in a GRangesList object

library(GenomicAlignments)
reads <- readGAlignments(...)
reads <- as(reads, "GRangesList")

2. Get your exons grouped by gene in a GRangesList object

## Check what TxDb packages are available:
BiocManager::available("TxDb")

## If no one suits your need, construct your own TxDb object:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(...)  # or use makeTxDbFromUCSC(...),
                              # or makeTxDbFromEnsembl(...), etc...

## Then extract the exons grouped by gene from the TxDb object:
ex_by_gene <- exonsBy(txdb, by="gene")

3. Drop the reads you don't want

You will typically use subsetByOverlaps() for this. However, for more control, and depending on what "reads that completely aligned onto the legitimate exon regions" means, you might need to use findOverlaps(), directly on reads and ex_by_gene, or on unlist(reads) and unlist(ex_by_gene), and be prepared for some direct manipulation of Hits objects. This could get tricky but in the end you should be able to keep only the reads that you want.

4. Compute the coverage of the remaining reads

cvg <- coverage(reads)

cvg will be an RleList object with one list element per chromosome in the genome.

5. Finally extract the coverage of each exon

all_exons <- unlist(ex_by_gene)
cvg_by_ex <- cvg[all_exons]

cvg_by_ex will be an RleList object with one list element per exon.

Propagate the exon names:

names(cvg_by_ex) <- mcols(all_exons)$exon_name
## Or, alternatively, if the exon names are problematic, propagate the
## exon ids:
names(cvg_by_ex) <- mcols(all_exons)$exon_id

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi Herve,

Thanks for your helpful comments. There is one issue here (If I misunderstand your comment, please correct me). In the genome there are genes that sit on the opposite strands, but in overlapping genomic region. Many of the existing RNA-seq data do not distinguish which strand they originated from. If we calculate the coverage first and then extract exons, in such regions we could have double counting issue. If you meant in step 3, that we should have dropped such reads, then it should work. We also wrote our own codes which have realized this. But it's slow and I am trying to utilize the existing bioconductor package/functions to speed it up.

I noticed Valerie Obenchain wrote a note on using SummarizeOverlaps method from GenomicAlignment package to get read count per gene (as HTseq does). The read filtering criterion is exactly what I want. But In addition to the read count per gene, we need calculate the coverage score in the entire transcript for each gene using the same rule as used in HTseq. The summarizeOverlaps could do this job if it also outputs which reads/or pairs were counted toward which gene (in additions to a only count of total # reads). I read carefully about this package, but it seems that no such output is available from this function. I just wonder whether there is any way to get around it to get the list of remaining reads after filtering.

Thanks so much for help again!

ADD REPLY
0
Entering edit mode

The mode argument controls how summarizeOverlaps() filters out reads. This argument takes a callback function that must have the following arguments: features, reads, ignore.strand, inter.feature, and must return a vector of counts parallel to the features argument (i.e. with 1 count per feature). There are 3 predefined callback functions: Union(), IntersectionStrict(), IntersectionNotEmpty() (see ?summarizeOverlaps() for more information).

For example Union is defined as follow:

> Union
function (features, reads, ignore.strand = FALSE, inter.feature = TRUE) 
{
    ov <- findOverlaps(features, reads, ignore.strand = ignore.strand)
    if (inter.feature) {
        reads_to_keep <- which(countSubjectHits(ov) == 1L)
        ov <- ov[subjectHits(ov) %in% reads_to_keep]
    }
    countQueryHits(ov)
}
<bytecode: 0x1640d388>
<environment: namespace:GenomicAlignments>

So if the filtering performed by these modes is what you are after but you don't want the last step (the counting one), just do what they do before that and skip the counting step. For example to filter your reads in the same way as Union(), use a modified version of Union():

Union2 <- function(features, reads, ignore.strand=FALSE, inter.feature=TRUE) 
{
    ov <- findOverlaps(features, reads, ignore.strand = ignore.strand)
    if (inter.feature) {
        reads_to_keep <- which(countSubjectHits(ov) == 1L)
        ov <- ov[subjectHits(ov) %in% reads_to_keep]
    }
    ov
}

Note that Union2() is to be used in the context of your transcript/exon coverage problem, NOT with summarizeOverlaps() where it would fail to work because it doesn't return a vector of counts.

Finally I should mention the coverageByTranscript() function defined in the GenomicFeatures package. It will compute the read coverage by transcript but hopefully it should not be too hard to extract the read coverage by exon from the RleList object returned by coverageByTranscript(). See ?coverageByTranscript (the man page has many examples).

H.

ADD REPLY
0
Entering edit mode

Did that help?

ADD REPLY
0
Entering edit mode

Hi, thanks again! I was away for a few weeks and didn't have access to this. I will try it out and let you know. Thanks so much for your help!!

ADD REPLY
0
Entering edit mode

Update:

I was able to use these functions to get the coverage calculation for single-end data. However I got new problems when dealing with paired-end read using readGAlignmentPairs function. I tested on the data the package comes with (i.e. extdata), no problem, this function returns the paired end data. However when I applied to my own paired-end data, I got no records for some reason. Could you provide any hints?

> library(Rsamtools)
> library("GenomicAlignments")
> setwd("~/degnorm")
> bamfile="SRR873822.bam"
> ##testing whether single or paired-end
> testPairedEndBam(bamfile)
[1] TRUE
> 
> ##subsetting the bam file for test
> t <- scanBamHeader(bamfile)[[1]][["targets"]]
> which <- GRanges(names(t), IRanges(1, unname(t)))
> 
> ##only look at chrI
> param <- ScanBamParam(which=which[1], what=character(), flag=scanBamFlag(isProperPair=TRUE,
+                                                                          isDuplicate=FALSE,
+                                                                          isSecondaryAlignment=FALSE))
> galp2 <- readGAlignmentPairs(bamfile, use.names=TRUE, param=param)
> galp2
GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns:
   seqnames strand :    ranges --    ranges
      <Rle>  <Rle> : <IRanges> -- <IRanges>
  -------
  seqinfo: 25 sequences from an unspecified genome
ADD REPLY
0
Entering edit mode

Please format your code following the guidance here https://support.bioconductor.org/p/117436/ so that your 'code chunks are legible.

ADD REPLY
0
Entering edit mode

Hi,

For coverageByTranscript(), does it distinguish the three different methods, i.e., Union(), IntersectionStrict(), IntersectionNotEmpty(). What I want is to use IntersectionStrict() which only keeps the compatible reads with transcritome that only aligns to one unique gene. I saw the coverageByTrnacript method allows to keep compatible reads but it does not exclude reads that aligns to shared region of multiple genes. Is that right? Many thanks again!

jiping

ADD REPLY

Login before adding your answer.

Traffic: 524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6