Question: How to use Rsamtool or GenomicAlignments to calculate RAN-seq read coverage curves
0
10 weeks ago by
Jiping Wang70
Jiping Wang70 wrote:

We recently did some work on correction for RNA degradation bias for RNA-seq data. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1682-7. This requires us to calculate the RNA-seq read coverage score on all exon regions of every genes. I wonder whether Rsamtools or GenomicAlignments can do this job. Here are two complexities:

1. Does this package provide functionality of read coverage for each gene, only on the exon region? I understand like Samtools it can calculate the coverage score on the genome. But for RNA-seq analysis, we want to focus on reads count and coverage that completely aligned onto the legitimate exon regions. For example, some reads might partially align to intron due to incomplete splicing, those reads should be opted out before calculation of the read coverage score.

2. Some reads could be aligned to multiple genes but at unique genomic location, if the genes are on different strands and overlap (which happens pretty often). To avoid ambiguity, we should drop such reads as well (this was similarly done in HTSeq for counting reads, but HTseq does not calculate the coverage). Does either package take this into consideration for constructing the coverage score for RNA-seq data specifically?

thanks for help.

Jiping Wang

modified 10 weeks ago by Hervé Pagès ♦♦ 14k • written 10 weeks ago by Jiping Wang70
Answer: How to use Rsamtool or GenomicAlignments to calculate RAN-seq read coverage curv
1
10 weeks ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Jiping,

No high level tool for doing exactly that with Rsamtools or GenomicAlignments, AFAIK, but you should be able to use some lower level functionalities to get exactly what you want. Roughly, the following steps should get you there:

library(GenomicAlignments)


#### 2. Get your exons grouped by gene in a GRangesList object

## Check what TxDb packages are available:
BiocManager::available("TxDb")

## If no one suits your need, construct your own TxDb object:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(...)  # or use makeTxDbFromUCSC(...),
# or makeTxDbFromEnsembl(...), etc...

## Then extract the exons grouped by gene from the TxDb object:
ex_by_gene <- exonsBy(txdb, by="gene")


#### 3. Drop the reads you don't want

You will typically use subsetByOverlaps() for this. However, for more control, and depending on what "reads that completely aligned onto the legitimate exon regions" means, you might need to use findOverlaps(), directly on reads and ex_by_gene, or on unlist(reads) and unlist(ex_by_gene), and be prepared for some direct manipulation of Hits objects. This could get tricky but in the end you should be able to keep only the reads that you want.

#### 4. Compute the coverage of the remaining reads

cvg <- coverage(reads)


cvg will be an RleList object with one list element per chromosome in the genome.

#### 5. Finally extract the coverage of each exon

all_exons <- unlist(ex_by_gene)
cvg_by_ex <- cvg[all_exons]


cvg_by_ex will be an RleList object with one list element per exon.

Propagate the exon names:

names(cvg_by_ex) <- mcols(all_exons)$exon_name ## Or, alternatively, if the exon names are problematic, propagate the ## exon ids: names(cvg_by_ex) <- mcols(all_exons)$exon_id


Hope this helps,

H.

Hi Herve,

Thanks for your helpful comments. There is one issue here (If I misunderstand your comment, please correct me). In the genome there are genes that sit on the opposite strands, but in overlapping genomic region. Many of the existing RNA-seq data do not distinguish which strand they originated from. If we calculate the coverage first and then extract exons, in such regions we could have double counting issue. If you meant in step 3, that we should have dropped such reads, then it should work. We also wrote our own codes which have realized this. But it's slow and I am trying to utilize the existing bioconductor package/functions to speed it up.

Thanks so much for help again!

The mode argument controls how summarizeOverlaps() filters out reads. This argument takes a callback function that must have the following arguments: features, reads, ignore.strand, inter.feature, and must return a vector of counts parallel to the features argument (i.e. with 1 count per feature). There are 3 predefined callback functions: Union(), IntersectionStrict(), IntersectionNotEmpty() (see ?summarizeOverlaps() for more information).

For example Union is defined as follow:

> Union
function (features, reads, ignore.strand = FALSE, inter.feature = TRUE)
{
ov <- findOverlaps(features, reads, ignore.strand = ignore.strand)
if (inter.feature) {
}
countQueryHits(ov)
}
<bytecode: 0x1640d388>
<environment: namespace:GenomicAlignments>


So if the filtering performed by these modes is what you are after but you don't want the last step (the counting one), just do what they do before that and skip the counting step. For example to filter your reads in the same way as Union(), use a modified version of Union():

Union2 <- function(features, reads, ignore.strand=FALSE, inter.feature=TRUE)
{
ov <- findOverlaps(features, reads, ignore.strand = ignore.strand)
if (inter.feature) {
}
ov
}


Note that Union2() is to be used in the context of your transcript/exon coverage problem, NOT with summarizeOverlaps() where it would fail to work because it doesn't return a vector of counts.

Finally I should mention the coverageByTranscript() function defined in the GenomicFeatures package. It will compute the read coverage by transcript but hopefully it should not be too hard to extract the read coverage by exon from the RleList object returned by coverageByTranscript(). See ?coverageByTranscript (the man page has many examples).

H.

Did that help?

Hi, thanks again! I was away for a few weeks and didn't have access to this. I will try it out and let you know. Thanks so much for your help!!

Update:

I was able to use these functions to get the coverage calculation for single-end data. However I got new problems when dealing with paired-end read using readGAlignmentPairs function. I tested on the data the package comes with (i.e. extdata), no problem, this function returns the paired end data. However when I applied to my own paired-end data, I got no records for some reason. Could you provide any hints?

> library(Rsamtools)
> library("GenomicAlignments")
> setwd("~/degnorm")
> bamfile="SRR873822.bam"
> ##testing whether single or paired-end
> testPairedEndBam(bamfile)
[1] TRUE
>
> ##subsetting the bam file for test
> which <- GRanges(names(t), IRanges(1, unname(t)))
>
> ##only look at chrI
> param <- ScanBamParam(which=which[1], what=character(), flag=scanBamFlag(isProperPair=TRUE,
+                                                                          isDuplicate=FALSE,
+                                                                          isSecondaryAlignment=FALSE))
> galp2 <- readGAlignmentPairs(bamfile, use.names=TRUE, param=param)
> galp2
GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns:
seqnames strand :    ranges --    ranges
<Rle>  <Rle> : <IRanges> -- <IRanges>
-------
seqinfo: 25 sequences from an unspecified genome


Hi,

For coverageByTranscript(), does it distinguish the three different methods, i.e., Union(), IntersectionStrict(), IntersectionNotEmpty(). What I want is to use IntersectionStrict() which only keeps the compatible reads with transcritome that only aligns to one unique gene. I saw the coverageByTrnacript method allows to keep compatible reads but it does not exclude reads that aligns to shared region of multiple genes. Is that right? Many thanks again!

jiping