Question: How to use Rsamtool or GenomicAlignments to calculate RAN-seq read coverage curves
0
gravatar for Jiping Wang
17 days ago by
Jiping Wang10
Jiping Wang10 wrote:

We recently did some work on correction for RNA degradation bias for RNA-seq data. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1682-7. This requires us to calculate the RNA-seq read coverage score on all exon regions of every genes. I wonder whether Rsamtools or GenomicAlignments can do this job. Here are two complexities:

  1. Does this package provide functionality of read coverage for each gene, only on the exon region? I understand like Samtools it can calculate the coverage score on the genome. But for RNA-seq analysis, we want to focus on reads count and coverage that completely aligned onto the legitimate exon regions. For example, some reads might partially align to intron due to incomplete splicing, those reads should be opted out before calculation of the read coverage score.

  2. Some reads could be aligned to multiple genes but at unique genomic location, if the genes are on different strands and overlap (which happens pretty often). To avoid ambiguity, we should drop such reads as well (this was similarly done in HTSeq for counting reads, but HTseq does not calculate the coverage). Does either package take this into consideration for constructing the coverage score for RNA-seq data specifically?

thanks for help.

Jiping Wang

ADD COMMENTlink modified 12 days ago by Hervé Pagès ♦♦ 14k • written 17 days ago by Jiping Wang10
Answer: How to use Rsamtool or GenomicAlignments to calculate RAN-seq read coverage curv
1
gravatar for Hervé Pagès
12 days ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Jiping,

No high level tool for doing exactly that with Rsamtools or GenomicAlignments, AFAIK, but you should be able to use some lower level functionalities to get exactly what you want. Roughly, the following steps should get you there:

1. Load you reads in a GRangesList object

library(GenomicAlignments)
reads <- readGAlignments(...)
reads <- as(reads, "GRangesList")

2. Get your exons grouped by gene in a GRangesList object

## Check what TxDb packages are available:
BiocManager::available("TxDb")

## If no one suits your need, construct your own TxDb object:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(...)  # or use makeTxDbFromUCSC(...),
                              # or makeTxDbFromEnsembl(...), etc...

## Then extract the exons grouped by gene from the TxDb object:
ex_by_gene <- exonsBy(txdb, by="gene")

3. Drop the reads you don't want

You will typically use subsetByOverlaps() for this. However, for more control, and depending on what "reads that completely aligned onto the legitimate exon regions" means, you might need to use findOverlaps(), directly on reads and ex_by_gene, or on unlist(reads) and unlist(ex_by_gene), and be prepared for some direct manipulation of Hits objects. This could get tricky but in the end you should be able to keep only the reads that you want.

4. Compute the coverage of the remaining reads

cvg <- coverage(reads)

cvg will be an RleList object with one list element per chromosome in the genome.

5. Finally extract the coverage of each exon

all_exons <- unlist(ex_by_gene)
cvg_by_ex <- cvg[all_exons]

cvg_by_ex will be an RleList object with one list element per exon.

Propagate the exon names:

names(cvg_by_ex) <- mcols(all_exons)$exon_name
## Or, alternatively, if the exon names are problematic, propagate the
## exon ids:
names(cvg_by_ex) <- mcols(all_exons)$exon_id

Hope this helps,

H.

ADD COMMENTlink modified 12 days ago • written 12 days ago by Hervé Pagès ♦♦ 14k

Hi Herve,

Thanks for your helpful comments. There is one issue here (If I misunderstand your comment, please correct me). In the genome there are genes that sit on the opposite strands, but in overlapping genomic region. Many of the existing RNA-seq data do not distinguish which strand they originated from. If we calculate the coverage first and then extract exons, in such regions we could have double counting issue. If you meant in step 3, that we should have dropped such reads, then it should work. We also wrote our own codes which have realized this. But it's slow and I am trying to utilize the existing bioconductor package/functions to speed it up.

I noticed Valerie Obenchain wrote a note on using SummarizeOverlaps method from GenomicAlignment package to get read count per gene (as HTseq does). The read filtering criterion is exactly what I want. But In addition to the read count per gene, we need calculate the coverage score in the entire transcript for each gene using the same rule as used in HTseq. The summarizeOverlaps could do this job if it also outputs which reads/or pairs were counted toward which gene (in additions to a only count of total # reads). I read carefully about this package, but it seems that no such output is available from this function. I just wonder whether there is any way to get around it to get the list of remaining reads after filtering.

Thanks so much for help again!

ADD REPLYlink written 11 days ago by Jiping Wang10

The mode argument controls how summarizeOverlaps() filters out reads. This argument takes a callback function that must have the following arguments: features, reads, ignore.strand, inter.feature, and must return a vector of counts parallel to the features argument (i.e. with 1 count per feature). There are 3 predefined callback functions: Union(), IntersectionStrict(), IntersectionNotEmpty() (see ?summarizeOverlaps() for more information).

For example Union is defined as follow:

> Union
function (features, reads, ignore.strand = FALSE, inter.feature = TRUE) 
{
    ov <- findOverlaps(features, reads, ignore.strand = ignore.strand)
    if (inter.feature) {
        reads_to_keep <- which(countSubjectHits(ov) == 1L)
        ov <- ov[subjectHits(ov) %in% reads_to_keep]
    }
    countQueryHits(ov)
}
<bytecode: 0x1640d388>
<environment: namespace:GenomicAlignments>

So if the filtering performed by these modes is what you are after but you don't want the last step (the counting one), just do what they do before that and skip the counting step. For example to filter your reads in the same way as Union(), use a modified version of Union():

Union2 <- function(features, reads, ignore.strand=FALSE, inter.feature=TRUE) 
{
    ov <- findOverlaps(features, reads, ignore.strand = ignore.strand)
    if (inter.feature) {
        reads_to_keep <- which(countSubjectHits(ov) == 1L)
        ov <- ov[subjectHits(ov) %in% reads_to_keep]
    }
    ov
}

Note that Union2() is to be used in the context of your transcript/exon coverage problem, NOT with summarizeOverlaps() where it would fail to work because it doesn't return a vector of counts.

Finally I should mention the coverageByTranscript() function defined in the GenomicFeatures package. It will compute the read coverage by transcript but hopefully it should not be too hard to extract the read coverage by exon from the RleList object returned by coverageByTranscript(). See ?coverageByTranscript (the man page has many examples).

H.

ADD REPLYlink modified 5 days ago • written 7 days ago by Hervé Pagès ♦♦ 14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 284 users visited in the last hour