We recently did some work on correction for RNA degradation bias for RNA-seq data. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1682-7. This requires us to calculate the RNA-seq read coverage score on all exon regions of every genes. I wonder whether Rsamtools or GenomicAlignments can do this job. Here are two complexities:
Does this package provide functionality of read coverage for each gene, only on the exon region? I understand like Samtools it can calculate the coverage score on the genome. But for RNA-seq analysis, we want to focus on reads count and coverage that completely aligned onto the legitimate exon regions. For example, some reads might partially align to intron due to incomplete splicing, those reads should be opted out before calculation of the read coverage score.
Some reads could be aligned to multiple genes but at unique genomic location, if the genes are on different strands and overlap (which happens pretty often). To avoid ambiguity, we should drop such reads as well (this was similarly done in HTSeq for counting reads, but HTseq does not calculate the coverage). Does either package take this into consideration for constructing the coverage score for RNA-seq data specifically?
thanks for help.
Jiping Wang
Hi Herve,
Thanks for your helpful comments. There is one issue here (If I misunderstand your comment, please correct me). In the genome there are genes that sit on the opposite strands, but in overlapping genomic region. Many of the existing RNA-seq data do not distinguish which strand they originated from. If we calculate the coverage first and then extract exons, in such regions we could have double counting issue. If you meant in step 3, that we should have dropped such reads, then it should work. We also wrote our own codes which have realized this. But it's slow and I am trying to utilize the existing bioconductor package/functions to speed it up.
I noticed Valerie Obenchain wrote a note on using SummarizeOverlaps method from GenomicAlignment package to get read count per gene (as HTseq does). The read filtering criterion is exactly what I want. But In addition to the read count per gene, we need calculate the coverage score in the entire transcript for each gene using the same rule as used in HTseq. The summarizeOverlaps could do this job if it also outputs which reads/or pairs were counted toward which gene (in additions to a only count of total # reads). I read carefully about this package, but it seems that no such output is available from this function. I just wonder whether there is any way to get around it to get the list of remaining reads after filtering.
Thanks so much for help again!
The
mode
argument controls howsummarizeOverlaps()
filters out reads. This argument takes a callback function that must have the following arguments:features
,reads
,ignore.strand
,inter.feature
, and must return a vector of counts parallel to thefeatures
argument (i.e. with 1 count per feature). There are 3 predefined callback functions:Union()
,IntersectionStrict()
,IntersectionNotEmpty()
(see?summarizeOverlaps()
for more information).For example
Union
is defined as follow:So if the filtering performed by these modes is what you are after but you don't want the last step (the counting one), just do what they do before that and skip the counting step. For example to filter your reads in the same way as
Union()
, use a modified version ofUnion()
:Note that
Union2()
is to be used in the context of your transcript/exon coverage problem, NOT withsummarizeOverlaps()
where it would fail to work because it doesn't return a vector of counts.Finally I should mention the
coverageByTranscript()
function defined in the GenomicFeatures package. It will compute the read coverage by transcript but hopefully it should not be too hard to extract the read coverage by exon from the RleList object returned bycoverageByTranscript()
. See?coverageByTranscript
(the man page has many examples).H.
Did that help?
Hi, thanks again! I was away for a few weeks and didn't have access to this. I will try it out and let you know. Thanks so much for your help!!
Update:
I was able to use these functions to get the coverage calculation for single-end data. However I got new problems when dealing with paired-end read using readGAlignmentPairs function. I tested on the data the package comes with (i.e. extdata), no problem, this function returns the paired end data. However when I applied to my own paired-end data, I got no records for some reason. Could you provide any hints?
Please format your code following the guidance here https://support.bioconductor.org/p/117436/ so that your 'code chunks are legible.
Hi,
For coverageByTranscript(), does it distinguish the three different methods, i.e., Union(), IntersectionStrict(), IntersectionNotEmpty(). What I want is to use IntersectionStrict() which only keeps the compatible reads with transcritome that only aligns to one unique gene. I saw the coverageByTrnacript method allows to keep compatible reads but it does not exclude reads that aligns to shared region of multiple genes. Is that right? Many thanks again!
jiping