Search
Question: Peaks in coverageByTranscript {GenomicFeatures}
0
14 months ago by
manuel.goepferich0 wrote:

Hi,

for my project it would be very important to know how the function

coverageByTranscript {GenomicFeatures}

really works. For many coverage plots (coverage of a 3'UTR) I get no step functions but rather spikes.

Is the whole read considered as coverage or just the end or start (How many base pairs)?

Shall I include an example?

P.S.: I am using paired-end reads.

modified 14 months ago by Hervé Pagès ♦♦ 13k • written 14 months ago by manuel.goepferich0
1
14 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Manuel,

The coverageByTranscript function has a man page with extensive examples. Did you check it? Only the parts of a read that overlap with a transcript participate to the coverage for this transcript. I just added a simple example to the man page to illustrate this (this is in GenomicFeatures 1.27.15). Repeating it here:

First, let's get a transcript

library(GenomicFeatures)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
my_transcript <- dm3_transcripts["FBtr0300689"]
my_transcript
# GRangesList object of length 1:
# FBtr0300689 # GRanges object with 2 ranges and 3 metadata columns: # seqnames ranges strand | exon_id exon_rank # <Rle> <IRanges> <Rle> | <integer> <integer> # [1] chr2L [7529, 8116] + | 1 1 # [2] chr2L [8193, 9484] + | 3 2 Then let's create 3 artificial aligned reads We represent them as a GRanges object of length 3 that contains the genomic positions of the 3 reads. Note that these reads are simple alignments i.e. each of them can be represented with a single range. This would not be the case if they were junction reads. my_reads <- GRanges(c("chr2L:7531-7630", "chr2L:8101-8200", "chr2L:8141-8240")) Compute the coverage on the reference genome coverage(my_reads) # RleList of length 1 #chr2L
# integer-Rle of length 8240 with 6 runs
#   Lengths: 7530  100  470   40   60   40
#   Values :    0    1    0    1    2    1

As you can see, all the genomic positions in the 3 ranges participate to the coverage. This can be confirmed by comparing:

> sum(coverage(my_reads))
chr2L
300

with:

> sum(width(my_reads))
[1] 300


They are the same.

Compute the coverage on the transcript

The 1st read is fully contained within the 1st exon:

coverageByTranscript(my_reads[1], my_transcript)
# RleList of length 1
# $FBtr0300689 # integer-Rle of length 1880 with 3 runs # Lengths: 2 100 1778 # Values : 0 1 0 Note that the length of the Rle (1880) is the length of the transcript. The 2nd and 3rd reads overlap the 2 exons and the intron. Only the parts that overlap the exons participate to coverage: coverageByTranscript(my_reads[2], my_transcript) # RleList of length 1 #$FBtr0300689
# integer-Rle of length 1880 with 3 runs
#   Lengths:  572   24 1284
#   Values :    0    1    0

Hope this helps,

H.

Thanks. It is definitely part of the answer. I have checked the explanations & the help pages. I will give an example for a decoded coverage vector (paired-end reads, star-mapping, mouse-genome, single cell, start of a 3'UTR):

62 62 62 59 53 53 53 53 52 50 50 50 50 45 45 45 45 45 42 42 42 43 49 49 50 50 49 49 49 48 48 48 50 50 50 46 46 ... 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  0  0  0  0

What is the explanation of 'peaks' (fluctuation of coverage every 2 - 5 base pairs) in this coverage vectors? Two explanations: (I) The reads were mapped to another part of the gene (intron, exon etc.) and only a small part of the read (sometimes 3 positions) are contributing to the coverage (because the quality of the read is locally low?) (II) There was a problem with my decoding or workflow (I don not think so.)

Best regards

Manuel

1

Not sure that I can be of much help. But don't we see the same kind of pattern here with the dm3 transcript that receives the highest coverage from the untreated1_chr4 data set from the pasillaBamSubset package?

library(pasillaBamSubset)
library(GenomicAlignments)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
which.max(sum(tx_cvg))
# FBtr0308296
#       23775 

Coverage on this transcript:

tx_cvg$FBtr0308296 # integer-Rle of length 1349 with 1030 runs # Lengths: 1 1 2 1 1 1 ... 25 3 4 2 115 # Values : 105 110 111 116 117 118 ... 4 3 2 1 0 The coverage profile for this transcript can be plotted with: plot(as.vector(runmean(tx_cvg$FBtr0308296, 50)))

I don't have access to yours so I can't really compare but maybe you can.

H.

What I did was something like this (and now it looks similar to me coverage vectors):

plot(decode(tx_cvg\$FBtr0308296))

The story goes as follows: My supervisor asked me where this wiggly shapes of coverage vectors are coming from?

I see that with 'runmean' the coverage vectors will be smoothed.

Do you know someone who could explain this effect (of wiggly coverage vectors)?

I'm not a statistician but my understanding is that the distribution of reads along the reference genome has a noise component to it. There are various causes for this noise, I'm not an expert, but one of them is a per-position bias due to the nucleotide content. Could the wiggly aspect of the coverage profile just reflect the intrinsically noisy nature of HTS experiments?

H.

BTW a more efficient/accurate/meaningful way to obtain the coverage by transcript would be to align the reads to the transcriptome instead of the reference genome. Then you can just use coverage() instead of coverageByTranscript() on the resulting BAM file. I've never tried it but that's what I would do if I wanted reliable transcript coverage. It has the advantage of counting only the reads that are compatible with the transcript (i.e. reads that can possibly originate from that transcript).

Note that it's still possible to count transcript-compatible reads when working with reads aligned to the reference genome. For example it can be done with pcoverageByTranscript() (the man page shows how to do this). But aligning the reads to the transcriptome first is more straightforward and reliable. It doesn't even require a splice-aware aligner (and if you use a splice-aware aligner, you should turn that feature off) so the mapping process is simpler and more reliable.

H.

0
14 months ago by
United States
James W. MacDonald46k wrote:

This function is hidden in the namespace. You could hypothetically do

library(GenomicFeatures)

getAnywhere(coverageByTranscript)

To see the function body. An arguably better way to proceed would be to go to the GitHub mirror for this package and look at the source files. Or you could check out from subversion, using 'readonly' for the username and password. But do note that the repositories are going to be changing soon, so the subversion repo and the GitHub mirror will be going away. I suppose an alternative that will remain is to get the package source from the bottom of the page here and untar and unzip.