Extracting Coordinates of startcodon from Grangeslist
1
0
Entering edit mode
@walter-f-baumann-12439
Last seen 3.9 years ago

Hi,

I work with ribosome profiling data and I want to check the coverage of reads on the  start codon.

My start is with an GRAngesList object I obtained from cdsBy(txdb, by=„tx).

In the end I want to obtain a GrangesList object with transcript id as names, and start of startcodon and stop of startcodon (+3 nts), which I then can put in the summarizeOverlaps to count features per region.

My question now is what the most efficient way of proceeding is. I though about writing an lapply function which loops through all elements in the GRangesList, searches the start of the cds (depending on strand orientation), and then built a new GRangesList object.

However, as I know R there must be an easier way to proceed. biomaRt from what I saw does not provide the genomic coordinates of the start codon. Most Ribosome profiling tools like Riboprofiling or riboseqR seem not to provide this kind of function.

3
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi Walter,

If no start codon crosses an intron, then you can extract their coordinates with:

cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
start_codon <- promoters(first_cds, upstream=0, downstream=3)

start_codon is a GRanges object parallel to cds_by_tx i.e. with one range per transcript in cds_by_tx.

Unfortunately some start codons can cross an intron. For example, if txdb is TxDb.Hsapiens.UCSC.hg19.knownGene, 186 start codons cross an intron:

sum(width(first_cds) < 3)
# [1] 186

For these start codons, the coordinates we obtained in start_codon are incorrect. Getting them right involves some complications.

Note that a simpler way to get the coverage of reads on the start codons is to compute their coverage on the processed cds (i.e. on the cds sequences after removal of the introns). This can be done with something like:

cds_cvg <- coverageByTranscript(reads, cds_by_tx, ignore.strand=TRUE)

cds_cvg will be an RleList object parallel to cds_by_tx where each list element is an integer-Rle object representing the coverage on a processed cds. You can then extract the coverage on the start codons with phead(cds_cvg, 3). With this approach, you don't need to compute the coordinates of the start codons. Also it does the right thing whether a start codon crosses an intron or not. See ?coverageByTranscript in the GenomicFeatures package for more information.

Cheers,

H.