Question: Extracting Coordinates of startcodon from Grangeslist
gravatar for Walter F. Baumann
5 months ago by
Walter F. Baumann 10 wrote:


I work with ribosome profiling data and I want to check the coverage of reads on the  start codon. 

My start is with an GRAngesList object I obtained from cdsBy(txdb, by=„tx). 

In the end I want to obtain a GrangesList object with transcript id as names, and start of startcodon and stop of startcodon (+3 nts), which I then can put in the summarizeOverlaps to count features per region. 

My question now is what the most efficient way of proceeding is. I though about writing an lapply function which loops through all elements in the GRangesList, searches the start of the cds (depending on strand orientation), and then built a new GRangesList object. 

However, as I know R there must be an easier way to proceed. biomaRt from what I saw does not provide the genomic coordinates of the start codon. Most Ribosome profiling tools like Riboprofiling or riboseqR seem not to provide this kind of function. 

Thanks for any advice!

ADD COMMENTlink modified 5 months ago by Hervé Pagès ♦♦ 13k • written 5 months ago by Walter F. Baumann 10
gravatar for Hervé Pagès
5 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Walter,

If no start codon crosses an intron, then you can extract their coordinates with:

cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
first_cds <-  unlist(phead(cds_by_tx, 1))
start_codon <- promoters(first_cds, upstream=0, downstream=3)

start_codon is a GRanges object parallel to cds_by_tx i.e. with one range per transcript in cds_by_tx.

Unfortunately some start codons can cross an intron. For example, if txdb is TxDb.Hsapiens.UCSC.hg19.knownGene, 186 start codons cross an intron:

sum(width(first_cds) < 3)
# [1] 186

For these start codons, the coordinates we obtained in start_codon are incorrect. Getting them right involves some complications.

Note that a simpler way to get the coverage of reads on the start codons is to compute their coverage on the processed cds (i.e. on the cds sequences after removal of the introns). This can be done with something like:

cds_cvg <- coverageByTranscript(reads, cds_by_tx, ignore.strand=TRUE)

cds_cvg will be an RleList object parallel to cds_by_tx where each list element is an integer-Rle object representing the coverage on a processed cds. You can then extract the coverage on the start codons with phead(cds_cvg, 3). With this approach, you don't need to compute the coordinates of the start codons. Also it does the right thing whether a start codon crosses an intron or not. See ?coverageByTranscript in the GenomicFeatures package for more information.



ADD COMMENTlink modified 5 months ago • written 5 months ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 297 users visited in the last hour