Answer by Hervé Pagès (Copied verbatim from Bioconductor slack channel)
idk how to calculate CDS start and End per transcript from GTF file.
Import the file as a TxDb object with GenomicFeatures::makeTxDbFromGFF(), then do (assuming txdb is your TxDb object):
tx <- transcripts(txdb)
## Add 'cds_start' and 'cds_end' metadata columns:
tx_lens <- transcriptLengths(txdb, with.cds_len=TRUE,
with.utr5_len=TRUE,
with.utr3_len=TRUE)
# sanity check
stopifnot(identical(tx_lens$tx_id, mcols(tx)$tx_id))
cds_start <- start(tx) +
ifelse(strand(tx) == "+", tx_lens$utr5_len, tx_lens$utr3_len)
cds_end <- end(tx) -
ifelse(strand(tx) == "+", tx_lens$utr3_len, tx_lens$utr5_len)
no_cds_idx <- which(tx_lens$cds_len == 0L)
cds_start[no_cds_idx] <- NA
cds_end[no_cds_idx] <- NA
mcols(tx)$cds_start <- cds_start
mcols(tx)$cds_end <- cds_end
This is what tx
looks like with the 2 additional metadata cols:
> tx
GRanges object with 29173 ranges and 4 metadata columns:
seqnames ranges strand | tx_id tx_name cds_start
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] chr2L 7529-9484 + | 1 FBtr0300689 7680
[2] chr2L 7529-9484 + | 2 FBtr0300690 7680
[3] chr2L 7529-9484 + | 3 FBtr0330654 7680
[4] chr2L 21952-24237 + | 4 FBtr0309810 <NA>
[5] chr2L 66584-71390 + | 5 FBtr0306539 67116
... ... ... ... . ... ... ...
[29169] chrYHet 319739-320997 - | 29169 FBtr0114244 <NA>
[29170] chrYHet 327052-328489 - | 29170 FBtr0114245 <NA>
[29171] chrUextra 523024-523048 - | 29171 FBtr0330363 <NA>
[29172] chrUextra 523024-523086 - | 29172 FBtr0330361 <NA>
[29173] chrUextra 523060-523086 - | 29173 FBtr0330362 <NA>
cds_end
<integer>
[1] 8610
[2] 9276
[3] 8610
[4] <NA>
[5] 70895
... ...
[29169] <NA>
[29170] <NA>
[29171] <NA>
I wish there was a simpler way though. I'll think about adding something to GenomicFeatures.