Question

How to calculate CDS start and End per transcript from GTF file

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 23 days ago

India

Hi Everyone

Posting this query for more visibility since Hervé Pagès already answered this.

Can anyone here tell if there is an R package for calculating dN/dS or pN/pS based on SNPEff annotated VCF and reference FASTA and GTF file. I am aware that there is a package called dNdScv but it is highly biased toward using annotation from Ensembl and idk how to calculate CDS start and End per transcript from GTF file.

dNdScv GenomicFeatures TxDb • 1.1k views

ADD COMMENT • link 20 months ago rohitsatyam102 ▴ 20

score 0 · Answer 1 · 2023-10-27

Answer by Hervé Pagès (Copied verbatim from Bioconductor slack channel)

idk how to calculate CDS start and End per transcript from GTF file.

Import the file as a TxDb object with GenomicFeatures::makeTxDbFromGFF(), then do (assuming txdb is your TxDb object):

tx <- transcripts(txdb)

## Add 'cds_start' and 'cds_end' metadata columns:

tx_lens <- transcriptLengths(txdb, with.cds_len=TRUE,
                                   with.utr5_len=TRUE,
                                   with.utr3_len=TRUE)
# sanity check
stopifnot(identical(tx_lens$tx_id, mcols(tx)$tx_id))  

cds_start <- start(tx) +
             ifelse(strand(tx) == "+", tx_lens$utr5_len, tx_lens$utr3_len)
cds_end   <- end(tx)   -
             ifelse(strand(tx) == "+", tx_lens$utr3_len, tx_lens$utr5_len)

no_cds_idx <- which(tx_lens$cds_len == 0L)
cds_start[no_cds_idx] <- NA
cds_end[no_cds_idx]   <- NA

mcols(tx)$cds_start <- cds_start
mcols(tx)$cds_end   <- cds_end

This is what tx looks like with the 2 additional metadata cols:

> tx
GRanges object with 29173 ranges and 4 metadata columns:
           seqnames        ranges strand |     tx_id     tx_name cds_start
              <Rle>     <IRanges>  <Rle> | <integer> <character> <integer>
      [1]     chr2L     7529-9484      + |         1 FBtr0300689      7680
      [2]     chr2L     7529-9484      + |         2 FBtr0300690      7680
      [3]     chr2L     7529-9484      + |         3 FBtr0330654      7680
      [4]     chr2L   21952-24237      + |         4 FBtr0309810      <NA>
      [5]     chr2L   66584-71390      + |         5 FBtr0306539     67116
      ...       ...           ...    ... .       ...         ...       ...
  [29169]   chrYHet 319739-320997      - |     29169 FBtr0114244      <NA>
  [29170]   chrYHet 327052-328489      - |     29170 FBtr0114245      <NA>
  [29171] chrUextra 523024-523048      - |     29171 FBtr0330363      <NA>
  [29172] chrUextra 523024-523086      - |     29172 FBtr0330361      <NA>
  [29173] chrUextra 523060-523086      - |     29173 FBtr0330362      <NA>
            cds_end
          <integer>
      [1]      8610
      [2]      9276
      [3]      8610
      [4]      <NA>
      [5]     70895
      ...       ...
  [29169]      <NA>
  [29170]      <NA>
  [29171]      <NA>

I wish there was a simpler way though. I'll think about adding something to GenomicFeatures.