How to calculate CDS start and End per transcript from GTF file
1
0
Entering edit mode
@rohitsatyam102-24390
Last seen 5 weeks ago
India

Hi Everyone

Posting this query for more visibility since Hervé Pagès already answered this.

Can anyone here tell if there is an R package for calculating dN/dS or pN/pS based on SNPEff annotated VCF and reference FASTA and GTF file. I am aware that there is a package called dNdScv but it is highly biased toward using annotation from Ensembl and idk how to calculate CDS start and End per transcript from GTF file.

dNdScv GenomicFeatures TxDb • 785 views
ADD COMMENT
0
Entering edit mode
@rohitsatyam102-24390
Last seen 5 weeks ago
India

Answer by Hervé Pagès (Copied verbatim from Bioconductor slack channel)

idk how to calculate CDS start and End per transcript from GTF file.

Import the file as a TxDb object with GenomicFeatures::makeTxDbFromGFF(), then do (assuming txdb is your TxDb object):

tx <- transcripts(txdb)

## Add 'cds_start' and 'cds_end' metadata columns:

tx_lens <- transcriptLengths(txdb, with.cds_len=TRUE,
                                   with.utr5_len=TRUE,
                                   with.utr3_len=TRUE)
# sanity check
stopifnot(identical(tx_lens$tx_id, mcols(tx)$tx_id))  

cds_start <- start(tx) +
             ifelse(strand(tx) == "+", tx_lens$utr5_len, tx_lens$utr3_len)
cds_end   <- end(tx)   -
             ifelse(strand(tx) == "+", tx_lens$utr3_len, tx_lens$utr5_len)

no_cds_idx <- which(tx_lens$cds_len == 0L)
cds_start[no_cds_idx] <- NA
cds_end[no_cds_idx]   <- NA

mcols(tx)$cds_start <- cds_start
mcols(tx)$cds_end   <- cds_end

This is what tx looks like with the 2 additional metadata cols:

> tx
GRanges object with 29173 ranges and 4 metadata columns:
           seqnames        ranges strand |     tx_id     tx_name cds_start
              <Rle>     <IRanges>  <Rle> | <integer> <character> <integer>
      [1]     chr2L     7529-9484      + |         1 FBtr0300689      7680
      [2]     chr2L     7529-9484      + |         2 FBtr0300690      7680
      [3]     chr2L     7529-9484      + |         3 FBtr0330654      7680
      [4]     chr2L   21952-24237      + |         4 FBtr0309810      <NA>
      [5]     chr2L   66584-71390      + |         5 FBtr0306539     67116
      ...       ...           ...    ... .       ...         ...       ...
  [29169]   chrYHet 319739-320997      - |     29169 FBtr0114244      <NA>
  [29170]   chrYHet 327052-328489      - |     29170 FBtr0114245      <NA>
  [29171] chrUextra 523024-523048      - |     29171 FBtr0330363      <NA>
  [29172] chrUextra 523024-523086      - |     29172 FBtr0330361      <NA>
  [29173] chrUextra 523060-523086      - |     29173 FBtr0330362      <NA>
            cds_end
          <integer>
      [1]      8610
      [2]      9276
      [3]      8610
      [4]      <NA>
      [5]     70895
      ...       ...
  [29169]      <NA>
  [29170]      <NA>
  [29171]      <NA>

I wish there was a simpler way though. I'll think about adding something to GenomicFeatures.

ADD COMMENT

Login before adding your answer.

Traffic: 533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6