CDS phase information missing from GenomicFeatures
1
1
Entering edit mode
arendsee ▴ 10
@arendsee-14118
Last seen 7.1 years ago
Iowa State University

The GenomicFeatures library allows gene models to be stored in an SQLite database.

Here is the full SQLite schema

CREATE TABLE chrominfo (
    _chrom_id INTEGER PRIMARY KEY,
    chrom TEXT UNIQUE NOT NULL,
    length INTEGER NULL,
    is_circular INTEGER NULL
);
CREATE TABLE transcript (
    _tx_id INTEGER PRIMARY KEY,
    tx_name TEXT NULL,
    tx_type TEXT NULL,
    tx_chrom TEXT NOT NULL,
    tx_strand TEXT NOT NULL,
    tx_start INTEGER NOT NULL,
    tx_end INTEGER NOT NULL,
    FOREIGN KEY (tx_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE exon (
    _exon_id INTEGER PRIMARY KEY,
    exon_name TEXT NULL,
    exon_chrom TEXT NOT NULL,
    exon_strand TEXT NOT NULL,
    exon_start INTEGER NOT NULL,
    exon_end INTEGER NOT NULL,
    FOREIGN KEY (exon_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE cds (
    _cds_id INTEGER PRIMARY KEY,
    cds_name TEXT NULL,
    cds_chrom TEXT NOT NULL,
    cds_strand TEXT NOT NULL,
    cds_start INTEGER NOT NULL,
    cds_end INTEGER NOT NULL,
    FOREIGN KEY (cds_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE splicing (
    _tx_id INTEGER NOT NULL,
    exon_rank INTEGER NOT NULL,
    _exon_id INTEGER NOT NULL,
    _cds_id INTEGER NULL,
    UNIQUE (_tx_id, exon_rank),
    FOREIGN KEY (_tx_id) REFERENCES transcript,
    FOREIGN KEY (_exon_id) REFERENCES exon,
    FOREIGN KEY (_cds_id) REFERENCES cds
);
CREATE TABLE gene (
    gene_id TEXT NOT NULL,
    _tx_id INTEGER NOT NULL,
    UNIQUE (gene_id, _tx_id),
    FOREIGN KEY (_tx_id) REFERENCES transcript
);
CREATE TABLE `metadata` (
    `name` TEXT,
    `value` TEXT
);

However, one important field seems to be missing: CDS phase. Phase specifies how many nucleotides must be removed from the beginning of a CDS to reach a complete codon (0, 1 or 2). If phase is missing, the CDS cannot be translated without making dangerous assumptions about the data. Specifically, you have to assume that the first CDS has a phase of 0. For poorly assembled genomes, this may not by the case.

Even if the assumption always holds, a CDS without phase data can be hard to work with since you cannot translate an individual CDS without considering the other members of the model.

According to the GFF specification, phase information is required for CDS entries and is stored in the 8th column. Here is an example of an incomplete gene model (from Saccharomyces uvarum):

# example.gff
AACA01001085.1  AUGUSTUS  CDS   1  289  .     +  1  Parent=g1.t1
AACA01001085.1  AUGUSTUS  exon  1  289  1.00  +  .  Parent=g1.t1
AACA01001085.1  AUGUSTUS  gene  1  289  1     +  .  g1
AACA01001085.1  AUGUSTUS  mRNA  1  289  .     +  .  ID=g1.t1;Other=g1

Where AACA01001085.1 is the short scaffold:

>AACA01001085.1
GTGGGGTTTCATGATTAGTTTCTTCACTCCTTTCATCACCGGCGCAATCAACTTTTACTA
TGGCTATGTCTTCATTGGCTGTCTATGTGTTGCCTACTTTTacgtctttttctttgtgcC
AGAAACGAAAGGCCTGACATTAGAAGAGGTCGACACTATGTGGCTGGAAGGCATTCCACC
ATGGAAATCGGCCTCATGGGTGCCACCGGACAGAAGAACCGCAGATTACGATGAAGACGC
TGTGGCCCATGACGAGAGGCCAGCTTACAAAAGATTCTTCTCCAACTGATCCTCCGACTC
ATCGATGAttcaaatgttttttttccctacCTCCGTAATATTTATgatgtttatttttgt
acTCCTTAATAATTTATGAACTTTTCCTTGTATTTCGCAGCAGCTGAACTCTAGTGCTGG
AGTGGACGCTACACAGCCCTCGGGACTAGCAGGTGCAGGAGCGAACGAAGCCGCAGGAAC
TCCACTAACAGACTACGCACAAAGACCCGGTGGACCCTGACTCAAAGCTGACGCAATACT
GCCACTACTTTGCAAACCTTAACTTTAACTTCGGCCCGACGCACCTAAGACGGACATAAG
AATGAGAAATGCAGTGAAGAGATCCTCCATCATATGCCAGCCGTAGGCCTTCGATGTTCT
TTGTGCAATGCTAC

Working with these partial models can result in serious problems:

library(GenomicFeatures)
library(Rsamtools)
library(Biostrings)

# Build SQLite gene model database
gff <- GenomicFeatures::makeTxDbFromGFF('example.gff')

# Build indexed fasta file and open a connection to it.
dna <- Rsamtools::FaFile(Rsamtools::indexFa("Saccharomyces_uvarum.fna"))

# Extract the CDS from the database. This will create a list of GRanges objects
# where each one contains an ordered set of coding sequences.
trans <- GenomicFeatures::cdsBy(gff, by='tx', use.names=TRUE)

trans
#> GRangesList object of length 1:
#> $g1.t1
#> GRanges object with 1 range and 3 metadata columns:
#>             seqnames    ranges strand |    cds_id    cds_name exon_rank
#>                <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
#>   [1] AACA01001085.1  [1, 289]      + |         1        <NA>         1
#> 
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Extract the nucleotide sequences 
cds <- GenomicFeatures::extractTranscriptSeqs(x=dna, transcripts=trans)

# Translate the CDS
aa_seq <- Biostrings::translate(cds) 
#> Warning message:
#> In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet],  :
#> last base was ignored

aa_seq
#>   A AAStringSet instance of length 1
#>     width seq                                               names
#> [1]    96 VGFHD*FLHSFHHRRNQLLLWLC...RRLR*RRCGP*REASLQKILLQL g1.t1

The final translated sequence is frameshifted. The translate function detects that something is wrong, but it doesn't have enough information to make the right decision. It decides to ignore that last base, when it should have ignored the first 2 bases.

One solution is to store the phase in the SQLite database and then write a phase-aware function for extracting CDS (e.g. extractCDSSeqs).

Here is my session info:

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS: /usr/lib/libblas.so.3.7.1
LAPACK: /usr/lib/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] Rsamtools_1.28.0       Biostrings_2.44.2      XVector_0.16.0
 [4] GenomicFeatures_1.28.4 AnnotationDbi_1.38.2   Biobase_2.36.2
 [7] GenomicRanges_1.28.5   GenomeInfoDb_1.12.2    IRanges_2.10.3
[10] S4Vectors_0.14.5       BiocGenerics_0.22.0    colorout_1.1-2
[13] pryr_0.1.2             devtools_1.13.3        magrittr_1.5

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13               compiler_3.4.2
 [3] bitops_1.0-6               tools_3.4.2
 [5] zlibbioc_1.22.0            biomaRt_2.32.1
 [7] digest_0.6.12              bit_1.1-12
 [9] lattice_0.20-35            RSQLite_2.0
[11] memoise_1.1.0              tibble_1.3.4
[13] pkgconfig_2.0.1            rlang_0.1.2
[15] Matrix_1.2-11              DelayedArray_0.2.7
[17] DBI_0.7                    GenomeInfoDbData_0.99.0
[19] rtracklayer_1.36.4         withr_2.0.0
[21] stringr_1.2.0              grid_3.4.2
[23] bit64_0.9-7                XML_3.98-1.9
[25] BiocParallel_1.10.1        blob_1.1.0
[27] matrixStats_0.52.2         codetools_0.2-15
[29] GenomicAlignments_1.12.2   SummarizedExperiment_1.6.4
[31] stringi_1.1.5              RCurl_1.95-4.8
GenomicFeatures phase gene models GFF • 4.0k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 16 hours ago
Seattle, WA, United States

Hi,

To me this only reveals that the CDS reported in example.gff is incorrect. Why would the CDS be reported to start at position 1 if the start codon for transcript g1.t1 is at position 2-4? In other words, shouldn't the phase of the 1st CDS in a transcript always be 0?

H.

ADD COMMENT
0
Entering edit mode

The phase will be 0 if the models are complete, but this isn't always the case. Note that the first CDS begins on the very first nucleotide of the scaffold. The GFF example was produced by the AUGUSTUS gene model predictor, which will detect models even when they are broken across scaffolds. I have several low quality yeast assemblies and each of them has hundreds partial models.

It is also convenient to include phase with each CDS entry because it allows the CDS to be translated independently of the rest of the model. Also, the phase is of biological interest. So making it accessible would be nice.

Partially degraded transcripts could also have non-zero phase.

ADD REPLY
1
Entering edit mode

To add some historic background here:

TxDb objects were originally designed to support correct gene models from UCSC and Ensembl (via makeTxDbFromUCSC and makeTxDbFromEnsembl). So at the time it didn't feel necessary to include the phase information. Note that the UCSC tracks don't contain this information because it can easily be computed from the other fields in the track, assuming that the information in these other fields is correct of course. (The phase actually gets computed/added to the GTF files you can get via the Table Browser.) Support for GFF files was only added a couple of years later via makeTxDbFromGFF.

The phase information could probably be added to TxDb objects. However, I don't think it's a good idea nor realistic that TxDb objects support incorrect gene models. In this case, one should work directly from the GFF file after loading it in a data.frame with rtracklayer::readGFF() or in a GRanges object with rtracklayer::import(). This will import the phase.

Note that, assuming the gene model in a TxDb object is correct, the phase can easily be computed with something like:

## 'cds_by_tx' must be a GRangesList object obtained with cdsBy(txdb, by="tx")
addCdsPhase <- function(cds_by_tx)
{
    cds_phase <- pc(rep(IntegerList(0), length(cds_by_tx)),
                    phead((3L - (cumsum(width(cds_by_tx)) %% 3L)) %% 3L, n=-1L))
    unlisted_cds_by_tx <- unlist(cds_by_tx, use.names=FALSE)
    mcols(unlisted_cds_by_tx)$cds_phase <- unlist(cds_phase, use.names=FALSE)
    relist(unlisted_cds_by_tx, cds_by_tx)
}

Then:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
cds_by_tx <- cdsBy(txdb, by="tx")
addCdsPhase(cds_by_tx)
# GRangesList object of length 26950:
# $1 
# GRanges object with 2 ranges and 4 metadata columns:
#       seqnames       ranges strand |    cds_id    cds_name exon_rank cds_phase
#          <Rle>    <IRanges>  <Rle> | <integer> <character> <integer> <numeric>
#   [1]    chr2L [7680, 8116]      + |         1        <NA>         1         0
#   [2]    chr2L [8193, 8610]      + |         3        <NA>         2         1
#
# $2 
# GRanges object with 3 ranges and 4 metadata columns:
#       seqnames       ranges strand | cds_id cds_name exon_rank cds_phase
#   [1]    chr2L [7680, 8116]      + |      1     <NA>         1         0
#   [2]    chr2L [8193, 8589]      + |      2     <NA>         2         1
#   [3]    chr2L [8668, 9276]      + |      5     <NA>         3         0
#
# $3 
# GRanges object with 2 ranges and 4 metadata columns:
#       seqnames       ranges strand | cds_id cds_name exon_rank cds_phase
#   [1]    chr2L [7680, 8116]      + |      1     <NA>         1         0
#   [2]    chr2L [8229, 8610]      + |      4     <NA>         2         1
# 
# ...
# <26947 more elements>
# -------
# seqinfo: 15 sequences (1 circular) from dm3 genome

I'll modify cdsBy() so it returns the above (only when grouping by transcript though i.e. when called with by="tx"). The good news is that there is no need to touch the db schema.

Hope this helps,

H.

ADD REPLY
0
Entering edit mode

Thanks, the modification to `cdsBy` certainly help. I coded around the partial models in my project by loading the GFF into a GRanges object and adding the phase to the start position (which makes in-frame models). Then I encoded the the original GFF phase in the CDS 'name' field before passing the GRanges object into `makeTxDbFromGRanges`. It was ugly, but preserved the phase.

I'm being a little pedantic, perhaps, but I would not call these partial models "incorrect". The part of the model that is described may be perfectly correct, but due to an incomplete genome assembly, the initial part of the sequence is missing.

If phase is not supported, perhaps importing partial models should raise an immediate error or warning? Otherwise, proteins translated based on the models will be frameshifted. This is exactly what was happening in my project.

ADD REPLY
0
Entering edit mode

OK.

So we could add a warning in makeTxDbFromGFF() when the phases reported in the file for the CDS in a transcript don't agree with the phases inferred from the CDS cumwidths. Not a totally satisfying solution because only the person who calls makeTxDbFromGFF() to create the TxDb object gets to see the warning, but not other people using that TxDb downstream. Would probably be better to have the TxDb object itself "remember" that the CDS information cannot be trusted for translation e.g. by having this reported in the metadata table or, as you suggested, by having the phase imported from the file. Then cdsBy( , by="tx") could issue a warning. More generally, tools that take a TxDb and DNA sequences as input in order to do translation (e.g. predictCoding()) would be able to warn the user, or fail graciously, or maybe still manage to produce correct/meaningful output if they have access to the phase that was reported in the GFF file.

This will be a significant effort though and is not likely to happen before our next release.

Thanks for your feedback,

H.

ADD REPLY
0
Entering edit mode

FYI I made a first pass on this: in GenomicFeatures 1.29.14, makeTxDbFromGFF() and makeTxDbFromGRanges() now import the CDS phase:

https://github.com/Bioconductor/GenomicFeatures/commit/22b2ea0a81bc064b8554400ce8f7e13e2c833d38

Was actually not that bad. cdsBy(txdb, by="tx") still needs to be modified to return the phase info. I'll try to get to this later.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Just what I was looking for!  Except, upon review, I think the logic is off.  I've commented one line as #OLD and replaced it with proposed new line.  

addCdsPhase <- function(cds_by_tx)
   ## 'cds_by_tx' must be a GRangesList object obtained with cdsBy(txdb, by="tx")
{
  cds_phase <- pc(rep(IntegerList(0), length(cds_by_tx)),
                  # OLD: heads((3L - (cumsum(width(cds_by_tx)) %% 3L)) %% 3L, n=-1L))
                  heads(cumsum(width(cds_by_tx)) %% 3L, n=-1L))
  unlisted_cds_by_tx <- unlist(cds_by_tx, use.names=FALSE)
  mcols(unlisted_cds_by_tx)$cds_phase <- unlist(cds_phase, use.names=FALSE)
  relist(unlisted_cds_by_tx, cds_by_tx)
}

Thumbs up?

Any chance of this hitting the mainstream?

 

ADD REPLY
0
Entering edit mode

So you are saying that the phase of the 2nd CDS on Fly transcript FBtr0300689 is 2 instead of 1?

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)["FBtr0300689"]

With my original version of addCdsPhase() (after replacing phead() with heads()):

addCdsPhase(cds_by_tx)
# GRangesList object of length 1:
# $FBtr0300689 
# GRanges object with 2 ranges and 4 metadata columns:
#       seqnames    ranges strand | ... exon_rank cds_phase
#          <Rle> <IRanges>  <Rle> | ... <integer> <numeric>
#   [1]    chr2L 7680-8116      + | ...         1         0
#   [2]    chr2L 8193-8610      + | ...         2         1
#
# -------
# seqinfo: 15 sequences (1 circular) from dm3 genome

With your modified version:

addCdsPhase2(cds_by_tx)
# GRangesList object of length 1:
# $FBtr0300689 
# GRanges object with 2 ranges and 4 metadata columns:
#       seqnames    ranges strand | ... exon_rank cds_phase
#          <Rle> <IRanges>  <Rle> | ... <integer> <numeric>
#   [1]    chr2L 7680-8116      + | ...         1         0
#   [2]    chr2L 8193-8610      + | ...         2         2
#
# -------
# seqinfo: 15 sequences (1 circular) from dm3 genome

I disagree with that.

H.

ADD REPLY
0
Entering edit mode

Yes, I am, Herve. Indeed.

Let's back up a sec, and, for completeness let's make sure we are talking about the same thing.

Here's my best reference - https://useast.ensembl.org/Help/Glossary where we find:

Phase Exon The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1.

I like that definition.

I also like Ensemble's implementation, which also assigns 2 as the value of the "start phase" of the second exon.

see: http://useast.ensembl.org/Drosophila_melanogaster/Transcript/Exons?db=core;g=FBgn0031208;r=2L:7529-9484;t=FBtr0300689

FWIW: their implementation appears to be setExonPhases in FlyBaseToolbox.pm;

ADD REPLY
0
Entering edit mode

There seems to be some confusion between frame and phase. For the CDS phase I prefer to stick to the definition provided at:

  https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

which, unfortunately, is in disagreement with Ensembl's definition.

Note that Ensembl's definition is for the exon phase, not the CDS phase. I don't know if that could explain the different definitions.

Anyway the CDS phase as defined in the official GFF3 specs is the phase implemented by my original addCdsPhase(). It is consistent with the phase of the 2nd CDS of Fly transcript FBtr0300689 reported in the Drosophila_melanogaster.BDGP6.93.chromosome.2L.gff3.gz file located at ftp://ftp.ensembl.org/pub/release-93/gff3/drosophila_melanogaster/

H.

ADD REPLY

Login before adding your answer.

Traffic: 680 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6