Search
Question: CDS phase information missing from GenomicFeatures
1
9 months ago by
arendsee10
Iowa State University
arendsee10 wrote:

The GenomicFeatures library allows gene models to be stored in an SQLite database.

Here is the full SQLite schema

CREATE TABLE chrominfo (
_chrom_id INTEGER PRIMARY KEY,
chrom TEXT UNIQUE NOT NULL,
length INTEGER NULL,
is_circular INTEGER NULL
);
CREATE TABLE transcript (
_tx_id INTEGER PRIMARY KEY,
tx_name TEXT NULL,
tx_type TEXT NULL,
tx_chrom TEXT NOT NULL,
tx_strand TEXT NOT NULL,
tx_start INTEGER NOT NULL,
tx_end INTEGER NOT NULL,
FOREIGN KEY (tx_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE exon (
_exon_id INTEGER PRIMARY KEY,
exon_name TEXT NULL,
exon_chrom TEXT NOT NULL,
exon_strand TEXT NOT NULL,
exon_start INTEGER NOT NULL,
exon_end INTEGER NOT NULL,
FOREIGN KEY (exon_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE cds (
_cds_id INTEGER PRIMARY KEY,
cds_name TEXT NULL,
cds_chrom TEXT NOT NULL,
cds_strand TEXT NOT NULL,
cds_start INTEGER NOT NULL,
cds_end INTEGER NOT NULL,
FOREIGN KEY (cds_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE splicing (
_tx_id INTEGER NOT NULL,
exon_rank INTEGER NOT NULL,
_exon_id INTEGER NOT NULL,
_cds_id INTEGER NULL,
UNIQUE (_tx_id, exon_rank),
FOREIGN KEY (_tx_id) REFERENCES transcript,
FOREIGN KEY (_exon_id) REFERENCES exon,
FOREIGN KEY (_cds_id) REFERENCES cds
);
CREATE TABLE gene (
gene_id TEXT NOT NULL,
_tx_id INTEGER NOT NULL,
UNIQUE (gene_id, _tx_id),
FOREIGN KEY (_tx_id) REFERENCES transcript
);
CREATE TABLE metadata (
name TEXT,
value TEXT
);


However, one important field seems to be missing: CDS phase. Phase specifies how many nucleotides must be removed from the beginning of a CDS to reach a complete codon (0, 1 or 2). If phase is missing, the CDS cannot be translated without making dangerous assumptions about the data. Specifically, you have to assume that the first CDS has a phase of 0. For poorly assembled genomes, this may not by the case.

Even if the assumption always holds, a CDS without phase data can be hard to work with since you cannot translate an individual CDS without considering the other members of the model.

According to the GFF specification, phase information is required for CDS entries and is stored in the 8th column. Here is an example of an incomplete gene model (from Saccharomyces uvarum):

# example.gff
AACA01001085.1  AUGUSTUS  CDS   1  289  .     +  1  Parent=g1.t1
AACA01001085.1  AUGUSTUS  exon  1  289  1.00  +  .  Parent=g1.t1
AACA01001085.1  AUGUSTUS  gene  1  289  1     +  .  g1
AACA01001085.1  AUGUSTUS  mRNA  1  289  .     +  .  ID=g1.t1;Other=g1


Where AACA01001085.1 is the short scaffold:

>AACA01001085.1
GTGGGGTTTCATGATTAGTTTCTTCACTCCTTTCATCACCGGCGCAATCAACTTTTACTA
TGGCTATGTCTTCATTGGCTGTCTATGTGTTGCCTACTTTTacgtctttttctttgtgcC
AGAAACGAAAGGCCTGACATTAGAAGAGGTCGACACTATGTGGCTGGAAGGCATTCCACC
ATGGAAATCGGCCTCATGGGTGCCACCGGACAGAAGAACCGCAGATTACGATGAAGACGC
TGTGGCCCATGACGAGAGGCCAGCTTACAAAAGATTCTTCTCCAACTGATCCTCCGACTC
ATCGATGAttcaaatgttttttttccctacCTCCGTAATATTTATgatgtttatttttgt
acTCCTTAATAATTTATGAACTTTTCCTTGTATTTCGCAGCAGCTGAACTCTAGTGCTGG
AGTGGACGCTACACAGCCCTCGGGACTAGCAGGTGCAGGAGCGAACGAAGCCGCAGGAAC
TCCACTAACAGACTACGCACAAAGACCCGGTGGACCCTGACTCAAAGCTGACGCAATACT
GCCACTACTTTGCAAACCTTAACTTTAACTTCGGCCCGACGCACCTAAGACGGACATAAG
AATGAGAAATGCAGTGAAGAGATCCTCCATCATATGCCAGCCGTAGGCCTTCGATGTTCT
TTGTGCAATGCTAC


Working with these partial models can result in serious problems:

library(GenomicFeatures)
library(Rsamtools)
library(Biostrings)

# Build SQLite gene model database
gff <- GenomicFeatures::makeTxDbFromGFF('example.gff')

# Build indexed fasta file and open a connection to it.
dna <- Rsamtools::FaFile(Rsamtools::indexFa("Saccharomyces_uvarum.fna"))

# Extract the CDS from the database. This will create a list of GRanges objects
# where each one contains an ordered set of coding sequences.
trans <- GenomicFeatures::cdsBy(gff, by='tx', use.names=TRUE)

trans
#> GRangesList object of length 1:
#> $g1.t1 #> GRanges object with 1 range and 3 metadata columns: #> seqnames ranges strand | cds_id cds_name exon_rank #> <Rle> <IRanges> <Rle> | <integer> <character> <integer> #> [1] AACA01001085.1 [1, 289] + | 1 <NA> 1 #> #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths # Extract the nucleotide sequences cds <- GenomicFeatures::extractTranscriptSeqs(x=dna, transcripts=trans) # Translate the CDS aa_seq <- Biostrings::translate(cds) #> Warning message: #> In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], : #> last base was ignored aa_seq #> A AAStringSet instance of length 1 #> width seq names #> [1] 96 VGFHD*FLHSFHHRRNQLLLWLC...RRLR*RRCGP*REASLQKILLQL g1.t1  The final translated sequence is frameshifted. The translate function detects that something is wrong, but it doesn't have enough information to make the right decision. It decides to ignore that last base, when it should have ignored the first 2 bases. One solution is to store the phase in the SQLite database and then write a phase-aware function for extracting CDS (e.g. extractCDSSeqs). Here is my session info: R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux Matrix products: default BLAS: /usr/lib/libblas.so.3.7.1 LAPACK: /usr/lib/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] Rsamtools_1.28.0 Biostrings_2.44.2 XVector_0.16.0 [4] GenomicFeatures_1.28.4 AnnotationDbi_1.38.2 Biobase_2.36.2 [7] GenomicRanges_1.28.5 GenomeInfoDb_1.12.2 IRanges_2.10.3 [10] S4Vectors_0.14.5 BiocGenerics_0.22.0 colorout_1.1-2 [13] pryr_0.1.2 devtools_1.13.3 magrittr_1.5 loaded via a namespace (and not attached): [1] Rcpp_0.12.13 compiler_3.4.2 [3] bitops_1.0-6 tools_3.4.2 [5] zlibbioc_1.22.0 biomaRt_2.32.1 [7] digest_0.6.12 bit_1.1-12 [9] lattice_0.20-35 RSQLite_2.0 [11] memoise_1.1.0 tibble_1.3.4 [13] pkgconfig_2.0.1 rlang_0.1.2 [15] Matrix_1.2-11 DelayedArray_0.2.7 [17] DBI_0.7 GenomeInfoDbData_0.99.0 [19] rtracklayer_1.36.4 withr_2.0.0 [21] stringr_1.2.0 grid_3.4.2 [23] bit64_0.9-7 XML_3.98-1.9 [25] BiocParallel_1.10.1 blob_1.1.0 [27] matrixStats_0.52.2 codetools_0.2-15 [29] GenomicAlignments_1.12.2 SummarizedExperiment_1.6.4 [31] stringi_1.1.5 RCurl_1.95-4.8  ADD COMMENTlink modified 9 months ago by Hervé Pagès ♦♦ 13k • written 9 months ago by arendsee10 1 9 months ago by Hervé Pagès ♦♦ 13k United States Hervé Pagès ♦♦ 13k wrote: Hi, To me this only reveals that the CDS reported in example.gff is incorrect. Why would the CDS be reported to start at position 1 if the start codon for transcript g1.t1 is at position 2-4? In other words, shouldn't the phase of the 1st CDS in a transcript always be 0? H. ADD COMMENTlink written 9 months ago by Hervé Pagès ♦♦ 13k The phase will be 0 if the models are complete, but this isn't always the case. Note that the first CDS begins on the very first nucleotide of the scaffold. The GFF example was produced by the AUGUSTUS gene model predictor, which will detect models even when they are broken across scaffolds. I have several low quality yeast assemblies and each of them has hundreds partial models. It is also convenient to include phase with each CDS entry because it allows the CDS to be translated independently of the rest of the model. Also, the phase is of biological interest. So making it accessible would be nice. Partially degraded transcripts could also have non-zero phase. ADD REPLYlink modified 9 months ago • written 9 months ago by arendsee10 1 To add some historic background here: TxDb objects were originally designed to support correct gene models from UCSC and Ensembl (via makeTxDbFromUCSC and makeTxDbFromEnsembl). So at the time it didn't feel necessary to include the phase information. Note that the UCSC tracks don't contain this information because it can easily be computed from the other fields in the track, assuming that the information in these other fields is correct of course. (The phase actually gets computed/added to the GTF files you can get via the Table Browser.) Support for GFF files was only added a couple of years later via makeTxDbFromGFF. The phase information could probably be added to TxDb objects. However, I don't think it's a good idea nor realistic that TxDb objects support incorrect gene models. In this case, one should work directly from the GFF file after loading it in a data.frame with rtracklayer::readGFF() or in a GRanges object with rtracklayer::import(). This will import the phase. Note that, assuming the gene model in a TxDb object is correct, the phase can easily be computed with something like: ## 'cds_by_tx' must be a GRangesList object obtained with cdsBy(txdb, by="tx") addCdsPhase <- function(cds_by_tx) { cds_phase <- pc(rep(IntegerList(0), length(cds_by_tx)), phead((3L - (cumsum(width(cds_by_tx)) %% 3L)) %% 3L, n=-1L)) unlisted_cds_by_tx <- unlist(cds_by_tx, use.names=FALSE) mcols(unlisted_cds_by_tx)$cds_phase <- unlist(cds_phase, use.names=FALSE)
relist(unlisted_cds_by_tx, cds_by_tx)
}

Then:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
cds_by_tx <- cdsBy(txdb, by="tx")
# GRangesList object of length 26950:
# $1 # GRanges object with 2 ranges and 4 metadata columns: # seqnames ranges strand | cds_id cds_name exon_rank cds_phase # <Rle> <IRanges> <Rle> | <integer> <character> <integer> <numeric> # [1] chr2L [7680, 8116] + | 1 <NA> 1 0 # [2] chr2L [8193, 8610] + | 3 <NA> 2 1 # #$2
# GRanges object with 3 ranges and 4 metadata columns:
#       seqnames       ranges strand | cds_id cds_name exon_rank cds_phase
#   [1]    chr2L [7680, 8116]      + |      1     <NA>         1         0
#   [2]    chr2L [8193, 8589]      + |      2     <NA>         2         1
#   [3]    chr2L [8668, 9276]      + |      5     <NA>         3         0
#
# \$3
# GRanges object with 2 ranges and 4 metadata columns:
#       seqnames       ranges strand | cds_id cds_name exon_rank cds_phase
#   [1]    chr2L [7680, 8116]      + |      1     <NA>         1         0
#   [2]    chr2L [8229, 8610]      + |      4     <NA>         2         1
#
# ...
# <26947 more elements>
# -------
# seqinfo: 15 sequences (1 circular) from dm3 genome

I'll modify cdsBy() so it returns the above (only when grouping by transcript though i.e. when called with by="tx"). The good news is that there is no need to touch the db schema.

Hope this helps,

H.

Thanks, the modification to cdsBy certainly help. I coded around the partial models in my project by loading the GFF into a GRanges object and adding the phase to the start position (which makes in-frame models). Then I encoded the the original GFF phase in the CDS 'name' field before passing the GRanges object into makeTxDbFromGRanges. It was ugly, but preserved the phase.

I'm being a little pedantic, perhaps, but I would not call these partial models "incorrect". The part of the model that is described may be perfectly correct, but due to an incomplete genome assembly, the initial part of the sequence is missing.

If phase is not supported, perhaps importing partial models should raise an immediate error or warning? Otherwise, proteins translated based on the models will be frameshifted. This is exactly what was happening in my project.

OK.

So we could add a warning in makeTxDbFromGFF() when the phases reported in the file for the CDS in a transcript don't agree with the phases inferred from the CDS cumwidths. Not a totally satisfying solution because only the person who calls makeTxDbFromGFF() to create the TxDb object gets to see the warning, but not other people using that TxDb downstream. Would probably be better to have the TxDb object itself "remember" that the CDS information cannot be trusted for translation e.g. by having this reported in the metadata table or, as you suggested, by having the phase imported from the file. Then cdsBy( , by="tx") could issue a warning. More generally, tools that take a TxDb and DNA sequences as input in order to do translation (e.g. predictCoding()) would be able to warn the user, or fail graciously, or maybe still manage to produce correct/meaningful output if they have access to the phase that was reported in the GFF file.

This will be a significant effort though and is not likely to happen before our next release.

H.

FYI I made a first pass on this: in GenomicFeatures 1.29.14, makeTxDbFromGFF() and makeTxDbFromGRanges() now import the CDS phase:

https://github.com/Bioconductor/GenomicFeatures/commit/22b2ea0a81bc064b8554400ce8f7e13e2c833d38

Was actually not that bad. cdsBy(txdb, by="tx") still needs to be modified to return the phase info. I'll try to get to this later.

Cheers,

H.