How GenomicFeatures cdsBy() accounts for the frame info in the gff to get the CDS?
1
0
Entering edit mode
@mfarias-virgens
Last seen 1 day ago
United States

How GenomicFeatures cdsBy() accounts for the frame info in the gff to get the CDS? The info in the the 8th gff field

https://m.ensembl.org/info/website/upload/gff.html

frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..

Here is an ex of what I see in my gff

NC_042565.1     Gnomon  CDS     41062   41423   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     39337   39418   .       -       1       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     38834   39014   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     36546   36702   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     35950   36139   .       -       1       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     35437   35544   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     33345   33435   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     30949   31197   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     28678   28908   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     27570   27667   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_

I'm using cdsBy() to get cds sequences (see bellow), which will serve as input to calc dNdS with orthologr dNdS() https://drostlab.github.io/orthologr/index.html

# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

# Get dna seq
dna <- readDNAStringSet("/users/mfariasv/data/mfariasv/newBF20/BFgenomic.fa")

# extract CDS
txdb.cds_by_transcript <- cdsBy(txdb, by="tx", use.names = TRUE) 
GenomicFeatures Biostrings • 157 views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi Madza,

TLDR: The _phase_ (not _frame_, see IMPORTANT NOTE below) of a CDS doesn't affect its genomic location, so it does not need to be accounted for by GenomicFeatures::cdsBy().

Long answer:

The various extractor functions in GenomicFeatures (transcripts(), exons(), cds(), transcriptsBy(), exonsBy(), cdsBy(), etc...) return the _genomic ranges_ of the features. The _genomic range_ of a feature is described by its 1-based start and end positions w.r.t. the chromosome/sequence where it belongs. When the features are coming from a GTF/GFF3 file, the genomic ranges of all features are extracted from columns 1, 4, 5 (_seqid_, _start_, _end_) of the file. Column 8 (_phase_) is not needed.

IMPORTANT NOTE: According to the Official GFF3 Specs, the 8th column in a GFF3 file is the _phase_, not the _frame_. The Ensembl folks got that wrong in the document that you're referring to above. FURTHERMORE: Even if in other documents they call it the "phase" (like in this document), they give an incorrect definition! This is very unfortunate and has already created confusion in the past. See this long thread from 3-4 years ago.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Makes sense! Thank you -mdz

ADD REPLY

Login before adding your answer.

Traffic: 230 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6