Here is one way to extract the longest CDS for each gene:
tx_lens <- transcriptLengths(txdb, with.cds_len=TRUE)
## tx_lens is a data.frame:
head(tx_lens)
# tx_id tx_name gene_id nexon tx_len cds_len
# 1 1 FBtr0300689 FBgn0031208 2 1880 855
# 2 2 FBtr0300690 FBgn0031208 3 1802 1443
# 3 3 FBtr0330654 FBgn0031208 2 1844 819
# 4 4 FBtr0309810 FBgn0263584 2 2230 0
# 5 5 FBtr0306539 FBgn0067779 7 4051 3024
# 6 6 FBtr0306536 FBgn0067779 5 3731 3024
cds_len_per_gene <- splitAsList(tx_lens$cds_len, tx_lens$gene_id)
cds_len_per_gene
# IntegerList of length 15682
# [["FBgn0000003"]] 0
# [["FBgn0000008"]] 3990 3990 3990
# [["FBgn0000014"]] 993 1773 993 993
# [["FBgn0000015"]] 813 972 1482 813 813 813
# [["FBgn0000017"]] 4863 4917 5118 4824 4770 5172 4569 4515 5001
# [["FBgn0000018"]] 1620
# [["FBgn0000022"]] 606
# [["FBgn0000024"]] 1950 1950
# [["FBgn0000028"]] 1185 1101 1191 1107 1128 1095 ... 1080 1113 1131 1152 1137
# [["FBgn0000032"]] 1368 1317
# ...
# <15672 more elements>
tx_id_per_gene <- splitAsList(tx_lens$tx_id, tx_lens$gene_id)
which_max <- which.max(cds_len_per_gene)
length_of_longest_cds <- unlist(cds_len_per_gene[as.list(which_max)])
tx_id_of_longest_cds <- unlist(tx_id_per_gene[as.list(which_max)])
## 'length_of_longest_cds' is a named integer vector that maps each gene id to the length
## of the longest CDS for that gene:
head(length_of_longest_cds)
# FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017 FBgn0000018
# 0 3990 1773 1482 5172 1620
## 'tx_id_of_longest_cds' is a named integer vector that maps each gene id to the transcript id
## of the longest CDS for that gene:
head(tx_id_of_longest_cds)
# FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017 FBgn0000018
# 17345 7681 21864 21876 16227 4279
## 'length_of_longest_cds' and 'tx_id_of_longest_cds' are parallel vectors with identical names:
identical(names(length_of_longest_cds), names(tx_id_of_longest_cds)) # TRUE
## Let's summarize these results in a 3-column data.frame with 1 row per gene:
longest_cds <- data.frame(gene_id=names(length_of_longest_cds),
length_of_longest_cds=length_of_longest_cds,
tx_id_of_longest_cds=tx_id_of_longest_cds)
## Let's remove rows for which the longest CDS has length 0 (non-coding genes):
longest_cds <- subset(longest_cds, length_of_longest_cds != 0)
Now let's extract all the CDS fragments grouped by transcript and keep only the longest ones:
cds_by_tx <- cdsBy(txdb, by="tx")
cds_by_tx <- cds_by_tx[as.character(longest_cds$tx_id_of_longest_cds)]
cds_by_tx
is a GRangesList object _parallel_ to longest_cds
i.e. it has 1 list element per row in longest_cds
.
The names on cds_by_tx
are transcript ids. You can replace them with gene ids with:
names(cds_by_tx) <- longest_cds$gene_id
Use cds_by_tx
as you've been using your txdb.cds_by_transcript
object to extract all CDS sequences from your dna
object. The difference being that now you're going to extract only the longest CDS for each gene. Once you have the longest CDS sequences in a DNAStringSet object (e.g. longest_cds_seqs
), make sure that the names on the object are identical to longest_cds$gene_id
and that the sequence lengths (which you can extract with width(longest_cds_seqs)
) are identical to longest_cds$length_of_longest_cds
.
Finally note that there can be more than one longest CDS per gene but the method described above will only pick-up one (the first one for each gene in the tx_lens
data.frame).
Hope this helps,
H.
Thank you so so munch! You saved me from a bad meeting tomorrow. Now maybe I'm jut gonna have a half-bad meeting. Will def credit you for it! Now, I guess the only part of my question that remains is how GenomicFeatures accounts for the frame info in the gff to get the CDS? You know, the info in the the 8th gff field, which I was reading https://m.ensembl.org/info/website/upload/gff.html
I ask bc I'll translate these cds into aa at some point.
Here is an ex of what I see in my gff
let me know if this is rather a different non-related question, and I'll post separately. Thanks again!
Yes, your question about the frame is unrelated. Please create a new post. Thanks!