I need to extract sequences for the longest isoforms
I am able to extract CDS using
# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
# Get dna seq
dna <- readDNAStringSet("/users/mfariasv/data/mfariasv/newBF20/BFgenomic.fa")
# extract CDS
txdb.cds_by_transcript <- cdsBy(txdb, by="tx", use.names = TRUE)
# Extract CDS sequences
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript, use.names = TRUE)
# use lapply function and unlist function and diretcly convert into a DNAStringSet
LargeDNAStringSet <- DNAStringSet(lapply(cds_by_trnascript, function(x) {unlist(x)}))
# Write to fasta
writeXStringSet(LargeDNAStringSet, "my.fasta")
But I don't know how to go about retrieving the longest isoforms
I initially got a warning when running
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
...
1: In .find_exon_cds(exons, cds) :
The following transcripts have exons that contain more than one CDS (only the first CDS
was kept for each exon): rna-NM_001142785.2, rna-NM_001172764.1
but I manually deleted those entries from my gff as they where uninteresting.
I also can't find the info for if/how GenomicFeatures cdsBy() uses the frame info in the gff to get the CDS.
Thank you
Thank you so so munch! You saved me from a bad meeting tomorrow. Now maybe I'm jut gonna have a half-bad meeting. Will def credit you for it! Now, I guess the only part of my question that remains is how GenomicFeatures accounts for the frame info in the gff to get the CDS? You know, the info in the the 8th gff field, which I was reading https://m.ensembl.org/info/website/upload/gff.html
I ask bc I'll translate these cds into aa at some point.
Here is an ex of what I see in my gff
let me know if this is rather a different non-related question, and I'll post separately. Thanks again!
Yes, your question about the frame is unrelated. Please create a new post. Thanks!