How to get longest isoform
1
0
Entering edit mode
@mfarias-virgens
Last seen 2.2 years ago
United States

I need to extract sequences for the longest isoforms

I am able to extract CDS using

# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

# Get dna seq
dna <- readDNAStringSet("/users/mfariasv/data/mfariasv/newBF20/BFgenomic.fa")

# extract CDS
txdb.cds_by_transcript <- cdsBy(txdb, by="tx", use.names = TRUE) 

# Extract CDS sequences
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript, use.names = TRUE) 

# use lapply function and unlist function and diretcly convert into a DNAStringSet
LargeDNAStringSet <- DNAStringSet(lapply(cds_by_trnascript, function(x) {unlist(x)})) 

# Write to fasta
writeXStringSet(LargeDNAStringSet, "my.fasta")

But I don't know how to go about retrieving the longest isoforms

I initially got a warning when running

txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
...
1: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS (only the first CDS
  was kept for each exon): rna-NM_001142785.2, rna-NM_001172764.1

but I manually deleted those entries from my gff as they where uninteresting.

I also can't find the info for if/how GenomicFeatures cdsBy() uses the frame info in the gff to get the CDS.

Thank you

GenomicFeatures Biostrings • 2.4k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Here is one way to extract the longest CDS for each gene:

tx_lens <- transcriptLengths(txdb, with.cds_len=TRUE)

## tx_lens is a data.frame:
head(tx_lens)
#   tx_id     tx_name     gene_id nexon tx_len cds_len
# 1     1 FBtr0300689 FBgn0031208     2   1880     855
# 2     2 FBtr0300690 FBgn0031208     3   1802    1443
# 3     3 FBtr0330654 FBgn0031208     2   1844     819
# 4     4 FBtr0309810 FBgn0263584     2   2230       0
# 5     5 FBtr0306539 FBgn0067779     7   4051    3024
# 6     6 FBtr0306536 FBgn0067779     5   3731    3024

cds_len_per_gene <- splitAsList(tx_lens$cds_len, tx_lens$gene_id)
cds_len_per_gene
# IntegerList of length 15682
# [["FBgn0000003"]] 0
# [["FBgn0000008"]] 3990 3990 3990
# [["FBgn0000014"]] 993 1773 993 993
# [["FBgn0000015"]] 813 972 1482 813 813 813
# [["FBgn0000017"]] 4863 4917 5118 4824 4770 5172 4569 4515 5001
# [["FBgn0000018"]] 1620
# [["FBgn0000022"]] 606
# [["FBgn0000024"]] 1950 1950
# [["FBgn0000028"]] 1185 1101 1191 1107 1128 1095 ... 1080 1113 1131 1152 1137
# [["FBgn0000032"]] 1368 1317
# ...
# <15672 more elements>

tx_id_per_gene <- splitAsList(tx_lens$tx_id, tx_lens$gene_id)

which_max <- which.max(cds_len_per_gene)
length_of_longest_cds <- unlist(cds_len_per_gene[as.list(which_max)])
tx_id_of_longest_cds <- unlist(tx_id_per_gene[as.list(which_max)])

## 'length_of_longest_cds' is a named integer vector that maps each gene id to the length
## of the longest CDS for that gene:
head(length_of_longest_cds)
# FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017 FBgn0000018 
#           0        3990        1773        1482        5172        1620 

## 'tx_id_of_longest_cds' is a named integer vector that maps each gene id to the transcript id
## of the longest CDS for that gene:
head(tx_id_of_longest_cds)
# FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017 FBgn0000018 
#       17345        7681       21864       21876       16227        4279 

## 'length_of_longest_cds' and 'tx_id_of_longest_cds' are parallel vectors with identical names:
identical(names(length_of_longest_cds), names(tx_id_of_longest_cds))  # TRUE

## Let's summarize these results in a 3-column data.frame with 1 row per gene:
longest_cds <- data.frame(gene_id=names(length_of_longest_cds),
                          length_of_longest_cds=length_of_longest_cds,
                          tx_id_of_longest_cds=tx_id_of_longest_cds)

## Let's remove rows for which the longest CDS has length 0 (non-coding genes):
longest_cds <- subset(longest_cds, length_of_longest_cds != 0)

Now let's extract all the CDS fragments grouped by transcript and keep only the longest ones:

cds_by_tx <- cdsBy(txdb, by="tx")
cds_by_tx <- cds_by_tx[as.character(longest_cds$tx_id_of_longest_cds)]

cds_by_tx is a GRangesList object _parallel_ to longest_cds i.e. it has 1 list element per row in longest_cds.

The names on cds_by_tx are transcript ids. You can replace them with gene ids with:

names(cds_by_tx) <- longest_cds$gene_id

Use cds_by_tx as you've been using your txdb.cds_by_transcript object to extract all CDS sequences from your dna object. The difference being that now you're going to extract only the longest CDS for each gene. Once you have the longest CDS sequences in a DNAStringSet object (e.g. longest_cds_seqs), make sure that the names on the object are identical to longest_cds$gene_id and that the sequence lengths (which you can extract with width(longest_cds_seqs)) are identical to longest_cds$length_of_longest_cds.

Finally note that there can be more than one longest CDS per gene but the method described above will only pick-up one (the first one for each gene in the tx_lens data.frame).

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thank you so so munch! You saved me from a bad meeting tomorrow. Now maybe I'm jut gonna have a half-bad meeting. Will def credit you for it! Now, I guess the only part of my question that remains is how GenomicFeatures accounts for the frame info in the gff to get the CDS? You know, the info in the the 8th gff field, which I was reading https://m.ensembl.org/info/website/upload/gff.html

frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..

I ask bc I'll translate these cds into aa at some point.

Here is an ex of what I see in my gff

NC_042565.1     Gnomon  CDS     41062   41423   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     39337   39418   .       -       1       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     38834   39014   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     36546   36702   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     35950   36139   .       -       1       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     35437   35544   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     33345   33435   .       -       0       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     30949   31197   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     28678   28908   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1
NC_042565.1     Gnomon  CDS     27570   27667   .       -       2       ID=cds-XP_021394452.1;Parent=rna-XM_021538777.2;Dbxref=GeneID:110474964,Genbank:XP_021394452.1;Name=XP_021394452.1;gbkey=CDS;gene=LCMT2;product=tRNA wybutosine-synthesizing protein 4;protein_id=XP_021394452.1

let me know if this is rather a different non-related question, and I'll post separately. Thanks again!

ADD REPLY
1
Entering edit mode

Yes, your question about the frame is unrelated. Please create a new post. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6