Hi,
I am trying to get the concatenated GRanges of ORFs for each gene.
I thought I could just get a list of cds from the txdb database and concatenate this list from the first to the last position of each cds. But I got stuck
What i do so far is:
cds <- cds(txdb, columns=c("TXNAME","EXONRANK"))
cds_grl <- multisplit(cds, cds$TXNAME)
wich returns a list of GRange object
Do you know how I could concatenate each GR element of the list to get the first and last possible CDS GRranges for each gene?
Maybe there is another method ?
Best, Quentin
Thinking about this further, I am not sure my nor Michael's answer is 100% correct. For example, there is SAMD11, which has many transcripts:
And it's here:
And here:
So you don't have a single GRanges item for all the CDS for this gene. We can use reduce to combine transcripts:
Which is cool, but reduce will also take any overlapping, unrelated transcripts and then reduce them to a single range as well, which isn't cool. AND, this still won't give you a single GRanges item for a single gene if the CDS aren't overlapping, which is a thing.
I assumed by "gene" the OP meant "transcript". To do it by gene, you would need to use the gene IDs instead. Trans-splicing is another complication though, where there will be multiple ranges per transcript, on different sequences and/or strands. But it's easy to restrict to length one elements before doing the
unlist()
.