Search
Question: Merging each element of a list of Genomic range
0
gravatar for q.thomas
3 months ago by
q.thomas0
q.thomas0 wrote:

Hi,

I am trying to get the concatenated GRanges of ORFs for each gene.

I thought I could just get a list of cds from the txdb database and concatenate this list from the first to the last position of each cds. But I got stuck

What i do so far is:

cds <- cds(txdb, columns=c("TXNAME","EXONRANK"))

cds_grl <- multisplit(cds, cds$TXNAME)

wich returns a list of GRange object

Do you know how I could concatenate each GR element of the list to get the first and last possible CDS GRranges for each gene?

Maybe there is another method ?

Best, Quentin

ADD COMMENTlink modified 3 months ago by James W. MacDonald47k • written 3 months ago by q.thomas0
0
gravatar for Michael Lawrence
3 months ago by
United States
Michael Lawrence10k wrote:

It's not exactly clear what you want. If you want a GRanges of the total span of the CDS blocks for each gene, you could do:

unlist(range(cds_grl))
ADD COMMENTlink written 3 months ago by Michael Lawrence10k
0
gravatar for James W. MacDonald
3 months ago by
United States
James W. MacDonald47k wrote:

It's not exactly clear what you mean by 'to get the first and last possible CDS GRanges'. That could be interpreted as wanting at most two GRanges per CDS, or what I assume you actually want, which would be to get a single GRanges item for each CDS that extends to the furthest extent of any underlying range.

I don't actually know how you would do the former. But the latter is simple:

reduce(unlist(range(cds_grl)))

which will give you a GRanges with just one range per gene, that extends to the furthest extent of all CDS from that gene.

ADD COMMENTlink written 3 months ago by James W. MacDonald47k

Thinking about this further, I am not sure my nor Michael's answer is 100% correct. For example, there is SAMD11, which has many transcripts:

> samd11 <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, "148398", "TXNAME","GENEID")
'select()' returned 1:many mapping between keys and columns
> samd11
   GENEID     TXNAME
1  148398 uc001abv.1
2  148398 uc001abw.1
3  148398 uc031pjl.1
4  148398 uc031pjm.1
5  148398 uc031pjn.1
6  148398 uc001abx.2
7  148398 uc031pjo.1
8  148398 uc031pjp.1
9  148398 uc031pjq.1
10 148398 uc031pjr.1
11 148398 uc031pjs.1
12 148398 uc031pjt.1
13 148398 uc031pju.1
14 148398 uc031pjv.1
15 148398 uc031pjw.1
16 148398 uc031pjx.1
17 148398 uc031pjy.1
18 148398 uc031pjz.1
19 148398 uc031pka.1
20 148398 uc031pkb.1
21 148398 uc031pkc.1
22 148398 uc031pkd.1
23 148398 uc031pke.1
24 148398 uc031pkf.1
25 148398 uc031pkg.1
26 148398 uc031pkh.1
27 148398 uc031pki.1
28 148398 uc031pkj.1
29 148398 uc031pkk.1
30 148398 uc031pkl.1
31 148398 uc031pkm.1
32 148398 uc031pkn.1
33 148398 uc031pko.1
34 148398 uc031pkp.1

And it's here:

> cds_grl[5]
GRangesList object of length 1:
$uc001abw.1
GRanges object with 13 ranges and 2 metadata columns:
       seqnames        ranges strand |                               TXNAME
          <Rle>     <IRanges>  <Rle> |                      <CharacterList>
   [1]     chr1 861322-861393      + | uc001abv.1,uc001abw.1,uc031pjl.1,...
   [2]     chr1 865535-865716      + | uc001abv.1,uc001abw.1,uc031pjl.1,...
   [3]     chr1 866419-866469      + | uc001abv.1,uc001abw.1,uc031pjl.1,...
   [4]     chr1 871152-871276      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
   [5]     chr1 874420-874509      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
   ...      ...           ...    ... .                                  ...
   [9]     chr1 877790-877868      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
  [10]     chr1 877939-878438      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
  [11]     chr1 878633-878757      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
  [12]     chr1 879078-879188      + | uc001abw.1,uc031pjm.1,uc001abx.2,...
  [13]     chr1 879288-879533      + | uc001abw.1,uc001abx.2,uc031pjp.1,...

And here:

> cds_grl2 <- unlist(range(cds_grl))
> cds_grl2[names(cds_grl2) %in% samd11[,2],]
GRanges object with 26 ranges and 0 metadata columns:
             seqnames        ranges strand
                <Rle>     <IRanges>  <Rle>
  uc001abv.1     chr1 861322-871275      +
  uc001abw.1     chr1 861322-879533      +
  uc001abx.2     chr1 861322-879533      +
  uc031pjl.1     chr1 861322-879533      +
  uc031pjm.1     chr1 861322-879533      +
         ...      ...           ...    ...
  uc031pki.1     chr1 861322-879706      +
  uc031pkj.1     chr1 861322-879706      +
  uc031pkm.1     chr1 861322-879533      +
  uc031pko.1     chr1 874690-879533      +
  uc031pkp.1     chr1 877822-879533      +
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

So you don't have a single GRanges item for all the CDS for this gene. We can use reduce to combine transcripts:

> cds_grl3 <- reduce(cds_grl2)
> cds_grl3[cds_grl3 %over% gns["148398",],]
GRanges object with 1 range and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 861322-879706      +
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Which is cool, but reduce will also take any overlapping, unrelated transcripts and then reduce them to a single range as well, which isn't cool. AND, this still won't give you a single GRanges item for a single gene if the CDS aren't overlapping, which is a thing.

 

ADD REPLYlink written 3 months ago by James W. MacDonald47k

I assumed by "gene" the OP meant "transcript". To do it by gene, you would need to use the gene IDs instead. Trans-splicing is another complication though, where there will be multiple ranges per transcript, on different sequences and/or strands. But it's easy to restrict to length one elements before doing the unlist().

ADD REPLYlink written 3 months ago by Michael Lawrence10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 224 users visited in the last hour