Question

Merging each element of a list of Genomic range

0

Entering edit mode

q.thomas • 0

@qthomas-16121

Last seen 7.1 years ago

Hi,

I am trying to get the concatenated GRanges of ORFs for each gene.

I thought I could just get a list of cds from the txdb database and concatenate this list from the first to the last position of each cds. But I got stuck

What i do so far is:

cds <- cds(txdb, columns=c("TXNAME","EXONRANK"))

cds_grl <- multisplit(cds, cds$TXNAME)

wich returns a list of GRange object

Do you know how I could concatenate each GR element of the list to get the first and last possible CDS GRranges for each gene?

Maybe there is another method ?

Best, Quentin

txdb granges cds orf R • 2.3k views

ADD COMMENT • link updated 7.1 years ago by James W. MacDonald 68k • written 7.1 years ago by q.thomas • 0

score 0 · Answer 1 · 2018-06-13

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.6 years ago

United States

It's not exactly clear what you want. If you want a GRanges of the total span of the CDS blocks for each gene, you could do:

unlist(range(cds_grl))

ADD COMMENT • link 7.1 years ago Michael Lawrence ★ 11k

score 0 · Answer 2 · 2018-06-13

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 6 days ago

United States

It's not exactly clear what you mean by 'to get the first and last possible CDS GRanges'. That could be interpreted as wanting at most two GRanges per CDS, or what I assume you actually want, which would be to get a single GRanges item for each CDS that extends to the furthest extent of any underlying range.

I don't actually know how you would do the former. But the latter is simple:

reduce(unlist(range(cds_grl)))

which will give you a GRanges with just one range per gene, that extends to the furthest extent of all CDS from that gene.

ADD COMMENT • link 7.1 years ago James W. MacDonald 68k

0

Entering edit mode

Thinking about this further, I am not sure my nor Michael's answer is 100% correct. For example, there is SAMD11, which has many transcripts:

> samd11 <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, "148398", "TXNAME","GENEID")
'select()' returned 1:many mapping between keys and columns
> samd11
   GENEID     TXNAME
1  148398 uc001abv.1
2  148398 uc001abw.1
3  148398 uc031pjl.1
4  148398 uc031pjm.1
5  148398 uc031pjn.1
6  148398 uc001abx.2
7  148398 uc031pjo.1
8  148398 uc031pjp.1
9  148398 uc031pjq.1
10 148398 uc031pjr.1
11 148398 uc031pjs.1
12 148398 uc031pjt.1
13 148398 uc031pju.1
14 148398 uc031pjv.1
15 148398 uc031pjw.1
16 148398 uc031pjx.1
17 148398 uc031pjy.1
18 148398 uc031pjz.1
19 148398 uc031pka.1
20 148398 uc031pkb.1
21 148398 uc031pkc.1
22 148398 uc031pkd.1
23 148398 uc031pke.1
24 148398 uc031pkf.1
25 148398 uc031pkg.1
26 148398 uc031pkh.1
27 148398 uc031pki.1
28 148398 uc031pkj.1
29 148398 uc031pkk.1
30 148398 uc031pkl.1
31 148398 uc031pkm.1
32 148398 uc031pkn.1
33 148398 uc031pko.1
34 148398 uc031pkp.1

And it's here:

> cds_grl[5]
GRangesList object of length 1:
$uc001abw.1
GRanges object with 13 ranges and 2 metadata columns:
       seqnames        ranges strand |                               TXNAME
          <Rle>     <IRanges>  <Rle> |                      <CharacterList>
   [1]     chr1 861322-861393      + | uc001abv.1,uc001abw.1,uc031pjl.1,...
   [2]     chr1 865535-865716      + | uc001abv.1,uc001abw.1,uc031pjl.1,...
   [3]     chr1 866419-866469      + | uc001abv.1,uc001abw.1,uc031pjl.1,...
   [4]     chr1 871152-871276      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
   [5]     chr1 874420-874509      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
   ...      ...           ...    ... .                                  ...
   [9]     chr1 877790-877868      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
  [10]     chr1 877939-878438      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
  [11]     chr1 878633-878757      + | uc001abw.1,uc031pjl.1,uc031pjm.1,...
  [12]     chr1 879078-879188      + | uc001abw.1,uc031pjm.1,uc001abx.2,...
  [13]     chr1 879288-879533      + | uc001abw.1,uc001abx.2,uc031pjp.1,...

And here:

> cds_grl2 <- unlist(range(cds_grl))
> cds_grl2[names(cds_grl2) %in% samd11[,2],]
GRanges object with 26 ranges and 0 metadata columns:
             seqnames        ranges strand
                <Rle>     <IRanges>  <Rle>
  uc001abv.1     chr1 861322-871275      +
  uc001abw.1     chr1 861322-879533      +
  uc001abx.2     chr1 861322-879533      +
  uc031pjl.1     chr1 861322-879533      +
  uc031pjm.1     chr1 861322-879533      +
         ...      ...           ...    ...
  uc031pki.1     chr1 861322-879706      +
  uc031pkj.1     chr1 861322-879706      +
  uc031pkm.1     chr1 861322-879533      +
  uc031pko.1     chr1 874690-879533      +
  uc031pkp.1     chr1 877822-879533      +
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

So you don't have a single GRanges item for all the CDS for this gene. We can use reduce to combine transcripts:

> cds_grl3 <- reduce(cds_grl2)
> cds_grl3[cds_grl3 %over% gns["148398",],]
GRanges object with 1 range and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 861322-879706      +
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Which is cool, but reduce will also take any overlapping, unrelated transcripts and then reduce them to a single range as well, which isn't cool. AND, this still won't give you a single GRanges item for a single gene if the CDS aren't overlapping, which is a thing.

ADD REPLY • link 7.1 years ago James W. MacDonald 68k

0

Entering edit mode

I assumed by "gene" the OP meant "transcript". To do it by gene, you would need to use the gene IDs instead. Trans-splicing is another complication though, where there will be multiple ranges per transcript, on different sequences and/or strands. But it's easy to restrict to length one elements before doing the unlist().

ADD REPLY • link 7.1 years ago Michael Lawrence ★ 11k