getting the longest transcript by gene from Refseq
1
1
Entering edit mode
dalloliogm ▴ 50
@dalloliogm-7141
Last seen 3.4 years ago
European Union

The Refseq Gene table contains more than one transcript per gene:

> library(dplyr)
> library(AnnotationHub)
> ahub = AnnotationHub()
> refgene = ahub[['AH5040']]
> table(table(mcols(refgene)['name']))
    1     2     3     4     5     6     7     8     9    10    12    13    19
45178   568   102    27    74    52   171   129    13    19     1     1     5

The vast majority of genes has only one transcript, but 1,162 genes have more.

In order to simplify my analysis I would like to select only one transcript per gene, the longest. However I am not familiar with bioconductor methods and I am not sure how to select it. For the moment I am using dplyr to sort transcripts by length and get the longest:

> refgene.ranks = refgene %>%
    DataFrame(., start=start(.), end=end(.), chrom=seqnames(.), transcript.length=width(.) ) %>% 
    subset(., select=c(name, chrom, start, end, transcript.length)) %>%
    data.frame %>% 
    group_by(name) %>% 
    arrange(-transcript.length) %>% 
    mutate(rank=row_number())

The code is complicated because I am more familiar with dplyr, so I need to convert the data to a data frame in order to sort.

The refgene.ranks variable contains transcripts ranked by their length. For example NM_001080137 has 9 transcripts:

> refgene.ranks %>% subset(name == 'NM_001080137')
Source: local data frame [9 x 3]
Groups: name [1]

          name  chrom     start       end transcript.length  rank
         (chr) (fctr)     (int)     (int)             (int) (int)
1 NM_001080137   chrX 120011345 120066151             54807     1
2 NM_001080137   chrX 120067695 120071012              3318     2
3 NM_001080137   chrX 120111461 120114778              3318     3
4 NM_001080137   chrX 120096881 120100198              3318     4
5 NM_001080137   chrX 120101741 120105058              3318     5
6 NM_001080137   chrX 120106601 120109918              3318     6
7 NM_001080137   chrX 120072556 120075873              3318     7
8 NM_001080137   chrX 120077416 120080733              3318     8
9 NM_001080137   chrX 120082277 120085594              3318     9

At this point I can get a GRanges from the longest transcripts per gene:

> longest.transcripts.gr = refgene.ranks %>% 
    subset(rank==1) %>% 
    makeGRangesFromDataFrame(., keep.extra.columns=T)

Then I should be able to subset the original refgene GRanges by the coordinates of the longest transcripts. However this doesn't work very well, because after that I still get more than one transcript for some gene:

> refgene.longest = refgene %>% subsetByOverlaps(refgene.ranks.gr, type='equal')
> refgene.longest %>% mcols %>% subset(., select=name) %>% data.frame %>% count(name) %>% arrange(-n)
Source: local data frame [46,308 x 2]

           name     n
          (chr) (int)
1  NM_001130404     4
2  NM_001130405     4
3  NM_001130406     4
4  NM_001130407     4
5  NM_001242326     4
6  NM_001242327     4
7  NM_001242328     4
8  NM_001242329     4
9  NM_001242330     4
10 NM_001242331     4

 

What am I doing wrong in this intersection? Is there a simpler way to select the longest transcript by gene?

refseq annotationhub • 3.9k views
ADD COMMENT
3
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States

You can split the data structure ('UCSCData', which is a little idiosyncractic to rtracklayer) into a list, where each element belongs to a particular gene

rd = splitAsList(refgene, refgene$name)

Then find the longest (maybe this is the difference -- here ties go to the first transcript?)

maxwd = which.max(width(rd))
refgene.longest = rd[splitAsList(unname(maxwd), names(maxwd))]

Confirming that there is only one per gene, showing the first, and unlisting to retrieve something with the original geometry:

table(elementLengths(refgene.longest))
refgene.longest[[1]]
unlist(refgene.longest)

 

 

ADD COMMENT

Login before adding your answer.

Traffic: 602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6