The Refseq Gene table contains more than one transcript per gene:
> library(dplyr)
> library(AnnotationHub)
> ahub = AnnotationHub()
> refgene = ahub[['AH5040']]
> table(table(mcols(refgene)['name']))
1 2 3 4 5 6 7 8 9 10 12 13 19
45178 568 102 27 74 52 171 129 13 19 1 1 5
The vast majority of genes has only one transcript, but 1,162 genes have more.
In order to simplify my analysis I would like to select only one transcript per gene, the longest. However I am not familiar with bioconductor methods and I am not sure how to select it. For the moment I am using dplyr to sort transcripts by length and get the longest:
> refgene.ranks = refgene %>%
DataFrame(., start=start(.), end=end(.), chrom=seqnames(.), transcript.length=width(.) ) %>%
subset(., select=c(name, chrom, start, end, transcript.length)) %>%
data.frame %>%
group_by(name) %>%
arrange(-transcript.length) %>%
mutate(rank=row_number())
The code is complicated because I am more familiar with dplyr, so I need to convert the data to a data frame in order to sort.
The refgene.ranks variable contains transcripts ranked by their length. For example NM_001080137 has 9 transcripts:
> refgene.ranks %>% subset(name == 'NM_001080137')
Source: local data frame [9 x 3]
Groups: name [1]
name chrom start end transcript.length rank
(chr) (fctr) (int) (int) (int) (int)
1 NM_001080137 chrX 120011345 120066151 54807 1
2 NM_001080137 chrX 120067695 120071012 3318 2
3 NM_001080137 chrX 120111461 120114778 3318 3
4 NM_001080137 chrX 120096881 120100198 3318 4
5 NM_001080137 chrX 120101741 120105058 3318 5
6 NM_001080137 chrX 120106601 120109918 3318 6
7 NM_001080137 chrX 120072556 120075873 3318 7
8 NM_001080137 chrX 120077416 120080733 3318 8
9 NM_001080137 chrX 120082277 120085594 3318 9
At this point I can get a GRanges from the longest transcripts per gene:
> longest.transcripts.gr = refgene.ranks %>%
subset(rank==1) %>%
makeGRangesFromDataFrame(., keep.extra.columns=T)
Then I should be able to subset the original refgene GRanges by the coordinates of the longest transcripts. However this doesn't work very well, because after that I still get more than one transcript for some gene:
> refgene.longest = refgene %>% subsetByOverlaps(refgene.ranks.gr, type='equal')
> refgene.longest %>% mcols %>% subset(., select=name) %>% data.frame %>% count(name) %>% arrange(-n)
Source: local data frame [46,308 x 2]
name n
(chr) (int)
1 NM_001130404 4
2 NM_001130405 4
3 NM_001130406 4
4 NM_001130407 4
5 NM_001242326 4
6 NM_001242327 4
7 NM_001242328 4
8 NM_001242329 4
9 NM_001242330 4
10 NM_001242331 4
What am I doing wrong in this intersection? Is there a simpler way to select the longest transcript by gene?
