The Refseq Gene table contains more than one transcript per gene:
> library(dplyr) > library(AnnotationHub) > ahub = AnnotationHub() > refgene = ahub[['AH5040']] > table(table(mcols(refgene)['name'])) 1 2 3 4 5 6 7 8 9 10 12 13 19 45178 568 102 27 74 52 171 129 13 19 1 1 5
The vast majority of genes has only one transcript, but 1,162 genes have more.
In order to simplify my analysis I would like to select only one transcript per gene, the longest. However I am not familiar with bioconductor methods and I am not sure how to select it. For the moment I am using dplyr to sort transcripts by length and get the longest:
> refgene.ranks = refgene %>% DataFrame(., start=start(.), end=end(.), chrom=seqnames(.), transcript.length=width(.) ) %>% subset(., select=c(name, chrom, start, end, transcript.length)) %>% data.frame %>% group_by(name) %>% arrange(-transcript.length) %>% mutate(rank=row_number())
The code is complicated because I am more familiar with dplyr, so I need to convert the data to a data frame in order to sort.
The refgene.ranks variable contains transcripts ranked by their length. For example NM_001080137 has 9 transcripts:
> refgene.ranks %>% subset(name == 'NM_001080137') Source: local data frame [9 x 3] Groups: name [1] name chrom start end transcript.length rank (chr) (fctr) (int) (int) (int) (int) 1 NM_001080137 chrX 120011345 120066151 54807 1 2 NM_001080137 chrX 120067695 120071012 3318 2 3 NM_001080137 chrX 120111461 120114778 3318 3 4 NM_001080137 chrX 120096881 120100198 3318 4 5 NM_001080137 chrX 120101741 120105058 3318 5 6 NM_001080137 chrX 120106601 120109918 3318 6 7 NM_001080137 chrX 120072556 120075873 3318 7 8 NM_001080137 chrX 120077416 120080733 3318 8 9 NM_001080137 chrX 120082277 120085594 3318 9
At this point I can get a GRanges from the longest transcripts per gene:
> longest.transcripts.gr = refgene.ranks %>% subset(rank==1) %>% makeGRangesFromDataFrame(., keep.extra.columns=T)
Then I should be able to subset the original refgene GRanges by the coordinates of the longest transcripts. However this doesn't work very well, because after that I still get more than one transcript for some gene:
> refgene.longest = refgene %>% subsetByOverlaps(refgene.ranks.gr, type='equal') > refgene.longest %>% mcols %>% subset(., select=name) %>% data.frame %>% count(name) %>% arrange(-n) Source: local data frame [46,308 x 2] name n (chr) (int) 1 NM_001130404 4 2 NM_001130405 4 3 NM_001130406 4 4 NM_001130407 4 5 NM_001242326 4 6 NM_001242327 4 7 NM_001242328 4 8 NM_001242329 4 9 NM_001242330 4 10 NM_001242331 4
What am I doing wrong in this intersection? Is there a simpler way to select the longest transcript by gene?