I can't advise on the biological advisability of this, but Bioconductor annotation packages do annotate pseudogene transcripts as well as gene transcripts:
> library(EnsDb.Hsapiens.v86)
> ts <- transcripts(EnsDb.Hsapiens.v86)
> grep("pseudo", unique(ts$tx_biotype), val=TRUE)
[1] "transcribed_unprocessed_pseudogene" "unprocessed_pseudogene"
[3] "processed_pseudogene" "transcribed_processed_pseudogene"
[5] "pseudogene" "unitary_pseudogene"
[7] "polymorphic_pseudogene" "transcribed_unitary_pseudogene"
[9] "IG_V_pseudogene" "TR_V_pseudogene"
[11] "TR_J_pseudogene" "IG_C_pseudogene"
[13] "IG_J_pseudogene" "IG_pseudogene"
So you could take the gene identifiers from these transcripts:
> gns <- unique(ts$gene_id)
and use them to group ranges by gene, whether the ranges are gene or pseudogene (I only do this for the first 10 unique gene IDs, because this is slow and the faster relist()
approach to constructing a GRangesList
escapes me in this case):
> gnranges <- lapply(gns[1:10], function(gn){
+ ts[ts$gene_id %in% gn]
+ })
> gnranges <- GRangesList(gnranges)
> names(gnranges) <- gns[1:10]
You can see that some of these genes have multiple transcripts:
> elementNROWS(gnranges)
ENSG00000223972 ENSG00000227232 ENSG00000278267 ENSG00000243485
2 1 1 3
ENSG00000237613 ENSG00000268020 ENSG00000240361 ENSG00000186092
2 1 1 1
ENSG00000238009 ENSG00000239945
5 1
And some of these are due to processed transcripts plus pseudogenes (and others are all lincRNAs etc, you might want to do some additional filtering):
> gnranges[[1]]
GRanges object with 2 ranges and 6 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
ENST00000456328 1 11869-14409 + | ENST00000456328
ENST00000450305 1 12010-13670 + | ENST00000450305
tx_biotype tx_cds_seq_start
<character> <integer>
ENST00000456328 processed_transcript <NA>
ENST00000450305 transcribed_unprocessed_pseudogene <NA>
tx_cds_seq_end gene_id tx_name
<integer> <character> <character>
ENST00000456328 <NA> ENSG00000223972 ENST00000456328
ENST00000450305 <NA> ENSG00000223972 ENST00000450305
-------
seqinfo: 357 sequences from GRCh38 genome
>
Then you could use this gnranges
object for read-counting as described in the workflow package rnaseqGene.
Now I have no idea if this would be advisable, I'm just pointing out some capabilities of these amazing annotation and GenomicRanges
packages...
Thanks for the response. That's true, but I was hoping there might be some bioconductor solution for coalescing pseudogenes to their corresponding canonical gene post-mapping (i.e. with the counts table). I didn't know if anybody had experience with doing this. I included the mapping information just for context.