Handling pseudogenes in RNA-seq
Entering edit mode
Last seen 5.9 years ago

I recently performed an RNA-seq experiment that was mapped using STAR through a package called zUMIs. Typically, our reads are 66 bp (and in the past our experiments have been mapped to the human genome) but this time our data ended up being 50 bp reads to the mouse genome. Our final counts dataset seems to be dominated by pseudogenes; close to 25% of the UMIs are linked to pseudogenes.

Of course, while pseudogenes can be transcribed, in our case it seems more likely that we have had an issue from the mapping front (I'm guessing due to the shorter read length). From our biological context, we certainly don't expect a massive number of transcribed pseudogenes. My question is - is there any way to coalesce the counts between genes and their corresponding pseudogenes (without remapping)? And if there isn't a good way to handle this, what STAR settings should we try adjusting to map to further promote mapping to canonical genes over pseudogenes?


rna-seq pseudogenes STAR counts • 3.4k views
Entering edit mode
Last seen 56 minutes ago
WEHI, Melbourne, Australia

This forum is to help you with Bioconductor software, but neither STAR nor zUMIs are Bioconductor packages. You need to send your question to the zUMIs authors, for example on their github issues page.

Entering edit mode

Thanks for the response. That's true, but I was hoping there might be some bioconductor solution for coalescing pseudogenes to their corresponding canonical gene post-mapping (i.e. with the counts table). I didn't know if anybody had experience with doing this. I included the mapping information just for context. 

Entering edit mode
Levi Waldron ★ 1.1k
Last seen 12 days ago
CUNY Graduate School of Public Health a…

I can't advise on the biological advisability of this, but Bioconductor annotation packages do annotate pseudogene transcripts as well as gene transcripts:

> library(EnsDb.Hsapiens.v86)
> ts <- transcripts(EnsDb.Hsapiens.v86)
> grep("pseudo", unique(ts$tx_biotype), val=TRUE)
 [1] "transcribed_unprocessed_pseudogene" "unprocessed_pseudogene"            
 [3] "processed_pseudogene"               "transcribed_processed_pseudogene"  
 [5] "pseudogene"                         "unitary_pseudogene"                
 [7] "polymorphic_pseudogene"             "transcribed_unitary_pseudogene"    
 [9] "IG_V_pseudogene"                    "TR_V_pseudogene"                   
[11] "TR_J_pseudogene"                    "IG_C_pseudogene"                   
[13] "IG_J_pseudogene"                    "IG_pseudogene"

So you could take the gene identifiers from these transcripts:

> gns <- unique(ts$gene_id)

and use them to group ranges by gene, whether the ranges are gene or pseudogene (I only do this for the first 10 unique gene IDs, because this is slow and the faster relist() approach to constructing a GRangesList escapes me in this case):

> gnranges <- lapply(gns[1:10], function(gn){
+   ts[ts$gene_id %in% gn]
+ })
> gnranges <- GRangesList(gnranges)
> names(gnranges) <- gns[1:10]

You can see that some of these genes have multiple transcripts:

> elementNROWS(gnranges)
ENSG00000223972 ENSG00000227232 ENSG00000278267 ENSG00000243485 
              2               1               1               3 
ENSG00000237613 ENSG00000268020 ENSG00000240361 ENSG00000186092 
              2               1               1               1 
ENSG00000238009 ENSG00000239945 
              5               1

And some of these are due to processed transcripts plus pseudogenes (and others are all lincRNAs etc, you might want to do some additional filtering):

> gnranges[[1]]
GRanges object with 2 ranges and 6 metadata columns:
                  seqnames      ranges strand |           tx_id
                     <Rle>   <IRanges>  <Rle> |     <character>
  ENST00000456328        1 11869-14409      + | ENST00000456328
  ENST00000450305        1 12010-13670      + | ENST00000450305
                                          tx_biotype tx_cds_seq_start
                                         <character>        <integer>
  ENST00000456328               processed_transcript             <NA>
  ENST00000450305 transcribed_unprocessed_pseudogene             <NA>
                  tx_cds_seq_end         gene_id         tx_name
                       <integer>     <character>     <character>
  ENST00000456328           <NA> ENSG00000223972 ENST00000456328
  ENST00000450305           <NA> ENSG00000223972 ENST00000450305
  seqinfo: 357 sequences from GRCh38 genome

Then you could use this gnranges object for read-counting as described in the workflow package rnaseqGene.

Now I have no idea if this would be advisable, I'm just pointing out some capabilities of these amazing annotation and GenomicRanges packages...


Login before adding your answer.

Traffic: 989 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6