Question

Handling pseudogenes in RNA-seq

0

Entering edit mode

supremerulersuraj • 0

@supremerulersuraj-12999

Last seen 7.0 years ago

I recently performed an RNA-seq experiment that was mapped using STAR through a package called zUMIs. Typically, our reads are 66 bp (and in the past our experiments have been mapped to the human genome) but this time our data ended up being 50 bp reads to the mouse genome. Our final counts dataset seems to be dominated by pseudogenes; close to 25% of the UMIs are linked to pseudogenes.

Of course, while pseudogenes can be transcribed, in our case it seems more likely that we have had an issue from the mapping front (I'm guessing due to the shorter read length). From our biological context, we certainly don't expect a massive number of transcribed pseudogenes. My question is - is there any way to coalesce the counts between genes and their corresponding pseudogenes (without remapping)? And if there isn't a good way to handle this, what STAR settings should we try adjusting to map to further promote mapping to canonical genes over pseudogenes?

Thanks!

rna-seq pseudogenes STAR counts • 4.3k views

ADD COMMENT • link updated 7.0 years ago by Levi Waldron ★ 1.1k • written 7.0 years ago by supremerulersuraj • 0

score 0 · Answer 1 · 2018-06-14

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 23 minutes ago

WEHI, Melbourne, Australia

This forum is to help you with Bioconductor software, but neither STAR nor zUMIs are Bioconductor packages. You need to send your question to the zUMIs authors, for example on their github issues page.

ADD COMMENT • link 7.0 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the response. That's true, but I was hoping there might be some bioconductor solution for coalescing pseudogenes to their corresponding canonical gene post-mapping (i.e. with the counts table). I didn't know if anybody had experience with doing this. I included the mapping information just for context.

ADD REPLY • link 7.0 years ago supremerulersuraj • 0

score 0 · Answer 2 · 2018-06-16

I can't advise on the biological advisability of this, but Bioconductor annotation packages do annotate pseudogene transcripts as well as gene transcripts:

> library(EnsDb.Hsapiens.v86)
> ts <- transcripts(EnsDb.Hsapiens.v86)
> grep("pseudo", unique(ts$tx_biotype), val=TRUE)
 [1] "transcribed_unprocessed_pseudogene" "unprocessed_pseudogene"            
 [3] "processed_pseudogene"               "transcribed_processed_pseudogene"  
 [5] "pseudogene"                         "unitary_pseudogene"                
 [7] "polymorphic_pseudogene"             "transcribed_unitary_pseudogene"    
 [9] "IG_V_pseudogene"                    "TR_V_pseudogene"                   
[11] "TR_J_pseudogene"                    "IG_C_pseudogene"                   
[13] "IG_J_pseudogene"                    "IG_pseudogene"

So you could take the gene identifiers from these transcripts:

> gns <- unique(ts$gene_id)

and use them to group ranges by gene, whether the ranges are gene or pseudogene (I only do this for the first 10 unique gene IDs, because this is slow and the faster relist() approach to constructing a GRangesList escapes me in this case):

> gnranges <- lapply(gns[1:10], function(gn){
+   ts[ts$gene_id %in% gn]
+ })
> gnranges <- GRangesList(gnranges)
> names(gnranges) <- gns[1:10]

You can see that some of these genes have multiple transcripts:

> elementNROWS(gnranges)
ENSG00000223972 ENSG00000227232 ENSG00000278267 ENSG00000243485 
              2               1               1               3 
ENSG00000237613 ENSG00000268020 ENSG00000240361 ENSG00000186092 
              2               1               1               1 
ENSG00000238009 ENSG00000239945 
              5               1

And some of these are due to processed transcripts plus pseudogenes (and others are all lincRNAs etc, you might want to do some additional filtering):

> gnranges[[1]]
GRanges object with 2 ranges and 6 metadata columns:
                  seqnames      ranges strand |           tx_id
                     <Rle>   <IRanges>  <Rle> |     <character>
  ENST00000456328        1 11869-14409      + | ENST00000456328
  ENST00000450305        1 12010-13670      + | ENST00000450305
                                          tx_biotype tx_cds_seq_start
                                         <character>        <integer>
  ENST00000456328               processed_transcript             <NA>
  ENST00000450305 transcribed_unprocessed_pseudogene             <NA>
                  tx_cds_seq_end         gene_id         tx_name
                       <integer>     <character>     <character>
  ENST00000456328           <NA> ENSG00000223972 ENST00000456328
  ENST00000450305           <NA> ENSG00000223972 ENST00000450305
  -------
  seqinfo: 357 sequences from GRCh38 genome
>

Then you could use this gnranges object for read-counting as described in the workflow package rnaseqGene.

Now I have no idea if this would be advisable, I'm just pointing out some capabilities of these amazing annotation and GenomicRanges packages...