Selecting Correct Set of Gene IDs from Mouse Gencode GFF3 for RNA Seq Analysis (Removing Genes without symbols)
Entering edit mode
pvd2107 ▴ 30
Last seen 4.9 years ago

Hi, I apologize if the answer to this question is obvious or located elsewhere, but I spent quite a bit of time searching online and posts on this site for an answer with no luck and I've been stuck on it for several days.

I am in the process of an RNA-seq analysis using the Salmon -> Tximport -> DESeq2 workflow ( and had a question about the appropriate list of genes to use from the latest Gencode mouse comprehensive gene annotation (CHR) GTF/GFF3 file. The goal of this project is to perform gene level differential expression analysis and to discover pathways that may be differentially active between several biological treatment groups (so fairly generic RNA-seq analysis). 

The Gencode comprehensive annotation has ~53,000 genes, ~22,000 of which are protein coding. When I look at differential expression between my two control groups, it appears to contain a large number of poorly annotated pseudogenes that return NA values when I attempt to annotate the file, and that I do not expect are particularly relevant to my analysis. These also have very large fold changes, which seems odd. The relevant parts of my workflow were to:

1. Run DESeq on my dds data set

> dds <- DESeqDataSetFromTximport(txi.salmon, sampleTable_filtered_3, ~condition)
> nrow(dds)
[1] 52325
> dds <- dds[ rowSums(counts(dds)) > 1, ]
> nrow(dds)
[1] 37249

> dds <- DESeq(dds)
> colData(dds)
DataFrame with 30 rows and 6 columns
              sampleid genotype      age treatment    group condition
              <factor> <factor> <factor>  <factor> <factor>  <factor>
1_S18_Quants     1_S18       B6    young       cnt     YCNT    YB6CNT
2_S19_Quants     2_S19       BJ    young       het     YHET    YHETBJ
4_S21_Quants     4_S21       B6      old      hetx    OHETX   OHETB6X
5_S22_Quants     5_S22       B6    young       cnt     YCNT    YB6CNT
7_S23_Quants     7_S23       BJ    young       iso     YISO    YISOBJ
...                ...      ...      ...       ...      ...       ...
30_S45_Quants   30_S45       B6      old       cnt     OCNT    OB6CNT
31_S46_Quants   31_S46       BJ    young       cnt     YCNT    YBJCNT
32_S47_Quants   32_S47       B6      old       het     OHET    OHETB6
33_S48_Quants   33_S48       B6    young       iso     YISO    YISOB6
34_S49_Quants   34_S49       B6      old       iso     OISO    OISOB6

2. Generate a results object for a comparison of interest

> YCNT.05 <- results(dds, contrast = c("condition", "YB6CNT", "YBJCNT"), alpha = 0.05, lfcThreshold = 1)
> summary(YCNT.05)

out of 37249 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 9, 0.024% 
LFC < 0 (down)   : 15, 0.04% 
outliers [1]     : 174, 0.47% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

3. Subset my results object to remove padj NA values

> YCNT.05_Subset <- subset(YCNT.05, padj <= 0.05)
> YCNT.05_Subset <- YCNT.05_Subset[order(YCNT.05_Subset$log2FoldChange),]
> head(YCNT.05_Subset)
log2 fold change (MLE): condition YB6CNT vs YBJCNT 
Wald test p-value: condition YB6CNT vs YBJCNT 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                   <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSMUSG00000110704 22.424026      -45.31601  5.115848 -8.662495 4.615481e-18 8.555948e-14
ENSMUSG00000082016 21.264101      -30.86202  6.149671 -4.855873 1.198576e-06 2.613953e-03
ENSMUSG00000094568  9.902831      -30.47642  6.151707 -4.791584 1.654700e-06 3.228843e-03
ENSMUSG00000103651 52.698629      -29.23280  6.149360 -4.591178 4.407518e-06 7.781368e-03
ENSMUSG00000059195 57.152734      -27.95013  5.706919 -4.722360 2.331231e-06 4.321520e-03
ENSMUSG00000086216 35.732336      -23.47339  4.956042 -4.534543 5.772839e-06 9.728545e-03

4. Use the library to annotate the gene names corresponding to the gene IDs from the gff3 file. I first removed the version numbers from the GENEID column I noted that in the TxImport vignette a gencode annotation file is also used in one example and contains these version numbers at the end of the Ensembl gene ID (I was assuming this was similarly done for annotation, but if there is a better way I would be curious).

 > library(
> columns(
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"         
[14] "MGI"          "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"     
> row.names(YCNT.05_Subset) = gsub("\\..*", "",row.names(YCNT.05_Subset))
> YCNT.05_Subset$symbol <- mapIds(,keys = row.names(YCNT.05_Subset), column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")
'select()' returned 1:1 mapping between keys and columns
> YCNT.05_Subset
log2 fold change (MLE): condition YB6CNT vs YBJCNT 
Wald test p-value: condition YB6CNT vs YBJCNT 
DataFrame with 24 rows and 7 columns
                    baseMean log2FoldChange     lfcSE      stat       pvalue         padj      symbol
                   <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric> <character>
ENSMUSG00000110704 22.424026      -45.31601  5.115848 -8.662495 4.615481e-18 8.555948e-14          NA
ENSMUSG00000082016 21.264101      -30.86202  6.149671 -4.855873 1.198576e-06 2.613953e-03          NA
ENSMUSG00000094568  9.902831      -30.47642  6.151707 -4.791584 1.654700e-06 3.228843e-03          NA
ENSMUSG00000103651 52.698629      -29.23280  6.149360 -4.591178 4.407518e-06 7.781368e-03          NA
ENSMUSG00000059195 57.152734      -27.95013  5.706919 -4.722360 2.331231e-06 4.321520e-03          NA
...                      ...            ...       ...       ...          ...          ...         ...
ENSMUSG00000005800  21.84068       22.60947  4.501899  4.800077 1.586044e-06 3.228843e-03        Mmp8
ENSMUSG00000022026  14.05982       26.72935  6.134014  4.194537 2.734298e-05 4.223921e-02       Olfm4
ENSMUSG00000084936  55.66428       28.41237  3.742702  7.324221 2.402904e-13 1.781754e-09          NA
ENSMUSG00000093752  19.27306       33.77075  4.913983  6.668876 2.577693e-11 1.194600e-07          NA
ENSMUSG00000074555  11.50912       36.31998  5.086493  6.943876 3.814853e-12 2.357261e-08     Gm10714

After manually searching some of these  ENSMUSG IDs that return NA values for the symbol, I find that these are generally pseudogenes, predicted genes, or happen to not return any information by a google search. I understand that since is based on using entrez gene identifiers I shouldn't expect there to be perfect overlap.

My first question is whether I should remove these genes at all or if this analysis is valid and I should just note that these are the changes I'm observing between these two control groups. I've never seen log2 fold changes that high in RNAseq data before so it does seem a bit off to me. 

My assumption is that removing these ENSMUSG IDs at this point in the workflow is not appropriate because the incorporation of all these transcripts may be negatively affecting my FDR, but that I should remove them before running DESeq in the first place. 

I'm wondering if it is statistically appropriate to filter out these genes from my dds dataset before running DESeq2 or if there is some reason not to do that. If it is appropriate, would it make sense to simply remove all ENSMUSG IDs that return an NA symbol value. Or is there a way to pull out the 21,000 protein coding genes from the gff3 annotation to run my analysis on specifically? Is one approach better than the other? I don't have a good sense about how gene set / pathway enrichment analysis handles long and small ncRNAs that also seem to make up a significant fraction of the annotation file:  

I'm not too familiar with how I would go about adding this symbol information to the DESeqDataSet object or how to remove select ENSMUSG IDs from it. I guess it might make sense to do that at the level of the TxImport list object, but again not exactly sure the steps I would take to go about this. 

As a reference, I have a counts and log2fc csv file from an old mouse RNAseq experiment where the data was processed by a bioinformatics core facility that only includes ~18,000 genes (could be due to different counts filtering, but every gene is annotated with some symbol), including mitochondrial and pseudogenes.  It would be great if there was some sort of standardized list of genes to include for gene level analysis that I could use for this analysis. 

I also noted that Gencode has a basic annotation file, which I downloaded and tried to use in parallel, but it seems to contain practically the same number of features so I don't believe that is helpful to me.

> nrow(txi.salmon_basic$counts)

[1] 52298

> nrow(txi.salmon$counts)

[1] 52325

Thank you very much for the help!

deseq2 tximport gencode • 1.6k views
Entering edit mode

I've also noted that Gencode offers a protein coding transcript sequences fasta file. I could alternatively make an index from this file and run Salmon, which seems like it would get around my issue. I've seen this in the methods section of a few papers I've read, so maybe that is the best approach here given that this is the focus of our analysis.


Entering edit mode
Last seen 1 day ago
United States

The high LFCs are from one group having all zeros or a single count. We recommend in the vignette and workflow to moderate the LFC using lfcShrink (see vignette, search for "lfcShrink"), which will provide robust estimates of LFC.

"My first question is whether I should remove these genes at all or if this analysis is valid...If it is appropriate, would it make sense to simply remove all ENSMUSG IDs that return an NA symbol value." 

I would not remove these genes simply because they don't have IDs for a given annotation. I use the comprehensive annotation from Gencode (CHR), and do not filter out any genes based on annotation.

Entering edit mode

Thanks a lot Michael, great to know that this is common practice to include all the genes from the annotation.

lfcShrink definitely helps to moderate the LFC values, especially for these low count genes from certain groups. Seems like this will be more appropriate as an input to GSEA, etc. 


Login before adding your answer.

Traffic: 423 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6