Hi, I apologize if the answer to this question is obvious or located elsewhere, but I spent quite a bit of time searching online and posts on this site for an answer with no luck and I've been stuck on it for several days.
I am in the process of an RNA-seq analysis using the Salmon -> Tximport -> DESeq2 workflow (http://www.bioconductor.org/help/workflows/rnaseqGene/#annotating-and-exporting-results) and had a question about the appropriate list of genes to use from the latest Gencode mouse comprehensive gene annotation (CHR) GTF/GFF3 file. The goal of this project is to perform gene level differential expression analysis and to discover pathways that may be differentially active between several biological treatment groups (so fairly generic RNA-seq analysis).
The Gencode comprehensive annotation has ~53,000 genes, ~22,000 of which are protein coding. When I look at differential expression between my two control groups, it appears to contain a large number of poorly annotated pseudogenes that return NA values when I attempt to annotate the file, and that I do not expect are particularly relevant to my analysis. These also have very large fold changes, which seems odd. The relevant parts of my workflow were to:
1. Run DESeq on my dds data set
> dds <- DESeqDataSetFromTximport(txi.salmon, sampleTable_filtered_3, ~condition) > nrow(dds) [1] 52325 > dds <- dds[ rowSums(counts(dds)) > 1, ] > nrow(dds) [1] 37249 > dds <- DESeq(dds)
> colData(dds) DataFrame with 30 rows and 6 columns sampleid genotype age treatment group condition <factor> <factor> <factor> <factor> <factor> <factor> 1_S18_Quants 1_S18 B6 young cnt YCNT YB6CNT 2_S19_Quants 2_S19 BJ young het YHET YHETBJ 4_S21_Quants 4_S21 B6 old hetx OHETX OHETB6X 5_S22_Quants 5_S22 B6 young cnt YCNT YB6CNT 7_S23_Quants 7_S23 BJ young iso YISO YISOBJ ... ... ... ... ... ... ... 30_S45_Quants 30_S45 B6 old cnt OCNT OB6CNT 31_S46_Quants 31_S46 BJ young cnt YCNT YBJCNT 32_S47_Quants 32_S47 B6 old het OHET OHETB6 33_S48_Quants 33_S48 B6 young iso YISO YISOB6 34_S49_Quants 34_S49 B6 old iso OISO OISOB6
2. Generate a results object for a comparison of interest
> YCNT.05 <- results(dds, contrast = c("condition", "YB6CNT", "YBJCNT"), alpha = 0.05, lfcThreshold = 1) > summary(YCNT.05) out of 37249 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 9, 0.024% LFC < 0 (down) : 15, 0.04% outliers [1] : 174, 0.47% low counts [2] : 0, 0% (mean count < 0) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
3. Subset my results object to remove padj NA values
> YCNT.05_Subset <- subset(YCNT.05, padj <= 0.05) > YCNT.05_Subset <- YCNT.05_Subset[order(YCNT.05_Subset$log2FoldChange),] > head(YCNT.05_Subset) log2 fold change (MLE): condition YB6CNT vs YBJCNT Wald test p-value: condition YB6CNT vs YBJCNT DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSMUSG00000110704 22.424026 -45.31601 5.115848 -8.662495 4.615481e-18 8.555948e-14 ENSMUSG00000082016 21.264101 -30.86202 6.149671 -4.855873 1.198576e-06 2.613953e-03 ENSMUSG00000094568 9.902831 -30.47642 6.151707 -4.791584 1.654700e-06 3.228843e-03 ENSMUSG00000103651 52.698629 -29.23280 6.149360 -4.591178 4.407518e-06 7.781368e-03 ENSMUSG00000059195 57.152734 -27.95013 5.706919 -4.722360 2.331231e-06 4.321520e-03 ENSMUSG00000086216 35.732336 -23.47339 4.956042 -4.534543 5.772839e-06 9.728545e-03
4. Use the org.Mm.eg.db library to annotate the gene names corresponding to the gene IDs from the gff3 file. I first removed the version numbers from the GENEID column I noted that in the TxImport vignette a gencode annotation file is also used in one example and contains these version numbers at the end of the Ensembl gene ID (I was assuming this was similarly done for annotation, but if there is a better way I would be curious).
> library(org.Mm.eg.db) > columns(org.Mm.eg.db) [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL" "IPI" [14] "MGI" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT" > row.names(YCNT.05_Subset) = gsub("\\..*", "",row.names(YCNT.05_Subset)) > YCNT.05_Subset$symbol <- mapIds(org.Mm.eg.db,keys = row.names(YCNT.05_Subset), column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first") 'select()' returned 1:1 mapping between keys and columns > YCNT.05_Subset log2 fold change (MLE): condition YB6CNT vs YBJCNT Wald test p-value: condition YB6CNT vs YBJCNT DataFrame with 24 rows and 7 columns baseMean log2FoldChange lfcSE stat pvalue padj symbol <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character> ENSMUSG00000110704 22.424026 -45.31601 5.115848 -8.662495 4.615481e-18 8.555948e-14 NA ENSMUSG00000082016 21.264101 -30.86202 6.149671 -4.855873 1.198576e-06 2.613953e-03 NA ENSMUSG00000094568 9.902831 -30.47642 6.151707 -4.791584 1.654700e-06 3.228843e-03 NA ENSMUSG00000103651 52.698629 -29.23280 6.149360 -4.591178 4.407518e-06 7.781368e-03 NA ENSMUSG00000059195 57.152734 -27.95013 5.706919 -4.722360 2.331231e-06 4.321520e-03 NA ... ... ... ... ... ... ... ... ENSMUSG00000005800 21.84068 22.60947 4.501899 4.800077 1.586044e-06 3.228843e-03 Mmp8 ENSMUSG00000022026 14.05982 26.72935 6.134014 4.194537 2.734298e-05 4.223921e-02 Olfm4 ENSMUSG00000084936 55.66428 28.41237 3.742702 7.324221 2.402904e-13 1.781754e-09 NA ENSMUSG00000093752 19.27306 33.77075 4.913983 6.668876 2.577693e-11 1.194600e-07 NA ENSMUSG00000074555 11.50912 36.31998 5.086493 6.943876 3.814853e-12 2.357261e-08 Gm10714
After manually searching some of these ENSMUSG IDs that return NA values for the symbol, I find that these are generally pseudogenes, predicted genes, or happen to not return any information by a google search. I understand that since org.Mm.eg.db is based on using entrez gene identifiers I shouldn't expect there to be perfect overlap.
My first question is whether I should remove these genes at all or if this analysis is valid and I should just note that these are the changes I'm observing between these two control groups. I've never seen log2 fold changes that high in RNAseq data before so it does seem a bit off to me.
My assumption is that removing these ENSMUSG IDs at this point in the workflow is not appropriate because the incorporation of all these transcripts may be negatively affecting my FDR, but that I should remove them before running DESeq in the first place.
I'm wondering if it is statistically appropriate to filter out these genes from my dds dataset before running DESeq2 or if there is some reason not to do that. If it is appropriate, would it make sense to simply remove all ENSMUSG IDs that return an NA symbol value. Or is there a way to pull out the 21,000 protein coding genes from the gff3 annotation to run my analysis on specifically? Is one approach better than the other? I don't have a good sense about how gene set / pathway enrichment analysis handles long and small ncRNAs that also seem to make up a significant fraction of the annotation file: https://www.gencodegenes.org/mouse_stats/archive.html#aM16.
I'm not too familiar with how I would go about adding this symbol information to the DESeqDataSet object or how to remove select ENSMUSG IDs from it. I guess it might make sense to do that at the level of the TxImport list object, but again not exactly sure the steps I would take to go about this.
As a reference, I have a counts and log2fc csv file from an old mouse RNAseq experiment where the data was processed by a bioinformatics core facility that only includes ~18,000 genes (could be due to different counts filtering, but every gene is annotated with some symbol), including mitochondrial and pseudogenes. It would be great if there was some sort of standardized list of genes to include for gene level analysis that I could use for this analysis.
I also noted that Gencode has a basic annotation file, which I downloaded and tried to use in parallel, but it seems to contain practically the same number of features so I don't believe that is helpful to me.
> nrow(txi.salmon_basic$counts)
[1] 52298
> nrow(txi.salmon$counts)
[1] 52325
Thank you very much for the help!
I've also noted that Gencode offers a protein coding transcript sequences fasta file. I could alternatively make an index from this file and run Salmon, which seems like it would get around my issue. I've seen this in the methods section of a few papers I've read, so maybe that is the best approach here given that this is the focus of our analysis.