Hi there,
I am performing DEA on RNAseq data, to see which genes (both mRNAs and lncRNAs) are differentially expressed between consecutive age groups. We have about 10 age groups. I have utilised 2 methods to perform DEA, edgeR and DESeq2. My code works perfectly fine but what I am finding interesting is that: with DESeq2, I have in total 89 DEgenes from all 9 comparisons (between 10 age groups) whilst with edgeR, I have approx 220 genes from all 9 comparisons and if I was to compare both results, I only have 13 in common.
My question is, could my DESeq2 criteria be too strict? Although I used the same padj threshold (FDR) and the same preprocessing criteria with both methods. Below is my code
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~AGE )
dds <- DESeq(dds)
plotDispEsts(dds)
vst_data <- vst(dds, blind= F)
# Define the order of age groups
age_groups <- sort(unique(metaData$AGE))
# Create an empty list to store the results
results_list <- list()
# Perform DE for each adjacent age group
for (i in 2:length(age_groups)) {
group1 <- age_groups[i-1]
group2 <- age_groups[i]
comparison_name <- paste0("res_", group2, "vs", group1, sep="")
# Subset DESeqDataSet
sub_dds <- dds [ dds$AGE %in% c(group1, group2)]
# Subset metadata
sub_meta <- metaData[metaData$AGE %in% c(group1, group2), ]
contrast <- c("AGE", group2, group1)
# Extract results & assign a name to the result object
res <- results(sub_dds,
contrast = contrast,
alpha= 0.05,
pAdjustMethod = "fdr")
results_list[[comparison_name]] <- lfcShrink(sub_dds,
contrast = contrast,
res = res,
type = "ashr")
}
Any help or advice, would be really appreciated!
Thanks
HI ATpoint
Thank you, great comments! Subsetting of data you mentioned makes perfect sense. I will adjust the code accordingly.
My age groups are factorial (A,B,C,D,... J) where group age A refers to 20-24 years and B to 25-29 years so on and so forth.
Just to note, the above code is using the quasi-likelihood F-test. I also did DE using the likelihood ration test/
the reason I want to use both is because I read many papers about the use of different approaches of DEA, especially in the context of network analysis, which I am interested in doing.
My question basically was if there was anything wrong with my code/ approah/ parameters taken with DESeq2 code and if not, could it be my edgeR code. and if not any of them, why would my results differ that much?!
I truly appreciate your help and the time you're taking to reply to me.
Thanks
I do not see any normalization in the edgeR code, so this is one issue. Second, this code cannot run because there is no dispersion estimation. Third, you're running the entire edgeR analysis on the subsets while for DESeq2 you're running everything on the full data (
DESeq()
) function and then use contrasts. Running on full data is different in power than subsetting. As said above, the subsetting in DESeq2 is not doing anything, as you would need to subset beforeDESeq()
to make a difference.In a nutshell, to be comparable, be sure you follow the edgeR manual in terms of correct order of functions, and do not subset to make it comparable to DESeq2. The glmQLFTest via contrasts will do the subsetting / per-contrast extraction for you.