Question: order of factors
0

Hi Micheal,

I am trying to understand the directionality of the log2 fold change that is reported by DESeq2.

I see in the manual that it says that the first level of the factor will be taken as the dominator for calculations by DESeq2.  I also see that the order of the factor levels matters when results() is called.  Can you elaborate on how those two work together. For example, lets suppose I am testing between level A and level B, would the results be wrong if in the DESeq() call level A was first, but in the results() call level B was first?

Thanks,

deseq2 • 1.4k views  modified 3.3 years ago by Michael Love26k • written 3.3 years ago by adam0
0
Michael Love26k wrote:

In the vignette we have:

"By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. There are two solutions: you can either explicitly tell results which comparison to make using the contrast argument (this will be shown later), or you can explicitly set the factors levels."

To restate:

You don't have to specify factor levels in the beginning, but instead you would have to specify which level should be the numerator and which the denominator of the fold change when you call results() by using the contrast argument, e.g.:

results(dds, contrast=("cond","B","A"))

Another option is that you can set the factor levels manually, before you run DESeq(), and then you don't have to use the contrast argument:

results(dds)

Either way, when you print results() in the console you will see which is the numerator and which is the denominator printed above the results table.

So...

results(dds, contrast=("cond","B","A"))

returns log2(B/A)?

Yes, this should be printed at the top of the table also when you print it in R.

Hi Michael,

I am honestly still confused regarding the comparison order of DESeq2. If I compare my samples with:

results(dds, contrast=c("condition", "non_treated", "treated"),  pAdjustMethod="BH")


I got a list of 1042 DEGs, but if I change the comparison order to:

 results(dds, contrast=c("condition", "treated", "non_treated"),  pAdjustMethod="BH")


I got 1122 DEGs. So around 80 more DEGs this time.

I thought I would get the same list only that the upregulated genes will be listed as downregulated genes and the other way around depending on the comparison order. But I supposed it is not the case. Should wildtyp or non treated samples always serves as baseline?

Can you post all your code and details about what you’re doing. That line of code usually just changes the sign of the test statistic and LFC. Are there more details?

Hi Michael,

thanks for the quick response. Sure I can post it here

library(DESeq2)
library(data.table)
library(vsn)
library(ggplot2)
library(gplots)
library(clusterProfiler)
library(ReactomePA)
library(VennDiagram)
library(plyr)
library(WGCNA)

first_condition <- "Het_Tr"
second_condition <- "Het_nonTr"
sample_number_first <- 14
sample_number_second <- 15
pvalueThreshold <- 0.05

#convert the output from featureCounts into R data.frames for DESeq2
row.names(read.counts) <- read.counts$Geneid #the gene Ids should be stored as row.names read.counts <- read.counts[ , -c(1:6)] #exclude all columns that do not contain read counts names(read.counts) <- DESeq_conditions #give meaningful sample names #sample info for batches# sample.info <- data.frame(condition=c(rep(first_condition,sample_number_first),rep(second_condition,sample_number_second)),sample=names(read.counts),batch=c(rep("first_batch",7),rep("second_batch",7),rep("first_batch",5),rep("second_batch",5),rep("third_batch",5))) DESeq.ds <- DESeqDataSetFromMatrix(countData=read.counts,colData=sample.info,design=~batch+condition) DESeq.ds <- DESeq.ds[rowSums(counts(DESeq.ds))>0,] #remove genes without any counts colSums(counts(DESeq.ds)) #investigate different library sizes DESeq.ds <- estimateSizeFactors(DESeq.ds) #calculate the size factors #pre-filtering with minimum of 10 reads in 14 samples DESeq.ds <- DESeq.ds[(rowSums(counts(DESeq.ds,normalized=TRUE)>=10)>=length(14)),] counts.sf_normalized <- counts(DESeq.ds, normalized=TRUE) #rlog-transformation of read counts rlog.DESeq.sumExp <- rlog(DESeq.ds, blind=TRUE) rlog.norm.counts <- assay(rlog.DESeq.sumExp) #read count correlations distance.m_rlog <- as.dist(1-cor(rlog.norm.counts, method="pearson")) tiff(paste0(outDir, "/DESeq2/rlogTransformed_pearsonCorrelation.tiff"), width = 6, height = 6, units = 'in', res = 300) plot(hclust(distance.m_rlog), label=colnames(rlog.norm.counts),main="rlog transformed read counts\ndistance: Pearson correlation") dev.off() #principal component analysis (PCA) determine wheather samples display greater variability between experimental conditions than between replicates of the same treatment P <- plotPCA(rlog.DESeq.sumExp) tiff(paste0(outDir, "/DESeq2/PCA_plot.tiff"), width = 6, height = 6, units = 'in', res = 300) P <- P + theme_bw() + ggtitle("Rlog transformed counts") + coord_equal(ratio =3.5) print(P) dev.off() #differential gene expression analysis #DESeq uses the levels of the condition to determine the order of the comparison DESeq.ds <- DESeq(DESeq.ds) con_list <- levels(DESeq.ds$condition) #list all conditions in the DESeq dataframe

#change the comparison matrix depending on the project
contrast_con <- matrix(c(first_condition, second_condition),nrow=1, ncol=2, byrow=TRUE)

for(i in 1:dim(contrast_con)){
con_dir <- paste0(contrast_con[i,1], "_vs_", contrast_con[i,2], '/')
system(paste("mkdir", paste0(outDir, "/DESeq2/", contrast_con[i,1], "_vs_", contrast_con[i,2])))
system(paste("mkdir", paste0(outDir, "/network_enrichment/", contrast_con[i,1], "_vs_", contrast_con[i,2])))

WT_vs_MT <- results(DESeq.ds, contrast= c("condition", contrast_con[i,1], contrast_con[i,2]), pAdjustMethod="BH")
WT_vs_MT["ENSEMBL"] <- gsub("\\..*","",rownames(WT_vs_MT)) #add ensembl number as new column
WT_vs_MT["gene_name"] <- "No name"
#convert to entrezID
WT_vs_MT_entrezID <- bitr(WT_vs_MT$ENSEMBL,fromType="ENSEMBL", toType=c("ENTREZID","SYMBOL","GENENAME"), OrgDb="org.Mm.eg.db") #add gene name into the results according to the ensembl number matched <- match(WT_vs_MT$ENSEMBL, WT_vs_MT_entrezID$ENSEMBL) #indexing all the matched ensembl number found WT_vs_MT[,8] <- WT_vs_MT_entrezID$SYMBOL[matched] #add the matched ensembl number back into the result table
WT_vs_MT <- WT_vs_MT[,c(7,8,1,2,3,4,5,6)] #reorder the table
write.csv2(WT_vs_MT, file=paste0(outDir,"/DESeq2/", con_dir, contrast_con[i,1], "_vs_", contrast_con[i,2],".csv"), row.names=FALSE)

DEG_WT_vs_MT <- subset(WT_vs_MT , padj < pvalueThreshold)
DEG_WT_vs_MT_sorted <- DEG_WT_vs_MT
DEG_WT_vs_MT_sorted <- DEG_WT_vs_MT_sorted[order(DEG_WT_vs_MT_sorted\$log2FoldChange),] #sort the DEG according to the log2 fold change
write.csv2(DEG_WT_vs_MT_sorted, file=paste0(outDir, "/DESeq2/", con_dir, contrast_con[i,1], "_vs_", contrast_con[i,2], "_log2FC_", pvalueThreshold ,".csv"), row.names=FALSE)
}


I pretty much followed the examples from the vignette but maybe you can give me a hint, where it went wrong. Thanks a lot!

Can you double check that the results() command on the same dds really gives two different lists in a clean R session?

The code in results() just flips the sign of the LFC and statistic so I’m wondering if there were other differences between your two runs that give different output.

For example, are the same number of genes included in both runs?

You are right, if I change the comparison order/level only in the results() I got the same list only with opposite values and I think this is due to different statistical values in between before results()?

But this also got me thinking, which condition should then always be used for the baseline?

anyway, thanks a lot for your help and I am a big fan of DESeq2 since 2014 began with my bachelor thesis! :)

You're welcome :)

My guess is that you're not putting the same genes into the dds. You can do some exploration to see if this is the case. But I think it's also possible to have small numerical differences when DESeq() is run with a different coding of the design matrix. But results() will be identical.

As per baseline, if you don't have a clear reference/control group, just pick one group and use that throughout.