Question: order of factors
0
gravatar for adam
3.3 years ago by
adam0
adam0 wrote:

Hi Micheal,

I am trying to understand the directionality of the log2 fold change that is reported by DESeq2.  

I see in the manual that it says that the first level of the factor will be taken as the dominator for calculations by DESeq2.  I also see that the order of the factor levels matters when results() is called.  Can you elaborate on how those two work together. For example, lets suppose I am testing between level A and level B, would the results be wrong if in the DESeq() call level A was first, but in the results() call level B was first?

 

Thanks,

Adam

deseq2 • 1.4k views
ADD COMMENTlink modified 3.3 years ago by Michael Love26k • written 3.3 years ago by adam0
Answer: order of factors
0
gravatar for Michael Love
3.3 years ago by
Michael Love26k
United States
Michael Love26k wrote:

In the vignette we have:

"By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. There are two solutions: you can either explicitly tell results which comparison to make using the contrast argument (this will be shown later), or you can explicitly set the factors levels."

To restate:

You don't have to specify factor levels in the beginning, but instead you would have to specify which level should be the numerator and which the denominator of the fold change when you call results() by using the contrast argument, e.g.:

results(dds, contrast=("cond","B","A"))

Another option is that you can set the factor levels manually, before you run DESeq(), and then you don't have to use the contrast argument:

results(dds)

Either way, when you print results() in the console you will see which is the numerator and which is the denominator printed above the results table.

ADD COMMENTlink written 3.3 years ago by Michael Love26k

So...

results(dds, contrast=("cond","B","A"))

returns log2(B/A)?

ADD REPLYlink modified 18 months ago • written 18 months ago by mbk0asis0

Yes, this should be printed at the top of the table also when you print it in R.

ADD REPLYlink written 18 months ago by Michael Love26k

Hi Michael,

I am honestly still confused regarding the comparison order of DESeq2. If I compare my samples with:

results(dds, contrast=c("condition", "non_treated", "treated"),  pAdjustMethod="BH")

I got a list of 1042 DEGs, but if I change the comparison order to:

 results(dds, contrast=c("condition", "treated", "non_treated"),  pAdjustMethod="BH")

I got 1122 DEGs. So around 80 more DEGs this time.

I thought I would get the same list only that the upregulated genes will be listed as downregulated genes and the other way around depending on the comparison order. But I supposed it is not the case. Should wildtyp or non treated samples always serves as baseline?

thanks in advance for your help! Dewi

ADD REPLYlink written 7 weeks ago by Sri Dewi0

Can you post all your code and details about what you’re doing. That line of code usually just changes the sign of the test statistic and LFC. Are there more details?

ADD REPLYlink written 7 weeks ago by Michael Love26k

Hi Michael,

thanks for the quick response. Sure I can post it here

library(DESeq2)
library(data.table)
library(vsn)
library(ggplot2)
library(gplots)
library(clusterProfiler)
library(ReactomePA)
library(VennDiagram)
library(plyr)
library(WGCNA)

first_condition <- "Het_Tr"
second_condition <- "Het_nonTr"
sample_number_first <- 14
sample_number_second <- 15
pvalueThreshold <- 0.05

#convert the output from featureCounts into R data.frames for DESeq2
read.counts <- read.table(paste0(outDir, "/featureCounts/counts.txt"), header=TRUE) 
row.names(read.counts) <- read.counts$Geneid #the gene Ids should be stored as row.names
read.counts <- read.counts[ , -c(1:6)] #exclude all columns that do not contain read counts
names(read.counts) <- DESeq_conditions #give meaningful sample names

#sample info for batches#
sample.info <- data.frame(condition=c(rep(first_condition,sample_number_first),rep(second_condition,sample_number_second)),sample=names(read.counts),batch=c(rep("first_batch",7),rep("second_batch",7),rep("first_batch",5),rep("second_batch",5),rep("third_batch",5)))   
DESeq.ds <- DESeqDataSetFromMatrix(countData=read.counts,colData=sample.info,design=~batch+condition) 
DESeq.ds <- DESeq.ds[rowSums(counts(DESeq.ds))>0,]  #remove genes without any counts
colSums(counts(DESeq.ds))  #investigate different library sizes
DESeq.ds <- estimateSizeFactors(DESeq.ds)  #calculate the size factors
#pre-filtering with minimum of 10 reads in 14 samples
DESeq.ds <- DESeq.ds[(rowSums(counts(DESeq.ds,normalized=TRUE)>=10)>=length(14)),]
counts.sf_normalized <- counts(DESeq.ds, normalized=TRUE) 

#rlog-transformation of read counts
rlog.DESeq.sumExp <- rlog(DESeq.ds, blind=TRUE)
rlog.norm.counts <- assay(rlog.DESeq.sumExp)

#read count correlations
distance.m_rlog <- as.dist(1-cor(rlog.norm.counts, method="pearson")) 
tiff(paste0(outDir, "/DESeq2/rlogTransformed_pearsonCorrelation.tiff"), width = 6, height = 6, units = 'in', res = 300)  
plot(hclust(distance.m_rlog), label=colnames(rlog.norm.counts),main="rlog transformed read counts\ndistance: Pearson correlation") 
dev.off()

#principal component analysis (PCA) determine wheather samples display greater variability between experimental conditions than between replicates of the same treatment 
P <- plotPCA(rlog.DESeq.sumExp)
tiff(paste0(outDir, "/DESeq2/PCA_plot.tiff"),  width = 6, height = 6, units = 'in', res = 300)
P <- P + theme_bw() + ggtitle("Rlog transformed counts") + coord_equal(ratio =3.5)
print(P)
dev.off()

#differential gene expression analysis
#DESeq uses the levels of the condition to determine the order of the comparison 
DESeq.ds <- DESeq(DESeq.ds)  
con_list <- levels(DESeq.ds$condition) #list all conditions in the DESeq dataframe

#change the comparison matrix depending on the project
contrast_con <- matrix(c(first_condition, second_condition),nrow=1, ncol=2, byrow=TRUE)

for(i in 1:dim(contrast_con)[1]){ 
con_dir <- paste0(contrast_con[i,1], "_vs_", contrast_con[i,2], '/')
system(paste("mkdir", paste0(outDir, "/DESeq2/", contrast_con[i,1], "_vs_", contrast_con[i,2])))
system(paste("mkdir", paste0(outDir, "/network_enrichment/", contrast_con[i,1], "_vs_", contrast_con[i,2])))

WT_vs_MT <- results(DESeq.ds, contrast= c("condition", contrast_con[i,1], contrast_con[i,2]), pAdjustMethod="BH")
WT_vs_MT["ENSEMBL"] <- gsub("\\..*","",rownames(WT_vs_MT)) #add ensembl number as new column
WT_vs_MT["gene_name"] <- "No name"
#convert to entrezID 
WT_vs_MT_entrezID <- bitr(WT_vs_MT$ENSEMBL,fromType="ENSEMBL", toType=c("ENTREZID","SYMBOL","GENENAME"), OrgDb="org.Mm.eg.db")
#add gene name into the results according to the ensembl number
matched <- match(WT_vs_MT$ENSEMBL, WT_vs_MT_entrezID$ENSEMBL) #indexing all the matched ensembl number found
WT_vs_MT[,8] <- WT_vs_MT_entrezID$SYMBOL[matched] #add the matched ensembl number back into the result table
WT_vs_MT <- WT_vs_MT[,c(7,8,1,2,3,4,5,6)] #reorder the table
write.csv2(WT_vs_MT, file=paste0(outDir,"/DESeq2/", con_dir, contrast_con[i,1], "_vs_", contrast_con[i,2],".csv"), row.names=FALSE)  

DEG_WT_vs_MT <- subset(WT_vs_MT , padj < pvalueThreshold)  
DEG_WT_vs_MT_sorted <- DEG_WT_vs_MT
DEG_WT_vs_MT_sorted <- DEG_WT_vs_MT_sorted[order(DEG_WT_vs_MT_sorted$log2FoldChange),] #sort the DEG according to the log2 fold change
write.csv2(DEG_WT_vs_MT_sorted, file=paste0(outDir, "/DESeq2/", con_dir, contrast_con[i,1], "_vs_", contrast_con[i,2], "_log2FC_", pvalueThreshold ,".csv"), row.names=FALSE)  
 }

I pretty much followed the examples from the vignette but maybe you can give me a hint, where it went wrong. Thanks a lot!

ADD REPLYlink written 7 weeks ago by Sri Dewi0

Can you double check that the results() command on the same dds really gives two different lists in a clean R session?

The code in results() just flips the sign of the LFC and statistic so I’m wondering if there were other differences between your two runs that give different output.

For example, are the same number of genes included in both runs?

ADD REPLYlink written 7 weeks ago by Michael Love26k

You are right, if I change the comparison order/level only in the results() I got the same list only with opposite values and I think this is due to different statistical values in between before results()?

But this also got me thinking, which condition should then always be used for the baseline?

anyway, thanks a lot for your help and I am a big fan of DESeq2 since 2014 began with my bachelor thesis! :)

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Sri Dewi0

You're welcome :)

My guess is that you're not putting the same genes into the dds. You can do some exploration to see if this is the case. But I think it's also possible to have small numerical differences when DESeq() is run with a different coding of the design matrix. But results() will be identical.

As per baseline, if you don't have a clear reference/control group, just pick one group and use that throughout.

ADD REPLYlink written 7 weeks ago by Michael Love26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 335 users visited in the last hour