Question

order of factors

0

Entering edit mode

adam • 0

@adam-10826

Last seen 9.5 years ago

Hi Micheal,

I am trying to understand the directionality of the log2 fold change that is reported by DESeq2.

I see in the manual that it says that the first level of the factor will be taken as the dominator for calculations by DESeq2. I also see that the order of the factor levels matters when results() is called. Can you elaborate on how those two work together. For example, lets suppose I am testing between level A and level B, would the results be wrong if in the DESeq() call level A was first, but in the results() call level B was first?

Thanks,

Adam

deseq2 • 11k views

ADD COMMENT • link updated 9.5 years ago by Michael Love 43k • written 9.5 years ago by adam • 0

score 0 · Answer 1 · 2016-08-02

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

In the vignette we have:

"By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. There are two solutions: you can either explicitly tell results which comparison to make using the contrast argument (this will be shown later), or you can explicitly set the factors levels."

To restate:

You don't have to specify factor levels in the beginning, but instead you would have to specify which level should be the numerator and which the denominator of the fold change when you call results() by using the contrast argument, e.g.:

results(dds, contrast=("cond","B","A"))

Another option is that you can set the factor levels manually, before you run DESeq(), and then you don't have to use the contrast argument:

results(dds)

Either way, when you print results() in the console you will see which is the numerator and which is the denominator printed above the results table.

ADD COMMENT • link 9.5 years ago Michael Love 43k

0

Entering edit mode

So...

results(dds, contrast=("cond","B","A"))

returns log2(B/A)?

ADD REPLY • link 7.7 years ago mbk0asis • 0

0

Entering edit mode

Yes, this should be printed at the top of the table also when you print it in R.

ADD REPLY • link 7.7 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

I am honestly still confused regarding the comparison order of DESeq2. If I compare my samples with:

results(dds, contrast=c("condition", "non_treated", "treated"),  pAdjustMethod="BH")

I got a list of 1042 DEGs, but if I change the comparison order to:

 results(dds, contrast=c("condition", "treated", "non_treated"),  pAdjustMethod="BH")

I got 1122 DEGs. So around 80 more DEGs this time.

I thought I would get the same list only that the upregulated genes will be listed as downregulated genes and the other way around depending on the comparison order. But I supposed it is not the case. Should wildtyp or non treated samples always serves as baseline?

thanks in advance for your help! Dewi

ADD REPLY • link 6.3 years ago Sri Dewi • 0

0

Entering edit mode

Can you post all your code and details about what you’re doing. That line of code usually just changes the sign of the test statistic and LFC. Are there more details?

ADD REPLY • link 6.3 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

thanks for the quick response. Sure I can post it here

library(DESeq2)
library(data.table)
library(vsn)
library(ggplot2)
library(gplots)
library(clusterProfiler)
library(ReactomePA)
library(VennDiagram)
library(plyr)
library(WGCNA)

first_condition <- "Het_Tr"
second_condition <- "Het_nonTr"
sample_number_first <- 14
sample_number_second <- 15
pvalueThreshold <- 0.05

#convert the output from featureCounts into R data.frames for DESeq2
read.counts <- read.table(paste0(outDir, "/featureCounts/counts.txt"), header=TRUE) 
row.names(read.counts) <- read.counts$Geneid #the gene Ids should be stored as row.names
read.counts <- read.counts[ , -c(1:6)] #exclude all columns that do not contain read counts
names(read.counts) <- DESeq_conditions #give meaningful sample names

#sample info for batches#
sample.info <- data.frame(condition=c(rep(first_condition,sample_number_first),rep(second_condition,sample_number_second)),sample=names(read.counts),batch=c(rep("first_batch",7),rep("second_batch",7),rep("first_batch",5),rep("second_batch",5),rep("third_batch",5)))   
DESeq.ds <- DESeqDataSetFromMatrix(countData=read.counts,colData=sample.info,design=~batch+condition) 
DESeq.ds <- DESeq.ds[rowSums(counts(DESeq.ds))>0,]  #remove genes without any counts
colSums(counts(DESeq.ds))  #investigate different library sizes
DESeq.ds <- estimateSizeFactors(DESeq.ds)  #calculate the size factors
#pre-filtering with minimum of 10 reads in 14 samples
DESeq.ds <- DESeq.ds[(rowSums(counts(DESeq.ds,normalized=TRUE)>=10)>=length(14)),]
counts.sf_normalized <- counts(DESeq.ds, normalized=TRUE) 

#rlog-transformation of read counts
rlog.DESeq.sumExp <- rlog(DESeq.ds, blind=TRUE)
rlog.norm.counts <- assay(rlog.DESeq.sumExp)

#read count correlations
distance.m_rlog <- as.dist(1-cor(rlog.norm.counts, method="pearson")) 
tiff(paste0(outDir, "/DESeq2/rlogTransformed_pearsonCorrelation.tiff"), width = 6, height = 6, units = 'in', res = 300)  
plot(hclust(distance.m_rlog), label=colnames(rlog.norm.counts),main="rlog transformed read counts\ndistance: Pearson correlation") 
dev.off()

#principal component analysis (PCA) determine wheather samples display greater variability between experimental conditions than between replicates of the same treatment 
P <- plotPCA(rlog.DESeq.sumExp)
tiff(paste0(outDir, "/DESeq2/PCA_plot.tiff"),  width = 6, height = 6, units = 'in', res = 300)
P <- P + theme_bw() + ggtitle("Rlog transformed counts") + coord_equal(ratio =3.5)
print(P)
dev.off()

#differential gene expression analysis
#DESeq uses the levels of the condition to determine the order of the comparison 
DESeq.ds <- DESeq(DESeq.ds)  
con_list <- levels(DESeq.ds$condition) #list all conditions in the DESeq dataframe

#change the comparison matrix depending on the project
contrast_con <- matrix(c(first_condition, second_condition),nrow=1, ncol=2, byrow=TRUE)

for(i in 1:dim(contrast_con)[1]){ 
con_dir <- paste0(contrast_con[i,1], "_vs_", contrast_con[i,2], '/')
system(paste("mkdir", paste0(outDir, "/DESeq2/", contrast_con[i,1], "_vs_", contrast_con[i,2])))
system(paste("mkdir", paste0(outDir, "/network_enrichment/", contrast_con[i,1], "_vs_", contrast_con[i,2])))

WT_vs_MT <- results(DESeq.ds, contrast= c("condition", contrast_con[i,1], contrast_con[i,2]), pAdjustMethod="BH")
WT_vs_MT["ENSEMBL"] <- gsub("\\..*","",rownames(WT_vs_MT)) #add ensembl number as new column
WT_vs_MT["gene_name"] <- "No name"
#convert to entrezID 
WT_vs_MT_entrezID <- bitr(WT_vs_MT$ENSEMBL,fromType="ENSEMBL", toType=c("ENTREZID","SYMBOL","GENENAME"), OrgDb="org.Mm.eg.db")
#add gene name into the results according to the ensembl number
matched <- match(WT_vs_MT$ENSEMBL, WT_vs_MT_entrezID$ENSEMBL) #indexing all the matched ensembl number found
WT_vs_MT[,8] <- WT_vs_MT_entrezID$SYMBOL[matched] #add the matched ensembl number back into the result table
WT_vs_MT <- WT_vs_MT[,c(7,8,1,2,3,4,5,6)] #reorder the table
write.csv2(WT_vs_MT, file=paste0(outDir,"/DESeq2/", con_dir, contrast_con[i,1], "_vs_", contrast_con[i,2],".csv"), row.names=FALSE)  

DEG_WT_vs_MT <- subset(WT_vs_MT , padj < pvalueThreshold)  
DEG_WT_vs_MT_sorted <- DEG_WT_vs_MT
DEG_WT_vs_MT_sorted <- DEG_WT_vs_MT_sorted[order(DEG_WT_vs_MT_sorted$log2FoldChange),] #sort the DEG according to the log2 fold change
write.csv2(DEG_WT_vs_MT_sorted, file=paste0(outDir, "/DESeq2/", con_dir, contrast_con[i,1], "_vs_", contrast_con[i,2], "_log2FC_", pvalueThreshold ,".csv"), row.names=FALSE)  
 }

I pretty much followed the examples from the vignette but maybe you can give me a hint, where it went wrong. Thanks a lot!

ADD REPLY • link 6.3 years ago Sri Dewi • 0

0

Entering edit mode

Can you double check that the results() command on the same dds really gives two different lists in a clean R session?

The code in results() just flips the sign of the LFC and statistic so I’m wondering if there were other differences between your two runs that give different output.

For example, are the same number of genes included in both runs?

ADD REPLY • link 6.3 years ago Michael Love 43k

0

Entering edit mode

You are right, if I change the comparison order/level only in the results() I got the same list only with opposite values and I think this is due to different statistical values in between before results()?

But this also got me thinking, which condition should then always be used for the baseline?

anyway, thanks a lot for your help and I am a big fan of DESeq2 since 2014 began with my bachelor thesis! :)

ADD REPLY • link 6.3 years ago Sri Dewi • 0

0

Entering edit mode

You're welcome :)

My guess is that you're not putting the same genes into the dds. You can do some exploration to see if this is the case. But I think it's also possible to have small numerical differences when DESeq() is run with a different coding of the design matrix. But results() will be identical.

As per baseline, if you don't have a clear reference/control group, just pick one group and use that throughout.

ADD REPLY • link 6.3 years ago Michael Love 43k