Question

Meaning of "Contrast" in "DESeq2" & how to compare the results of the same cellline under different conditions?

1

Entering edit mode

JunLVI ▴ 40

@junlvi-8996

Last seen 5.7 years ago

Japan

Hi I am trying to analyze a dataset with a design quite like the dataset in the "airway" package.

but I am not sure I understood the differential analysis given as an example.

dds <- DESeq(dds)by doing this, i guess we could get all the differentially expressed genes (DEG) list of all the cells under different conditions

differentially expressed genes in N61311 (untrt vs trt) +

differentially expressed genes in other cell lines (untrt vs trt)

but how could we get the DEG of one cell lines under different conditions, i.g. N61311 (untrt vs txt) only?

I guess I need to use the "contrast" argument, but somehow I felt confused when reading the example,

results(dds, contrast=c("cell", "N061011", "N61311"))

it seems that "contrast" argument tried to compare two cell lines without specify which conditions (each cell lines have two conditions untrt vs trt,as far as I could tell. ) so what is "contrast" argument trying to compare, anyway?

Back to my own purpose, if I want to get DEGs between N61311 (untrt vs trt), what should I do?

TKS!

deseq2 • 8.3k views

ADD COMMENT • link updated 7.2 years ago by Michael Love 41k • written 7.2 years ago by JunLVI ▴ 40

score 0 · Answer 1 · 2017-01-23

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

In the workflow we have:

"For the airway experiment, we will specify ~ cell + dex meaning that we want to test for the effect of dexamethasone (dex) controlling for the effect of different cell line (cell)."

To explain this a bit more, we have a coefficient in the model for each cell line's base level (so 4 of these), and then a single fold change for the treatment effect, the same for all cell lines. We have 8 samples, and 5 coefficients in the model to estimate, so 3 residual degrees of freedom.

You ask:

"how could we get the DEG of one cell lines under different conditions, i.g. N61311 (untrt vs txt) only?"

We cannot, because there are no replicates in this experiment to fit a treatment effect for each cell line and still assess the variability. There are only 8 samples, and such a design would imply 8 coefficients: 4 base levels and 4 treatment effects. There would be no residual degrees of freedom.

ADD COMMENT • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,
Thank you for your kind reply and patient explanation.

I have to say I might not fully understand your explanation concerning "degree of freedom" .
After reading it and some homework with google and wikipedia, here is my understanding at this stage:

1. Degree of freedom numerically equals total sample volume n.
2. To fit an effect of a condition (here “treatment”) with confidence that the observed effect is not attributed to variability, residual degrees of freedom is needed or should be > 0.
it seems that residual degrees of freedom (RDF) could be numerically calculated as "RDF =Degree of freedom - coefficients to estimate”

Back to the “airway” sample,
"To explain this a bit more, we have a coefficient in the model for each cell line's base level (so 4 of these), and then a single fold change for the treatment effect, the same for all cell lines. We have 8 samples, and 5 coefficients in the model to estimate, so 3 residual degrees of freedom."

my biological translation of that is,

in this specific case,

dds <- DESeqDataSet(se, design = ~ cell + dex)
dds <- DESeq(dds)

DESeq() is to detect the common (homogenous) treatment effect across all the 4 cell lines assuming such a common effect exist and the differences among cell lines is restricted to their base level.

But DESeq() could not tell us whether each cell line have unique response upon treatment in addition to their different base levels, due to the fact that there are no biological replicates.

But when I apply DESeq to perform differential gene expression analysis to my dataset,

(the structure of the dataset is 4 cell lines of 2 genotype with biological duplicates (4*2*2 samples), the purpose of the analysis to assess the effect of different genotypes (KO vs WT) on the gene expression in 4 cell lines)
(each column in the figure represents a sample, each row represents a gene)

the first thing I got notice was that the top DEG is dominated by the genes differentially expressed at base levels, rather than gene differentially expressed as a result of KO effect.

How should I interpret the results? what I could think of are:

1. there are few gene differentially expressed as a result of KO effect.
2. I did not use the DESeq() to analyze properly. Instead of analyzing all the sample at the same time, I should compare the difference within a certain cell type which I have biological duplicates under each condition.

This might be too much, but any thought or suggestions would be highly appreciated.

ADD REPLY • link 7.2 years ago JunLVI ▴ 40

0

Entering edit mode

It's easier and faster to provide help if you state from the beginning what the experiment is designed to test and how your experimental design looks.

DESeq() just estimates parameters based entirely on what design you specify and based on information in the variables in column data. It doesn't do any guessing about what to do.

"How should I interpret the results?"

I don't know what code you used to produce a results table.

Before interpreting results for a design that might not be the one you are interested in, let's start from the beginning, now that I know what your experimental design looks like. Do you want to make comparisons of each cell line separately?

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

"Do you want to make comparisons of each cell line separately?"

I think so, otherwise the top gene list will be dominated by the genes differentially expressed at base levels.

ADD REPLY • link 7.2 years ago JunLVI ▴ 40

0

Entering edit mode

It looks like from the other thread, you saw how to do the per-cell-line analysis using a variable 'group' which combines the cell line and treatment, that would be what I recommend.

The top gene list should not be dominated by gene with different base levels unless you are extracting the wrong coefficient. I still don't know what code you used to produce the results table, or what your column data looks like, so I can't help you there.

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode



sampleTable <- read.csv("samples.csv",row.names=1)
filenames <- paste0(sampleTable$samplename,".bam")
library("Rsamtools")
bamfiles <- BamFileList(filenames, yieldSize=2000000)
gtffile <- ("mm10genes.gtf")
library("GenomicFeatures")
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character()))
(ebg <- exonsBy(txdb, by="gene"))
library("GenomicAlignments")
library("BiocParallel")
register(MulticoreParam())
se <- summarizeOverlaps(features=ebg, reads=bamfiles,
                        mode="Union",
                        singleEnd=TRUE,
                        ignore.strand=TRUE,
                        fragments=FALSE )
library("DESeq2")
(colData(se) <- DataFrame(sampleTable))
dds <- DESeqDataSet(se, design = ~ compartment + genotype)
dds <- dds[rowSums(counts(dds)) > 1,]
dds <- DESeq(dds)
rld <- rlog(dds, blind=FALSE)
library("genefilter")
topVarGenes <- head(order(rowVars(assay(rld)),decreasing=TRUE),40)
mat <- assay(rld)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("genotype","compartment")])
pheatmap(mat, annotation_col=df,fontsize=9,fontsize_number = 0.4 * fontsize)

This is what I did to get the heatmap attached in which the top gene list is dominated by the genes differentially expressed at base levels. any idea about what went wrong?

this is what the inside of the .csv file looks like:

"samples.csv"
"ids","samplename","genotype","compartment"
"457CellA","457CellA","WT","CellA"
"457CellB","457CellB","WT","CellB"
"457CellC","457CellC","WT","CellC"
"457CellD","457CellD","WT","CellD"
"458CellA","458CellA","cKO","CellA"
"458CellB","458CellB","cKO","CellB"
"458CellC","458CellC","cKO","CellC"
"458CellD","458CellD","cKO","CellD"
"459CellA","459CellA","WT","CellA"
"459CellB","459CellB","WT","CellB"
"459CellC","459CellC","WT","CellC"
"459CellD","459CellD","WT","CellD"
"460CellA","460CellA","cKO","CellA"
"460CellB","460CellB","cKO","CellB"
"460CellC","460CellC","cKO","CellC"
"460CellD","460CellD","cKO","CellD"

ADD REPLY • link 7.2 years ago JunLVI ▴ 40

0

Entering edit mode

In the other thread you posted, I pointed out that if you want to plot the genes that are differentially expressed, you need to specify them.

Because you posted your code finally, I can see that you aren't building a results table at all. You've never called results() to perform a statistical test of one or more coefficients.

In your code above, you are plotting the genes with the highest variance (topVarGenes). But this has nothing to do with a statistical test.

Take a look at the entire vignette or workflow, because it seems you missed the importance of the results() function. Think critically about what each function is returning: if you haven't done any tests, how can you pick the genes that are differentially expressed?

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Hi Michael, Thank you very much for your feedback. Yes, today when I tried to understand the meaning of every function, I realized that, as you pointed out, I just plotted the genes with highest variance, which inevitably led to the results that top gene list was dominated by the genes differentially expressed at base levels.

I made that stupid mistake because I just tried to follow the RNA-seq workflow without fully understanding of what I have done. You are right, I need to do more homework with the vignette.

Now I guess I could plot the genes, I have to check the genes in detail later though.

res <- results(dds)
sigGenes <- subset(res,padj <0.05)
sigGenes <- head(order(sigGenes$padj),decreasing=TRUE,40)
mat <- assay(rld)[ sigGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("compartment","genotype")])
pheatmap(mat, annotation_col=df,fontsize=9,fontsize_number = 0.4 * fontsize)
dev.off()

another question I am having now, is when I used "Interaction" like:

resCellA<- results(dds, contrast=c("group", "CellAWT", "CellAcKO"))
sigGenes <- head(order(resCellA$padj,decreasing=FALSE),40)
mat <- assay(rld)[ sigGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("compartment","genotype")])
pheatmap(mat, annotation_col=df,fontsize=9,fontsize_number = 0.4 * fontsize)
dev.off()

I would get a gene clustering heatmap with all the samples, is there any way to plot the heatmap with CellA samples only? I guess one solution could be, performing an analysis with CellA samples only, accordingly, the plot will only include the CellA samples. But in the FAQ of the vignette, somehow running samples from all groups together is recommended, that is why I was trying to find out way to plot the results of a single group after running samples from all groups. I hope my question does sounds too crazy.

TKS!

ADD REPLY • link 7.2 years ago JunLVI ▴ 40

0

Entering edit mode

I think you will have to partner with someone with experience coding with R, or take an online class in using R before attempting this analysis. There is a critical problem with your code above.

Above, you subset the res object, so it is no longer linked 1-1 with rows of rld. Then you get the order (a numeric index) from the subsetted object and use that to pull out rows of rld. But because these are not linked, and you are using numeric indexing, you are specifying the wrong genes. The safer way would be to use the gene ID's which are rownames of both res and mat.

I won't necessarily be able to follow up with further details very much.

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Thank you very much, Michael. You let me know where my problem lies, both technical one in this specific case and overall one, plus the solution, that is much more that I should expected from some one who has developed the package. Really appreciate!

ADD REPLY • link 7.2 years ago JunLVI ▴ 40

0

Entering edit mode

"The top gene list should not be dominated by gene with different base levels unless you are extracting the wrong coefficient"

Based on your comments here, I wondered that I might messed up with my design. After checking the code I used as I pasted in my previous comment, I realized that there seems to be only two (maybe one) lines of codes which could change the design of DESeq() analysis.

dds <- DESeqDataSet(se, design = ~ compartment + genotype)

df <- as.data.frame(colData(rld)[,c("genotype","compartment")]) # this might not affect the design)

so I tried all the possible permutation:

design = ~ compartment + genotype

c("genotype","compartment")

design = ~ compartment + genotype

c("compartment","genotype")

design = ~ genotype + compartment

c("compartment","genotype")

design = ~ genotype + compartment

c("genotype","compartment")

Basically I got the same results. if you notice some other obvious mistakes, please kindly let me know.

Or the results are supposed to be like that. Actually biologically the basel levels of 4 cell lines are quite different.

TKS!

ADD REPLY • link 7.2 years ago JunLVI ▴ 40