Question

setup condition table in deseq2 for experiment with multiple conditions

0

Entering edit mode

annalisa.giampetruzzi • 0

@annalisagiampetruzzi-8103

Last seen 8.7 years ago

Italy

Hi all,

I need to analyze my RNA seq data of my project consisting of:
1 pathogen (2 condition Healthy vs Infected)
2 plant cultivars
2 plant tissues
3 replicates for infected
2 replicates for Healthy

Now I have the raw count data (after Bowtie alignment and Seqmonk
quantitation) and I need to setup a condition table for running DESeq2
in RStudio.

Could you please take a look to this my idea of condition table (above)?
I have thought to do two independent analysis or you believe is it
possible to do analysis of all condition in only one step? if yes in
which way?
Could you suggest me other specific tutorial to run DESeq for my purpose?
I will appreciate your help,Thanks in advance,

Annalisa

condition    type tissue
infected    inf    xylem Cultivar1
infected    inf    xylem Cultivar1
infected    inf    xylem Cultivar1
infected    inf    xylem Cultivar2
infected    inf    xylem Cultivar2
infected    inf    xylem Cultivar2
Healthy    Hty    xylem Cultivar1
Healthy    Hty    xylem Cultivar1
Healthy    Hty    xylem Cultivar2
Healthy    Hty    xylem Cultivar2

     condition    type tissue
infected    inf    phloem Cultivar1
infected    inf    phloem Cultivar1
infected    inf    phloem Cultivar1
infected    inf    phloem Cultivar2
infected    inf    phloem Cultivar2
infected    inf    phloem Cultivar2
Healthy    Hty    phloem Cultivar1
Healthy    Hty    phloem Cultivar1
Healthy    Hty    phloem Cultivar2
Healthy    Hty    phloem Cultivar2

deseq2 multiple comparisons • 3.4k views

ADD COMMENT • link 8.9 years ago annalisa.giampetruzzi • 0

0

Entering edit mode

It would be helpful if you told us what hypotheses you want to test. In general it can be more powerful to analyze all your data at one time, as you get better dispersion estimates. But that assumes that the variability in gene expression in xylem and phloem are relatively similar.

If you care to test for interactions (e.g., as an example, genes that react differently to infection depending on the tissue), then you need to analyze all together.

ADD REPLY • link 8.9 years ago James W. MacDonald 65k

0

Entering edit mode

Stats Master Jim,

Fur us "stats posers", are there formal ways to test for "how similar is similar enough" regarding the variance in these situations to help decide which subsets of data to include data during the analysis, ie. the range would be (1) only data from the replicates "used" to assess the contrast of interest; to (3) all of the data -- where (2) is some number of samples between (1) and (3).

No long response required, but a pointer or two to either "well known methods" or previous posts, or whatever, would be wunderbär

ADD REPLY • link 8.9 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

I don't know of a test, but Gordon or Aaron (or Mike Love) might know of something. Anyway, Gordon had this to say about the subject: A: EdgeR condition-specific dispersion

I suppose if I were worried about a given data set, I would estimate dispersions separately and then plot to see if e.g., the trended dispersions were sufficiently different, eyeballometrically, to make me want to do anything but the usual.

ADD REPLY • link 8.9 years ago James W. MacDonald 65k

0

Entering edit mode

I recommend that users make a PCA plot of transformed data to see if the groups have similar within-group variance. The transformation just learns the general trend of dispersion, if there are differences in the within-group variances, these can still be observed from the PCA plot of all samples.

ADD REPLY • link 8.9 years ago Michael Love 41k

0

Entering edit mode

Hi James,

you are right, the aim of the study is to demonstrate that cultivar 1 is less susceptible than the cultivar 2, and possibly to identify which gene are responsible of this different behaviour against the pathogen.

I'm not sure that the variability in gene expression in xylem and phloem are relatively similar,for this reason I decided to keep separate the four transcriptome (2 tissues of 2 cultivar). What do you think about it?

ADD REPLY • link 8.9 years ago annalisa.giampetruzzi • 0

0

Entering edit mode

If you don't care to compare xylem to phloem, then it isn't strictly necessary to combine. However, if you are looking for tissue specific differences you don't have a choice.

But there is a tradeoff to be made. The more data you have (to a point), the better you can estimate gene-wise dispersions, contingent upon the two tissue types being relatively similar. In statistics it isn't often possible to know what the 'right' thing to do is. Instead, you do what is defensible, and you defend your choice by saying why you made that choice. So it is better to e.g., make some plots that you can refer to when you say why you did what you did.

ADD REPLY • link 8.9 years ago James W. MacDonald 65k

score 0 · Answer 1 · 2015-06-09

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 2 hours ago

United States

You can use a design with an interaction term, to test the difference in susceptibility across cultivar. (I can't easily see from your colData which variable has the cultivar information, so I just renamed the variable in the design below). For example:

design(dds) <- ~ condition + cultivar + condition:cultivar
dds <- DESeq(dds, test="LRT", reduced=~condition + cultivar)
res <- results(dds)

If you are not very familiar with designs with interaction, I would recommend just running the above code separately for two DESeqDataSets (xylem and phloem) to generate to separate results tables.

ADD COMMENT • link 8.9 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thank for the code, I want to share with you a screen shoot of R after running the command line you suggested.

Before the running your code I have write these:

> count.data.set <- DESeqDataSetFromMatrix(countData=just.raw.counts, colData=column.data,design= ~ cultivar)
converting counts to integer mode
> count.data.set <- DESeq(count.data.set)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Then I put your command but I'm not sure about the result in yellow:

> res

log2 fold change (MLE): conditionXfInfect.cultivarOG

LRT p-value: '~ condition + cultivar + condition:cultivar' vs '~ condition + cultivar'
DataFrame with 440720 rows and 6 columns

The aim of my study is to compare the two cultivar (LC, OG) about their susceptibility vs the pathogen.

Could you suggest me how can I check the results? I tried to make a PCA plot on the result and it seems it make sense but I'm not sure to do it correctly, what kind of command do you suggest to do after the code you indicated?

Sorry for the many question and thanks for your help.

Annalisa

ADD REPLY • link 8.9 years ago annalisa.giampetruzzi • 0

0

Entering edit mode

hi Annalisa,

Note the following paragraph from the Details section of ?results:

"For analyses using the likelihood ratio test (using nbinomLRT), the p-values are determined solely by the difference in deviance between the full and reduced model formula. A log2 fold change is included, which can be controlled using the name argument, or by default this will be the estimated coefficient for the last element of resultsNames(object)."

ADD REPLY • link 8.9 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

thanks for your help, and I'm sorry for the delay in giving you my feedback about your suggestion.

Since I want to be sure that I'm doing stat analysis in a correct way I'm just reporting you the sequence of command for DESeq2 analysis:

> library (DESeq2)

> count.data.set <- DESeqDataSetFromMatrix(countData=just.raw.counts, colData=column.data,design= ~ cultivar)

> design(count.data.set) <- ~ condition + cultivar + condition:cultivar

> count.data.set <- DESeq(count.data.set, test="LRT", reduced=~condition + cultivar)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

> res <- results(count.data.set)

> res

log2 fold change (MLE): conditionXfInfect.cultivarOG

LRT p-value: '~ condition + cultivar + condition:cultivar' vs '~ condition + cultivar'

DataFrame with 370495 rows and 6 columns

baseMean log2FoldChange lfcSE

LCxylem_1 1.6309831 -5.4641154 3.456783

LCxylem_2 1.6521952 -1.0980881 2.610223

LCxylem_3 1.7317710 1.5575173 2.586786

LCxylem_4 0.6258317 -3.4078363 5.300495

LCxylem_6 1.4656769 0.1299767 3.353129

... ... ... ...

OG_xylem_354412 1874.443 0.1252115 0.3988262

OG_xylem_354413 3016.101 -1.7225056 0.7029631

OG_xylem_354417 3253.581 -0.7609053 0.3493495

OG_xylem_354419 18300.413 -0.1461710 0.7355976

OG_xylem_354420 3354.743 -0.5229080 0.3715953

stat pvalue padj

LCxylem_1 2.720939853 0.09903968 NA

LCxylem_2 0.178695531 0.67249669 NA

LCxylem_3 0.362620677 0.54705437 NA

LCxylem_4 0.471006694 0.49252441 NA

LCxylem_6 0.001487213 0.96923768 NA

... ... ... ...

OG_xylem_354412 0.09855314 0.75357288 0.9979041

OG_xylem_354413 5.90594432 0.01508987 0.2942873

OG_xylem_354417 4.72888815 0.02966005 0.4211882

OG_xylem_354419 0.03947982 0.84250110 0.9993861

OG_xylem_354420 1.97731981 0.15967304 0.8181710

> resultsNames(count.data.set)

[1] "Intercept"

[2] "condition_XfInfect_vs_Hty"

[3] "cultivar_OG_vs_LC"

[4] "conditionXfInfect.cultivarOG"

> attr(count.data.set, "modelMatrixType")

[1] "standard"

> attr(count.data.set, "modelMatrix")

Intercept /condition_XfInfect_vs_Hty /(cultivar_OG_vs_LC /conditionXfInfect.cultivarOG

LC1_X 1 1 0 0

LC2_X 1 1 0 0

LC3_X 1 1 0 0

LC5_X 1 0 0 0

LC6_X 1 0 0 0

OG1_X 1 1 1 1

OG2_X 1 1 1 1

OG3_X 1 1 1 1

OG5_X 1 0 1 0

OG6_X 1 0 1 0

attr(,"assign")

[1] 0 1 2 3

attr(,"contrasts")

attr(,"contrasts")$condition

[1] "contr.treatment"

attr(,"contrasts")$cultivar

[1] "contr.treatment"

> cultivar_OG_vs_LC <- results(count.data.set, name="cultivar_OG_vs_LC")

> cultivar_OG_vs_LC

log2 fold change (MLE): cultivar OG vs LC

LRT p-value: '~ condition + cultivar + condition:cultivar' vs '~ condition + cultivar'

DataFrame with 370495 rows and 6 columns

baseMean log2FoldChange lfcSE stat

LCxylem_1 1.6309831 -0.69979092 1.764431 2.720939853

LCxylem_2 1.6521952 0.89750898 2.230079 0.178695531

LCxylem_3 1.7317710 -0.13644458 2.085695 0.362620677

LCxylem_4 0.6258317 3.61605246 4.577625 0.471006694

LCxylem_6 1.4656769 -0.06768277 2.826552 0.001487213

... ... ... ... ...

OG_xylem_354412 1874.443 -0.41241729 0.3089393 0.09855314

OG_xylem_354413 3016.101 -0.24446671 0.5438788 5.90594432

OG_xylem_354417 3253.581 -0.10213181 0.2703613 4.72888815

OG_xylem_354419 18300.413 -1.14945073 0.5697906 0.03947982

OG_xylem_354420 3354.743 -0.05013024 0.2876679 1.97731981

pvalue padj

LCxylem_1 0.09903968 NA

LCxylem_2 0.67249669 NA

LCxylem_3 0.54705437 NA

LCxylem_4 0.49252441 NA

LCxylem_6 0.96923768 NA

... ... ...

OG_xylem_354412 0.75357288 0.9979041

OG_xylem_354413 0.01508987 0.2942873

OG_xylem_354417 0.02966005 0.4211882

OG_xylem_354419 0.84250110 0.9993861

OG_xylem_354420 0.15967304 0.8181710

So at this point I understand that changing the name, in result command for example from "conditionXfInfect.cultivarOG" (default) in "cultivar_OG_vs_LC", I can see a different value of Log2 fold change, because in "conditionXfInfect.cultivarOG" is related to a comparison between OG1_X OG2_X OG3X vs the other samples:

conditionXfInfect.cultivarOG

LC1_X 0

LC2_X 0

LC3_X 0

LC5_X 0

LC6_X 0

OG1_X 1

OG2_X 1

OG3_X 1

OG5_X 0

OG6_X 0

while in the "cultivar_OG_vs_LC" the comarison is between OG1_X OG2_X OG3_X OG5_X OG6_X vs the other cultivar samples:

cultivar_OG_vs_LC

LC1_X 0

LC2_X 0

LC3_X 0

LC5_X 0

LC6_X 0

OG1_X 1

OG2_X 1

OG3_X 1

OG5_X 1

OG6_X 1

Is it correct my sequence of command and relative results interpretation? I found 1138 transcripts DE from LRT statistics, my question now is : it will be useful to construct an heatmap on this transcript, and how can i do this on this selected list of transcript? Because until now I'm able to construct an heatmap on top var genes after rlog transformation of dds.

Could you suggest me some RNAseq paper reporting LRT statistics on their data?

Sorry for the long message, but I'm new in statistical analysis.

Best,

Annalisa

ADD REPLY • link 8.7 years ago annalisa.giampetruzzi • 0

0

Entering edit mode

hi Annalisa,

In these tables the p-values are testing whether the interaction term equals zero, i.e. whether the susceptibility is equal across cultivar.

The 'name' term just changes the different LFCs from the model. If you are stuck on the interpretation of the terms in your model, I would recommend you speak to a local statistician at your institute. These models are best explained in person rather than over the forum in my opinion.

You can create a heatmap of any set of vectors of LFC's by pulling these out of the results table and then using the heatmap code in our vignette and workflow.

You can subset the 'res' object with standard R row/column indexing:

res[ tx.idx, "log2FoldChange"]

I don't have any concrete recommendations for what kind of heatmap you should include though.

ADD REPLY • link 8.7 years ago Michael Love 41k

score 0 · Answer 2 · 2015-06-15

0

Entering edit mode

annalisa.giampetruzzi • 0

@annalisagiampetruzzi-8103

Last seen 8.7 years ago

Italy

Hi all,

thank for all comment on my question,

ADD COMMENT • link 8.9 years ago annalisa.giampetruzzi • 0