Search
Question: setup condition table in deseq2 for experiment with multiple conditions
0
gravatar for annalisa.giampetruzzi
3.4 years ago by
Italy
annalisa.giampetruzzi0 wrote:

Hi all,

need to analyze my RNA seq data of my project consisting of:
1 pathogen (2 condition Healthy vs Infected)
2 plant cultivars
2 plant tissues
3 replicates for infected
2 replicates for Healthy

Now I have the raw count data (after Bowtie alignment and Seqmonk
quantitation) and I need to setup a condition table for running DESeq2
in RStudio.

Could you please take a look to this my idea of condition table (above)?
I have thought to do two independent analysis or you believe is it
possible to do analysis of all condition in only one step? if yes in
which way?
Could you suggest me other specific tutorial to run DESeq for my purpose?
I will appreciate your help,Thanks in advance,

Annalisa

 condition    type tissue
infected    inf    xylem Cultivar1
infected    inf    xylem Cultivar1
infected    inf    xylem Cultivar1
infected    inf    xylem Cultivar2
infected    inf    xylem Cultivar2
infected    inf    xylem Cultivar2
Healthy    Hty    xylem Cultivar1
Healthy    Hty    xylem Cultivar1
Healthy    Hty    xylem Cultivar2
Healthy    Hty    xylem Cultivar2

     condition    type tissue
infected    inf    phloem Cultivar1
infected    inf    phloem Cultivar1
infected    inf    phloem Cultivar1
infected    inf    phloem Cultivar2
infected    inf    phloem Cultivar2
infected    inf    phloem Cultivar2
Healthy    Hty    phloem Cultivar1
Healthy    Hty    phloem Cultivar1
Healthy    Hty    phloem Cultivar2
Healthy    Hty    phloem Cultivar2

 

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by annalisa.giampetruzzi0

It would be helpful if you told us what hypotheses you want to test. In general it can be more powerful to analyze all your data at one time, as you get better dispersion estimates. But that assumes that the variability in gene expression in xylem and phloem are relatively similar.

If you care to test for interactions (e.g., as an example, genes that react differently to infection depending on the tissue), then you need to analyze all together.

ADD REPLYlink written 3.4 years ago by James W. MacDonald48k

Stats Master Jim,

Fur us "stats posers", are there formal ways to test for "how similar is similar enough" regarding the variance in these situations to help decide which subsets of data to include data during the analysis, ie. the range would be (1) only data from the replicates "used" to assess the contrast of interest; to (3) all of the data -- where (2) is some number of samples between (1) and (3).

No long response required, but a pointer or two to either "well known methods" or previous posts, or whatever, would be wunderbär

ADD REPLYlink written 3.4 years ago by Steve Lianoglou12k

I don't know of a test, but Gordon or Aaron (or Mike Love) might know of something. Anyway, Gordon had this to say about the subject: A: EdgeR condition-specific dispersion

I suppose if I were worried about a given data set, I would estimate dispersions separately and then plot to see if e.g., the trended dispersions were sufficiently different, eyeballometrically, to make me want to do anything but the usual.

ADD REPLYlink written 3.4 years ago by James W. MacDonald48k

I recommend that users make a PCA plot of transformed data to see if the groups have similar within-group variance. The transformation just learns the general trend of dispersion, if there are differences in the within-group variances, these can still be observed from the PCA plot of all samples.

ADD REPLYlink written 3.4 years ago by Michael Love19k

Hi James,

you are right,  the aim of the study is to demonstrate that cultivar 1 is less susceptible than the cultivar 2, and possibly to identify which gene are responsible of this different behaviour against the pathogen.

I'm not sure that the variability in gene expression in xylem and phloem are relatively similar,for this reason I decided to keep separate the four transcriptome (2 tissues of 2 cultivar). What do you think about it?

ADD REPLYlink written 3.4 years ago by annalisa.giampetruzzi0

If you don't care to compare xylem to phloem, then it isn't strictly necessary to combine. However, if you are looking for tissue specific differences you don't have a choice.

But there is a tradeoff to be made. The more data you have (to a point), the better you can estimate gene-wise dispersions, contingent upon the two tissue types being relatively similar. In statistics it isn't often possible to know what the 'right' thing to do is. Instead, you do what is defensible, and you defend your choice by saying why you made that choice. So it is better to e.g., make some plots that you can refer to when you say why you did what you did.
 

ADD REPLYlink written 3.4 years ago by James W. MacDonald48k
0
gravatar for Michael Love
3.4 years ago by
Michael Love19k
United States
Michael Love19k wrote:

You can use a design with an interaction term, to test the difference in susceptibility across cultivar. (I can't easily see from your colData which variable has the cultivar information, so I just renamed the variable in the design below). For example:

design(dds) <- ~ condition + cultivar + condition:cultivar
dds <- DESeq(dds, test="LRT", reduced=~condition + cultivar)
res <- results(dds)

If you are not very familiar with designs with interaction, I would recommend just running the above code separately for two DESeqDataSets (xylem and phloem) to generate to separate results tables.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Michael Love19k

rHi Michael,

Thank for the code, I want to share with you a screen shoot of R after running the command line you suggested.

Before the running your code I have write these:

> count.data.set <- DESeqDataSetFromMatrix(countData=just.raw.counts, colData=column.data,design= ~ cultivar)
converting counts to integer mode
> count.data.set <- DESeq(count.data.set)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Then I put your command but I'm not sure about the result in yellow:

> res

log2 fold change (MLE): conditionXfInfect.cultivarOG 

LRT p-value: '~ condition + cultivar + condition:cultivar' vs '~ condition + cultivar' 
DataFrame with 440720 rows and 6 columns

 

The aim of my study is to compare the two cultivar (LC, OG) about their susceptibility vs the pathogen.

Could you suggest me how can I check the results? I tried to make a PCA plot on the result and it seems it make sense but I'm not sure to do it correctly, what kind of command do you suggest to do after the code you indicated?

Sorry for the many question and thanks for your help.

Annalisa

 

 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by annalisa.giampetruzzi0

hi Annalisa,

Note the following paragraph from the Details section of ?results:

"For analyses using the likelihood ratio test (using nbinomLRT), the p-values are determined solely by the difference in deviance between the full and reduced model formula. A log2 fold change is included, which can be controlled using the name argument, or by default this will be the estimated coefficient for the last element of resultsNames(object)."

ADD REPLYlink written 3.4 years ago by Michael Love19k

Hi Michael,

thanks for your help, and I'm sorry for the delay in giving you my feedback about your suggestion. 

Since I want to be sure that I'm doing stat analysis in a correct way I'm just reporting you the sequence of command for DESeq2 analysis:

> library (DESeq2)

> count.data.set <- DESeqDataSetFromMatrix(countData=just.raw.counts, colData=column.data,design= ~ cultivar)

> design(count.data.set) <- ~ condition + cultivar + condition:cultivar

> count.data.set <- DESeq(count.data.set, test="LRT", reduced=~condition + cultivar)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

> res <- results(count.data.set)

> res

log2 fold change (MLE): conditionXfInfect.cultivarOG

LRT p-value: '~ condition + cultivar + condition:cultivar' vs '~ condition + cultivar'

DataFrame with 370495 rows and 6 columns

                 baseMean log2FoldChange     lfcSE

                <numeric>      <numeric> <numeric>

LCxylem_1       1.6309831     -5.4641154  3.456783

LCxylem_2       1.6521952     -1.0980881  2.610223

LCxylem_3       1.7317710      1.5575173  2.586786

LCxylem_4       0.6258317     -3.4078363  5.300495

LCxylem_6       1.4656769      0.1299767  3.353129

...                   ...            ...       ...

OG_xylem_354412  1874.443      0.1252115 0.3988262

OG_xylem_354413  3016.101     -1.7225056 0.7029631

OG_xylem_354417  3253.581     -0.7609053 0.3493495

OG_xylem_354419 18300.413     -0.1461710 0.7355976

OG_xylem_354420  3354.743     -0.5229080 0.3715953

                       stat     pvalue      padj

                  <numeric>  <numeric> <numeric>

LCxylem_1       2.720939853 0.09903968        NA

LCxylem_2       0.178695531 0.67249669        NA

LCxylem_3       0.362620677 0.54705437        NA

LCxylem_4       0.471006694 0.49252441        NA

LCxylem_6       0.001487213 0.96923768        NA

...                     ...        ...       ...

OG_xylem_354412  0.09855314 0.75357288 0.9979041

OG_xylem_354413  5.90594432 0.01508987 0.2942873

OG_xylem_354417  4.72888815 0.02966005 0.4211882

OG_xylem_354419  0.03947982 0.84250110 0.9993861

OG_xylem_354420  1.97731981 0.15967304 0.8181710

 

> resultsNames(count.data.set)

[1] "Intercept"                  

[2] "condition_XfInfect_vs_Hty"  

[3] "cultivar_OG_vs_LC"          

[4] "conditionXfInfect.cultivarOG"

 

> attr(count.data.set, "modelMatrixType")

[1] "standard"

> attr(count.data.set, "modelMatrix")

      Intercept /condition_XfInfect_vs_Hty /(cultivar_OG_vs_LC /conditionXfInfect.cultivarOG

LC1_X         1                         1                 0                            0

LC2_X         1                         1                 0                            0

LC3_X         1                         1                 0                            0

LC5_X         1                         0                 0                            0

LC6_X         1                         0                 0                            0

OG1_X         1                         1                 1                            1

OG2_X         1                         1                 1                            1

OG3_X         1                         1                 1                            1

OG5_X         1                         0                 1                            0

OG6_X         1                         0                 1                            0

attr(,"assign")

[1] 0 1 2 3

attr(,"contrasts")

attr(,"contrasts")$condition

[1] "contr.treatment"

 

attr(,"contrasts")$cultivar

[1] "contr.treatment"

> cultivar_OG_vs_LC <- results(count.data.set, name="cultivar_OG_vs_LC")

> cultivar_OG_vs_LC

log2 fold change (MLE): cultivar OG vs LC

LRT p-value: '~ condition + cultivar + condition:cultivar' vs '~ condition + cultivar'

DataFrame with 370495 rows and 6 columns

                 baseMean log2FoldChange     lfcSE        stat

                <numeric>      <numeric> <numeric>   <numeric>

LCxylem_1       1.6309831    -0.69979092  1.764431 2.720939853

LCxylem_2       1.6521952     0.89750898  2.230079 0.178695531

LCxylem_3       1.7317710    -0.13644458  2.085695 0.362620677

LCxylem_4       0.6258317     3.61605246  4.577625 0.471006694

LCxylem_6       1.4656769    -0.06768277  2.826552 0.001487213

...                   ...            ...       ...         ...

OG_xylem_354412  1874.443    -0.41241729 0.3089393  0.09855314

OG_xylem_354413  3016.101    -0.24446671 0.5438788  5.90594432

OG_xylem_354417  3253.581    -0.10213181 0.2703613  4.72888815

OG_xylem_354419 18300.413    -1.14945073 0.5697906  0.03947982

OG_xylem_354420  3354.743    -0.05013024 0.2876679  1.97731981

                    pvalue      padj

                 <numeric> <numeric>

LCxylem_1       0.09903968        NA

LCxylem_2       0.67249669        NA

LCxylem_3       0.54705437        NA

LCxylem_4       0.49252441        NA

LCxylem_6       0.96923768        NA

...                    ...       ...

OG_xylem_354412 0.75357288 0.9979041

OG_xylem_354413 0.01508987 0.2942873

OG_xylem_354417 0.02966005 0.4211882

OG_xylem_354419 0.84250110 0.9993861

OG_xylem_354420 0.15967304 0.8181710

 

So at this point I understand that changing the name, in result command  for example from "conditionXfInfect.cultivarOG" (default) in "cultivar_OG_vs_LC", I can see a different value of Log2 fold change, because in "conditionXfInfect.cultivarOG" is related to a comparison between OG1_X OG2_X OG3X vs the other samples:

conditionXfInfect.cultivarOG

LC1_X                              0

LC2_X                              0

LC3_X                              0

LC5_X                              0

LC6_X                              0

OG1_X                             1

OG2_X                             1

OG3_X                             1

OG5_X                             0

OG6_X                             0

while in the "cultivar_OG_vs_LC" the comarison is between  OG1_X  OG2_X OG3_X OG5_X OG6_X  vs the other cultivar samples:

cultivar_OG_vs_LC 

LC1_X                   0                            

LC2_X                   0                            

LC3_X                   0                           

LC5_X                   0                           

LC6_X                   0                            

OG1_X                  1                         

OG2_X                  1                            

OG3_X                  1                            

OG5_X                  1                            

OG6_X                  1    

Is it correct my sequence of command and relative results  interpretation? I found 1138 transcripts DE from LRT statistics, my question now is : it will be useful to construct an heatmap on this transcript, and how can i do this on this selected list of transcript? Because until now I'm able to construct an heatmap on top var genes after rlog transformation of dds.    

Could you suggest me some RNAseq paper reporting LRT statistics on their data?

Sorry for the long message, but I'm new in statistical analysis.

Best, 

Annalisa               

 
 
ADD REPLYlink written 3.2 years ago by annalisa.giampetruzzi0

hi Annalisa,

In these tables the p-values are testing whether the interaction term equals zero, i.e. whether the susceptibility is equal across cultivar. 

The 'name' term just changes the different LFCs from the model. If you are stuck on the interpretation of the terms in your model, I would recommend you speak to a local statistician at your institute. These models are best explained in person rather than over the forum in my opinion. 

You can create a heatmap of any set of vectors of LFC's by pulling these out of the results table and then using the heatmap code in our vignette and workflow.

You can subset the 'res' object with standard R row/column indexing: 

res[ tx.idx, "log2FoldChange"]

I don't have any concrete recommendations for what kind of heatmap you should include though.

ADD REPLYlink written 3.2 years ago by Michael Love19k
0
gravatar for annalisa.giampetruzzi
3.4 years ago by
Italy
annalisa.giampetruzzi0 wrote:

Hi all,

thank for all comment on my question, 

ADD COMMENTlink written 3.4 years ago by annalisa.giampetruzzi0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 245 users visited in the last hour