Search
Question: DESeq2 model - How to test case/control time course model when taking account within sample variability?
0
gravatar for heikki.sarin
6 weeks ago by
heikki.sarin0 wrote:

Hi,

I have a CASE/CONTROL RNA-seq dataset with three time point measurements (PRE/MID/POST) for both groups. We're trying to test whether there's a difference in expression between the two groups in any of the the time points. From physiological point of view we hypothesize that there's no big difference between PRE and POST measurements between the groups but we should see a difference in the MID timepoint. In total we have little over 100 RNA-lib samples so theres quite a few biological replicates but no technical replicates. We have eliminated possible outliers based on PCA and heatmaps and are using RUVseq to eliminate overestimated variance in the data.

There's a good example of the model in DESeq2 vignette but we also want to account for the within sample variability (I think).  In our coldata we have CASE/CONTROL status, PRE_MID_POST status and SUBJECT_ID (how we can see that to whom three RNA- library samples belong to).

I think the right model to test time course case/control setup should be,  full: ~ CASE_CONTROL + PRE_MID_POST + CASE/CONTROL and reduced: ~ CASE_CONTROL + PRE_MID_POST, using the LRT-test in DESeq2. However, the p-values aren't calculated correctly and the results aren't quite what you would expect because it seem that DESeq2 doesn't calculate for the outliers or low count genes at all etc. I'm not quite sure if this is because we should also account for the SUBJECT_ID in the model so the within subject variability would be accounted for. I tried this by modelling and also by creating a separate model matrix according to the vignette (nested within sample variability) but the full=design() argument doesn't work:Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘design’ for signature ‘"matrix"’.

I'm quite confused by the modelling that we should use because when we subset only cases in the analysis then the results seem plausible and DESeq2 seems to calculate thing correctly But when we include also the control group in analysis for some reason there's big problems.

So my questions are:

1) Should we account for the SUBJECT_ID in the model?

2) Is the LRT-test the right approach ?

3) What sort of model would be the right solutions for our case?

Help would be really appreciated, especially from Mike Love if possible.

ADD COMMENTlink modified 6 weeks ago by Michael Love13k • written 6 weeks ago by heikki.sarin0
1
gravatar for Michael Love
6 weeks ago by
Michael Love13k
United States
Michael Love13k wrote:

Can you describe exactly the subject ID structure across condition and time? How many subjects for case and for control? Each subject has exactly three time points? I would recommend using column names that aren't the same as the factor levels, to be less confusing, for example call the columns: condition, time and subject. And the levels are case, control, etc. Also, you'll need to make sure to set the reference level for all the factors (see vignette).

ADD COMMENTlink written 6 weeks ago by Michael Love13k

 

Subject ID is numerical that is coded as factor. So for subject that has all three time point measurement samples there is three same numbers in the subject_id column. There is some incomplete subjects that has only one or two timepoint measurements. Case/control status (1/2) and timepoint status (1/2/3) are both also numerical and coded as factors in the coldata in R. That was also a factor that I was wondering that how does DESeq2 interpret want variables to be coded. Is it suitable to code all the variables presented in coldata as factors? And the SAMPLE_IDs are as row names that match the countdatas colnames. Head of the coldata presented below:

SAMPLE_ID CASE_CONTROL PRE_MID_POST SUBJECT_ID
Sample_0390293029 1 1 6
Sample_3028403300 1 2 6
Sample_3049585023 1 3 6
Sample_9348019238 2 1 95
Sample_3209482934 2 2 95
Sample_93405920930 2 3 95

 

In total there's samples from 25 cases and 15 controls. And as mentioned earlier some of the subjects have only one or two samples. 

Thank you for the quick response Michael, really appreciate the help!

ADD REPLYlink written 6 weeks ago by heikki.sarin0
1

Yes it's appropriate that here all the variables are factors. This means DESeq2 will add coefficients to explain differences between factor levels. 

With DESeq2, and in the case you want to control for subject-level differences in this experimental design, you'll have to remove the samples that don't have complete time series. It sounds like this is just a few, and those samples don't inform you about the way expression changes over time and over condition within subject. Note that it is possible to keep these samples in the analysis, by using duplicateCorrelation() in the limma package, instead of using a term in the design for the subject effects, as we do here.

For modeling with DESeq2, you can follow this section in the vignette, in that you will need to create a new column subject.nested, which takes levels 1-25 for the cases, and 1-15 for the controls.

You would use a design of ~condition + time + condition:subject.nested + condition:time. If you want to find genes with condition-specific changes over time, you could use a reduced design of ~condition + time + condition:subject.nested. I'd suggest using plotCounts() to look at the top genes from this LRT, to see how gene expression is changing over condition and time. 

Because the cases and controls have different numbers of subjects, you will have to use the model.matrix function to produce matrices for the 'full' and 'reduced' argument of DESeq(), and for both matrices you will have to delete those columns in the model matrix that have all zeros. There is description and example code in the vignette section I referenced.

ADD REPLYlink written 6 weeks ago by Michael Love13k

Still having some issues with model.matrices. I managed to create the additional column to the coldata and also model.matrices for the full and reduced model (zero columns removed).

However, when I try to assign the model.matrices to the DESeq2 function it  doesn't seem to work.The code I'm trying to implement is:

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = countdata,colData = coldata, design = ~1)
ddsFullCountTable <- ddsFullCountTable[ rowSums(counts(ddsFullCountTable)) > 1, ]
dds <- DESeq(ddsFullCountTable, test="LRT", full=design(full1), reduced=reduced1)

Where the full1 is model.matrix created for the full model and as instructed in the vignette and reduced1 for the reduced model. I get the error: Error in (function (classes, fdef, mtable)  : unable to find an inherited method for function ‘design’ for signature ‘"matrix"’. On the note, is it necessary that the model.matrices are "unnamed"

What do think about RUVseq in removing overestimated variance from the dataset? Is it good solution in reducing the background noise in a dataset or is there a risk that it will greatly decrease the chances of finding true DE genes?

 

 

 

 

ADD REPLYlink written 6 weeks ago by heikki.sarin0
1
You should provide the matrices directly to 'full' and to 'reduced'. So drop the design(...)
ADD REPLYlink modified 5 weeks ago • written 6 weeks ago by Michael Love13k

That little correction made the trick. DESeq2 manages to run the pipeline. Still however it giver an error: 30 rows did not converge in beta, labelled in mcols(object)$fullBetaConv:Use larger maxit argument with nbinomLRT. Is this something that should be dealt with? Went through some documents and previous topics but didn't find how I could apply the nbinomLRT and larger maxit argument with the model matrices we have made.

The results seem somewhat we would expect. However there seems to be going on something strange. P-values are calculated incorrectly as the histogram of the values we can see a peak in the "1" end of the histogram. Also in the dispersion plot one can see a cluster of values that has clearly deviated in the bottom left corner of the plot.

When plotting the significant genes with plotCounts we can see that for some genes there is one/few/numerous samples that clearly separate from the rest of the group. 

Thanks again for the help!

ADD REPLYlink written 5 weeks ago by heikki.sarin0
1

I'm thinking, you might benefit from doing some pre-filtering before running DESeq(), for example:

keep <- rowSums(counts(dds) >= 10) >= x
​dds <- dds[keep,]

i.e. at least 'x' samples need to have a count of 10 or more. Do you think that this would help deal with the fact that you have significant genes in which your DE signal is coming from some samples but not all?

ADD REPLYlink written 5 weeks ago by Michael Love13k

Coming back to this..

Well, that certainly solved the problem with the "maxit" argument and also now the p-values are calculated correctly. When continuing on some pathway analysis with the results the end outcomes  seem somewhat more biologically logical.

However, I'm a bit worried that doing the pre-filtering I might lose some important data. Is there another way I could reliably detect if there really is  a couple of samples that contribute relatively much compared to others on the DE genes detection?

ADD REPLYlink written 4 weeks ago by heikki.sarin0
1

There is little chance for a gene to have power to detect differential expression if the counts are single digit. So a safe choice for 'x' is the number of samples per group.

DESeq2 doesn't technically need prefiltering for most datasets, but these rows are extra work for the algorithm, and hard to reach convergence sometimes.

ADD REPLYlink written 4 weeks ago by Michael Love13k

That clears it, thank you!

The contrast are covered in quite detail in previous posts and vignette but I still find them quite ambiguous to interpret especially when implementing on this model which is quite complex (at least in my mind with the model.matrix) . So the resultNames gives me:

 "Intercept"                      "CASE_CONTROL2"                  "PRE_MID_POST2"                 
 [4] "PRE_MID_POST3"                  "CASE_CONTROL1.subject.nested2"  "CASE_CONTROL2.subject.nested2" 
 [7] "CASE_CONTROL1.subject.nested3"  "CASE_CONTROL2.subject.nested3"  "CASE_CONTROL1.subject.nested4" 
[10] "CASE_CONTROL2.subject.nested4"  "CASE_CONTROL1.subject.nested5"  "CASE_CONTROL2.subject.nested5" 
[13] "CASE_CONTROL1.subject.nested6"  "CASE_CONTROL2.subject.nested6"  "CASE_CONTROL1.subject.nested7" 
[16] "CASE_CONTROL2.subject.nested7"  "CASE_CONTROL1.subject.nested8"  "CASE_CONTROL2.subject.nested8" 
[19] "CASE_CONTROL1.subject.nested9"  "CASE_CONTROL2.subject.nested9"  "CASE_CONTROL1.subject.nested10"
[22] "CASE_CONTROL2.subject.nested10" "CASE_CONTROL1.subject.nested11" "CASE_CONTROL2.subject.nested11"
[25] "CASE_CONTROL1.subject.nested12" "CASE_CONTROL2.subject.nested12" "CASE_CONTROL1.subject.nested13"
[28] "CASE_CONTROL2.subject.nested13" "CASE_CONTROL2.subject.nested14" "CASE_CONTROL2.subject.nested15"
[31] "CASE_CONTROL2.subject.nested16" "CASE_CONTROL2.subject.nested17" "CASE_CONTROL2.subject.nested18"
[34] "CASE_CONTROL2.subject.nested19" "CASE_CONTROL2.PRE_MID_POST2"    "CASE_CONTROL2.PRE_MID_POST3"  

I'm a bit confused on the matter if I want to examine only one time point between conditions how to construct the contrast. Eg. Let's say I want to examine the timepoint 2 between condition groups. Is it right if I assign in the contrast the factor  "CASE_CONTROL2.PRE_MID_POST2" only or do I need to assign something else also ?

Or how to do it if I want to examine how many DE genes there's within one group between 1-2, 2-3 or all 1-3 timepoints? Do I need to assign some of the "subject.nested" resultNames in the contrast to account for the within subject variability?

With the LRT I get sig. p-values for hundreds of genes but when applying the contrasts and Wald-test as a result there is thousands of significant genes which seems that the contrast really can't be right.

Thank you for your help. Really getting analysis forward thanks to this!

ADD REPLYlink written 4 weeks ago by heikki.sarin0
Yes that term, CASE_CONTROL2.PRE_MID_POST2, is the difference due to case status at time 2 relative to the difference at baseline. And you can test with results() using 'name' and test="Wald"
ADD REPLYlink modified 7 days ago • written 4 weeks ago by Michael Love13k

Great. The results seem to add up, when comparing with the full LRT results.

One question regarding pathway analysis. Is there one certain package you would recommend for downstream pathway analysis of the results? 

 

ADD REPLYlink written 24 days ago by heikki.sarin0

Most users I think use goseq. If you search the support posts there should be some about this.

ADD REPLYlink written 24 days ago by Michael Love13k

Yeah, it was quite straight forward.

We have also the exon data from the experiment. Is it possible to use DESeq2 for exon data? Been looking at DEXseq(which to my understanding uses DESeq2 as inferential engine) for the analysis but was just wondering as we have done the analysis already to gene-level data so would it be a lot easier to implement the current workflow to fit exon-level data on DESeq2?

 

 

ADD REPLYlink written 21 days ago by heikki.sarin0
For exon level you definitely need DEXSeq. The model is specifically designed for differential exon usage, controlling for gene expression. Best to read over the DEXSeq paper or vignette.
ADD REPLYlink modified 7 days ago • written 21 days ago by Michael Love13k

Been checking that over. Just wondering the same issue as with gene-level data on the model. As in DEXSeq user must specify on the coldata the exon column that has to be taken into account. How do I have to implement it/add on the model that I've used with DESeq2?

Full: ~condition + time + condition:subject.nested + condition:time

Reduced: ~condition + time + condition:subject.nested

Otherwise DEXseq seem somewhat self-explanatory on how to conduct the workflow.

ADD REPLYlink written 21 days ago by heikki.sarin0
Good Q for a new post with a DEXSeq tag.
ADD REPLYlink modified 7 days ago • written 21 days ago by Michael Love13k

Follow-up on this post. If I'd like to perform just case-only analysis with LRT between time-points would the correct models be:

~ subject.nested + time:subject.nested

~ subject.nested

So that the between subject variability is taken into account.

ADD REPLYlink written 7 days ago by heikki.sarin0

This will find differences in the profiles across time while controlling for differences in subjects.

ADD REPLYlink written 7 days ago by Michael Love13k

That's exactly what I was looking for. However, DESeq2 produces the following error :

Warning message:
In checkForExperimentalReplicates(object, modelMatrix) :
  same number of samples and coefficients to fit,
  estimating dispersion by treating samples as replicates.
  read the ?DESeq section on 'Experiments without replicates'

ADD REPLYlink written 7 days ago by heikki.sarin0

Oh, you only have individual time points per sample then you can't find this model. Just use 'time' then in place of the interaction of time and sample. You don't have another choice for this as you would need replicates for model you proposed.

ADD REPLYlink written 7 days ago by Michael Love13k

Actually I have three timepoints for every subject and almost twenty biological replicates. Shouldn't it work?

ADD REPLYlink written 7 days ago by heikki.sarin0

No, you'd need replicates per sample and per timepoint.

ADD REPLYlink written 6 days ago by Michael Love13k

Can you explain why is that compared to the model where condition is also included? Why there isn't need for replicates per sample in that model?

~condition + time + condition:subject.nested + condition:time

 

ADD REPLYlink written 6 days ago by heikki.sarin0

I don't have time to dive deep into explanations of models on the support site. Remember I'm just providing software support not statistical design. I often recommend partnering with a local statistician if you want explanations of various designs (and then you should consider bringing that person onto the project).

ADD REPLYlink written 5 days ago by Michael Love13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 228 users visited in the last hour