Question

DEseq2 replicate normalization

0

Entering edit mode

michael.morash • 0

@6662f6a7

Last seen 4.2 years ago

Canada

We have an RNAseq experiment with 6 conditions (5 serial dilutions of a single drug and 1 untreated control) with 5 biological replicates for each condition. Our study uses low drug concentrations and so we expect to see only a moderate number of DEGs due to condition. In fact most of the variation comes from replicate (generally global gene expression increases or decreases). Our coldata has Gene ID, Count, Condition and Replicate, and the experimental design is ~ replicate + condition.

We would like to normalize or correct to remove any replicate differences before determining the effects of the condition. Is there a way to explicitly normalize the samples based on replicate, before assessing the effect of condition? As an example, if this were a qPCR experiment, I would normalize to a few housekeeping genes.

DEseq2 replicate DESeq2 normalization • 2.6k views

ADD COMMENT • link 4.2 years ago michael.morash • 0

score 0 · Answer 1 · 2021-12-02

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 hours ago

United States

Your experimental design is the way to what you ask, there's nothing else to do.

ADD COMMENT • link 4.2 years ago Michael Love 43k

0

Entering edit mode

So is our design correct? Does the design "~ replicate + condition" mean that when we do a contrast for condition "control" vs condition "treated" that the replicate differences will be normalized? I'm also a little unclear, after reading through the DEseq2 literature, and the R script explanations for design variables, what the " + " sign is telling DEseq2 to do with regards to normalization.

ADD REPLY • link 4.2 years ago michael.morash • 0

0

Entering edit mode

You should consult with a statistician to better understand what is happening.

Yes that is what the design you have does.

ADD REPLY • link 4.2 years ago Michael Love 43k

score 0 · Answer 2 · 2021-12-02

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 11 weeks ago

United States

There are maybe two questions going on here.

Normalizing by "replicate"

Unless there is something common among the different replicates in your experiment, I'm not sure that including the replicate term in your model as you are doing and trying to normalize is possible/reasonable. Allow me to outline a few different scenarios that would enable you to use something like a replicate term it sounds like you are thinking about.

You mention that you have 5 replicates for each of your 6 serial dilutions including control, ie. 1:1, 1:2, 1:3, 1:4, 1:5, and control.

You may have done one complete round of serial dilutions + controls and performed RNA-extraction over the course of five days: if so, adding a "replicate" term with values d1, d2, ..., d5 can make sense, and your experimental colData() might look something like this:

dilution    replicate
1:1         d1
1:2         d1
1:3         d1
1:4         d1
ctrl        d1
1:1         d2
1:2         d2
...
1:5         d5
ctrl        d5

Or perhaps you did full rounds of 1:1, ..., 1:5 at different times during the same day? Like, once before your first coffee, then another before your second coffee, then a third before your third coffee, then you had breakfast, then you figured "I don't have time for all this" and bypassed the whole "brewing the coffee" step and managed to fit a few more rounds of dilution and rna extraction between spoons of ground coffee you were shoveling.

dilution    replicate
1:1         coffee1
1:2         coffee1
1:3         coffee1
1:4         coffee1
ctrl        coffee1
1:1         coffee2
1:2         coffee2
...
1:5         spoon2
ctrl        spoon2

Another scenario might have been that you prepped three rounds of dilution + control samples, then you said "jeez, I could really use a coffee" so you went ahead and drank some, after which you prepared the final two dilutions rounds and two control samples. You might encode your "replicate" covariate with BC (before coffee) and AC (after coffee):

dilution    replicate
1:1         BC
1:2         BC
1:3         BC
1:4         BC
ctrl        BC
1:1         BC
1:2         BC
...
1:5         AC
ctrl        AC

Or, perhaps, you had 5 different people perform a full round of dilutions ... then in this case, the replicate would correspond to the "operator" who processed a complete set of samples.

Is there something along those lines you can try to control for?

Normalization against housekeeping genes

This most likely isn't going to be the way to go ...

ADD COMMENT • link 4.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

All replicates were done completely independently, in full on subject groups from different parents. Dilution series A done one week, dilution series B the next week etc. All of the chemicals were prepared separately. All samples were then frozen @ -80C, and the RNA isolated on each complete replicate at separate times. So I believe scenario 1 is the most accurate. Our coldata looks like your example, except it includes a Gene ID and count column as well.

ADD REPLY • link 4.2 years ago michael.morash • 0

0

Entering edit mode

Well, we're getting into the weeds a bit here, and having a chat with a local stats person would be helpful so they can help you figure out how to properly model the experiment as this type of questions isn't RNA-seq specific, but a more broad question about using linear models for statistical analysis.

That having been said, it sounds like you have complete serial dilutions and controls from 5 different individuals? So instead of a "replicate" covariate, it might aptly named "subject_id", perhaps? Does a PCA plot over your sample show sample-level clustering more closely with "subject_id" perhaps?

You could set up a model like ~ dilution + subject_id can be appropriate. I'm not sure what you want to test though, perhaps finding genes that have a dose-response relationship w/ the dilution? You could imagine analyzing this like a time series experiment, perhaps, where "dilution" takes the place of time.

You'll find some info on that for DESeq2 in this tutorial. There are also examples of analyzing time-series data in the edgeR user guide (section 4.8) as well as the limma user's guide.

Try to find the time to read through DESeq2, edgeR, and limma user guides. You'll find many examples of how to analyze expression data under different scenarios, and I suspect you'll find something close enough to your situation that would help you get a better handle of things you can try to make some progress with your specific questions.

ADD REPLY • link 4.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

I really appreciate the support from both of you for my question. Essentially we are hoping to identify all DEGs when comparing each dilution to the control. We have different software for modelling variable dose response.

Other than the EdgeR guide (which I am familiar with but haven't studied in depth) I have read through the others you suggested already. It seems that whichever nomenclature I use (dilution or condition, replicate or subjet id), I would be using the same experimental design. We have run the design in both orientations (~ rep + cond, ~ cond + rep) and it makes essentially no difference (LFCs are the same over the first 6-8 decimal places and identify the same DEGs). The major question I have is what does "~replicate + condition" tell DEseq2 to do to the data? When I use only ~ condition, a much smaller (~5-10%) set of transcripts is identified.

ADD REPLY • link 4.2 years ago michael.morash • 0

0

Entering edit mode

Let's use subject_id here.

You would define a model like ~ condition + subject_id if you suspect that individuals have different baseline expression for each gene, and you want to control for that so you can then test for differences in the residuals which are do to condition effects. If this is the data you have, then this is likey what you want to do.

One thing that might help with your intuition is to find a gene that you detect DGE when you include subject_id but lose it when that term is not included.

You can use the plotCounts function (I think that's what it's called in DESeq2) to visualize the expression of that gene across condition, then color the points by subject_id. Do you see trends in the data where, say, the blue dots are systematically lower than the red dots across your condition(s)? If so, that's what you are correcting.

Perhaps another way to think about it is that when you just use ~ condition, you are doing a t-test for each gene, but using ~condition + subject_id, you are doing a paired t-test.

Hopefully that helps a bit.

ADD REPLY • link 4.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Yes, thank you. This has helped me a great deal. I was having a hard time wrapping my head around what the different designs were telling DEseq2 to do, but you have cleared it up.

Many thanks!

ADD REPLY • link 4.2 years ago michael.morash • 0