Question

weird model design for DE analysis

1

Entering edit mode

Lorena Pantano ▴ 140

@lorena-pantano-6001

Last seen 19 months ago

Boston

Hi,

I have some doubts about my data. I would like to do DE analysis of RNAseq data.

I have two set of experiments done in two different time points, so there is probably batch effect: let's say batch A and B.

For A, I have 7 controls and 7 cases (clinical identified). For B, I have the same 7 controls and other 7 pre-cases (preclinical identified)

I wanted to know if I can analyze all together, but I have questions like:

1-7 cases from A are only in batch A, and the others 7 are only in batch B, is it correct to setup a blocked variable indicating the batch effect although I only have one entire group for one batch, and another entire group for another batch? 2-and 7 control for A and B are same people, so it is not quite right to merge all together because it would be more like technical replicates.

My goal is:

1-get DE genes that follow additive effect. Something like increase a little in pre-cases, and increase more in cases.

Thanks to all

RNASeq • 1.9k views

ADD COMMENT • link updated 5 months ago by Gordon Smyth 53k • written 12.5 years ago by Lorena Pantano ▴ 140

Gordon Smyth · Answer 1 · 2013-07-28

Dear Lorena,

You are right to recognise that this design needs special treatment.

You can analyse all the data together, but you will need a statistical procedure that can account for the fact that the control samples in the two batches are biologically paired, whereas all the other samples are biologically independent. This requires a method that can fit random effects for the individuals or can estimate within individual correlations.

You cannot do this in edgeR, DESeq, DESeq2 or in any of the other RNA-seq packages based on the negative binomial distribution. These packages assume that all samples are statistically independent.

The only valid approach that I know of would be to use voom and the duplicate correlation approach of the limma package. First, read in your data (for both batches) and normalize your data as in the edgeR package. If y is your DGEList object, then

   y <- calcNormFactors(y)

Then setup you design matrix as you would do for limma or edgeR. This includes effects for the batch and treatment, but not effects for patients. Something like

   design <- model.matrix(~batch+condition)

where batch takes two levels (A and B) and condition takes three levels (control, preclinical, clinical).

Then transform to logCPM using the voom function.

   v <- voom(y, design, plot=TRUE)

Then estimate the correlation between the repeat samples for the same individual:

   dupcor <- duplicateCorrelation(v, design, block=individual)

Here 'individual' is a vector or factor indicating which individual each sample comes from. It has length equal to the number of samples and takes 21 levels, one for each individual in your data (7 controls, 7 preclinic patients, 7 clinical patients).

Now it it best to repeat the two previous steps so that the voom weights and correlation converge:

   v <- voom(y, design, block=individual, correlation=dupcor$consensus)
   dupcor <- duplicateCorrelation(v, design, block=individual)

Then conduct a differential expression analysis:

  fit <- lmFit(v, design, block=individual, correlation=dupcor$consensus)
  fit <- eBayes(fit)

Be sure to check that the value of dupcor$consensus is positive. This analysis keeps track of the fact that the two samples from each control individual are correlated.

Then you can use the topTable() function in the limma package to extract genes that follow any pattern of differential expression that you are interested in.

Best wishes
Gordon

Gordon Smyth · Answer 2 · 2013-07-26

hi Lorena Can you clarify on a few points below: On Fri, Jul 26, 2013 at 1:29 PM, Lorena Pantano <lorena.pantano at="" gmail.com=""> wrote: > Hi, > > I have some doubts about my data. > I would like to do DE analysis of RNAseq data. > > I have two set of experiments done in two different time points, so there > is probably batch effect: let's say batch A and B. > > For A, I have 7 controls and 7 cases (clinical identified) > For B, I have the same 7 controls and other 7 pre-cases (preclinical > identified) > > I wanted to know if I can analyze all together, but I have questions like: > > > 1-7 cases from A are only in batch A, and the others 7 are only in batch B, > is it correct to setup a blocked variable indicating the batch effect > although I only have one entire group for one batch, and another entire > group for another batch? You can still use blocking. Because the control group spans batch A and batch B, you can still estimate the batch effect (the model matrix should still be full rank). > 2-and 7 control for A and B are same people, so it is not quite right to > merge all together because it would be more like technical replicates. > I don't understand this last sentence. Do you mean that some, but not all, of individuals in the control group from batch A are also in batch B? I wouldn't say they are technical replicates if the samples are drawn from the same individual. RNA-Seq experiments of the same individual will contain biological variability. > My goal is: > > 1-get DE genes that follow additive effect. Something like increase a > little in pre-cases, and increase more in cases. > It sounds to me like you might want to treat pre-cases and cases as separate levels of a factor variable, rather than assume that the trend will necessarily be: control < pre-case < case. But to receive more specific advice, you should probably explain more about the share of individuals across the 28 samples. Mike