Question: limma: within-array duplicates correlation
0
3.7 years ago by
roser.navarro0 wrote:

Dear all,

I would like to use limma package to find differential expressed genes.

I'm working on miRNA microarray data.

From the technical part:

We have used agilent v3.0 8x15K microarray.

The chip contains different probes for 2006 miRNAs.

 PROBES without CONTROLS:         60180
---------------------------------- 
  (Non-CTRL) Unique Probe:  4774
(Non-CTRL) Unique Genes:  2006
----------------------------------
DISTRIBUTION OF REPLICATED NonControl Probes
reps
7    8   10   15   30
630  630  414 3094    6
------------------------------------------------------
Replication at Probe level- MEDIAN  CV 

Each gene has 30 probes (one or more than 2 different probes).

As you can see most of probes are replicated 15 times across the chip.

In summary: Each gene has one or more probes and a total of 30.

The distribution of the replicates on the chip is not side-by-side.

From the biological part:

We took 2 samples from each patient (day 2 and 7) from control and obese patients. We have a total of 16 patients, so we have 32 samples.

So we have 2 levels in the analysis: day and disease.

OK! Now the question is...

We don't know if we could/should apply within-array replicate spots.

Following these recomendations:

we have treated

Person <- factor(Woman)  ## It took values from 1 to 16

Treatmetn = Group  ## It took values obesity_2, obesity_7, control_2, control_7

design <- model.matrix(~0+Treatment)

colnames(design) <- levels(Treatment)

corfit <- duplicateCorrelation(normalized_matrix,design,block=Person)   ## normalized with background and betweenArray corrections.

corfit$consensus.correlation And we've used the correlation to fit the model fit <- lmFit(valores.pca_all, design, block=Person,correlation=corfit$consensus)

And then we've made the contrasts:


contrastes <- makeContrasts (

c1       =  control_7 - control_2,

c2       =  obesity_2 - control_2,

c3       =  obesity_7 - obesity_2,

c4       =  obesity_7 - control_7,

levels = design)

fit2=contrasts.fit(fit,contrastes)

fit2=eBayes(fit2)

However, we don't know if we should use or not within-array replicate spots. And if it is the case, how could we apply it.

I've read your paper, but I'm not statistician so...

Is the following code correct ?

NDUPS = 30 ## we have 30 probes for each gene

corfit <- duplicateCorrelation (expr, design = des.mat, ndups = NDUPS, spacing = 1)

fit <- lmFit (expr, design = des.mat, ndups = NDUPS, spacing=1, correlation = corfit$consensus) Thanks in advance mirna microarray limma • 1.3k views ADD COMMENTlink modified 3.7 years ago by Aaron Lun25k • written 3.7 years ago by roser.navarro0 Answer: limma: within-array duplicates correlation 2 3.7 years ago by Aaron Lun25k Cambridge, United Kingdom Aaron Lun25k wrote: If you're already using duplicateCorrelation to estimate correlations due to the patient factor, I don't think it's possible to use it again (i.e., in the same analysis) to account for correlations between duplicate spots. An alternative might be to average across the duplicate spots within each array, and then perform the DE analysis on those averages, e.g., by using avereps or avedups. This will probably have some benefit as the average should be more stable than the signal from any single probe. Admittedly, you won't be able to account for the variability of the signal across duplicate spots, which is what duplicateCorrelation would do if you used it here; but this also means that you're free to use duplicateCorrelation on the patients, which is more practically important as it allows for comparisons between obese and normal individuals. ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Aaron Lun25k Thanks a lot Aaron Lun. I'm using avereps before using duplicateCorrelation on patients. But I don't know which of both approaches would be better. Results using spots or patients are very different. That's why I would like to know if the following code is correct: DUPS = 30 ## we have 30 probes for each gene (they can be different) corfit <- duplicateCorrelation (expr, design = des.mat, ndups = NDUPS, spacing = 1) fit <- lmFit (expr, design = des.mat, ndups = NDUPS, spacing=1, correlation = corfit$consensus)

ADD REPLYlink written 3.7 years ago by roser.navarro0
1

I assume you've set it up so that the duplicate spots are side-by-side in exprs, otherwise spacing=1 would be incorrect. What exactly do you mean by "they can be different"? Because if any genes have a different number of spots, then many of the assignments will probably be wrong if you assume that they've all got 30.

Anyway, modelling the correlation due to the patient factor seems more important than modelling the correlation between spots, given that patient-to-patient (biological) variability should be much larger than spot-to-spot (technical) variability. You haven't shown what des.mat is, but I assume that you've used the same ~Treatment formula. Presumably you haven't included a patient blocking factor; not that it would help, because that precludes an obese/normal comparison.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Aaron Lun25k

Thanks a lot for your help. I hope this will be my last question :-)

This is my code (for the 2 approaches)

When I started the analysis I used the following code. But I didn't get any DEG.

x <-read.maimages(targets_all$FileName, source="agilent", green.only = TRUE) colnames(x$E) <- targets_all$Code x <- backgroundCorrect(x, method="normexp", offset = 0) y <- normalizeBetweenArrays(x, method = "quantile") y <- y[y$genes$ControlType == 0,] ################ pipeline 1: duplicatedCorrelation Between Women y.ave <- avereps(y, ID=y$genes$SystematicName) dim(y.ave) ### 2006 miRNAs and 32 samples Person <- factor(targets_all$Woman)
Treat = targets_all$Group design <- model.matrix(~0+Treat) colnames(design) <- levels(Treat) corfit <- duplicateCorrelation(y.ave$E,design,block=Person)
corfit$consensus.correlation contrastes <- makeContrasts ( c1 = control_lh7 - control_lh2, c2 = obesity_lh2 - control_lh2, c3 = obesity_lh7 - obesity_lh2, c4 = obesity_lh7 - control_lh7, levels = design) ## fit model fit <- lmFit(y.ave$E,design,block=Person,correlation=corfit$consensus.correlation) ## contrastes fit2=contrasts.fit(fit,contrastes) fit2=eBayes(fit2) res_limma_m1_c1 <- topTable(fit2, coef = 1, number = 2006) ## 0 deg res_limma_m1_c2 <- topTable(fit2, coef = 2, number = 2006) ## 0 deg res_limma_m1_c3 <- topTable(fit2, coef = 3, number = 2006) ## 0 deg res_limma_m1_c4 <- topTable(fit2, coef = 4, number = 2006) ## 0 deg So, I decided to use correlation between spots and I got a lot of DEG ############### pipeline 2: Correlation within array: spots expr <- y$E    ## i'm using background corrected and normalized data (is it ok?) at probe level
rownames (expr) <- y$genes$SystematicName
table(table(rownames(expr)))  ## each gene 30 spots: but probe can be different
ord <- order (rownames (expr))
expr <- expr[ord,]
expr[1:3,]

## Estimate the correlation structure for the genes / miRNAs
NDUPS <- unique (table (rownames (expr)))  ## 30
if (length (NDUPS) > 1) stop ("NOT THE SAME NUMBER OF DUPLICATED SPOTS")

dupcor <- duplicateCorrelation (expr, design = design, ndups = NDUPS, spacing = 1)
dupcor$consensus ## fit the limma model with correlation fit <- lmFit (expr, design = design, ndups = NDUPS, spacing=1, correlation = dupcor$consensus)
fit <- contrasts.fit (fit, contrastes)
fit2=eBayes(fit)

res_limma_m2_c1 <- topTable(fit2, coef = 1, number = 2006)  ## 175 deg
res_limma_m2_c2 <- topTable(fit2, coef = 2, number = 2006)  ## 27 deg
res_limma_m2_c3 <- topTable(fit2, coef = 3, number = 2006)  ## 246 deg
res_limma_m2_c4 <- topTable(fit2, coef = 4, number = 2006)  ## 193 deg

Do you think the both approches (and code) are correct?

In this case, why do you think the results are so different?

Huge thanks in advance :-D

ADD REPLYlink written 3.7 years ago by roser.navarro0
1

Getting more DE genes doesn't make the analysis correct. Because your second approach doesn't consider the correlations due to the patient factor, it will likely be liberal as it overstates the amount of information that is present across dependent samples. This is why you get more DE genes at a nominal FDR threshold of (presumably) 5%, compared to the first analysis. Many of these are probably false positives, at least at this FDR threshold.

In short, the first approach seems a lot more valid to me. If you don't get any DE genes, then so be it. It's better to get nothing from a correct analysis than to get something from an incorrect one. In any case, you can just relax the FDR threshold in the first analysis to pick up some DE genes. Sure, the evidence for DE will be weaker, but at least you know it's weaker, whereas the second analysis will give you a false sense of security about the reliability of your results.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Aaron Lun25k

Thank you :-)

ADD REPLYlink written 3.7 years ago by roser.navarro0
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 230 users visited in the last hour