limma: within-array duplicates correlation
1
0
Entering edit mode
@rosernavarro-9660
Last seen 7.3 years ago

Dear all,

I would like to use limma package to find differential expressed genes.

I'm working on miRNA microarray data. 

 

From the technical part:

We have used agilent v3.0 8x15K microarray.

The chip contains different probes for 2006 miRNAs.

 

 PROBES without CONTROLS:         60180 
---------------------------------- 
  (Non-CTRL) Unique Probe:  4774 
  (Non-CTRL) Unique Genes:  2006 
---------------------------------- 
DISTRIBUTION OF REPLICATED NonControl Probes 
reps
   7    8   10   15   30 
 630  630  414 3094    6 
------------------------------------------------------ 
Replication at Probe level- MEDIAN  CV 

 

Each gene has 30 probes (one or more than 2 different probes).

As you can see most of probes are replicated 15 times across the chip. 

In summary: Each gene has one or more probes and a total of 30.

The distribution of the replicates on the chip is not side-by-side.

 

 

From the biological part:

We took 2 samples from each patient (day 2 and 7) from control and obese patients. We have a total of 16 patients, so we have 32 samples.

So we have 2 levels in the analysis: day and disease.

 

OK! Now the question is...

 

We don't know if we could/should apply within-array replicate spots.

 

Following these recomendations:

Limma Design for Paired Samples Question http://www.statsci.org/smyth/pubs/2003to2014.html

we have treated

Person <- factor(Woman)  ## It took values from 1 to 16

Treatmetn = Group  ## It took values obesity_2, obesity_7, control_2, control_7

design <- model.matrix(~0+Treatment)

colnames(design) <- levels(Treatment)

corfit <- duplicateCorrelation(normalized_matrix,design,block=Person)   ## normalized with background and betweenArray corrections.

corfit$consensus.correlation

 

And we've used the correlation to fit the model 

 

fit <- lmFit(valores.pca_all, design, block=Person,correlation=corfit$consensus)

 

And then we've made the contrasts:
 

contrastes <- makeContrasts (

  c1       =  control_7 - control_2,     

  c2       =  obesity_2 - control_2, 

  c3       =  obesity_7 - obesity_2, 

  c4       =  obesity_7 - control_7,

  levels = design)

fit2=contrasts.fit(fit,contrastes)

fit2=eBayes(fit2)

 

However, we don't know if we should use or not within-array replicate spots. And if it is the case, how could we apply it.

http://www.statsci.org/smyth/pubs/dupcor.pdf

I've read your paper, but I'm not statistician so...

 

Is the following code correct ?

 

NDUPS = 30 ## we have 30 probes for each gene

corfit <- duplicateCorrelation (expr, design = des.mat, ndups = NDUPS, spacing = 1)

fit <- lmFit (expr, design = des.mat, ndups = NDUPS, spacing=1, correlation = corfit$consensus)

 

Thanks in advance

limma mirna microarray • 3.0k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

If you're already using duplicateCorrelation to estimate correlations due to the patient factor, I don't think it's possible to use it again (i.e., in the same analysis) to account for correlations between duplicate spots. An alternative might be to average across the duplicate spots within each array, and then perform the DE analysis on those averages, e.g., by using avereps or avedups. This will probably have some benefit as the average should be more stable than the signal from any single probe. Admittedly, you won't be able to account for the variability of the signal across duplicate spots, which is what duplicateCorrelation would do if you used it here; but this also means that you're free to use duplicateCorrelation on the patients, which is more practically important as it allows for comparisons between obese and normal individuals.

ADD COMMENT
0
Entering edit mode

Thanks a lot Aaron Lun.

I'm using avereps before using duplicateCorrelation on patients.

But I don't know which of both approaches would be better. 

Results using spots or patients are very different. That's why I would like to know if the following code is correct:

DUPS = 30 ## we have 30 probes for each gene (they can be different)

corfit <- duplicateCorrelation (expr, design = des.mat, ndups = NDUPS, spacing = 1)

fit <- lmFit (expr, design = des.mat, ndups = NDUPS, spacing=1, correlation = corfit$consensus)

ADD REPLY
1
Entering edit mode

I assume you've set it up so that the duplicate spots are side-by-side in exprs, otherwise spacing=1 would be incorrect. What exactly do you mean by "they can be different"? Because if any genes have a different number of spots, then many of the assignments will probably be wrong if you assume that they've all got 30.

Anyway, modelling the correlation due to the patient factor seems more important than modelling the correlation between spots, given that patient-to-patient (biological) variability should be much larger than spot-to-spot (technical) variability. You haven't shown what des.mat is, but I assume that you've used the same ~Treatment formula. Presumably you haven't included a patient blocking factor; not that it would help, because that precludes an obese/normal comparison.

ADD REPLY
0
Entering edit mode

Thanks a lot for your help. I hope this will be my last question :-)

This is my code (for the 2 approaches)

When I started the analysis I used the following code. But I didn't get any DEG.

 

x <-read.maimages(targets_all$FileName, source="agilent", green.only = TRUE)
colnames(x$E) <- targets_all$Code
x <- backgroundCorrect(x, method="normexp", offset = 0)
y <- normalizeBetweenArrays(x, method = "quantile")
y <- y[y$genes$ControlType == 0,]

################ pipeline 1: duplicatedCorrelation Between Women

y.ave <- avereps(y, ID=y$genes$SystematicName)

dim(y.ave)  ### 2006 miRNAs and 32 samples

Person <- factor(targets_all$Woman)
Treat = targets_all$Group
design <- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)
corfit <- duplicateCorrelation(y.ave$E,design,block=Person)
corfit$consensus.correlation


contrastes <- makeContrasts (
  c1       =  control_lh7 - control_lh2,     
  c2       =  obesity_lh2 - control_lh2, 
  c3       =  obesity_lh7 - obesity_lh2, 
  c4       =  obesity_lh7 - control_lh7,
  levels = design)


## fit model

fit <- lmFit(y.ave$E,design,block=Person,correlation=corfit$consensus.correlation)

## contrastes 

fit2=contrasts.fit(fit,contrastes)
fit2=eBayes(fit2)

res_limma_m1_c1 <- topTable(fit2, coef = 1, number = 2006)  ## 0 deg
res_limma_m1_c2 <- topTable(fit2, coef = 2, number = 2006)  ## 0 deg
res_limma_m1_c3 <- topTable(fit2, coef = 3, number = 2006)  ## 0 deg
res_limma_m1_c4 <- topTable(fit2, coef = 4, number = 2006)  ## 0 deg

 

So, I decided to use correlation between spots and I got a lot of DEG


############### pipeline 2: Correlation within array: spots

expr <- y$E    ## i'm using background corrected and normalized data (is it ok?) at probe level
rownames (expr) <- y$genes$SystematicName
table(table(rownames(expr)))  ## each gene 30 spots: but probe can be different
ord <- order (rownames (expr))
expr <- expr[ord,]
expr[1:3,]


## Estimate the correlation structure for the genes / miRNAs
NDUPS <- unique (table (rownames (expr)))  ## 30
if (length (NDUPS) > 1) stop ("NOT THE SAME NUMBER OF DUPLICATED SPOTS")


dupcor <- duplicateCorrelation (expr, design = design, ndups = NDUPS, spacing = 1)
dupcor$consensus

## fit the limma model with correlation
fit <- lmFit (expr, design = design, ndups = NDUPS, spacing=1, correlation = dupcor$consensus)
fit <- contrasts.fit (fit, contrastes)
fit2=eBayes(fit)


res_limma_m2_c1 <- topTable(fit2, coef = 1, number = 2006)  ## 175 deg
res_limma_m2_c2 <- topTable(fit2, coef = 2, number = 2006)  ## 27 deg
res_limma_m2_c3 <- topTable(fit2, coef = 3, number = 2006)  ## 246 deg
res_limma_m2_c4 <- topTable(fit2, coef = 4, number = 2006)  ## 193 deg

Do you think the both approches (and code) are correct? 

In this case, why do you think the results are so different?

 

Huge thanks in advance :-D

ADD REPLY
1
Entering edit mode

Getting more DE genes doesn't make the analysis correct. Because your second approach doesn't consider the correlations due to the patient factor, it will likely be liberal as it overstates the amount of information that is present across dependent samples. This is why you get more DE genes at a nominal FDR threshold of (presumably) 5%, compared to the first analysis. Many of these are probably false positives, at least at this FDR threshold.

In short, the first approach seems a lot more valid to me. If you don't get any DE genes, then so be it. It's better to get nothing from a correct analysis than to get something from an incorrect one. In any case, you can just relax the FDR threshold in the first analysis to pick up some DE genes. Sure, the evidence for DE will be weaker, but at least you know it's weaker, whereas the second analysis will give you a false sense of security about the reliability of your results.

ADD REPLY
0
Entering edit mode

Thank you :-)

ADD REPLY

Login before adding your answer.

Traffic: 590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6