Question

limma for 450K methylation array data

0

Entering edit mode

Andreas Paul • 0

@andreas-paul-11140

Last seen 7.4 years ago

Dear all,

we want to use the limma package for the analysis of methylation data from the 450K Illumina
array for continuous traits.
We have 50 samples which were processed on the 450k array. The design only consist cases
without any controls. We have more than 5 covariates we want to include into the analysis apart
from age and ethnicity.
Could you give a broad idea/ R commands how to build a model and adjust for the above covariates using limma? We will use normalized and filtered M-values for the statistical analysis.

ID	GroupAge	Ethnicity	GroupSurgery	GroupSeizures	GroupDrug	GroupTreatment	Group5	Group6	Group7
1	1	CEU	1	2	1	3	..	..	..
2	1	CEU	2	3	3	3	..	..	..
3	2	CEU	1	2	2	3	..	..	..
4	1	CEU	1	2	1	3	..	..	..
5	3	AFR	3	1	1	2	..	..	..
6	1	ASN	2	1	2	3	..	..	..
7	1	ASN	1	2	2	3	..	..	..
8	1	CEU	3	3	2	2	..	..	..
9	2	MEX	1	1	1	1	..	..	..

This is how I would do it in the first place. Would be happy if you can have a look at. Im very thankful for any corrections and suggestions for designing the analysis.

samplesheet <- read.table("samplesheet.txt",row.names="Sample_ID")
age <- as.numeric(samplesheet[colnames(m),]$GroupAge)
ethnicity <- as.factor(samplesheet[colnames(m),]$Ethnicity)
group1 <- as.numeric(samplesheet[colnames(m),]$GroupSurgery)
group2 <- as.numeric(samplesheet[colnames(m),]$GroupSeizures)
group3 <- as.numeric(samplesheet[colnames(m),]$GroupDrug)
group4 <- as.numeric(samplesheet[colnames(m),]$GroupTreatment)

library(limma)
design <- model.matrix(~age+ethnicity+group1+group2+group3+group4)
fit <- lmFit(m.filtered,design)
fit <- eBayes(fit)
topTable(fit, coef="age")

topTable(fit, coef="ethnicity")

toTable(fit, coef="group1")

...

Thank you very much in advance.

limma minfi • 1.3k views

ADD COMMENT • link 7.7 years ago Andreas Paul • 0

0

Entering edit mode

Andreas Paul • 0

@andreas-paul-11140

Last seen 7.4 years ago

Thank you very much! This helps as it is really well explained! Thank you again!

ADD COMMENT • link 7.7 years ago Andreas Paul • 0

score 2 · Accepted Answer · 2016-07-20

All that as.numeric business will make R think that those things are numeric rather than factor. So unless a surgery of 3 is like 3 times more surgerylicious than a surgery of 1, you should probably not be doing that. The same goes for the seizures, drug, and treatment factors.

Also note that you won't have a coefficient called 'group1' if it is in fact some sort of surgery level. It would probably be easier to just leave it as Surgery, because then you will have a Surgery2 and Surgery3, which will estimate the difference between those surgery 'levels' and the 1 surgery level. If you stick with 'group1', then those coefficients will be 'group12' and 'group13', which are just confusing for no real gain. As a simple example:

> d.f <- data.frame(age = sample(25:75, 10), ethnicity = sample(c("CEU", "AFR","ASN","MEX"), 10, TRUE), Surgery = factor(sample(1:3, 10, TRUE)))
> model.matrix(~age+ethnicity+Surgery, d.f)
   (Intercept) age ethnicityASN ethnicityCEU ethnicityMEX Surgery2 Surgery3
1            1  56            0            0            1        1        0
2            1  65            0            1            0        0        0
3            1  28            1            0            0        0        1
4            1  44            1            0            0        0        0
5            1  33            1            0            0        0        1
6            1  69            0            0            1        0        1
7            1  71            0            0            1        0        0
8            1  61            0            0            0        1        0
9            1  60            0            1            0        1        0
10           1  68            1            0            0        0        1
attr(,"assign")
[1] 0 1 2 2 2 3 3
attr(,"contrasts")
attr(,"contrasts")$ethnicity
[1] "contr.treatment"

attr(,"contrasts")$Surgery
[1] "contr.treatment"

> names(d.f)[3] <- "group1"
> model.matrix(~age+ethnicity+ group1, d.f)
   (Intercept) age ethnicityASN ethnicityCEU ethnicityMEX group12 group13
1            1  56            0            0            1       1       0
2            1  65            0            1            0       0       0
3            1  28            1            0            0       0       1
4            1  44            1            0            0       0       0
5            1  33            1            0            0       0       1
6            1  69            0            0            1       0       1
7            1  71            0            0            1       0       0
8            1  61            0            0            0       1       0
9            1  60            0            1            0       1       0
10           1  68            1            0            0       0       1
attr(,"assign")
[1] 0 1 2 2 2 3 3
attr(,"contrasts")
attr(,"contrasts")$ethnicity
[1] "contr.treatment"

attr(,"contrasts")$group1
[1] "contr.treatment"

In addition, there is this idea that adjacent CpGs should be correlated, based on the idea that methylation is a regional phenomenon. This is a powerful idea, as you can then borrow information from adjacent CpG probes, to reduce the inherent variability of individual measurements. The minfi package will allow you to run bumphunter on your data, which puts a smooth curve on adjacent CpGs, and then you can use those smoothed estimates to look for differences rather than the individual CpGs. This also reduces multiplicity, so it's probably a win-win.