Question

handling missing data (NA) for methylaion analysis in Limma

0

Entering edit mode

mheydarpour ▴ 10

@mheydarpour-9430

Last seen 6.9 years ago

I have a final report methylation data for ~800k rows for CpG sites for 50 samples. There are many missing-values (NA) in this dataset which I have a problem when I want use Normalization and also differential methylation analysis in Limma. I just provide you a small sample of my data as follows: (Values of each cell is Beta-values)

CpG-sites	sample1	sample2	sample3	sample4	sample5
cg01017367	0.6735	0.7229	0.6696	0.6561	0.6043
cg01485780	NA	0.7923	0.7458	NA	0.7526
cg02276259	0.4328	0.4618	0.4860	0.4493	0.3947
cg04315069	0.7968	NA	0.7816	0.8490	0.7797
cg06291348	0.3715	0.3593	NA	0.3172	0.2958
cg07495256	0.8986	0.9079	0.9192	0.9116	0.8012
cg07920074	0.7049	0.7388	0.7777	0.7039	NA

My question is: How to handle these missing-data (NA) in this huge dataset (~800k rows + 50 columns)? Is there any package in R to consider missing data? Is there a fast program to impute missing data in R? Thanks in advance for any advise

limma missing data methylation normalization • 3.5k views

ADD COMMENT • link updated 8.2 years ago by Aaron Lun ★ 29k • written 8.2 years ago by mheydarpour ▴ 10

0

Entering edit mode

limma handles missing values naturally in lmFit. You'll have to be more precise about the nature of your problem.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

But when I run "lmFit", I got the following error in Limma:

fit=lmFit(CpG, design)
Error in rowMeans(y$exprs, na.rm = TRUE) : 'x' must be numeric

and I also have problem for Normalization of these Beta-values

ADD REPLY • link 8.2 years ago mheydarpour ▴ 10

score 0 · Answer 1 · 2017-10-25

0

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 3 hours ago

The city by the bay

Well, is your matrix numeric? This works fine for me:

CpG <- matrix(rnorm(1000), ncol=10, nrow=100)
CpG[sample(length(CpG), 100)] <- NA # Adding some missing values.
design <- model.matrix(~gl(2,5))
fit <- lmFit(CpG, design)

You'll have to be more precise about the problems you're having with normalization. As far as I am aware, the normalization methods in limma can deal with missing values. (Assuming, of course, that it makes sense to apply them to methylation data - I'm not familiar enough with this to be sure.) In any case, you shouldn't be using beta values for linear modelling, see https://dx.doi.org/10.1186/1471-2105-11-587.

ADD COMMENT • link 8.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks Aaron for you explanation.

My CpG-data matrix is numeric with some "NA" , however, I still got the same error when I run Limma with this design matrix:

design=model.matrix(~0+age+pheno)

where age is continuous variable and pheno is categorical outcome (0,1)

What would be the problem that I got the following error:

Error in rowMeans(y$exprs, na.rm = TRUE) : 'x' must be numeric

ADD REPLY • link 8.2 years ago mheydarpour ▴ 10

0

Entering edit mode

Well, does running typeof(CpG) give you "double"?

ADD REPLY • link 8.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

No, when I run "typeof(CpG)", it gave me "list"

ADD REPLY • link 8.2 years ago mheydarpour ▴ 10

0

Entering edit mode

Well, there you go. Coerce it into a numeric matrix with as.matrix before running lmFit.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

I tried this: CpG2 <- as.matrix(CpG)

run: fit <- lmFit(CpG2, design)

Error: Error in rowMeans(y$exprs, na.rm = TRUE) : 'x' must be numeric

ADD REPLY • link 8.2 years ago mheydarpour ▴ 10

0

Entering edit mode

Does running typeof(CpG) give you "double"?

ADD REPLY • link 8.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

typeof(CpG2) gave me "character"

ADD REPLY • link 8.2 years ago mheydarpour ▴ 10

0

Entering edit mode

Obviously, then, CpG2 is not a numeric matrix. There is one column in your original data frame (CpG) that is clearly non-numeric, and this causing as.matrix to produce a character matrix. The identity of the offending column is left as an exercise for the reader.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

CpG2 matrix : first column of this matrix is "CpG-sites-Name" and first row is "Sample-Id-Name".

ADD REPLY • link 8.2 years ago mheydarpour ▴ 10

0

Entering edit mode

Correct! That is indeed the offending column. Now, try removing it from your data frame before you run as.matrix.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 29k