Question

Normalization starting from beta values in minfi

0

Entering edit mode

Asma rabe ▴ 290

@asma-rabe-4697

Last seen 6.2 years ago

Japan

Hi,

I have matrix of Beta values of methylation data using 450k array, I am using minfi to analyze the methylation data. I have the following questions:

1-How to normalize the data starting by Ratio Set. All normalization function use either RGChannelSet or MethylSet

2- I tried identification of DMR using bumphunter as follows:

designMatrix <- model.matrix(~ status)

#status is a factor with D and C levels for control and disease

dmrs <- bumphunter(GRset, design = designMatrix,cutoff = 0.2, B=0, type="Beta")

head(dmrs$table,n=3)

I got table with the following columns:

chr start end VALUE area cluster indexStart indexEnd L clusterL

What does the column VALUE means?

dmrs$pvaluesMarginal is empty with value NA

I do not know why dmrs$pvaluesMarginal is empty??

3-How to determine Hypo and hyper methylated genes?

Thank you very much in advance

methylation minfi • 3.3k views

ADD COMMENT • link written 7.9 years ago by Asma rabe ▴ 290

score 1 · Accepted Answer · 2016-06-01

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

1.) I don't think there is a good way to normalize the ratios. All reasonable normalization methods take into account the two different types of probes, as well as dye bias. If all you have are ratios, you have lost all that information. Have you looked at plots to see if the data have already been normalized?

2.) This is the mean (by default, you might have changed it) of the beta values for that region. Without a reproducible example, I don't know why pvaluesMarginal is NA either.

3.) The value tells you the sign of the coefficient. Depending on how you set up the model, that infers whether a region is hyper or hypo-methylated.

ADD COMMENT • link 7.9 years ago James W. MacDonald 65k

0

Entering edit mode

Hi James,

Thank you very much for the helpful answer.

1- Yes, I have checked the density plot and the data looks fine with bimodal distribution.

Any further checks should be done to confirm thae normalization of data?

2- I could get pvaluesMarginal . Thank you.

3-Here is the example I used to run bumphunter

phenotype<-c(rep("D",3),rep("C",3))

names(phenotype)<-c("disease1","disease2","disease3","control1","control2","control3")

designMatrix <- model.matrix(~ phenotype)

dmrs <- bumphunter(GRset, design = designMatrix,cutoff = 0.2, B=1000, type="Beta")

here are column names of table of dmrs object

colnames(dmrs1$table)

[1] "chr" "start" "end" "value" "area"

[6] "cluster" "indexStart" "indexEnd" "L" "clusterL"

[11] "p.value" "fwer" "p.valueArea" "fwerArea"

Sorry to disturb you but I have the following questions:

i-Shall i consider only p.value to select significant dmrs? or shall i consider both p.value and fewer?

ii-What are columns "cluster" "indexStart" "indexEnd" "L" "clusterL" mean?

iii-Column value have +ve and -ve values, according to the model above dmrs with negative values are hypomethylated in disease relative to control while hyper methylated in disease dmrs have +ve value. I would like to plot the heat map for dmrs regions(genes ) across samples. The only way I have in my mind is to use beta values to plot the heatmap for dmrs but i have beta values for probes rather than dmrs. Any idea??

Thank you very much.

ADD REPLY • link 7.9 years ago Asma rabe ▴ 290

0

Entering edit mode

1.) If all the density plots are very similar, then I think you can assume they have been normalized. You might also contact whomever you got the data from and ask them how they normalized.

2.) Cool.

3.)

i) FWER is obviously the safer way to go, as it has been adjusted for multiple comparisons.

ii.) These columns come from regionFinder and clusterMaker, which are both internal functions. And in normal usage you don't need to know what they mean. Basically what happens is that all the CpG positions are first indexed in order, from 1 to N where N is the total number of CpGs. Then clusterMaker defines clusters along each chromosome, based on CpGs that are not too far from each other (e.g., in a cluster). regionFinder then looks at each cluster and tries to find contiguous CpGs that are all up or down, based on criteria that you supply. But regionFinder reports the regions in terms of the cluster it was found in and the index position, from the original indexing of the CpG positions.

So cluster tells you which cluster the DMR is in, the indexStart and indexEnd are the index positions for the start and end of the DMR, L is the length of the DMR (in terms of CpGs), and clusterL is the length of the cluster (again in terms of number of CpGs, not genomic distance). So like I said, these are useful for internal calculations and stuff, but are not particularly informative for the end user.

ADD REPLY • link 7.9 years ago James W. MacDonald 65k

0

Entering edit mode

Oh, I forgot

3.)

iii) I don't really like the whole heatmap idea, because in my experience it isn't very enlightening. Instead I prefer to use Gviz to plot the genomic region as well as the beta values, along with a smoothed line for each group. The Gviz user's guide is really awesome, and should help you figure out how to do that.

ADD REPLY • link 7.9 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you very much for the clear explanation.

I wonder why many regions are identified as DMRs but their fewer in not statistically significant?

Thanks for introducing Gviz , I can use it to investigate few genes of interest but in case I have hundreds of DMR and I would like to summarize them in a visually clear way. How people generally do in methylation data downstream analysis?

ADD REPLY • link 7.9 years ago Asma rabe ▴ 290

0

Entering edit mode

If you are saying a significant p-value indicates a region is a DMR, and fwer < 0.05 means not significant, then this is simply the difference between a p-value for a single test, and a multiplicity adjusted p-value.

I don't know how people generally do much of anything. I only know what random sorts of things I do personally. And what I normally do is use a package I wrote that uses a combination of ReportingTools or DT and Gviz, the idea being that I generate plots for each region using Gviz, then generate an HTML table that has in each row the chromosomal position of the DMR, the nearest gene, some of the stats form the bumps table, and links to the Gviz plots.

That way my collaborators just have a table that says the closest gene name and where the DMR is, and they can look through that to see if any pique their interest. If so, they can click on the link and see the genomic region and the methylation status.

My package is on GitHub, and you are welcome to use it, or just get ideas if you like, but with the clear understanding that this is at your own risk, and I'm not going to support you (e.g., don't email me with questions). This package is rather ad hoc and I have never done anything to make it particularly user-friendly or generalized to arbitrary experiments. So like I said, you are welcome use it but understand that it's on you to figure things out if it goes boom.

ADD REPLY • link 7.9 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you very much for the helpful answers.

Regarding your answer:

>>If you are saying a significant p-value indicates a region is a DMR, and fwer < 0.05 means not significant, then this is simply the difference between a p-value for a single test, and a multiplicity adjusted p-value.

What i meant is not correction for multiple testing, I meant that I found regions obtained by pump hunter with p-val >0.1 and some have p.val=1 so, I wonder how these regions are not statistically significant and picked by bump hunter.

ADD REPLY • link 7.8 years ago Asma rabe ▴ 290