EPIC array single replicate pipeline?
1
0
Entering edit mode
chay • 0
@chay16
Last seen 3.0 years ago
Elgin

Hello, I’m new to this so sorry if any of this is obvious but I would really appreciate any help. I have 6 cancer cell lines, two of each of the following subtypes: mesenchymal, proneural, classical.

I want to find differentially methylated positions/regions, and have tried using limma to compare each subtypes in a contrast matrix but this generates no significant results. This is not totally unexpected as, from inspecting MDS plots/PCA, there probably isn’t enough power to detect changes due to the low sample size. However, I would like to compare the 2 cell lines that are most likely to have significant results from looking at the MDS plot. Can anyone suggest a single replicate analysis pipeline that would allow me to do this? I cannot do it with limma as it cannot do the statistical analysis so just comes up with an error.

However, prior to setting up my contrast matrix, my design matrix (without intercept) generates significant probes for each of the subtypes. I used this design matrix to calculate DMPs and have found a large number of significant probes. However, as this is not the way it’s done in the pipeline, I have no idea if these results are valid? Does anyone know please? From this I can’t calculate DMRs as DMRcate requires the use on an intercept and that alters the whole matrix, hence why I don’t know if using the matrix without intercept is okay for DMPs, but I am guessing not. I also tried using ChAMP but that generated no significant DMPs.

Hopefully this makes sense and someone can help me out, thanks!

fit <- lmFit(mVals, design)
> summary(decideTests(fit))
       Mesenchymal Proneural Classical
Down        212395    208081    208239
NotSig      278049    298912    284535
Up          321598    305049    319268

# Contrast matrix for specific comparisons
> contMatrix <- makeContrasts(Mesenchymal-Proneural,
                             Mesenchymal-Classical,
                             Proneural-Classical,
                             levels=design)
> contMatrix
             Contrasts
Levels        Mesenchymal - Proneural Mesenchymal - Classical Proneural - Classical
  Mesenchymal                       1                       1                     0
  Proneural                        -1                       0                     1
  Classical                         0                      -1                    -1

> fit2 <- contrasts.fit(fit, contMatrix)
> fit2 <- eBayes(fit2)
> summary(decideTests(fit2))

       Mesenchymal - Proneural Mesenchymal - Classical Proneural - Classical
Down                         0                       0                     0
NotSig                  812042                  812042                812042
Up                           0                       0                     0
minfi methylationArrayAnalysis • 1.4k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 8 minutes ago
United States

If you set up your design matrix without an intercept you are calculating the mean of each group, and then if you test for significance without using a contrasts matrix all you are asking is if the mean M-values (I hope you are using M-values instead of the Beta proportions) is different from zero. Which is trivial and uninteresting. And as you can see, once you use a contrasts matrix you have no significant differences.

ADD COMMENT
0
Entering edit mode

Ok thanks so much, that helps a lot because I couldn't really make sense of what the different matrices meant. I'll disregard that result then. Do you know of a method that would allow me to compare two cell lines, without taking the phenotype into comparison?

ADD REPLY
0
Entering edit mode

I am not sure what you mean by 'without taking the phenotype into comparison'. You should probably take phenotype into account when making comparisons, even if there isn't a big difference in the MDS plot. Here's an example

> df <- data.frame(celltype = factor(rep(1:6, each = 6)), type = factor(rep(c("mesenchymal","proneural","classical"), each = 2, times = 6)))
> df
   celltype        type
1         1 mesenchymal
2         1 mesenchymal
3         1   proneural
4         1   proneural
5         1   classical
6         1   classical
7         2 mesenchymal
8         2 mesenchymal
9         2   proneural
10        2   proneural
11        2   classical
12        2   classical
13        3 mesenchymal
14        3 mesenchymal
15        3   proneural
16        3   proneural
17        3   classical
18        3   classical
19        4 mesenchymal
20        4 mesenchymal
21        4   proneural
22        4   proneural
23        4   classical
24        4   classical
25        5 mesenchymal
26        5 mesenchymal
27        5   proneural
28        5   proneural
29        5   classical
30        5   classical
31        6 mesenchymal
32        6 mesenchymal
33        6   proneural
34        6   proneural
35        6   classical
36        6   classical
> design <- model.matrix(~celltype + type, df)
> colnames(design)
[1] "(Intercept)"     "celltype2"       "celltype3"       "celltype4"      
[5] "celltype5"       "celltype6"       "typemesenchymal" "typeproneural"

So the intercept is the mean of the first celltype for the classical phenotype. Columns 2-6 estimate the difference between each celltype and the baseline celltype, for the classical phenotype. The last two columns compute the difference between the mesenchymal and proneural samples and the baseline.

Assuming that the differences between mesenchymal and proneural and classical phenotypes are simply a shift in the mean expression (e.g., there isn't an interaction between the phenotypes and the cell types), then heuristically you can think of those two columns adjusting the mesenchymal and proneural phenotypes to be equivalent to the classical phenotype, and then the within-cell type comparisons use all data to estimate the differences between the cell types.

But this does take phenotype into account, because you adjust for differences in the phenotype as part of the model.

This ignores any possible interaction (where for example a gene's expression goes up between cell type 1 and 2 for mesenchymal, and goes down between those cell types for classical). If there is an interaction, then ignoring that fact is as problematic as ignoring the phenotype.

ADD REPLY
0
Entering edit mode

Thanks so much for your reply, I really appreciate you taking the time to help me, but | have to admit that I didn't understand most of that (I'm a university student and this is the first time I've done anything like this).

Sorry I wasn't clear, I meant that I just want to see if there are any differences between the 2 cell lines (1 mesenchymal sample and 1 proneural) that are furthest apart in the MDS plot, irrespective of their subtype, as I have been told that if I can't pick up any significance for those then the power just isn't strong enough due to small sample size and patient variability. So let's say any difference between the subtype is negated, which I think is what you are saying in the second paragraph, then I would set up a contrast matrix comparing columns 2-6? But even if I did it that way, the results wouldn't really be valid right, as it's problematic?

Also, if the intercept is the mean, going back to my original matrix, it gives the below result. Would the intercept in this case be the mean of both mesenchymal cell lines? If it is, then for say DMRcate, can I set the coefficient as 1? Also when I use bumphunter with this matrix, I get 81211 bumps, but I guess I'm just really confused at what the intercept represents in this case and why it's the only column with significance and whether that is how it's meant to be.

I'm sorry for all the questions, but would be really grateful if you have the time to respond, thanks!

subtype <- factor(rep(c("Mesenchymal","Proneural","Classical"),2), levels = c("Mesenchymal","Proneural","Classical"))
       (Intercept) subtypeProneural subtypeClassical
design <- model.matrix(~ + subtype)
design

1           1                0                0
2           1                1                0
3           1                0                1
4           1                0                0
5           1                1                0
6           1                0                1

fit <- lmFit(mVals, design)
summary(decideTests(fit))

(Intercept) subtypeProneural subtypeClassical
Down        212395                0                0
NotSig      278049           812042           812042
Up          321598                0                0
ADD REPLY
0
Entering edit mode

In general you don't use model.matrix like that. I mean you can, and it obviously worked, but not having anything between the tilde and the plus sign relies on R figuring out that the plus sign shouldn't be there and ignoring it.

Computers are usually not that forgiving, so you should make sure you read and understand the help page for every function you use (boring, I know, but you are a self-professed newbie, and the only way to learn is by learning. If you plan to use Open Source software to any degree, you have to understand that it's not really free, and the cost is in your time to learn how to use it.)

Anyway, you should peruse the limma User's Guide, by which I mean read it through several times, because it covers all the stuff about fitting models and whatnot.

And yes, the intercept is the mean of the mesenchymal cell lines. Which is an uninteresting thing to measure. You have to remember that these measurements aren't really something with a meaningful interpretation. You have measured the fluorescent intensity of some probes that bound to a particular genomic region, and used those intensities to compute (I certainly hope) the log ratio of the methylated and unmethylated probes. Which you can compare to other such measurements for that same CpG, but which you cannot interpret alone.

Significance for the intercept is saying that there is some evidence that the M-value (the log ratio mentioned above) is different from zero, which is not the same as saying that there is evidence the methylation is different between any of the groups. The second and third columns tell you that, and as you can see it's still no.

ADD REPLY
0
Entering edit mode

Okay, thanks so much, that definitely helps a lot. I certainly underestimated how complicated this would be (my fault for being a naive student), but as my dissertation is due in a month I'm limited for how much time I can dedicate to fully understanding the functions, but I'll definitely try to get more familiar with the functions.

So basically, I can conclude that there isn't any difference between methylation between the subtypes, or at least the power isn't strong enough to detect them, or is it too early for that? I'm just worried that I haven't used enough methods to avoid potentially reporting a false negative (ChAMP using DMRcate and Probe Lasso, DMRcate separately, limma, minfi/bumphunter, but I know a lot of these methods overlap).

I'm just a bit confused as well because when I followed the minfi workflow it generated no significant DMPs (small p-values but large q-values) but bumphunter does return bumps. I know that DMRs don't necessarily have to contain DMPs but it doesn't seem right to have no significant DMPs but approx 70,000-80,000 bumps. So that means the bumps returned by bumphunter can't be significant either since I've used the same matrix with intercept? Although when I change to coefficient to 2 or 3 I do get a different number of bumps found. So maybe in this case bumphunter has better detection power than limma, but I'm unsure what to make of this cause I read the bumphunter vignette and I know for the statistical model I should really have replicates which I don't have, and I know that you set a cutoff when running the function but the output doesn't give me any p- or q-values or anything so I can check whether it's actually significant (or at least from what I gather).

Any advice on that? Once again I really appreciate it so thank you!

ADD REPLY
0
Entering edit mode

Bumphunter looks for things that look like bumps. That's all it does, and there will always be bumps! What you want to know is if a bump is in some sense larger/longer/higher than expected under the null distribution, and without any replication you cannot estimate the null distribution so it's impossible to say if any of your bumps are unlikely under the null (which you need to know in order to infer that they are a significant result).

Think of it this way. Say you wanted to know if people in Ohio are taller than people in Indiana. You go out and measure one person in Ohio and one person in Indiana, and the person in Ohio is 4" taller. What do you now know about the heights of people in those two states? Next to nothing. What if one person is a male and the other is a female? What if one grew up hungry and the other had all the best food, plus a personal trainer?

You are trying to do the same thing, except you are doing it over and over and over again, for like 850,000 CpGs. Does that sound like a super reasonable thing to do? Would you expect some (maybe lots) of those CpGs to have differences? If so, how would you ever decide if those differences are due to some underlying phenotype you care about rather than some other thing that you cannot account for? And how would you ever tell if the observed differences you see are larger in some sense than the random differences you expect when comparing two groups that may actually not be different at all?

ADD REPLY
0
Entering edit mode

Thanks so much for explaining that, it makes a lot more sense now! In that case, are those limitations things I could mention in my write up or should I just totally exclude the bumphunter results since I'm not using it with replicates so therefore it isn't really relevant to me? I'm guessing it's not worth putting in meaningless data right?

I guess I'll just have to concede to the fact that my results are not significant, as I cannot find any other packages that look suitable and user-friendly for the limited time I have (except RnBeads which is refusing to work for me), and I know that having no significant result is typical in science, but it's gonna be tricky to write up a whole scientific report without any actual data, whether significant or not (besides that from dmpFinder), but I guess I'll have to figure that part out on my own :)

ADD REPLY

Login before adding your answer.

Traffic: 935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6