Hi, I want to pre-face that this might be a long read.
I would like to re-create data from a dataset to learn how to perform differential analysis with limma. My main goal is to re-create Fig. 2 from the cited study (https://onlinelibrary.wiley.com/doi/10.1002/cam4.1187). I'm extremely new to this but have compiled code from tutorials, etc. that will read the files and perform rma normalization, then use limma to look for differential expression. At-least that's what I understand to be happening.
(btw I used GSM2572161 through GSM2572170 for .cel files (HSC vs CML))
I will put the code below:
library(oligo)
library(dplyr)
setwd("C:fake-example/GSE/GSE97562")
celFiles <- list.celfiles(full.names = TRUE)
affyExpressionFS <- read.celfiles(celFiles)
#RMA
eset <- oligo::rma(affyExpressionFS)
data.matrix = exprs(eset)
library(limma)
#Make design
design = model.matrix(~0+factor(c(1,1,1,1,1,2,2,2,2,2)))
design[,2] = c("cmlS","cmlS","cmlS","cmlS","cmlS","cmlP","cmlP","cmlP","cmlP","cmlP")
colnames(design)[2]="source"
design = as.data.frame(design)
groups = design$source
#Factor
f = factor(groups,levels=c("cmlS","cmlP"))
design2 = model.matrix(~ 0 + f)
colnames(design2) = c("cmlS","cmlP")
#Fit to model
data.fit = lmFit(data.matrix,design2)
contrast.matrix = makeContrasts(cmlP- cmlS,levels=design2)
data.fit.con = contrasts.fit(data.fit,contrast.matrix)
data.fit.eb = eBayes(data.fit.con)
#Make table with expression data
options(digits=2)
tab = topTable(data.fit.eb,coef=1,number=Inf,adjust.method="BH")
First, want to apologize for the ugly code. From what I understand this is a very basic run-through of differential analysis. The output of tab is 33,927 genes/probes (since I never annotated). The data I get back after keeping "genes" that follow the study's filters (fold change greater than 2 and p-value less than .05) is 50 genes (they get 584). What I believe has not been done is removal of mappings to the same gene ID. I also realize that I have not cutoff "low counts". I'm not sure what else could account for such differences. I have attempted annotating these probes but have had very little luck with that. I realize that it's much to ask for a tutorial and I feel that I might not be completely understanding what I'm doing, but was hoping someone could answer these questions:
Is it possible to convert .cel files immediately to .csv or .txt? (Something I'm able to look at and visualize would be extremely helpful) (Such as in RNA-seq data)
How would you go about removing "low counts"? Meaning, how would you find the sample cutoff and then set that as the threshold for the "counts".
When should annotation be done? (Many questions I found seem to do it after rma normalization)
Could you point me to a tutorial for annotating the probeID's? ( The affyExpressionFS outputs this : Annotation: pd.hugene.1.0.st.v1 )
-I tried using annotateEset which seemed to work, genuinely don't know how to check the mappings after this function though.
How do I remove the NA values after annotating the probes?
I appreciate any help/advice! Thank you!
I forgot this step
Which allows you to also make other comparisons if you care to, by adding them to the call to
makeContrasts
.Thank you! This helps so much I really appreciate it. One quick question I had about the data though.
When I run your analysis with cutting out completely the files from samples I don't want, (so only reading in GSM2572161 through GSM2572170) I return with 95 DEG's. This might sound silly, but do I need to be normalizing with full datasets, then reducing my sample size to the ones I want? Not sure why, but I thought this would skew my results.
Here is the only part I change about the code besides limiting the files to samples I want. I only change the design matrix to fit the design of my 10 samples
Thank you for the help!
It's not the normalization that produces the difference, it's the degrees of freedom. By fitting a model with all the data, I have 32 degrees of freedom whereas your model only has 8. This increases power because the standard error will be smaller. As a fake example:
The first model uses all 40 observations, and the second model uses only the first 10.
Note that the first two rows have identical estimates (the second row being CML_stem - NBM_stem, which is the comparison you are interested in), but the standard error is smaller for the larger model, so the t-statistic is larger. And in addition, the p-value is based on 32 df for the full model and only 8 for the reduced model.
You will also gain power because of the
eBayes
step, which will estimate a more accurate variance prior with all 40 samples instead of 10.