I’m trying to recreate the results from this Alzheimer’s paper for further analysis (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-016-0355-3)
Specifically, the supplementary table 4A reports genes significantly differentially expressed between brain donors that have Alzheimer’s vs. those that do not. The authors have corrected for covariates including sex, race, PMI, and pH for ~44,000 different probes (they did not remove half of the lowly expressed genes as is frequently suggested, as far as I’m aware).
I am having trouble recreating their results. Specifically, I am not getting any significant result when I look at the Inferior Frontal Gyrus differentially expressed genes when I look at low vs high Alzheimer’s clinical dementia rating (CDR). The problem is they are reporting 468 differentially expressed genes in the inferior frontal gyrus when they partition the data in the same way. I am hoping someone can help me pick up exactly what I am doing wrong.
I have two files, one is called patient and includes all of the patient covariates
The other file is the microarray expression data, called alzheimers
#The paper divided up patients into 'normal', 'low', and 'high' alzheimers (supplementary table 2 of paper) and I am dividing up the patients in the same way
CDR_low=patient%>%dplyr::filter(CDR%in%c(0.5, 1, 2))%>%mutate(alzheimers='low')
CDR_high=patient%>%dplyr::filter(CDR%in%c(3, 4, 5))%>%mutate(alzheimers='high')
#bind patient tables together
CDR_low_high=rbind(CDR_low, CDR_high)
#filter out alzheimers table results that don't match the patient data
CDR_low_high_alzheimers=c()
CDR_low_high_alzheimers_colnames=c()
for (i in 1:ncol(alzheimers)){
if (colnames(alzheimers)[i]%in%CDR_low_high$individualIdentifier){
CDR_low_high_alzheimers=cbind(CDR_low_high_alzheimers, alzheimers[,i])
CDR_low_high_alzheimers_colnames=rbind(CDR_low_high_alzheimers_colnames, colnames(alzheimers)[i])
}
}
colnames(CDR_low_high_alzheimers)=CDR_low_high_alzheimers_colnames
rownames(CDR_low_high_alzheimers)=alzheimers$ID
#filter out patient table results that don't match the alzheimers data
CDR_low_high_present=c()
for (i in 1:nrow(CDR_low_high)){
if (CDR_low_high$individualIdentifier[i]%in%colnames(CDR_low_high_alzheimers)){
CDR_low_high_present=rbind(CDR_low_high_present, CDR_low_high[i,])
}
}
CDR_low_high=CDR_low_high_present
rownames(CDR_low_high)=c()
rownames(CDR_low_high)=CDR_low_high$individualIdentifier
CDR_low_high=CDR_low_high%>%dplyr::select(PMI, pH, Sex, Race, Age, alzheimers)
# take log2 since the data is not normally distributed
CDR_low_high_alzheimers=log2(CDR_low_high_alzheimers)
rownames(CDR_low_high_alzheimers)=alzheimers$ID
#Design matrix
CDR_low_high_design <- model.matrix(~0+CDR_low_high$alzheimers+CDR_low_high$pH+CDR_low_high$Sex+CDR_low_high$Race+CDR_low_high$PMI)
colnames(CDR_low_high_design)=c('CDR_alzheimers_high', 'CDR_alzheimers_low', 'pH', 'sex_male', 'race_hispanic', 'race_white', 'PMI')
CDR_low_high_expression_fit <- lmFit(CDR_low_high_alzheimers, CDR_low_high_design)
CDR_low_high_expression_contr.matrix <- makeContrasts(
CDR_alzheimers_low_CDR_alzheimers_normal = CDR_alzheimers_high-CDR_alzheimers_low,
levels = CDR_low_high_design)
CDR_low_high_expression_fit2 <- contrasts.fit(CDR_low_high_expression_fit, CDR_low_high_expression_contr.matrix)
CDR_low_high_expression_fit2 <- eBayes(CDR_low_high_expression_fit2)
topTable(CDR_low_high_expression_fit2, number=nrow(CDR_low_high_expression_fit2))
This is what my results look like:
this is what their results look like:
as you can see, they have a lot more significant results whereas I have none. Furthermore, even their log2FC values are different and I don't know what I am doing wrong or how I can replicate their results.
1) Did you contact the authors? They should provide code that you could study/use/improve.
2) You do not describe how you retrieve the data to run your script. Your script is long and hard to digest. I used GEOquery to obtain some of the data from GEO, and was able to do a sanity check of the assertion on TXNIP, which I think is the gene asserted to exhibit largest difference in the subset you are looking at.
1) I contacted four of the authors but unfortunately they all directed me to one of the four authors I contacted who hasn't gotten back to me over a few weeks now. I don't know if he is on holiday or busy but I am at a standstill until I can figure out how they got their results.
2) I got the data from the data portal on synapse, which you need an account to access. Specifically I am using this file (https://www.synapse.org/#!Synapse:syn16779991). the patient file with all the covariates is here (https://www.synapse.org/#!Synapse:syn3205399). The data itself has already been processed from the CEL files.
I will not be looking at synapse, so cannot help with that. You might want to work from the GEO deposit, checking that it is consistent with the synapse content. If the datasets are consistent, you would be able to post a fully reproducible example based on the GEO data and then you might get more assistance here. This is a bit of how I started out.
I've added a fully reproducible version here https://support.bioconductor.org/p/118615/.
Still no luck with the original question unfortunately