Question

I am having trouble recreating the linear regression results published in an Alzheimer's paper

0

Entering edit mode

ravelarvargas • 0

@ravelarvargas-17116

Last seen 4.8 years ago

I’m trying to recreate the results from this Alzheimer’s paper for further analysis (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-016-0355-3)

Specifically, the supplementary table 4A reports genes significantly differentially expressed between brain donors that have Alzheimer’s vs. those that do not. The authors have corrected for covariates including sex, race, PMI, and pH for ~44,000 different probes (they did not remove half of the lowly expressed genes as is frequently suggested, as far as I’m aware).

I am having trouble recreating their results. Specifically, I am not getting any significant result when I look at the Inferior Frontal Gyrus differentially expressed genes when I look at low vs high Alzheimer’s clinical dementia rating (CDR). The problem is they are reporting 468 differentially expressed genes in the inferior frontal gyrus when they partition the data in the same way. I am hoping someone can help me pick up exactly what I am doing wrong.

I have two files, one is called patient and includes all of the patient covariates

The other file is the microarray expression data, called alzheimers Alzheimer's expression data

#The paper divided up patients into 'normal', 'low', and 'high' alzheimers (supplementary table 2 of paper) and I am dividing up the patients in the same way
CDR_low=patient%>%dplyr::filter(CDR%in%c(0.5, 1, 2))%>%mutate(alzheimers='low') 
CDR_high=patient%>%dplyr::filter(CDR%in%c(3, 4, 5))%>%mutate(alzheimers='high')

#bind patient tables together
CDR_low_high=rbind(CDR_low, CDR_high)

#filter out alzheimers table results that don't match the patient data
CDR_low_high_alzheimers=c()
CDR_low_high_alzheimers_colnames=c()
for (i in 1:ncol(alzheimers)){
  if (colnames(alzheimers)[i]%in%CDR_low_high$individualIdentifier){
    CDR_low_high_alzheimers=cbind(CDR_low_high_alzheimers, alzheimers[,i])
    CDR_low_high_alzheimers_colnames=rbind(CDR_low_high_alzheimers_colnames, colnames(alzheimers)[i])

  }
}

colnames(CDR_low_high_alzheimers)=CDR_low_high_alzheimers_colnames
rownames(CDR_low_high_alzheimers)=alzheimers$ID

#filter out patient table results that don't match the alzheimers data
CDR_low_high_present=c()
for (i in 1:nrow(CDR_low_high)){
  if (CDR_low_high$individualIdentifier[i]%in%colnames(CDR_low_high_alzheimers)){
    CDR_low_high_present=rbind(CDR_low_high_present, CDR_low_high[i,])
  }
}
CDR_low_high=CDR_low_high_present

rownames(CDR_low_high)=c()

rownames(CDR_low_high)=CDR_low_high$individualIdentifier
CDR_low_high=CDR_low_high%>%dplyr::select(PMI, pH, Sex, Race, Age, alzheimers)

# take log2 since the data is not normally distributed
CDR_low_high_alzheimers=log2(CDR_low_high_alzheimers)
rownames(CDR_low_high_alzheimers)=alzheimers$ID

#Design matrix
CDR_low_high_design <- model.matrix(~0+CDR_low_high$alzheimers+CDR_low_high$pH+CDR_low_high$Sex+CDR_low_high$Race+CDR_low_high$PMI)

colnames(CDR_low_high_design)=c('CDR_alzheimers_high', 'CDR_alzheimers_low', 'pH', 'sex_male', 'race_hispanic', 'race_white', 'PMI')

CDR_low_high_expression_fit <- lmFit(CDR_low_high_alzheimers, CDR_low_high_design)

CDR_low_high_expression_contr.matrix <- makeContrasts(
  CDR_alzheimers_low_CDR_alzheimers_normal = CDR_alzheimers_high-CDR_alzheimers_low, 
  levels = CDR_low_high_design)

CDR_low_high_expression_fit2 <- contrasts.fit(CDR_low_high_expression_fit, CDR_low_high_expression_contr.matrix)
CDR_low_high_expression_fit2 <- eBayes(CDR_low_high_expression_fit2)
topTable(CDR_low_high_expression_fit2, number=nrow(CDR_low_high_expression_fit2))

This is what my results look like: my differential gene expression results

this is what their results look like: their differential gene expression results

as you can see, they have a lot more significant results whereas I have none. Furthermore, even their log2FC values are different and I don't know what I am doing wrong or how I can replicate their results.

limma edger linear regression microarray differential gene expression • 1.2k views

ADD COMMENT • link 5.2 years ago ravelarvargas • 0

0

Entering edit mode

1) Did you contact the authors? They should provide code that you could study/use/improve.

2) You do not describe how you retrieve the data to run your script. Your script is long and hard to digest. I used GEOquery to obtain some of the data from GEO, and was able to do a sanity check of the assertion on TXNIP, which I think is the gene asserted to exhibit largest difference in the subset you are looking at.

ADD REPLY • link 5.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

1) I contacted four of the authors but unfortunately they all directed me to one of the four authors I contacted who hasn't gotten back to me over a few weeks now. I don't know if he is on holiday or busy but I am at a standstill until I can figure out how they got their results.

2) I got the data from the data portal on synapse, which you need an account to access. Specifically I am using this file (https://www.synapse.org/#!Synapse:syn16779991). the patient file with all the covariates is here (https://www.synapse.org/#!Synapse:syn3205399). The data itself has already been processed from the CEL files.

ADD REPLY • link 5.2 years ago ravelarvargas • 0

1

Entering edit mode

I will not be looking at synapse, so cannot help with that. You might want to work from the GEO deposit, checking that it is consistent with the synapse content. If the datasets are consistent, you would be able to post a fully reproducible example based on the GEO data and then you might get more assistance here. This is a bit of how I started out.

library(GEOquery)
gg = getGEO("GSE84422")
table(gg[[2]][["brain region:ch1"]])

            Anterior Cingulate                Caudate Nucleus 
                            59                             52 
Dorsolateral Prefrontal Cortex                   Frontal Pole 
                            57                             63 
                   Hippocampus         Inferior Frontal Gyrus 
                            55                             53 
       Inferior Temporal Gyrus          Middle Temporal Gyrus 
                            58                             58 
       Occipital Visual Cortex          Parahippocampal Gyrus 
                            53                             60 
    Posterior Cingulate Cortex               Precentral Gyrus 
                            58                             49 
             Prefrontal Cortex                        Putamen 
                            56                             52 
      Superior Parietal Lobule        Superior Temporal Gyrus 
                            50                             60 
                 Temporal Pole 
                            58

ADD REPLY • link 5.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

I've added a fully reproducible version here https://support.bioconductor.org/p/118615/.

Still no luck with the original question unfortunately

ADD REPLY • link 5.1 years ago ravelarvargas • 0