Question

Limma continuous design (association polygenic risk score with gene expression)

0

Entering edit mode

e.vanzanten.1 • 0

@237afa63

Last seen 24 days ago

Netherlands

Hi all,

I've been using limma with Illumina BeadChip files, already normalized by the quantile method. I have data from 677 cell lines and the aim is to determine the association between their polygenic risk scores and gene expression. In the end, I want to identify genes of which the expression associates with polygenic risk scores for several (8) traits I selected. In the model, I correct for several covariates, amongst other things the age of the individuals the cell lines are derived from. However, age is not available for all samples, which is why I did a sensitivity analysis for age for the samples (n=442) that do have age available. So in the end I have 3 analyses:

The analysis I want to use for my paper (n=677 cell lines), not correcting for age
The sensitivity analysis model 1 (n=442 cell lines), correcting for age
The sensitivity analysis model 2 (n=442 cell lines), not correcting for age (to allow direct comparison with sensitivity analysis model 1).

I got some unexpected results, in that the sensitivity analysis model 2 resulted in way more DEGs than my original analysis, even though this involves less samples and both are not corrected for age. I already checked for the influence of outlier samples by running the method="robust" in the lmFit, but still the number of DEGs in the sensitivity analysis model 2 is in the thousands, while the number of DEGs in my original model is hundreds, so there's a very large difference. This can just be that the datasets inherently differ as one of them has more samples than the other, but I just want to make sure my code makes sense so I can rule that out. I would therefore greatly appreciate your input!

PS. I'm now just checking the results when running the model without any covariates, so just with the PRS, and the lmFit step takes way longer with just the PRS in the design when compared with PRS + covariates, which also seems weird to me.


###Take input arguments from bash command line (all disorder names) 
args <- commandArgs(trailingOnly=TRUE)
disorder <- args[1]

###Paths (removed literal names) 
wd  <-  "working directory "
d_out <- "output directory"
p_info <- "info file containing PRS per disorder, filenames of samples to read in, covariate information and sample replicate information"



# Create directories
if (!dir.exists(d_out)) dir.create(d_out)
setwd(wd)

## Read and clean info file and add "d" column to later block for technical replicate correlation
info <- read.table(p_info, header = TRUE, sep = " ") %>%
  mutate(d = as.numeric(as.factor(sub("\\d+(\\D*)$", "\\1", Name))))

# Read in expression files based on filename listed in column named "all"
y <- limma::read.ilmn(info$all, 
                      probeid = "PROBE_ID",
                      annotation = c("TargetID", "SYMBOL"),
                      expr = "AVG_Signal",
                      other.columns = "Detection",
                      sep = "\t")

# Filter for "truly expressed" genes based on detection above background (in my case, higher detection p-value means higher likelihood to be truly detected above background)
expressed <- rowSums(y$other$Detection > 0.95) >= ncol(y)/2
y <- y[expressed,]

#Log2 transformation + adding gene names as rownames (created new EList because my samples were normalized already, but not log transformed)
      E <- new("EList")
      E$E <- log2(y$E)
      E$genes <- y$genes
      E$genes$EntrezID <- AnnotationDbi::mapIds(org.Hs.eg.db, 
                                                y$genes$SYMBOL, 
                                                keytype = "SYMBOL", 
                                                column = "ENTREZID")
      rownames(E$E) <- E$genes$TargetID 


      #Make formula (this is without age as covariate, the other covariates are related to measures of pluripotency of my cell lines (they're iPSC lines) and potential batch effects + PC's) 
      covariates <- c("Sex","Method_of_derivation","Pluripotency_Score","Novelty_Score", "Open.access.data",paste0("PC",1:4))
      determinant <- paste0(disorder,"_PRS")
      base_formula <- paste("~", determinant, "+", paste(covariates, collapse= "+"))                     

      #Design matrix
      design  <-  model.matrix(as.formula(base_formula),data=info)

      #Run lmFit, hardcoded the average replicate correlation.

      fit <- limma::lmFit(E, design, block = info$d, correlation = 0.6320562, method="robust")
      fit2 <- limma::eBayes(fit)

      #Post-processing 
      sfit <- summary(decideTests(fit2,method="global"))
      z <- limma::topTable(fit2, coef = determinant, n = nrow(fit2$coefficients))

limma • 808 views

ADD COMMENT • link updated 5 months ago by Gordon Smyth 50k • written 6 months ago by e.vanzanten.1 • 0

score 0 · Answer 1 · 2023-10-23

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 56 minutes ago

WEHI, Melbourne, Australia

I can't check R code for complex data analyses, especially when I am not familiar with the datasets, but there are some non-standard aspects to your code that don't seem to follow.

It is not possible to combine robust regression (method="robust") with blocking and correlation. If you set method="robust", then block and correlation will be ignored, which will artificially inflate the number of DE genes because the technical replicates will be treated as full replicates.
Robust regression is intended to allow for individual outlier observations rather than outlier samples. Outlier samples are much better handled by arrayWeights(). While limma does offer robust regression as an option, I don't recommend or use it myself. Much better usually would be to set trend=TRUE and robust=TRUE at the eBayes() step.
There is a disconnect between the first half of your code (reading the data) and the second half of the code (annotation and DE analysis). After running read.ilmn() the recommended next step would be neqc(), after which the data would be background corrected, log-transformed and normalized. The second half of the code however assumes that the data is normalized but not log-transformed. That seems very strange and the code used to undertake this normalization is not shown.
The mapIds code is assumes all the SYMBOL values are still current gene symbols rather than aliases. I would prefer to use code that can map gene aliases as well as official symbols to Entrez Ids.
The expression filtering is far more stringent than we recommend.
You are hardcoding a block correlation, but there is no indication of how you estimated that correlation.
If lmFit takes longer with just the PRS in the design when compared with PRS + covariates, that is a sure sign that you have changed other settings as well, for example block or method.

ADD COMMENT • link 6 months ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon,

Thank you for your reply, it is very helpful. Some comments to clarify:

About the log transformation/background correction and normalization: so the files are generated by a consortium, and already pre-processed to some extent. On their website, it says they are "normalized by the quantile method", but it was also mentioned that they performed multiple loess for normalization, so it's a bit confusing. I was assuming that that also involved a background correction, but they have not mentioned that specifically. I could tell from the expression values (which were in the order of hundreds) that they were not log-transformed, which is why I did that myself. So I am now thinking of downloading their unnormalized files (although they do not have any control probe files, so not sure how that would work out). In the end, this is the reason I did not use neqc(), because I feel like that would perform redundant normalization. I can still do backgroundCorrect(method="normexp"), but I'm not sure if background correction can be performed after normalization...
I will add trend=TRUE & robust=TRUE to the eBayes command to correct for outliers
I adjusted the expression filtering to a detection p-value > 0.95 for more than 25% of samples (instead of 50%), which is in line with the limma usersguide.
About the block correlation: I estimated this with the duplicateCorrelation(y, design=design, block=info$d) command, for 10.000 randomly selected probes (otherwise this would take very long). I now realize that this correlation may vary because of the design matrix? I don't fully understand that, I thought it would render the average expression correlation between technical replicates, which to me is irrespective of how the design matrix looks... But I may be wrong.

Thank you again, your tips are very much appreciated!

ADD REPLY • link 6 months ago e.vanzanten.1 • 0

0

Entering edit mode

the files are generated by a consortium, and already pre-processed to some extent

It seems unusual that they would normalize the data and then put it back into Illumina format files that can be read by read.ilmn(). But I guess anything is possible.

I am now thinking of downloading their unnormalized files (although they do not have any control probe files, so not sure how that would work out)

neqc will work fine on raw files with or without the control probe files. In the absence of the control probes, neqc will infer the control probe distribution from the detection p-values.

I thought it would render the average expression correlation between technical replicates, which to me is irrespective of how the design matrix looks...

No, that isn't how duplicate correlation works. Within-block correlation is relative to regular replicate variability. duplicateCorrelation estimates whether the technical replicates are more similar than regular replicates, so it has to measure the variance of regular replicates, and it can only do that by fitting a linear model with a design matrix and obtaining the residuals. If you omit the design matrix, then the residual variance will be too high, so the ratio of technical variance to regular variance will be too small and the within-technical-replicate correlation will be over-estimated.

ADD REPLY • link 6 months ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon,

Thank you for all your help. I updated my code and re-ran it, but I now have a different question related to permutation testing. From the limma analysis, I got a few DEGs, and I now want to test whether these could have occurred under the null distribution. Given that I have replicates, I performed a clustered shuffling of my expression values (e.g. say we have sample A, sample A, sample A (3 replicates), sample B, sample B (2 replicates), sample C, sample C, sample C (3 replicates) and sample D, sample D (2 replicates), then the expression values of sample A would be shuffled with sample C, and the values of sample B would be shuffled with the values of sample D. I kept the metadata the same so that we preserve the original information, but the association with different expression array data compared to the original is tested (to me that would then be the null distribution of no association).

I then do use the shuffled metadata to make sure that the blocking is done correctly for the shuffled samples. I now performed this permutation analysis by shuffling 1000 times, and oftentimes I get 0 DEGs (as expected) but also sometimes suddenly thousands of DEGs, which is why I am a bit confused about my approach.

I did read your reply on this post, but I still do not see how the approach above would be invalid since I just shuffle my samples in a clustered way and therefore account for independence of samples, and I also feel like this approach tests the correct null hypothesis. Could you give me some insights? Thanks again!

ADD REPLY • link 5 months ago e.vanzanten.1 • 0

0

Entering edit mode

I don't know what you expect me to say. You apparently don't trust the limma result and instead implement a permutation method that you already know I don't agree with. Permutation produces a null distribution only on average, not for each permute, and even then the null distribution that it generates is not the one relevant to limma. It is to be expected that some of the permuted datasets will generate DE results. Anyway, the permutation testing is your method so it is your responsiblity to solve any problems you have with it. I can only help with limma questions.

ADD REPLY • link 5 months ago Gordon Smyth 50k