Question: HTA 2.0 Differential expression analysis - design question
0
4 months ago by
RV10
RV10 wrote:

Hi, I have 1500 HTA 2.0 affy human microarray CEL files. These file are a combination of treatment, placebo and healthy controls at 4 different times (weeks). They are not paired samples. Hence I started out doing the following - 1- Perform batch QC using Affy power tools 2- Perform sample wise QC 3- Batch processing of arrays and normalization using rma.

eset_norm <- oligo::rma(raw_data, target = "core")


4- Combine all the batches of normalized eset and annotate using pd.hta.2.0

eset_norm_anno <- annotateEset(eset_norm ,pd.hta.2.0, columns = c("PROBEID", "ENTREZID", "SYMBOL", "GENENAME"), multivals = "list")


5- Filter probe id that are related to main gene expression

eset_norm_anno_main <- getMainProbes(eset_norm_anno )


6- Filtered probe ids using manual threshold in order to remove all the low expression probes across samples. 7- Filtered probe ids that were not annotated, i.e which had gene symbol /ID as "NA". Ended up with 34886 probe ids (Started with 70k probes after core rma norm). 8- set up design matrix - I would like to look at a list of differential expressed gene based on treatment and time effect. In order to figure out genes that are expressed deferentially per dosage at a given time point, I set up the following design matrix and develop contrast matrix that will compare DrugA.Dose1.WEEK0 and PLACEBO.WEEK0 and so forth.

design = model.matrix(~0+groups)
colnames(design)
# colnames(design)
#[1] "DrugA.Dose1.WEEK0"  "DrugA.Dose1.WEEK6" "DrugA.Dose1.WEEK1"  "DrugA.Dose1.WEEK8"
#[5] "DrugA.Dose1.WEEK2"  "DrugA.Dose2.WEEK0"  "DrugA.Dose2.WEEK6" "DrugA.Dose2.WEEK1"
#[9] "DrugA.Dose2.WEEK8" "DrugA.Dose2.WEEK2"  "CONTROL.WEEK0.HEALTHY"  "PLACEBO.WEEK0"
#[13] "PLACEBO.WEEK6"         "PLACEBO.WEEK1"          "PLACEBO.WEEK8"         "PLACEBO.WEEK2"


9- Setting up analysis using Limma -

contrast = makeContrasts( Dose1week0=DrugA.Dose1.WEEK0-PLACEBO.WEEK0)
data.fit = lmFit(eset_norm_anno_main_final,design)
data.fit.con = contrasts.fit(fit = data.fit,contrasts = contrast)
data.fit.eb = eBayes(fit = data.fit.con)
results <- decideTests(data.fit.eb, adjust.method = "BH", lfc = 1)


Can you please tell me if the design and my process is right ? My ultimate list represents a list of probeids that are mapped to a gene id with statistics and p values associated. Do I have to aggregate the result based on Gene symbols to understand what Genes have been regulated or the result with transcripts just fine to present and compare ?

A quick check will help me very much.

Thanks, RV

annotation affy limma hta • 206 views
modified 4 months ago by James W. MacDonald50k • written 4 months ago by RV10
Answer: HTA 2.0 Differential expression analysis - design question
3
4 months ago by
United States
James W. MacDonald50k wrote:

Assuming your call to makeContrasts is just a stylized version of what you actually did (you presumably have more than one contrast of interest, but obviously the entire call to makeContrasts would be boring), it looks OK. BUT you probably don't want to use the lfc argument to either topTable nor decideTests. Those arguments are doing a post hoc filtering of your results, which is not really copacetic for a couple of reasons, and still exist mostly for backward compatibility.

If you want to require that the fold change be larger than a certain value, you should use treat instead of eBayes, which will incorporate the fold change into the inference, rather than as a post hoc test that you stapled on at the end. You almost surely want to use something smaller than 1 for treat, however, as you are asking for evidence that the underlying population fold change exceeds a constant, rather than asking that about the sample fold change (where the former incorporates information about how accurately you might be estimating population parameters, and the latter does not).

Also, it's not clear what you mean by 3- Batch processing of arrays and normalization using rma. That looks to me like you might have done something in separate batches (like running rma separately on different groups?), which isn't really what you should be doing. Ideally you are processing all the samples at once, although you may not have sufficient RAM to do that. If you do have to run rma in batches, you should ensure that those batches are orthogonal to the questions at hand (e.g., make sure that each batch has a relatively equal number of all the treatment/time combinations), and also to do some PCA plots, etc at the end to ensure that there aren't any obvious technical effects being introduced.

Hi James,

Thank you! I have a couple more follow up questions and it will great to know your thoughts.

Assuming your call to makeContrasts is just a stylized version of what you actually did (you presumably have more than one contrast of interest, but obviously the entire call to makeContrasts would be boring), it looks OK. BUT you probably don't want to use the lfc argument to either topTable nor decideTests. Those arguments are doing a post hoc filtering of your results, which is not really copacetic for a couple of reasons, and still exist mostly for backward compatibility.

That is right, I did not incorporate all the contrasts. I loop over a list of contrasts and calculate contrast.fit and ebayes. As far as looking at genes deferentially expressed between doses, I should be doing something like this right - contrast = makeContrasts( Dose1vDose2_week0=(DrugA.Dose1.WEEK0-PLACEBO.WEEK0)-(DrugA.Dose2.WEEK0-PLACEBO.WEEK0)) ? I do not think it is a straight up comparison DrugA.Dose1.WEEK0-DrugA.Dose2.WEEK0 .

Secondly, I am merely trying to filter the list of genes that are significant based on adj p-value <= 0.05 and lfc >=2. SInce DecideTests already had a defulat filter of 0.05, I included the lfc. From your comment, I need not use it but later filter the result I get from topTable separately. Am I right in understanding that ?

Finally, I did want to normalize it all and I used a linux cluster with 250GB RAM and also tried it in one of the x1 ec2 instance with 1TB RAM and I was getting NaNs when I tried analyzing the entire data set, Hence I randomly divided the samples into 7 buckets making sure I pick equal amounts from each of the label and normalize. I did two more iterations of the same thing and I plot PCA to see that there was no significant batch effects based on processing dates of these samples. Thank you for pointing that out.

The contrast

(DrugA.Dose1.WEEK0-PLACEBO.WEEK0)-(DrugA.Dose2.WEEK0-PLACEBO.WEEK0)



reduces to

DrugA.Dose1.WEEK0-DrugA.Dose2.WEEK0



because you are subtracting the same thing in both parentheses. Remember that you are just doing (linear) algebra here, and if you can reduce the formula to something simpler, well, it's already the simpler thing.

For your second question, you don't want to filter using the lfc argument, like ever. Or by hand. Use the treat function instead, and use something smaller than 1 as the fold change cutoff. That's the short story.

The long story is that here are a couple of reasons for this. The first reason has to do with what a p-value really is. The p-value actually has very little to do with your data. You use your data to estimate the differences between the two groups and to get an estimate of the variance, and then generate the t-statistic. You then use the sample variance to define the null distribution, which is the set of t-statistics you would expect to see, given the number of samples and the estimated variance under the assumption that there are no differences between the two groups.

If you have a p-value of 0.05, then you expect to see a t-statistic that large even when there are no differences between two groups of that size, about 5% of the time. So you have either observed a pretty rare occurrence, or there really are differences between the two groups. But this is a probabilistic argument - you don't know the truth - where you (should) say that you have some evidence that the groups are different, and you can say exactly what that statement is based on.

If you add in an additional fold change criterion, then you can't really interpret the p-value the same way any longer, because it has now been conditioned on the sample fold change. It's also sort of weird, because the p-value is based on a population estimate, and the sample fold change is based on a subset of that population. So that's problem #1.

Given the definition of the p-value (the expected proportion of statistics, under the null, that exceed a certain cutoff), if you do say 10,000 comparisons and none are really different, you expect about 500 to have a p < 0.05. Because that is the definition of the p-value. So you probably want to adjust for the fact that you have made so many simultaneous comparisons, in order to 'get rid of' all the false positives. One way to do that is to use FDR, which adjusts the p-value in such a way that you can then make statements about a set of results, rather than one at a time. So if you select genes based on an FDR < 0.05, the interpretation is that you expect the maximum percentage of false positives in that set to be about 5%. But if you add in an extra fold change criterion, then the FDR is no longer interpretable either, because you have removed some of the genes that were in that set, so the maximum false positive rate is somewhat less than 5%, but actually unknowable.

Using the treat function fixes all these issues, because it modifies the underlying test from 'how often should I expect to see a t-statistic as large or larger than my observed statistic if there really are no differences between the two groups' to 'how often should I expect to see a t-statistic as large or larger than my observed statistic if the differences between the two groups is less than some constant?' By incorporating the fold change into your test, the p-values remain interpretable, as do the FDRs.

But if you add in an extra fold change criterion, then the FDR is no longer interpretable either, because you have removed some of the genes that were in that set, so the maximum false positive rate is somewhat less than 5%, but actually unknowable.

It's actually worse than this. To give an extreme case:

set.seed(192837)
y1 <- matrix(rnorm(4000, mean=0, sd=2), ncol=4) # low mean, high var
y2 <- matrix(rnorm(4000, mean=rep(c(0,1), each=2), sd=0.1),
ncol=4, byrow=TRUE) # DE with higher mean, low var

# Simple two-group design.
design <- model.matrix(~gl(2, 2))

# limma analysis.
library(limma)
fit <- lmFit(rbind(y1, y2), design)
fit <- eBayes(fit, trend=TRUE)
res <- topTable(fit, n=Inf, sort.by="none")

# Observed FDR without logFC filtering:
is.sig <- res$adj.P.Val <= 0.05 sum(head(is.sig, 1000))/sum(is.sig) # first 1000 are non-DE. ## [1] 0.0458891 # With logFC filtering: is.lfc <- abs(res$logFC) > 1
is.lfc.sig <- is.lfc & is.sig
## [1] 0.08348135

is.lfc <- abs(res\$logFC) > 1.5
is.lfc.sig <- is.lfc & is.sig
## [1] 1


So you can see that the actual FDR after filtering on the log-fold change is higher than the specified FDR threshold! This is because, in the presence of many highly variable but non-DE genes, the log-fold change filter actually enriches for false positives. Highly variable genes are more likely to achieve large log-fold changes by chance - a fact that is considered during calculation of the p-value but not if you apply a filter to the log-fold change estimate.