Search
Question: Non-used samples changing number of significant genes in limma
0
9 months ago by
Jake60
United States
Jake60 wrote:

Hi,

I performed polysome profiling to measure translation rate in wildtype and knockout cells. For the polysome samples I collected a monosome fraction, a low polysome fraction, and a high polysome fraction. I also collected mRNA for both wildtype and knockout. I'm interested in comparing translation changes between knockout and wildtype (and most interested in the change in high polysome relative to changes in mRNA).

I'm using limma to find differences in the ratio of (ko high / ko mRNA) / (wt high / wt mRNA). I initially input all of my samples (mRNA, monosome, low, high) into limma and only pulled out the ratio that I was interested in. However, I noticed that the samples I put in changes the number of significant genes a lot even when I only pull out the ko vs wt high vs mRNA ratio:

 input samples: # significant changes (ko high / ko mRNA) / (wt high / wt mRNA): mRNA/mono/low/high 643 mRNA/mono/high 1176 mRNA/low/high 1186 mRNA/high 1228

I know that voom takes the variation across samples into account is that the reason that I'm losing significance the more sample types that I put in? Looking at the samples I don't think there are any outliers that are increasing overall variance.

This is the code that I'm using:

layout <- data.frame(row.names = colnames(count_mat),
type = colnames(count_mat),
condition = colnames(count_mat),
stringsAsFactors = F)
layout[grep('high', colnames(count_mat)), ]$type <- 'high' layout[grep('low', colnames(count_mat)), ]$type <- 'low'
layout[grep('mono', colnames(count_mat)), ]$type <- 'mono' layout[grep('mRNA', colnames(count_mat)), ]$type <- 'mRNA'
layout[grep('ko', colnames(count_mat)), ]$condition <- 'ko' layout[grep('wt', colnames(count_mat)), ]$condition <- 'wt'
limmaModel <- paste(layout$condition, layout$type, sep = '.')
limmaModel <- factor(limmaModel)
print(layout)

design <- model.matrix(~0 + limmaModel)
colnames(design) <- levels(limmaModel)
rownames(design) <- rownames(layout)

count_mat_filtered <- count_mat[rowSums(cpm(count_mat) > 5) >= 3, ]

# Normalize to log2(CPM + 0.5) with voom
d <- DGEList(count_mat_filtered, genes = rownames(count_mat_filtered))
d <- calcNormFactors(d)
v <- voom(d, design, plot = T)

# Differential expression
fit <- lmFit(v, design)
cont.matrix <- makeContrasts(
interaction_high_mrna = (ko.high - ko.mRNA) - (wt.high - wt.mRNA),
levels = design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)

interaction_high_mrna <- topTable(fit2, coef = 'interaction_high_mrna', number = nrow(count_mat_filtered))

sum(interaction_high_mrna\$adj.P.Val < 0.05)

R version 3.4 edgeR_3.18.1 limma_3.32.9

modified 9 months ago • written 9 months ago by Jake60
0
9 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.8k wrote:

Remember that limma is estimating a single value of the variance for each gene based on all the samples that are provided to it. This same variance value is then used for all tests involving that fitted model. Of course adding more samples also increases the precision of the estimated variance, which should improve your statistical power, all else being equal. If adding in more samples is causing you to get fewer significant genes, the most likely explanation is that the samples you added in have a higher variance than the samples that were already in the model. This should show up clearly on the voomWithQualityWeights plot. To make this more obvious, you could force voomWithQualityWeights to estimate a single weight for each group by passing your design as the var.design argument (in addition to the regular design). Keep in mind that a higher variance should result in a lower weight, so you should be looking for the mono and low groups to have low weights.

I tried it with both voomWithQualityWeights and voomWithQualityWeights(var.design = design). This is what the plot looks like for the second one: https://imgur.com/HDhOUhQ. The order is:

1. ko 1 high
2. ko 1 low
3. ko 1 mono
4. ko 2 high
5. ko 2 low
6. ko 2 mono
7. wt 1 high
8. wt 1 low
9. wt 1 mono
10. wt 2 high
11. wt 2 low
12. wt 2 mono
13. ko 1 mRNA
14. ko 2 mRNA
15. ko 3 mRNA
16. ko 4 mRNA
17. wt 1 mRNA
18. wt 2 mRNA
19. wt 3 mRNA

In this case would you recommend just inputing the samples that I am interested in directly comparing?

Thanks

This plot doesn't seem to show the low or mono samples as having any worse quality than the high samples, although the mRNA samples seem to have a higher average quality than almost anything else. And even if they did, I would expect the quality weights to account for that. I'm afraid I don't know enough to diagnose the problem based on this plot.

However, taking a closer look at the counts of significant genes in the table you gave, the only real difference is in the top row. The difference between 1176 genes and 1228 is negligible. So adding in either the mono or the low samples has a minimal effect, but adding both of them in the same model has a drastic effect. This seems very counterintuitive, and if I saw this effect in my data, I would suspect that I made a mistake coding my model for the first row and would double-check my code. For example, you could have accidentally treated low and mono as a single group instead of 2 separate groups in the design, thus inflating the variance estimates. I doubt that's the exact answer, though, since your quality weight plot clearly shows different weights for those two groups.