Question: Non-used samples changing number of significant genes in limma
gravatar for Jake
12 months ago by
United States
Jake60 wrote:


I performed polysome profiling to measure translation rate in wildtype and knockout cells. For the polysome samples I collected a monosome fraction, a low polysome fraction, and a high polysome fraction. I also collected mRNA for both wildtype and knockout. I'm interested in comparing translation changes between knockout and wildtype (and most interested in the change in high polysome relative to changes in mRNA). 

I'm using limma to find differences in the ratio of (ko high / ko mRNA) / (wt high / wt mRNA). I initially input all of my samples (mRNA, monosome, low, high) into limma and only pulled out the ratio that I was interested in. However, I noticed that the samples I put in changes the number of significant genes a lot even when I only pull out the ko vs wt high vs mRNA ratio:

input samples: # significant changes (ko high / ko mRNA) / (wt high / wt mRNA):
mRNA/mono/low/high 643
mRNA/mono/high 1176
mRNA/low/high 1186
mRNA/high 1228

I know that voom takes the variation across samples into account is that the reason that I'm losing significance the more sample types that I put in? Looking at the samples I don't think there are any outliers that are increasing overall variance.

This is the code that I'm using:

layout <- data.frame(row.names = colnames(count_mat),
                     type = colnames(count_mat),
                     condition = colnames(count_mat),
                     stringsAsFactors = F)
layout[grep('high', colnames(count_mat)), ]$type <- 'high'
layout[grep('low', colnames(count_mat)), ]$type <- 'low'
layout[grep('mono', colnames(count_mat)), ]$type <- 'mono'
layout[grep('mRNA', colnames(count_mat)), ]$type <- 'mRNA'
layout[grep('ko', colnames(count_mat)), ]$condition <- 'ko'
layout[grep('wt', colnames(count_mat)), ]$condition <- 'wt'
limmaModel <- paste(layout$condition, layout$type, sep = '.')
limmaModel <- factor(limmaModel)

design <- model.matrix(~0 + limmaModel)
colnames(design) <- levels(limmaModel)
rownames(design) <- rownames(layout)

# Filter low read counts
count_mat_filtered <- count_mat[rowSums(cpm(count_mat) > 5) >= 3, ]

# Normalize to log2(CPM + 0.5) with voom
d <- DGEList(count_mat_filtered, genes = rownames(count_mat_filtered))
d <- calcNormFactors(d)
v <- voom(d, design, plot = T)

# Differential expression
fit <- lmFit(v, design)
cont.matrix <- makeContrasts(
  interaction_high_mrna = (ko.high - ko.mRNA) - (wt.high - wt.mRNA),
  levels = design)
fit2 <-, cont.matrix)
fit2 <- eBayes(fit2)

interaction_high_mrna <- topTable(fit2, coef = 'interaction_high_mrna', number = nrow(count_mat_filtered))

sum(interaction_high_mrna$adj.P.Val < 0.05)


R version 3.4 edgeR_3.18.1 limma_3.32.9


ADD COMMENTlink modified 12 months ago • written 12 months ago by Jake60
gravatar for Ryan C. Thompson
12 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.9k wrote:

Remember that limma is estimating a single value of the variance for each gene based on all the samples that are provided to it. This same variance value is then used for all tests involving that fitted model. Of course adding more samples also increases the precision of the estimated variance, which should improve your statistical power, all else being equal. If adding in more samples is causing you to get fewer significant genes, the most likely explanation is that the samples you added in have a higher variance than the samples that were already in the model. This should show up clearly on the voomWithQualityWeights plot. To make this more obvious, you could force voomWithQualityWeights to estimate a single weight for each group by passing your design as the argument (in addition to the regular design). Keep in mind that a higher variance should result in a lower weight, so you should be looking for the mono and low groups to have low weights.

ADD COMMENTlink written 12 months ago by Ryan C. Thompson6.9k

I tried it with both voomWithQualityWeights and voomWithQualityWeights( = design). This is what the plot looks like for the second one: The order is:

  1. ko 1 high
  2. ko 1 low
  3. ko 1 mono
  4. ko 2 high
  5. ko 2 low
  6. ko 2 mono
  7. wt 1 high
  8. wt 1 low
  9. wt 1 mono
  10. wt 2 high
  11. wt 2 low
  12. wt 2 mono
  13. ko 1 mRNA
  14. ko 2 mRNA
  15. ko 3 mRNA
  16. ko 4 mRNA
  17. wt 1 mRNA
  18. wt 2 mRNA
  19. wt 3 mRNA

In this case would you recommend just inputing the samples that I am interested in directly comparing?


ADD REPLYlink modified 12 months ago • written 12 months ago by Jake60

This plot doesn't seem to show the low or mono samples as having any worse quality than the high samples, although the mRNA samples seem to have a higher average quality than almost anything else. And even if they did, I would expect the quality weights to account for that. I'm afraid I don't know enough to diagnose the problem based on this plot.

However, taking a closer look at the counts of significant genes in the table you gave, the only real difference is in the top row. The difference between 1176 genes and 1228 is negligible. So adding in either the mono or the low samples has a minimal effect, but adding both of them in the same model has a drastic effect. This seems very counterintuitive, and if I saw this effect in my data, I would suspect that I made a mistake coding my model for the first row and would double-check my code. For example, you could have accidentally treated low and mono as a single group instead of 2 separate groups in the design, thus inflating the variance estimates. I doubt that's the exact answer, though, since your quality weight plot clearly shows different weights for those two groups.

ADD REPLYlink modified 12 months ago • written 12 months ago by Ryan C. Thompson6.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 159 users visited in the last hour