Question

Questionable limma results for differential expression on RNA-seq

0

Entering edit mode

gld-bioinf • 0

@64a5a589

Last seen 8 months ago

United Kingdom

Hi,

I'm analysing RNA-seq data and have an experimental set up similar to this:

colData <- data.frame(sample=c(1,2,3,4,5,6,7,8),
                      group=c("A", "B", "C", "A", "B", "A", "C", "C"),
                      patient=c(1,1,1,2,2,3,3,4),
                      age=c(13.2, 15, 16.8, 13.4, 15.3, 12.8, 17, 16.5))

I'd like to do all pairwise group comparisons (AvB, AvC, BvC), whilst taking into account the ages of the patients and matched patient samples. However the issue is not all my patient samples are matched and in all groups.

I've decided to use limma, currently I understand duplicateCorrelation() can be used to account for missing matched patient samples?

I've generated the code below following the limma manual and other similar questions I've found online:

# reads is an expression matrix with columns as samples and rows as genes
dge <- DGEList(counts=reads)
# remove none expressed genes
cutoff <- 1
drop <- which(apply(cpm(dge), 1, max) < cutoff)
d <- dge[-drop,]
d <- calcNormFactors(d)

# fit model 
design = model.matrix(~0+colData$group+colData$age)
colnames(design) <- c("A", "B", "C", "age")

# normalise reads using voom
v <- voom(d, design=design)
# here the patient is treated as a random effect due to performing between and within patient comparisons
corfit <- duplicateCorrelation(v, design, block=colData$patient)

# repeat voom and duplicate correlation (https://support.bioconductor.org/p/59700/)
v <- voom(d, design=design, block=colData$patient, correlation=corfit$consensus)
corfit <- duplicateCorrelation(v, design, block=colData$patient)

fit <- lmFit(v, design=design, block=colData$patient, correlation=corfit$consensus)

contr <- makeContrasts(A - B, levels=design)

fit2 <- contrasts.fit(fit, contr)
fit2 <- eBayes(fit2)
top.table <- topTable(fit2, sort.by = "p", n = Inf)
head(top.table, 20)

The code runs fine, but I'm doubting my model is correct as the lowest adjusted p-value is 0.97 for any of the comparisons so no genes are coming up as significantly differentially expressed.

This may just be the data, but want to double check there is no issue with my model/code as I'm new to using limma. Any help appreciated, thanks!

RNASeqData DifferentialExpression limma • 762 views

ADD COMMENT • link updated 9 months ago by Gordon Smyth 50k • written 9 months ago by gld-bioinf • 0

score 0 · Answer 1 · 2023-07-24

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

The code seems fine but the analysis you are conducting has almost no statistical power to detect DE because the three patient groups are strongly correlated with age and you are adjusting for age. Groups A, B and C respectively consist of patients of increasing age, and therefore any gene that is DE between A and C and intermediate in B will be impossible to detect after adjusting for age effects.

I do not understand how you can have three different ages for the same patient, but that is another mattter.

ADD COMMENT • link 9 months ago Gordon Smyth 50k

0

Entering edit mode

Ok thanks for your feedback! I'll try to rerun without including age in the model. As for the three ages, they are samples collected from the patient over a period of several years.

ADD REPLY • link 9 months ago gld-bioinf • 0

0

Entering edit mode

If these are human patients, then larger sample sizes would usually be required to get reliable results. It would be a common experience to find no statistically significant DE in an analysis with just three human subjects.

ADD REPLY • link 9 months ago Gordon Smyth 50k

0

Entering edit mode

The colData I included was just an example so we have almost 20 patients. I removed the age variable and this still resulted in very high adjusted p-values. I happened to try without setting intercept as 0 in the model and the results change a lot.

top 3 genes ranked by adjusted p-val with design = model.matrix(~0+colData$group)

                                   logFC     AveExpr         t      P.Value adj.P.Val         B
ENSG00000164597_COG5          -0.4467702  4.97840450 -3.905644 0.0005218332 0.9999292 -3.123975
ENSG00000164114_MAP9           0.8135768  2.56065281  3.581045 0.0012408505 0.9999292 -3.975650
ENSG00000163312_HELQ           0.4339874  4.30014688  3.544689 0.0013656245 0.9999292 -3.473365

top 3 genes ranked by adjusted p-val with design = model.matrix(~colData$group)

                            logFC   AveExpr        t      P.Value    adj.P.Val        B
ENSG00000167978_SRRM2   11.367109 11.397061 87.02216 2.109628e-36 3.673494e-32 68.18470
ENSG00000105063_PPP6R1   7.978524  8.088218 80.23223 2.164661e-35 1.291034e-31 66.35750
ENSG00000156508_EEF1A1  13.855348 13.840293 79.77207 2.552459e-35 1.291034e-31 66.56524

I've read on a few posts that setting/not setting 0 intercept shouldn't make too much of a difference. Is there a way to determine whether to set it as 0 or not is more appropriate for my data? Sorry if this is a simplistic question, my background is not stats and I appreciate any help you can give!

ADD REPLY • link 9 months ago gld-bioinf • 0

0

Entering edit mode

Your second analysis is wrong. Please read the previous posts or the limma User's Guide to see why. If you change the design matrix you naturally have to make the corresponding adjustment to the contrasts.

ADD REPLY • link 9 months ago Gordon Smyth 50k

0

Entering edit mode

Ahh I see, as my set up is group-means parametrization and without ~0 would require treatment-contrasts parametrization. Thanks for pointing me in the right direction!

ADD REPLY • link 9 months ago gld-bioinf • 0