How to use limma to analyze complex temporal RNAseq data
1
0
Entering edit mode
@b9a7fe93
Last seen 27 days ago
Netherlands

Hello everyone, I am a super fan of limma. Recently, I came across some chanllengings when working a complex time course RNAseq project. The metadata includes three groups, including unstained control, high, and low. Each group has 7 time series. However, they only had one biological replicate, which means I only have one sample per time point / group. The data provider wants to know which genes are differentialy regulated in time between grouped High versus Unstained (as well as high_vs_low, low_vs_unstained). According to the guideline in Chapter 9.6, I suppose this is more in line with the process of Many time points (9.6.2) since it can handle the situation without biological replication very well.

What confused me is, the example in guideline only includes control and treatment groups. I'm unsure if this approach can be directly applied to my data. Should I reassemble the data, construct linear models for High_vs_Unstained, high_vs_low, and low_vs_unstained separately, and then run contrasts.fit()? Because I suppose the adjust p value obatined from my current codes is the trend of all pairwise comparisons among three groups. Here are my codes and data.

library(splines)
D <- readRDS("D.filt.anno.dedup.norm.rds")
targets <- D$targets
targets$group <- factor(targets$group, levels = c("unstained", "low", "high"))
targets$time <- gsub("t","",targets$time) %>% as.numeric()
targets$time
# [1] 1 1 1 2 2 2 3 3 4 4 4 5 5 5 6 6 6 7 7 7

X <- ns(targets$time, df=5)
Group <- factor(targets$group)
design <- model.matrix(~Group * X)
colnames(design) <- gsub(":","_", colnames(design))

colnames(design)
# [1] "(Intercept)"  "Grouplow"     "Grouphigh"    "X1"           "X2"           "X3"          
# [7] "X4"           "X5"           "Grouplow_X1"  "Grouphigh_X1" "Grouplow_X2"  "Grouphigh_X2"
# [13] "Grouplow_X3"  "Grouphigh_X3" "Grouplow_X4"  "Grouphigh_X4" "Grouplow_X5"  "Grouphigh_X5"

logCPM <- cpm(D, log = TRUE, prior.count = 3)
fit <- lmFit(logCPM, design)
fit.eb <- eBayes(fit, trend = TRUE)

my.contrasts <- makeContrasts(
  # High vs low
  HighvsLow_X2 = (Grouphigh_X2 - Grouphigh_X1) - (Grouplow_X2 - Grouplow_X1),
  HighvsLow_X3 = (Grouphigh_X3 - Grouphigh_X1) - (Grouplow_X3 - Grouplow_X1),
  HighvsLow_X4 = (Grouphigh_X4 - Grouphigh_X1) - (Grouplow_X4 - Grouplow_X1),
  HighvsLow_X5 = (Grouphigh_X5 - Grouphigh_X1) - (Grouplow_X5 - Grouplow_X1),
  # High vs unstained
  HighvsUnstained_X2 = Grouphigh_X2 - Grouphigh_X1,
  HighvsUnstained_X3 = Grouphigh_X3 - Grouphigh_X1,
  HighvsUnstained_X4 = Grouphigh_X4 - Grouphigh_X1,
  HighvsUnstained_X5 = Grouphigh_X5 - Grouphigh_X1,
  # Low vs unstained
  LowvsUnstained_X2 = Grouplow_X2 - Grouplow_X1,
  LowvsUnstained_X3 = Grouplow_X3 - Grouplow_X1,
  LowvsUnstained_X4 = Grouplow_X4 - Grouplow_X1,
  LowvsUnstained_X5 = Grouplow_X5 - Grouplow_X1,
  levels = design)

fit2 <- contrasts.fit(fit, my.contrasts)
fit2.eb <- eBayes(fit2, trend = TRUE, robust= FALSE)

tt   = topTable(fit2.eb,n=Inf,sort.by="none")
tt <- tt[,-c(1:12)]
gene_anno <- D$idTOsymbol
tt <- tt[gene_anno$entrezgene_id,]
tt <- cbind(gene_anno,tt)
for (i in 1:ncol(my.contrasts)){
  tt.i = topTable(fit2.eb,coef=i,n=Inf,sort.by="none")[,c(1,3:6)]
  names(tt.i) = paste0(names(tt.i)," (",colnames(my.contrasts)[i],")")
  tt = cbind(tt,tt.i)}
str(tt)
#'data.frame':  20555 obs. of  66 variables:
# $ entrezgene_id                 : chr  "1" "10" "100" "10000" ...
# $ hgnc_symbol                   : chr  "A1BG" "NAT2" "ADA" "AKT3" ...
# $ AveExpr                       : num  -2.85 -2.45 4.22 5.68 2.58 ...
# $ F                             : num  1.542 0.987 2.46 1.076 0.565 ...
# $ P.Value                       : num  0.298 0.517 0.134 0.472 0.779 ...
# $ adj.P.Val                     : num  0.987 0.987 0.987 0.987 0.987 ...
# $ logFC (HighvsLow_X2)          : num  0.522 -2.0541 -0.195 -1.3448 -0.0839 ...
# $ t (HighvsLow_X2)              : num  0.2136 -0.6023 -0.1544 -1.3685 -0.0527 ...
# $ P.Value (HighvsLow_X2)        : num  0.837 0.567 0.882 0.216 0.96 ...
# $ adj.P.Val (HighvsLow_X2)      : num  1 1 1 1 1 ...
# $ B (HighvsLow_X2)              : num  -4.6 -4.6 -4.6 -4.59 -4.6 ...
timecourse RNASeq limma • 538 views
ADD COMMENT
0
Entering edit mode

Secondly, I also tried to directly apply the experimental setting time series instead of 'pseudotime series' in the design. Do you think it could introduce too many false positive results due to inability to estimate within-group variability? Here are my codes.

design <- model.matrix(~ group * time, data = targets)
logCPM <- cpm(D, log = TRUE, prior.count = 3)
fit <- lmFit(logCPM, design)
colnames(design) <- gsub(":", "_", colnames(design))

mycontrast <- makeContrasts(
  high_vs_low_time = grouphigh_time - grouplow_time,
  high_vs_unstained_time = grouphigh_time,
  low_vs_unstained_time = grouplow_time,
  levels = design)

fit <- contrasts.fit(fit.eb, mycontrast)
fit.eb <- eBayes(fit)

tt_high_vs_low_time <- topTable(fit.eb, coef="high_vs_low_time", n=Inf)
tt_high_vs_unstained_time <- topTable(fit.eb, coef="high_vs_unstained_time", n=Inf)
tt_low_vs_unstained_time <- topTable(fit.eb, coef="low_vs_unstained_time", n=Inf)
ADD REPLY
0
Entering edit mode
@ca3a779c
Last seen 7 weeks ago
Australia

Hi Jiahao,

If you only want to compare the difference between groups, you can consider time effect by adding time to the design matrix. See below my R codes for your aimed analysis, and you may have a try.

If you are also interested in DE genes along time course for each group, you can consider reading our time course paper: https://f1000research.com/articles/12-684/v2 .

library(limma)
library(edgeR)

# (not quite sure what your data D is)
y <- DGEList(counts = raw_counts, samples = D$targets)

keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes=FALSE]

y <- calcNormFactors(y)

# design (add time effect to the design)
design <- model.matrix(~ 0 + group + time, data = y$samples)
colnames(design) <- gsub("group", "", colnames(design))
design

# levels of group: c("high", "low", "unstained")
contrasts <- makeContrasts(
  high_vs_low = high - low,
  high_vs_unstained = high - unstained,
  low_vs_unstained = low - unstained,
  levels = design)

y <- estimateDisp(y, design, robust = TRUE)

# set sample.weights to TRUE if library size (sequencing depth) varies much among samples
vfit <- voomLmFit(y, design, sample.weights = TRUE) 
vfit <- contrasts.fit(vfit, contrasts=contrasts)

efit <- eBayes(vfit, robust = TRUE)
de <- decideTests(efit, p.value = 0.05)
summary(de)

DE_genes <- topTable(efit, coef="high_vs_low", n=Inf, p.value = 0.05)
ADD COMMENT
0
Entering edit mode

Hi Jinming, Great thanks for your comments. I did want to try your group design at the first begining. However, this method needs biological replicate. Otherwise this will generate too many NA values or Inf values in the linear model and cause eBayes to fail, at least when directly using lmFit to build the linear model. So your opinion is to use voomLmFit to build the linear model?

ADD REPLY
0
Entering edit mode

Hi Jinming, I want to make some supplement to my question. The D object was built by using DEGList, so it is the same as your y object. According to codes, it looks like it only compared the high, low, and unstained, but the time factor was not included in the makeConstrasts step.

head(targets[,-c(1:4)])
#group      Sample type     SamplesID   time     timeGroup
#unstained  no-EV control   S001    t1   unstained_t1
#low        low EV uptake   S002    t1   low_t1
#high       high EV uptake  S003    t1   high_t1
#unstained  no-EV control   S004        t2   unstained_t2
#low        low EV uptake   S005        t2   low_t2
#high       high EV uptake  S006    t2   high_t2
> design
     unstained low high timet2 timet3 timet4 timet5 timet6 timet7
S001         1   0    0      0      0      0      0      0      0
S002         0   1    0      0      0      0      0      0      0
S003         0   0    1      0      0      0      0      0      0
S004         1   0    0      1      0      0      0      0      0
S005         0   1    0      1      0      0      0      0      0
ADD REPLY
0
Entering edit mode

Hi Jiahao,

I generated a dataset to make it similar to yours, and I tried my R codes (voomLmFit + eBayes) on this dataset, which worked well. Both voomLmFit and lmFit are good, while the number of DE genes might be different. I would recommend using voomLmFit for your dataset as it is more powerful and has additional parameters to refine the model.

# My dataset
> table(y$samples$group,y$samples$time)

            0 1 2 3 4 5
  high      1 1 1 1 1 1
  low       1 1 1 1 1 1
  unstained 1 1 1 1 1 1

As there are no biological replicates in your dataset, you may consider generating additional data to have sufficient biological replicates (e.g., 3 for each time point in each group). Then, either voomLmFit or lmFit should be good to perform the DE analysis.

Thanks for the details of the D object. The time effect has been added to the design matrix, which will be used to fit the linear model. So, only groups are needed in makeContrasts if you are interested in comparing the difference between groups.

design <- model.matrix(~ 0 + group + time, data = y$samples)
vfit <- voomLmFit(y, design, sample.weights = TRUE)

If the voomLmFit pipeline does not work on your dataset, you could probably try to perform time course analysis for each group (e.g., high) to obtain the Up/Down DE genes along the time course. Then, you can compare the DE genes between groups (e.g., DE_genes_of_high vs DE_genes_of_ unstained) to see if there are any genes of your interests. (It is not the best solution but worth a try.)

# An example analysis for group "high"
y1 <- y[,y$samples$group=="high"]

t1 <- y1$samples$time
X <- splines::ns(as.numeric(t1),df = 3) 
A <- cbind(1,t1,X)
QR <- qr(A)
r <- QR$rank
R_rank <- QR$qr[1:r,1:r]
Z <- t(backsolve(R_rank,t(A),transpose=TRUE))
Z <- Z[,-1]
design <- model.matrix(~ Z)
design

y1 <- estimateDisp(y1, design)

# edgeR quasi-likelihood pipeline
fit <- glmQLFit(y1, design, robust=TRUE)
res <- glmQLFTest(fit, coef=2:4)
summary(decideTests(res))
tab <- topTags(res, n=Inf, p.value = 0.05)$table
tab$trend <- ifelse(tab$logFC.Z1 > 0, "Up", "Down")
ADD REPLY
0
Entering edit mode

Hi Jinming,

Great thanks for your reply! I have no doubt that voomLmFit handles complex temporal variations. However, I suspect that your group design does not include confounding factors or interactions between time and group. This is difficult to include it later by using makeConstrast. Please note that I included this in my initial design (design <- model.matrix(~group* time)). I have tried another way to create constrast as well (like chapter 9.6.1) and applied it in voomLmFit. Apparently, due to insufficient sample size, this method failed to account for variability in the subsequent analysis, resulting in errors.

treatment = factor(targets$timeGroup, level=c("unstained_t1","low_t1","high_t1",
                                           "unstained_t2","low_t2","high_t2",
                                           "low_t3","high_t3", # no "unstained_t3" sample
                                           "unstained_t4","low_t4","high_t4",
                                           "unstained_t5","low_t5","high_t5",
                                           "unstained_t6","low_t6","high_t6",
                                           "unstained_t7","low_t7","high_t7"))

design <- model.matrix(~ 0+treatment)
#D <- estimateDisp(D, design, robust = TRUE)
vfit <- voomLmFit(D, design, sample.weights = TRUE) 
# Which genes respond differently over time in the high group relative to the unstained?
mycontrast <- makeContrasts(
  HighVSUnstained_t2 = (high_t2 - high_t1) - (unstained_t2 - unstained_t1),
  HighVSUnstained_t3 = (high_t3 - high_t1) - (unstained_t3 - unstained_t1),
  HighVSUnstained_t4 = (high_t4 - high_t1) - (unstained_t4 - unstained_t1),
  HighVSUnstained_t5 = (high_t5 - high_t1) - (unstained_t5 - unstained_t1),
  HighVSUnstained_t6 = (high_t6 - high_t1) - (unstained_t6 - unstained_t1),
  HighVSUnstained_t7 = (high_t7 - high_t1) - (unstained_t7 - unstained_t1),
  levels = design)
vfit <- contrasts.fit(vfit, contrasts=mycontrast)
efit <- eBayes(vfit, robust = TRUE)
#  Error in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  : 
#  No residual degrees of freedom in linear model fits

Your another advise is to the natural spline, which is exactly the better approach in chapter 9.6.2. I have stickly followed it and also believe it is the best solution for my data.

What confused me is, I am not sure whether my makeContrasts was right. Did it actually compare the high and unstained group over time? Could I get which genes respond differently over time in the high group relative to the unstained according to this makeContrasts building.

my.contrasts <- makeContrasts(
  # High vs unstained
  HighvsUnstained_X2 = Grouphigh_X2 - Grouphigh_X1,
  HighvsUnstained_X3 = Grouphigh_X3 - Grouphigh_X1,
  HighvsUnstained_X4 = Grouphigh_X4 - Grouphigh_X1,
  HighvsUnstained_X5 = Grouphigh_X5 - Grouphigh_X1,
  # Low vs unstained
  LowvsUnstained_X2 = Grouplow_X2 - Grouplow_X1,
  LowvsUnstained_X3 = Grouplow_X3 - Grouplow_X1,
  LowvsUnstained_X4 = Grouplow_X4 - Grouplow_X1,
  LowvsUnstained_X5 = Grouplow_X5 - Grouplow_X1,
  levels = design)
ADD REPLY
0
Entering edit mode

Hi Jiahao,

Thanks for more details about your analysis! For my design (~ 0 + group + time), confounding factors or interactions were not considered.

Part1:

In your analysis following guide in chapter 9.6.1, the eBayes test failed because of no residual degrees of freedom (which is caused by insufficient samples as you already know).

Contrasts used for "Which genes respond differently over time in the high group relative to the unstained?"

To obtain the difference, each time point should be compared to its immediate preceding time point. (e.g., Dif_t7 = t7 - t6)

cont.dif <- makeContrasts(
  Dif_t2_high_vs_unstained = (high_t2 - high_t1) - (unstained_t2 - unstained_t1),
  Dif_t3_high_vs_unstained = (high_t3 - high_t2) - (unstained_t3 - unstained_t2),
  ...,
  Dif_t7_high_vs_unstained = (high_t7 - high_t6) - (unstained_t7 - unstained_t6),
  levels=design)

Part2:

For the spline models in chapter 9.6.2, splines terms (X1, X2, X3, ...) should be taken as a whole because they are generated from a time vector. X2 - X1 is not meaningful, so your contrasts are incorrect.

To detect genes with different time trends between two groups, specify column numbers of all the interaction parameters in colnames(design):

# e.g.
topTable(fit, coef=8:12) # (last 5 columns), when df = 5 
topTable(fit, coef=6:8)  # (last 3 columns), when df = 3

I would recommend comparing two groups at a time (e.g., high vs low) using a cubic spline (df=3) for your data with 7 time points.

y2 <- y[,y$samples$group != "unstained"]
X <- splines::ns(as.numeric(y2$samples$time), df=3)
Group <- factor(y2$samples$group)
design <- model.matrix(~Group*X)
colnames(design)
fit <- voomLmFit(y2, design)
fit <- eBayes(fit)
topTable(fit, coef=6:8)

# voomLmFit() can be replaced with lmFit() using log-cpm as input if you prefer lmFit()
# lcpm <- cpm(y2, log = TRUE)
# fit <- lmFit(lcpm, design)
ADD REPLY
0
Entering edit mode

Thank you so much! I would compare every two groups separately.

ADD REPLY

Login before adding your answer.

Traffic: 404 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6