Question

Using edgeR::voomLmFit for the analysis of paired clinical RNA-seq samples with multiple cell types

1

Entering edit mode

Devin ▴ 10

@507495b5

Last seen 14 hours ago

United States

Dear all,

I am looking for guidance using edgeR::voomLmFit to analyze clinical RNA-seq datasets from matched remission & disease samples. Briefly, the study design is:

20 patients, each patient has a remission and disease time point (biospecimen). Each biospecimen is sorted into two cell populations. The (ideal) objectives in layperson terms are to:

A) Identify transcriptomic changes between disease states and remission shared by all cell types and patients
B) Identify transcriptomic changes between disease states and remission for individual cell types, shared by all patients
C) Identify transcriptomic changes between disease and remission for individual patients and cell types

Code:

library(edgeR)

# Prepare metadata
set.seed(1)
metadata <- data.frame(
  BIOSPECIMEN=factor(paste0('S',1:20)),
  INDIVIDUAL=factor(paste0('Patient',c(rep(1,4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)))),
  CELLTYPE=factor(paste0('Cell',c(rep(c(1,2),5)))),
  DISEASESTATE=factor(c('Remission','Remission','D1','D1',
                        'Remission','Remission','D1','D1',
                        'Remission','Remission','D2','D2',
                        'Remission','Remission','D3','D3',
                        'Remission','Remission','D1','D1'),
                      levels=c('Remission','D1','D2','D3')),
  DISEASESTATEBIN=factor(rep(c('Remission','Remission','D','D'),5)),
  RIN=rnorm(20,mean=6,sd=1.5),
  AGE=unlist(lapply(rnorm(5,mean=45,sd=8),rep,4))
)

# Center and scale RIN score
metadata$RINS <- scale(metadata$RIN)

# Generate random counts
counts <- sapply(1:nrow(metadata),function(i) {
  rnbinom(n=1000,size=sample(seq(0.1,1,by=0.1),1),prob=0.1)
})
colnames(counts) <- metadata$BIOSPECIMEN
row.names(counts) <- paste0('gene',1:nrow(counts))

Specific Question 1: Is there a preferred paradigm for voomLmFit for paired RNA-seq between blocking on individual and including individual in the model design (Option 1 vs Option 2 below):

### Analysis A) Option 1
y1 <- DGEList(counts=counts,samples=metadata)
design1 <- model.matrix(~0 + DISEASESTATE + RINS,
                        data=y1$samples)
keep1 <- filterByExpr(y1,design1)
y1 <- y1[keep1,,keep.lib.sizes=FALSE]
y1 <- normLibSizes(y1)
fit1 <- voomLmFit(y1,design1,block=y1$samples$INDIVIDUAL,sample.weights=TRUE)
contrasts1 <- makeContrasts('DISEASESTATED1 - DISEASESTATERemission',
                            levels=design1)
fit1 <- contrasts.fit(fit1,contrasts1)
fit1 <- eBayes(fit1)
res1 <- topTable(fit1,sort.by='P',n=Inf)

### Analysis A) Option 2
y2 <- DGEList(counts=counts,samples=metadata)
design2 <- model.matrix(~0 + DISEASESTATE + INDIVIDUAL + RINS,
                        data=y2$samples)
keep2 <- filterByExpr(y2,design2)
y2 <- y2[keep2,,keep.lib.sizes=FALSE]
y2 <- normLibSizes(y2)
fit2 <- voomLmFit(y2,design2,sample.weights=TRUE)
contrasts2 <- makeContrasts('DISEASESTATED1 - DISEASESTATERemission',
                            levels=design2)
fit2 <- contrasts.fit(fit2,contrasts2)
fit2 <- eBayes(fit2)
res2 <- topTable(fit2,sort.by='P',n=Inf)

# Comparison
comp <- merge(res1,res2,by='row.names')
plot(comp$logFC.x,comp$logFC.y)
message(cor(comp$logFC.x,comp$logFC.y))

Specific Question 2: For analysis B), is the best approach to create a combined group as below, or should I instead split the counts into individual cell types first? The differences in cell types (as assessed by MDS/PCA) are much greater than differences between patients or disease status:

# Analysis B)
metadata$CELLTYPE_DISEASESTATE <- paste(metadata$CELLTYPE,metadata$DISEASESTATE,sep='_')
y3 <- DGEList(counts=counts,samples=metadata)
design3 <- model.matrix(~0 + CELLTYPE_DISEASESTATE + INDIVIDUAL + RINS,
                        data=y3$samples)
keep3 <- filterByExpr(y3,design3)
y3 <- y3[keep3,,keep.lib.sizes=FALSE]
y3 <- normLibSizes(y3)
fit3 <- voomLmFit(y3,design3,sample.weights=TRUE)
contrasts3 <- makeContrasts('CELLTYPE_DISEASESTATECell1_D1 - CELLTYPE_DISEASESTATECell1_Remission',
                            levels=design3)
fit3 <- contrasts.fit(fit3,contrasts3)
fit3 <- eBayes(fit3)
res3 <- topTable(fit3,sort.by='P',n=Inf)

General questions:

Is it appropriate to account for differences in RNA integrity (RIN) score using the centered/scaled RIN values in the model design as shown above?
It seems reasonable to default to sample.weights=TRUE for clinical samples. Are there downsides to this approach?

This is my first post. Please forgive me for any issues with this post, and thank you in advance for any help.

edgeR limma voomLmFit • 530 views

ADD COMMENT • link updated 19 hours ago by Gordon Smyth 52k • written 1 day ago by Devin ▴ 10

score 3 · Accepted Answer · 2025-03-19

3

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Specific Question 1: Is there a preferred paradigm for voomLmFit for paired RNA-seq between blocking on individual and including individual in the model design (Option 1 vs Option 2 below):

Since the experimental design is balanced (every individual has 4 samples), I recommend including individual in the model matrix.

Specific Question 2: For analysis B), is the best approach to create a combined group as below, or should I instead split the counts into individual cell types first?

It makes no difference, provided that you end up fitting all four groups for each individual. I recommended the combined group for simplicity.

Is it appropriate to account for differences in RNA integrity (RIN) score using the centered/scaled RIN values in the model design as shown above?

There are some publications that recommend doing so, but I don't find it very useful. It is up to you.

It seems reasonable to default to sample.weights=TRUE for clinical samples. Are there downsides to this approach?

No, there are no real downsides. I generally recommend sample weights for human clinical samples.

ADD COMMENT • link 1 day ago Gordon Smyth 52k

0

Entering edit mode

Thank you, Gordon, for your responses to all of my questions. For analysis A), in the case when a patient has either a missing timepoint or missing cell type from one or both time points, the approach would be to block on individual, correct? Do you have any general guidelines on whether it is preferred to restrict to complete pairs for a balanced study design (using ~ DISEASE + INDIVIDUAL) versus using incomplete pairs (with ~ DISEASE, followed by blocking on INDIVIDUAL)?

ADD REPLY • link 21 hours ago Devin ▴ 10

1

Entering edit mode

I generally recommend block=individual for unbalanced designs, to recover information from incomplete blocks. I would only be disuaded from that when there are large outlier effects for particular individuals, in which case the design matrix approach is more robust and conservative.

ADD REPLY • link 19 hours ago Gordon Smyth 52k