Dear all,
I am looking for guidance using edgeR::voomLmFit
to analyze clinical RNA-seq datasets from matched remission & disease samples. Briefly, the study design is:
20 patients, each patient has a remission and disease time point (biospecimen). Each biospecimen is sorted into two cell populations. The (ideal) objectives in layperson terms are to:
- A) Identify transcriptomic changes between disease states and remission shared by all cell types and patients
- B) Identify transcriptomic changes between disease states and remission for individual cell types, shared by all patients
- C) Identify transcriptomic changes between disease and remission for individual patients and cell types
Code:
library(edgeR)
# Prepare metadata
set.seed(1)
metadata <- data.frame(
BIOSPECIMEN=factor(paste0('S',1:20)),
INDIVIDUAL=factor(paste0('Patient',c(rep(1,4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)))),
CELLTYPE=factor(paste0('Cell',c(rep(c(1,2),5)))),
DISEASESTATE=factor(c('Remission','Remission','D1','D1',
'Remission','Remission','D1','D1',
'Remission','Remission','D2','D2',
'Remission','Remission','D3','D3',
'Remission','Remission','D1','D1'),
levels=c('Remission','D1','D2','D3')),
DISEASESTATEBIN=factor(rep(c('Remission','Remission','D','D'),5)),
RIN=rnorm(20,mean=6,sd=1.5),
AGE=unlist(lapply(rnorm(5,mean=45,sd=8),rep,4))
)
# Center and scale RIN score
metadata$RINS <- scale(metadata$RIN)
# Generate random counts
counts <- sapply(1:nrow(metadata),function(i) {
rnbinom(n=1000,size=sample(seq(0.1,1,by=0.1),1),prob=0.1)
})
colnames(counts) <- metadata$BIOSPECIMEN
row.names(counts) <- paste0('gene',1:nrow(counts))
Specific Question 1: Is there a preferred paradigm for voomLmFit for paired RNA-seq between blocking on individual and including individual in the model design (Option 1 vs Option 2 below):
### Analysis A) Option 1
y1 <- DGEList(counts=counts,samples=metadata)
design1 <- model.matrix(~0 + DISEASESTATE + RINS,
data=y1$samples)
keep1 <- filterByExpr(y1,design1)
y1 <- y1[keep1,,keep.lib.sizes=FALSE]
y1 <- normLibSizes(y1)
fit1 <- voomLmFit(y1,design1,block=y1$samples$INDIVIDUAL,sample.weights=TRUE)
contrasts1 <- makeContrasts('DISEASESTATED1 - DISEASESTATERemission',
levels=design1)
fit1 <- contrasts.fit(fit1,contrasts1)
fit1 <- eBayes(fit1)
res1 <- topTable(fit1,sort.by='P',n=Inf)
### Analysis A) Option 2
y2 <- DGEList(counts=counts,samples=metadata)
design2 <- model.matrix(~0 + DISEASESTATE + INDIVIDUAL + RINS,
data=y2$samples)
keep2 <- filterByExpr(y2,design2)
y2 <- y2[keep2,,keep.lib.sizes=FALSE]
y2 <- normLibSizes(y2)
fit2 <- voomLmFit(y2,design2,sample.weights=TRUE)
contrasts2 <- makeContrasts('DISEASESTATED1 - DISEASESTATERemission',
levels=design2)
fit2 <- contrasts.fit(fit2,contrasts2)
fit2 <- eBayes(fit2)
res2 <- topTable(fit2,sort.by='P',n=Inf)
# Comparison
comp <- merge(res1,res2,by='row.names')
plot(comp$logFC.x,comp$logFC.y)
message(cor(comp$logFC.x,comp$logFC.y))
Specific Question 2: For analysis B), is the best approach to create a combined group as below, or should I instead split the counts into individual cell types first? The differences in cell types (as assessed by MDS/PCA) are much greater than differences between patients or disease status:
# Analysis B)
metadata$CELLTYPE_DISEASESTATE <- paste(metadata$CELLTYPE,metadata$DISEASESTATE,sep='_')
y3 <- DGEList(counts=counts,samples=metadata)
design3 <- model.matrix(~0 + CELLTYPE_DISEASESTATE + INDIVIDUAL + RINS,
data=y3$samples)
keep3 <- filterByExpr(y3,design3)
y3 <- y3[keep3,,keep.lib.sizes=FALSE]
y3 <- normLibSizes(y3)
fit3 <- voomLmFit(y3,design3,sample.weights=TRUE)
contrasts3 <- makeContrasts('CELLTYPE_DISEASESTATECell1_D1 - CELLTYPE_DISEASESTATECell1_Remission',
levels=design3)
fit3 <- contrasts.fit(fit3,contrasts3)
fit3 <- eBayes(fit3)
res3 <- topTable(fit3,sort.by='P',n=Inf)
General questions:
- Is it appropriate to account for differences in RNA integrity (RIN) score using the centered/scaled RIN values in the model design as shown above?
- It seems reasonable to default to
sample.weights=TRUE
for clinical samples. Are there downsides to this approach?
This is my first post. Please forgive me for any issues with this post, and thank you in advance for any help.
Thank you, Gordon, for your responses to all of my questions. For analysis A), in the case when a patient has either a missing timepoint or missing cell type from one or both time points, the approach would be to block on individual, correct? Do you have any general guidelines on whether it is preferred to restrict to complete pairs for a balanced study design (using ~ DISEASE + INDIVIDUAL) versus using incomplete pairs (with ~ DISEASE, followed by blocking on INDIVIDUAL)?
I generally recommend
block=individual
for unbalanced designs, to recover information from incomplete blocks. I would only be disuaded from that when there are large outlier effects for particular individuals, in which case the design matrix approach is more robust and conservative.