Question

Limma analysis combining twins pairs and unpaired controls

0

Entering edit mode

kalletrontti • 0

@kalletrontti-17337

Last seen 6.8 years ago

Hi all,

I have RNA-seq data from twins of which one of the pair is sick (n=5) and the other one is healthy (n=5). In addition, I have six healthy controls that are not related to the twins.

I wish to look for differential expression (a) between sick twins compared to their healthy twin pairs, and (b) between sick twins and healthy controls including the healthy twins and unrelated healthy controls.

For part (a) I take that blocking might be most suitable as most twin pairs cluster tightly in an MDS plot. The code I wrote based on limma's manual is the following, please correct if you point out something peculiar:

pairs <- factor(expDesign$Pair) #Levels "P1".."P5" or "unpaired"
disease <- factor(expDesign$Disease) #Levels "Sick" and "Healthy"
des <- model.matrix(~pairs+disease) #Pair info to block

y <- DGEList(counts=data)
keep<- rowSums(cpm(y)>1) >= 6
y <- y[keep, ]
y <- calcNormFactors(y, method="TMM")
v <- voom(y, des, plot=TRUE)

topTable(fit, coef=2, sort.by = "p")

For aim (b) I guess limma's duplicateCorrelation() could be appropriate. However I have doubts in my skills writing the code (below) and would appreciate any advice on it. Also all comments for what would be best practise to analyse this kind of data structure are warmly welcome!

dge <- DGEList(counts=data)
keep<- rowSums(cpm(dge)>1) >= 6
dge <- dge[keep, ]
dge$samples$lib.size <- colSums(dge$counts)
dge <- calcNormFactors(dge)

design <- model.matrix(~expDesign$Disease)
v <- voom(dge, design, plot=TRUE)
corfit <- duplicateCorrelation(v, design, block=expDesign$Pair)
corfit$consensus #Gives 0.2208816
fit <- lmFit(v, design, block=expDesign$Pair, correlation=corfit$consensus)
fit <- eBayes(fit)
topTable(fit, coef=2, sort.by = "p")

Thanks a lot!

limma twin analysis rnaseq • 1.3k views

ADD COMMENT • link updated 6.8 years ago by Aaron Lun ★ 28k • written 6.8 years ago by kalletrontti • 0

score 0 · Answer 1 · 2018-09-14

Your code looks fine; just a few comments:

I would consider removing the unpaired samples in your first analysis. For the paired samples, the variance is driven by differences in the disease effect across twins - you don't have to worry about genotypic effects (or other effects that are common between twins) because that's absorbed by the pairs terms. In contrast, the variance in the unpaired samples will be driven by differences in genotype and other individual-specific effects. By analyzing everything together, you are unnecessarily introducing two distinct modes of variation, which violates assumptions of equal variance in the linear model. Admittedly, limma is quite robust to this, but why take the risk?
I assume you're missing some code between voom and topTable in your first analysis.
In your second analysis, try one round of iteration for voom and duplicateCorrelation (i.e., re-estimating the weights with the consensus correlation, and then re-estimating the correlation with the new weights). This should improve both estimates by ensuring that the weights account for the correlations and vice versa. See Gordon's comments at using duplicateCorrelation with limma+voom for RNA-seq data.
I usually like setting robust=TRUE in eBayes to protect against genes with outlier variances.
There's no need to recompute the library sizes if you use keep.lib.sizes=FALSE, see ?"[.DGEList".