Search
Question: Limma analysis combining twins pairs and unpaired controls
0
8 days ago by
kalletrontti0 wrote:

Hi all,

I have RNA-seq data from twins of which one of the pair is sick (n=5) and the other one is healthy (n=5). In addition, I have six healthy controls that are not related to the twins.

I wish to look for differential expression (a) between sick twins compared to their healthy twin pairs, and (b) between sick twins and healthy controls including the healthy twins and unrelated healthy controls.

For part (a) I take that blocking might be most suitable as most twin pairs cluster tightly in an MDS plot. The code I wrote based on limma's manual is the following, please correct if you point out something peculiar:

pairs <- factor(expDesign$Pair) #Levels "P1".."P5" or "unpaired" disease <- factor(expDesign$Disease)    #Levels "Sick" and "Healthy"
des <- model.matrix(~pairs+disease)      #Pair info to block

y <- DGEList(counts=data)
keep<- rowSums(cpm(y)>1) >= 6
y <- y[keep, ]
y <- calcNormFactors(y, method="TMM")
v <- voom(y, des, plot=TRUE)

topTable(fit, coef=2, sort.by = "p")

For aim (b) I guess limma's duplicateCorrelation() could be appropriate. However I have doubts in my skills writing the code (below) and would appreciate any advice on it. Also all comments for what would be best practise to analyse this kind of data structure are warmly welcome!

dge <- DGEList(counts=data)
keep<- rowSums(cpm(dge)>1) >= 6
dge <- dge[keep, ]
dge$samples$lib.size <- colSums(dge$counts) dge <- calcNormFactors(dge) design <- model.matrix(~expDesign$Disease)
v <- voom(dge, design, plot=TRUE)
corfit <- duplicateCorrelation(v, design, block=expDesign$Pair) corfit$consensus     #Gives 0.2208816
fit <- lmFit(v, design, block=expDesign$Pair, correlation=corfit$consensus)
fit <- eBayes(fit)
topTable(fit, coef=2, sort.by = "p")

Thanks a lot!

modified 7 days ago by Aaron Lun21k • written 8 days ago by kalletrontti0
0
7 days ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

• I would consider removing the unpaired samples in your first analysis. For the paired samples, the variance is driven by differences in the disease effect across twins - you don't have to worry about genotypic effects (or other effects that are common between twins) because that's absorbed by the pairs terms. In contrast, the variance in the unpaired samples will be driven by differences in genotype and other individual-specific effects. By analyzing everything together, you are unnecessarily introducing two distinct modes of variation, which violates assumptions of equal variance in the linear model. Admittedly, limma is quite robust to this, but why take the risk?
• I assume you're missing some code between voom and topTable in your first analysis.
• In your second analysis, try one round of iteration for voom and duplicateCorrelation (i.e., re-estimating the weights with the consensus correlation, and then re-estimating the correlation with the new weights). This should improve both estimates by ensuring that the weights account for the correlations and vice versa. See Gordon's comments at using duplicateCorrelation with limma+voom for RNA-seq data.
• I usually like setting robust=TRUE in eBayes to protect against genes with outlier variances.
• There's no need to recompute the library sizes if you use keep.lib.sizes=FALSE, see ?"[.DGEList".

Many thanks Aaron. Indeed there's a part of code missing between voom and topTable, copy-paste ...

For the first analysis I forgot to mention that actually the unpaired samples were removed prior analysis, sorry for that.

I'll try the iteration for the second part asap, thanks for the tip!

Kalle