Question

How to create design matrix for differential analysis within the same cell-line samples?

0

Entering edit mode

Beginner ▴ 60

@beginner-15939

Last seen 2.8 years ago

Switzerland

I have a total of 8 samples, 4 controls and 4 Foxcut gene over expressed samples. I have a dataframe data with genes as rows and samples as columns with counts.

The column data for all the 8 samples look like below with replicate and cell-line information:

Samples             TYPE                 Replicate   Cell-lines
Cell1_HA1         Control                  1             1
Cell1_HA2         Control                  2             1
Cell1_foxcut11  FOXCUT_OverExpression      1             1
Cell1_foxcut12  FOXCUT_OverExpression      2             1
Cell2_HA1         Control                  3             2
Cell2_HA2         Control                  4             2
Cell2_foxcut11  FOXCUT_OverExpression      3             2
Cell2_foxcut12  FOXCUT_OverExpression      4             2

I have counts data for all the 8 samples after star alignment. I'm using edgeR package for differential analysis. This is the first time I'm doing differential analysis with cell-line data with replicate information. I'm not aware about how to create design matrix and contrast.matrix for differential analysis within same cell-line samples.

I wanted to compare the below samples and do differential analysis:

Cell1_foxcut samples vs Cell1_HA samples
Cell2_foxcut samples vs Cell2_HA samples

I tried like below, but not sure whether this is right or not.

colnames(data) %in% coldata$Samples
coldata <- coldata[match(colnames(data), coldata$Samples),]
table(coldata$Type)

library(edgeR)
group <- factor(paste0(coldata$TYPE))
y <- DGEList(data,group = group)
y$samples 

## Filtering 
keep <- rowSums(cpm(y) > 0.5) >= 1

y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y,method = "TMM") ##Normalization

## Create design matrix
design2 <- model.matrix(~ 0 + group + coldata$Replicate + coldata$Cell-lines)

And how to give coef in contrast.matrix for differential analysis between different samples?

If the above design.matrix is not right could you please help me how to do this. I have seen tutorials and many other questions, but couldn't come to a conclusion, because I'm confused in this type of analysis.

thanks a lot

edger r differentialanalysis rnaseq designmatrix • 1.4k views

ADD COMMENT • link updated 6.7 years ago by Aaron Lun ★ 29k • written 6.7 years ago by Beginner ▴ 60

0

Entering edit mode

@Gordon Could you please help me in something about my post. thanq

ADD REPLY • link 6.7 years ago Beginner ▴ 60

score 0 · Answer 1 · 2019-05-03

You are missing some fundamentals of (i) how the R language works and (ii) statistics. I'll point out the issues here but would recommend you consult with a local bioinformatician to do your analysis.

design2 <- model.matrix(~ 0 + group + coldata$Replicate + coldata$Cell-lines)

coldata$Cell-lines is not a variable. It actually gets parsed into coldata$Cell - lines. If you want to refer to the column name, you should do coldata[["Cell-lines"]] or its back-ticked equivalent.

If the above design.matrix is not right could you please help me how to do this.

Putting aside your syntactic errors, the design matrix is incorrect for various reasons:

Replicate is nested within Cell-lines, and thus the latter is unestimable.
You are assuming the same effect of TYPE for each cell line, which does not seem to be what you want to do in your comparisons.
Replicate is probably an integer but needs to be treated as a factor.

The correct approach would be:

group <- paste0(coldata$TYPE, ".", coldata[["Cell-lines"]])
rep.num <- factor(coldata$Replicate)
design <- model.matrix(~ 0 + group + rep.num)
design <- design[,-6] # get to full rank

The last line drops a redundant coefficient to achieve full column rank. model.matrix is smart, but it's not that smart, and sometimes the output matrix is not quite what you want. This leaves 6 coefficients:

[1] "groupControl.1"               "groupControl.2"
[3] "groupFOXCUT_OverExpression.1" "groupFOXCUT_OverExpression.2"
[5] "rep.num2"                     "rep.num4"

The first 4 coefficients represent the average log-expression in the group defined by each TYPE/cell line combination(*), while the last two represent blocking factors for the replicates.

And how to give coef in contrast.matrix for differential analysis between different samples?

makeContrasts(groupFOXCUT_OverExpression.1 - groupControl.1, levels=design)
makeContrasts(groupFOXCUT_OverExpression.2 - groupControl.2, levels=design)

It is absolutely critical that you do not compare between cell lines directly. They implicitly have different Replicate numbers, so it is impossible to distinguish between the replicate effect and the cell line effect. It is, however, permissible to compare them indirectly, e.g., to see if the effect of over-expression in cell line 1 is different from the effect in cell line 2:

makeContrasts((groupFOXCUT_OverExpression.1 - groupControl.1)
     - (groupFOXCUT_OverExpression.2 - groupControl.2), levels=design)

(*): Technically, the *.1 coefficients represent the log-expression in replicate 1 of cell line 1, and the *.2 coefficients represent the log-expression in replicate 3 of cell line 2. This is why it makes little sense to directly compare between cell lines using these coefficients, as you'd just be testing for differences in individual replicates; you wouldn't be able to account for replicate-to-replicate variability, and thus your results would not be relevant to the wider "population" of replicates. If you do need to compare between cell lines, I would subset the data to only include one sample for each Replicate, or use voom with duplicateCorrelation.