Question: Experimental design RNAseq differential expression
0
4 weeks ago by
Pietro20
Pietro20 wrote:

Dear users,

I am struggling to understand if my design is correct. I found edgeR section 3.3.1 similar to my situation but I am not that confident.

Here is my experimental design:

samples_table

sampleId    cellLine    treatment   time    IC
s1                 a      vehicle   0        S
s2                 a         drug   48       S
s3                 a         drug   168      S
s4                 b      vehicle   0        S
s5                 b         drug   48       S
s6                 b         drug   168      S
s7                 c      vehicle   0        S
s8                 c         drug   48       S
s9                 c         drug   168      S
s10                d      vehicle   0        S
s11                d         drug   48       S
s12                d         drug   168      S
s13                e      vehicle   0        R
s14                e         drug   48       R
s15                e         drug   168      R
s16                f      vehicle   0        R
s17                f         drug   48       R
s18                f         drug   168      R


I have 6 cell lines treated with a drug and RNA sequenced after 48 and 168 hours of treatment. Last column indicates if the cell line is susceptible or resistant to another compound.

I would like to find how resistant cell lines differentially respond to the drug at 48 and/or at 168 hours compared to resistant ones.

Here is my approach:

group <- factor(paste(samples_table$IC, samples_table$time, sep="."))
y <- DGEList(counts.keep, group=group)
y <- calcNormFactors(y)
design <- model.matrix(~0 + group)
colnames(design) <- levels(group)
y <- estimateDisp(y, design)

# Quasi-likelihood test
fit <- glmQLFit(y, design)

design

R.0 R.168 R.48 S.0 S.168 S.48
1    0     0    0   1     0    0
2    0     0    0   0     0    1
3    0     0    0   0     1    0
4    0     0    0   1     0    0
5    0     0    0   0     0    1
6    0     0    0   0     1    0
7    0     0    0   1     0    0
8    0     0    0   0     0    1
9    0     0    0   0     1    0
10   1     0    0   0     0    0
11   0     0    1   0     0    0
12   0     1    0   0     0    0
13   1     0    0   0     0    0
14   0     0    1   0     0    0
15   0     1    0   0     0    0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")\$group
[1] "contr.treatment"

my.contrasts <- makeContrasts(
S48 = S.48 - S.0,
S168 = S.168 - S.0,
S168vs48 = S.168 - S.48,
R48 = R.48 - R.0,
R168 = R.168 - R.0,
R168vs48 = R.168 - R.48,
RvsS.0 = R.0 - S.0,
RvsS.48 = (R.48 - R.0) - (S.48 - S.0),
RvsS.168 = (R.168 - R.0) - (S.168 - S.0),
all = (R.48 + R.168 - R.0) - (S.48 + S.168 - S.0),
levels = design
)

# to find genes that differentially respond at 48h between resistant and susceptible cell lines

qlf <- glmQLFTest(fit,
contrast = my.contrasts[ , "RvsS.48"])

topTags(qlf)


What do you think? Shouldn't I account the fact that I have different cell lines as a sort of batch effect?

Thanks a lot

Pietro

PS: cross posted to https://www.biostars.org/p/370900/

limma edger design batch contrasts • 150 views
modified 4 weeks ago by Aaron Lun23k • written 4 weeks ago by Pietro20
Answer: Experimental design RNAseq differential expression
3
4 weeks ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

You'll have to treat cell lines as "replicates" of each other, in some sense. Whether or not this is reasonable depends on the nature of the cell line. Assuming your table is in df, I would do:

design <- model.matrix(~cellLine + time:IC, data=df)
colnames(design) <- make.names(colnames(design)) # make valid.


The first 6 columns are the cell line-specific blocking factors and not of particular scientific interest (treating cell lines as replicates). The last two coefficients are the resistant and sensitive response to time, respectively. If you want to find genes with differences in the response, just do:

con <- makeContrasts(time.ICR - time.ICS, levels=design)


If you want to test 48 and 168 hours specifically, just convert time into a factor before you do the above. Note this requires manual removal of the time0:IC* coefficients to get the design matrix to full rank. The remaining coefficients will then represent the log-fold change over time zero in either resistant or sensitive lines.

P.S. Don't cross-post without giving people a chance to answer.

Interesting, thank you Aaron.

Unfortunately I get no significant DE genes, probably because of the high variability of the cell lines.

The high variability of the response of each cell line, to be precise.

If you can't treat the cell lines as replicates, an alternative would be to do something like this:

design <- model.matrix(~0 + cellLine + cellLine:time, df=data)
colnames(design) <- make.names(colnames(design)) # make valid.


Here, you get one intercept and slope term per cell line. You might then do:

con <- makeContrasts((cellLinea.time + cellLineb.time + cellLinec.time + cellLined.time)/4
- (cellLinee.time + cellLinef.time)/2, levels=design)


i.e., is there any difference between the average response of the resistant and sensitive cell lines. This will avoid problems with variability across cell lines, but the resulting DE genes are less generalizable (e.g., to new cell lines). There's also an assumption of linearity with respect to time.

Thanks Aaron,

It's amazing how much we learn from guys like you.

Still all FDR == 1, though.

Here is what I meant... poor design?

The strong separation between cells lines is not, in and of itself, poor design, because that just represents the biological truth. The "poorness" of the design stems from the absence of replicates within each cell line, requiring you to treat cell lines as replicates of each other. This is probably iffy, not least because the resistant cell lines are unlikely to be randomly and independently sampled from the "distribution" of all possible resistant cell lines (whatever that means); same for the sensitive cell lines.

If one had replicates within each cell line, there would be more statistical power from more samples. You wouldn't have to make the linearity assumption, which would probably improve the model fit and allow you to test differences at specific time points. Testing the effects of specific cell lines is also more concrete with regards to reproducibility, as anyone attempting to replicate the study would only need to choose the same cell lines (rather than trying to figure out how to sample from the "distribution" of cell lines).