Question

Help for defining design and contrasts of shRNA screen data

0

Entering edit mode

Jane Merlevede ▴ 90

@jane-merlevede-5019

Last seen 7.2 years ago

Dear all,

I am analyzing 34 samples of a shRNA screen using edgeR. There are 7450 shRNAs targeting 668 genes.
The experiment is composed of 7 cell lines including 3 controls, 4 treatments and 2 controls & 2 time points. The 4 cell lines which are not controls can be mutated (Mut1 or Mut2).
There are at least 2 replicates for each condition.
Below is the description of the 34 samples:

SampleID    CellLines    Time    Drug    Mut
Sample1    Ctrl1_FH    T1    None    None
Sample2    Ctrl1_FH    T2    None    None
Sample3    Ctrl2_MF    T1    None    None
Sample4    Ctrl2_MF    T2    None    None
Sample5    Ctrl3_SC    T1    None    None
Sample6    Ctrl3_SC    T2    None    None
Sample7    NE4    T1    None    Mut2
Sample8    NE4    T2    None    Mut2
Sample9    NE4    T2    Drug1    Mut2
Sample10    NE4    T2    Drug2    Mut2
Sample11    NE4    T2    Drug3    Mut2
Sample12    NE4    T2    Non-irradiated    Mut2
Sample13    NE4    T2    Irrad    Mut2
Sample14    NE5    T1    None    Mut2
Sample15    NE5    T2    None    Mut2
Sample16    NE5    T2    Drug1    Mut2
Sample17    NE5    T2    Drug2    Mut2
Sample18    NE5    T2    Drug3    Mut2
Sample19    NE5    T2    Non-irradiated    Mut2
Sample20    NE5    T2    Irrad    Mut2
Sample21    NE6    T1    None    Mut1
Sample22    NE6    T2    None    Mut1
Sample23    NE6    T2    Drug1    Mut1
Sample24    NE6    T2    Drug2    Mut1
Sample25    NE6    T2    Drug3    Mut1
Sample26    NE6    T2    Non-irradiated    Mut1
Sample27    NE6    T2    Irrad    Mut1
Sample28    NE7    T1    None    Mut1
Sample29    NE7    T2    None    Mut1
Sample30    NE7    T2    Drug1    Mut1
Sample31    NE7    T2    Drug2    Mut1
Sample32    NE7    T2    Drug3    Mut1
Sample33    NE7    T2    Non-irradiated    Mut1
Sample34    NE7    T2    Irrad    Mut1

For simplicity, I do not use the Mut variable for the beginning.
I try to answer 5 questions:
1) time effect
2) Drug1 effect
3) Drug2 effect
4) Drug3 effect
5) Irrad effect
6) "congelation" effect/ non irradiated vs untreated at T2: for irradiation study, cells were frozen. Controls cells were frozen at the same time points as well to served as controls for irradiated cells. I would like to check the effect of congelation, it is not a biological question, just for information.

I am asking help for defining the contrasts that will allow answering these questions.

Here is the R code I use:

library("edgeR")

rawdata <- read.delim("data_34Samples",header=TRUE, stringsAsFactors=FALSE) 

y <- DGEList(counts=rawdata[,5:38], ,genes=rawdata[,1:4])
selr <- rowSums(cpm(y$counts)>10)>=4

y <- y[selr,]
y <- calcNormFactors(y)

# Set up design matrix for GLM
Time= factor(c("T1","T2","T1","T2","T1","T2" ,
               "T1","T2","T2", "T2", "T2", "T2", "T2",
               "T1","T2","T2", "T2", "T2", "T2", "T2",
               "T1","T2","T2", "T2", "T2", "T2", "T2",
               "T1","T2","T2", "T2", "T2", "T2", "T2"))
CellLines=factor(c("Ctrl1_FH","Ctrl1_FH", "Ctrl2_MF","Ctrl2_MF", "Ctrl3_SC","Ctrl3_SC",
                   "NE4", "NE4", "NE4","NE4", "NE4", "NE4", "NE4",
                   "NE5", "NE5", "NE5", "NE5", "NE5", "NE5", "NE5",
                   "NE6", "NE6", "NE6", "NE6", "NE6", "NE6", "NE6",
                   "NE7", "NE7", "NE7", "NE7", "NE7", "NE7", "NE7"))
Drug= factor(c( "None",  "None", "None", "None", "None", "None",
                "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated",
                "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated",
                "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated",
                "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated"))
Drug <- relevel(Drug, "None")

design <- model.matrix(~0+CellLines+Time+Drug)
rownames(design) <- colnames(rawdata[,5:38])

# Estimate dispersions
yglm <- estimateDisp(y, design)
sqrt(yglm$common.disp)

# Fit negative bionomial GLM
fit <- glmFit(yglm, design)
colnames(design)

[1] "CellLinesCtrl1_FH" "CellLinesCtrl2_MF"   "CellLinesNE4"         "CellLinesNE5"         "CellLinesNE6"         "CellLinesNE7"        
[7] "CellLinesCtrl3_SC"     "TimeT2"                  "DrugDrug1"           "DrugDrug2"          "DrugIrradiated"          "DrugNI"                 
[13] "DrugDrug3

# Carry out Likelihood ratio tests

#1: time effect
lrt1 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt1,p.value=0.01, lfc=0))
[,1]
-1 1559
0  3973
1  1239

#2: Drug1 effect
lrt2 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,-1,1,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt2,p.value=0.01, lfc=0))

#3: Drug2 effect
lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt,p.value=0.01, lfc=0))

#4: Drug3 effect
lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt,p.value=0.01, lfc=0))

#5: Irrad effect
lrt5 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,1,-1,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt5,p.value=0.01, lfc=0))
   [,1]
-1    0
0  6771
1     0

#6: congelation effect
lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt,p.value=0.01, lfc=0))

1) For the first contrast, I think I can compare T2 vs T1 using contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0). Am I right?Are the information of paired samples between each T1/T2 used?

2-4) For the study of drug effects, I can test the effect of each drug.
I want to compare each cell line treated by one drug with the same cell line at the same time point with no treatment. For example, for drug 1: Samples 9, 16, 23, 30 vs Samples 8, 15, 22, 29.
I don't know how to precise the "controls" 8, 15, 22, 29 with my model.

5) For the study of irrad effect, I guess my contrast is wrong, but I don't know why.

6) For the study of congelation effect, I don't know how to precise that I want to compare "DrugNI" to (Time=T2 & Drug=none & for the pairs only (without usinfg control cellLines) )

Since it seems difficult to write the contrats, my design is probably incorrect.
Any help would be greatly appreciated.
Thank you in advance.

edgeR design and contrast matrix • 2.2k views

ADD COMMENT • link updated 9.3 years ago by Aaron Lun ★ 29k • written 9.3 years ago by Jane Merlevede ▴ 90

score 1 · Answer 1 · 2016-08-18

1

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 3 hours ago

The city by the bay

For your first question, yes, that contrast will test for a non-zero average effect of time across all cell lines. You've blocked on the cell line in the model so the pairing will be automatically considered.

For your second to fourth questions, you should be setting up contrasts like below. This will compare the treated time-2 sample of each cell line to the corresponding untreated time-2 sample, and will average the effects across all cell lines. (The coefficient for each drug represents the log-fold change over untreated time-2.)

con.drug1 <- c(0,0,0,0,0,0,0,0,1,0,0,0,0) # drop DrugDrug1
con.drug2 <- c(0,0,0,0,0,0,0,0,0,1,0,0,0) # drop DrugDrug2
con.drug3 <- c(0,0,0,0,0,0,0,0,0,0,1,0,0) # drop DrugDrug3

I don't know how the Drug3 coefficient managed to come after the irradiated/NI coefficients in your code example, I guess it's something to do with different character orders. Just adjust it appropriately in your setting, or use makeContrasts.

For your fifth question, your irradiation contrast looks fine. Because the coefficient ordering is different for me, it would look like:

# compare DrugIrradiated to DrugNI 
con.irrad <- c(0,0,0,0,0,0,0,0,0,0,0,1,-1)

Finally, if you want to compare non-irradiated time-2 samples to the time-2 untreated sample, you could just do:

con.nonir <- c(0,0,0,0,0,0,0,0,0,0,0,0,1) # Drop DrugNI

I should point out that there's no point blocking on mutation status, because the cell line factor is nested within it. This means that if you added Mut as an additive term, you'd end up with linear dependencies in the design matrix. Instead, what you can do is describe mutation-specific responses to time or drug or irradiation, using something like this:

design <- model.matrix(~0+CellLines+Time:Mut+Drug:Mut)

This will account for the possibility that the control cell lines respond differently from the mutated lines when treated with various things. However, the design matrix will be a lot more complicated, so be warned.

ADD COMMENT • link 9.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you a lot Aaron for your precise answers.

[2-4] When I test drug, irrad and congelation effects, I have none deregulation.

#2: Drug1 effect
lrt2 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,1,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt2,p.value=0.01, lfc=0))
[,1]
-1    0
0  6671
1     0

#3: Drug2 effect
lrt3 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,1,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt3,p.value=0.01, lfc=0))
[,1]
-1    0
0  6771
1     0

#4: Drug3 effect
lrt4 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,1))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt4,p.value=0.01, lfc=0))
-1    0
0  6771
1     0

#5: Irrad effect
lrt5 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,1,-1,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt5,p.value=0.01, lfc=0))
   [,1]
-1    0
0  6671
1     0

#6: congelation effect
lrt6 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,1,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt6,p.value=0.01, lfc=0))
[,1]
-1    1
0  6670
1     0

It is very surprising to get no effect at all, especially that for most tests, there is no FDR near 0.05 or even 0.1...

I do not understand why when using these contrasts, the samples that received drug1 are compared to the same samples that did not receive any treatment at T2. For each non-controls cell lines, there are 7 replicates, including 2 that are non treated (one at T1 and one at T2). I wonder if the treated samples are not compared to both T1 & T2 untreated cell lines...

[1] The effect of time is studied with the contrast: contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0).

In fact, I would like to test the effect of time on control cell lines (3 replicates: samples 2,4,6 vs samples 1,3,5) and on non-control cell lines (4 replicates: samples 8, 15, 22, 29 vs samples 7, 14, 21, 28). I tried the following:

#1: time effect for controls
lrt1 <- glmLRT(fit, contrast=c(1/3,1/3,0,0,0,0,1/3,1,0,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt1,p.value=0.01, lfc=0))
[,1]
-1 6671
0     0
1     0

#1bis: time effect for non-control cell lines
lrt1bis <- glmLRT(fit, contrast=c(0,0,1/4,1/4,1/4,1/4,0,1,0,0,0,0,0))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt1bis,p.value=0.01, lfc=0))
[,1]
-1 6671
0     0
1     0

Do I have to add a variable specifying for each sample if it is a control or not?

Thank you also for your advise concerning the mutational status. I won't add it for now, I will answer all the other questions first, but I will definitively try it at the end.

ADD REPLY • link 9.3 years ago Jane Merlevede ▴ 90

0

Entering edit mode

Each drugged, NI or irradiated sample is done at T2, so it makes sense that those samples will only be compared to the untreated T2 sample for any given cell line. Otherwise, your treatment effect would be confounded with the time effect if you tried to compare T2 drug-treated samples to T1 untreated samples (or to the average of T1 and T2).

Now, as for why you don't detect any DE genes; perhaps you should do some data exploration to see what's happening in your analysis. Do the treatments separate on a MDS plot? How large are the dispersion estimates, and how big are the log-fold changes? It may not be that you don't have any effect, it's just that the replicate-to-replicate variability is too large to detect it. This is not surprising for shRNA screens, which seem to be a lot more variable than standard RNA-seq experiments.

Finally, if you want to test for control-specific time effects, you'll have to do something similar to what I described with Mut. The current design assumes that the effect of time/drug/irradiation is the same for each cell line, so there's no way you can perform mutation-specific contrasts. Switching the design may also help with reducing the dispersion estimates, because any mutation-specific behaviour will not be fitted correctly (and will inflate the apparent variability) with the current design.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you again.

On the MDS plot, samples do not cluster by treatment, but by cell type mainly. One highly homogeneous cluster is composed of the samples at T1, whatever the cell type. sqrt(yglm$common.disp)=0.37.

To test control vs tumor cell types, I added the variable Type and rewrote the model:

Type=factor(c("Control","Control","Control","Control","Control","Control",  "Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor","Tumor"))
Type <- relevel(Type, "Control")

design <- model.matrix(~0+CellLines+Time:Type+Drug)
rownames(design) <- colnames(rawdata[,5:38])

colnames(design)
[1] "CellLinesCtrl1_FH" "CellLinesCtrl2_MF"   "CellLinesNE4"         "CellLinesNE5"         "CellLinesNE6"         "CellLinesNE7"        
[7] "CellLinesCtrl3_SC"     "DrugDrug1"           "DrugDrug2"          "DrugIrradiated"          "DrugNI"                  "DrugDrug3"       
[13] "TimeT1:TypeControl"          "TimeT2:TypeControl"          "TimeT1:TypeTumor"          "TimeT2:TypeTumor"        

yglm <- estimateDisp(y, design)
Error in glmFit.default(y$counts[sel, ], design, offset = offset[sel,  :
Design matrix not of full rank.  The following coefficients not estimable:
TimeT2:TypeControl TimeT2:TypeTumor

But now, the design matrix is not of full rank because 'Type' factor is redundant (we can find the type from the cell lines). Any suggestion?

Thank you in advance

ADD REPLY • link 9.3 years ago Jane Merlevede ▴ 90

0

Entering edit mode

To address your first point; that's a fairly big dispersion, so if the log2-fold changes are small (i.e., less than 1), it's entirely possible that you won't have enough power to detect DE. Obviously, you have the multiple testing correction to consider as well, so even if there's one or two genes with strong log-fold changes, their p-values may end up being bumped above the desired threshold by the BH correction.

As for your design matrix, I would suggest dropping TimeT1:TypeControl and TimeT1:TypeTumor. This gets you back to full rank and means that TimeT2:TypeControl and TimeT2:TypeTumor can be directly interpreted as the time effect (of T2 - T1) in control and tumor samples, respectively. It should also decrease your dispersion estimate, though that may or may not make a difference.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you a lot for your help.

I did not manage to remove TimeT1:TypeControl and TimeT1:TypeTumor from my design in a proper model.

I tried several models, including ~Type+Time:Type+CellLines+Drug that remove "TypeControl:TimeT2" and "TypeTumor:TimeT2" but for which the design matrix is not full rank. Can you please tell me how I can do it?

Using the design model.matrix(~0+CellLines+Time+Drug), some log2FC are above 1, but not very high:

For drug1 effect:

lrt2 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,1,0,0,0,0))
> topTags(lrt2)
Coefficient:  1*DrugDrug1
logFC   logCPM        LR       PValue       FDR
2.9799978 7.297090 15.272228 9.307492e-05 0.6209028
3.4263815 2.695302 13.747392 2.091109e-04 0.6974894
0.8956251 6.283026 11.145643 8.422927e-04 1.0000000
1.1346983 4.389094 10.833519 9.967901e-04 1.0000000
2.2961060 2.144074  9.527960 2.023649e-03 1.0000000
-1.1633082 4.750350  8.761706 3.076211e-03 1.0000000
0.7539900 6.439005  7.859837 5.054495e-03 1.0000000
-1.0267952 4.987337  7.812037 5.189938e-03 1.0000000
-1.3345994 5.120204  7.569400 5.936753e-03 1.0000000
0.9123838 4.303168  7.235598 7.147201e-03 1.0000000

For drug2 effect:

lrt3 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,1,0,0,0))
> topTags(lrt3)
Coefficient:  1*DrugDrug2
logFC   logCPM       LR      PValue       FDR
2.3684345 7.427169 8.478908 0.003592877 0.9995883
1.4046833 4.209204 8.303472 0.003956936 0.9995883
-1.2903085 4.074975 7.723884 0.005449505 0.9995883
0.7789655 5.996795 7.653827 0.005665202 0.9995883
-0.5527061 7.526940 7.103694 0.007692524 0.9995883
0.5650339 6.729865 7.090303 0.007750216 0.9995883
-1.1043724 7.693558 6.908663 0.008577910 0.9995883
-1.1370034 6.009219 6.608990 0.010146519 0.9995883
-0.6363230 7.262186 6.179702 0.012922401 0.9995883
-0.6099724 7.329452 5.950450 0.014713529 0.9995883

For drug3 effect:

Coefficient:  1*DrugDrug3
logFC   logCPM        LR       PValue       FDR
4.2244466 9.157926 18.334159 1.853538e-05 0.1236495
-1.4571474 4.435609  8.392686 3.767337e-03 0.9998432
-0.8364629 6.194259  7.343956 6.728859e-03 0.9998432
0.8411572 5.928609  6.959348 8.338239e-03 0.9998432
-0.7433334 7.906262  6.785312 9.191096e-03 0.9998432
0.9282393 7.954196  6.657916 9.871627e-03 0.9998432
-1.1169799 4.786234  6.508956 1.073325e-02 0.9998432
1.5525921 2.510491  6.331517 1.186108e-02 0.9998432
-0.8986939 6.903729  6.317812 1.195311e-02 0.9998432
1.0954241 7.204220  6.206467 1.272844e-02 0.9998432

For irrad effect:

Coefficient:  1*DrugIrradiated -1*DrugNI
logFC   logCPM        LR       PValue       FDR
-3.917052 7.041146 16.845979 4.053913e-05 0.1811841
-1.642399 4.889577 16.291050 5.431993e-05 0.1811841
-2.006384 5.585286 12.229156 4.704849e-04 0.9996997
-2.845546 6.233897 11.158008 8.366973e-04 0.9996997
1.147908 6.522431 11.084916 8.703275e-04 0.9996997
1.104593 4.684845 10.532542 1.172908e-03 0.9996997
-1.651363 3.669673 10.442020 1.231805e-03 0.9996997
1.892145 6.726671  9.979417 1.582998e-03 0.9996997
-3.255104 2.695302  9.539765 2.010674e-03 0.9996997
1.866870 2.988207  9.296391 2.296057e-03 0.9996997

For congelation effect:

lrt6 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,1,0))
topTags(lrt6)
Coefficient:  1*DrugNI
TlogFC   logCPM       LR       PValue         FDR
-2.656426 6.029387 25.76135 3.863478e-07 0.002577326
-2.389441 4.593385 19.61792 9.457773e-06 0.031546401
-2.207612 5.489149 17.50708 2.862397e-05 0.036153887
1.104606 8.537751 17.45063 2.948659e-05 0.036153887
-1.148784 6.942375 17.41478 3.004797e-05 0.036153887
-1.703085 5.322701 17.26469 3.251736e-05 0.036153887
-2.177608 4.038652 16.55934 4.715112e-05 0.044935020
1.005283 9.097557 16.07430 6.090516e-05 0.045905174
-1.955200 5.147749 16.04264 6.193173e-05 0.045905174
1.415395 9.567192 15.58317 7.895406e-05 0.052670256

The FDR are smaller in congelation effect, which is not a biological effect in which we are interested. Too bad...

ADD REPLY • link 9.3 years ago Jane Merlevede ▴ 90

0

Entering edit mode

To remove the coefficients, you literally remove them, i.e.:

design <- model.matrix(~0+CellLines+Time:Type+Drug)
design <- design[,-grep("^TimeT1", colnames(design))]

The model.matrix function is not all-knowing, so some manual adjustment is required for complex designs.

As for your lack of DE results, I suspect that the responses to the drug in your experiment are simply too variable to get enough power. I would suggest having a look at the paired CPMs (i.e., for each drug and the untreated sample from the same cell line) and comparing these across cell lines for one of the top genes. I dare say that the drug/untreated fold change would probably not be very consistent across the cell lines.

Now, even if this is the case, you can still salvage something. The drug 3 and irradiation results give you a handful of genes at a FDR of 20%. These are candidates for further validation - which are admittedly less reliable than what you might have gotten at 5%, but better than nothing. It just depends on how much money and time you or your collaborators are prepared to spend/waste on trying to validate false positives.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you, the model is correct now.

I can test the effect of time in both control and untreated tumor cell types:

colnames(design)
[1] "CellLinesCtrl1_FH" "CellLinesCtrl2_MF"   "CellLinesNE4"         "CellLinesNE5"         "CellLinesNE6"         "CellLinesNE7"       
[7] "CellLinesCtrl3_SC"     "DrugDrug1"           "DrugDrug2"          "DrugIrradiated"          "DrugNI"                  "DrugDrug3"      
[13] "TimeT2:TypeControl"          "TimeT2:TypeTumor"

lrt1 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,1,0))
summary(de <- decideTestsDGE(lrt1,p.value=0.01, lfc=0))
[,1]
-1  968
0  4752
1   951

lrt1bis <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,1))
# Show top ranked sgRNAs
summary(de <- decideTestsDGE(lrt1bis,p.value=0.01, lfc=0))
[,1]
-1 1018
0  4859
1   794

Would you recommend to use several models to do the different tests separately? To see if the dispersion can decrease?

Concerning the correction for multiple testing, I will use 20%, since there is nothing below 10%. We have in average 10 sh per gene. Initially our aim was to study genes for which >=50% of sh were deregulated.

If we have >=50% sh for a same gene deregulated with FDR<=20%, it could be a candidate. From the results, there won't be several deregulated sh.

ADD REPLY • link 9.3 years ago Jane Merlevede ▴ 90

0

Entering edit mode

Contrasts look correct to me. One possibility is that, in the model with no type-specific time coefficient, the T2-T1 effect in the tumor samples is underestimated. This results in a lower fitted value for the T2 untreated tumor sample, and (incorrectly) a higher log-fold change for the drug effect. However, in a model with type-specific time coefficients, this is avoided which gives you the true, smaller log-fold change for the drug effect and a higher FDR. You'll have to dig around your data set and analysis to find out.

I wouldn't use multiple models to do different tests. That the DE is not robust to a relatively modest reparametrization should already be a warning signal about the reliability of the results. Changing the model to push the FDR below a threshold would be disingenuous if you know that it doesn't show up elsewhere. Your conclusion - that there won't be a lot of DE shRNAs - is probably the right one for the drug contrasts. The time contrasts seem better, but I don't know how biologically interesting that would be.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you Aaron for your time and explanation.

I tried using estimateGLMRobustDisp() instead of estimateDisp() but unsurprisingly, it did not change a lot the results. I will try DESeq2, dispersion estimation is a bit different, but I guess it won't change dramatically.

Indeed, the time effect is not the most interesting effect in our experiment, but at least, there is something to work on. Since I would like to see which shRNA are deregulated across time in tumor cell types, and are not deregulated in normal cell types, I compared my two previous contrasts. Can you please confirm that it is correct? I guess it is better to do it so, rather than comparing the 2 lists.

#1: time effect for normal cells
lrt1 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,1,0))
summary(de <- decideTestsDGE(lrt1,p.value=0.01, lfc=0))
[,1]
-1  968
0  4752
1   951

#1bis: time effect for tumor cells
lrt1bis <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,1))
summary(de <- decideTestsDGE(lrt1bis,p.value=0.01, lfc=0))
[,1]
-1 1018
0  4859
1   794

#1ter: time effect between tumor and normal cells
lrt1ter <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,1)- c(0,0,0,0,0,0,0,0,0,0,0,0,1,0) )
summary(de <- decideTestsDGE(lrt1ter,p.value=0.01, lfc=0))
[,1]
-1  104
0  6499
1    68

Finally, here are my last questions:

- can I be sure that edgeR(/DESeq) models are well adapted to studied a subset of genes (here ~700 genes for ~7000 sh)?

- since shRNAseq data show more dispersion that RNAseq experiment, how many biological replicates do you recommend per condition in general?

ADD REPLY • link 9.3 years ago Jane Merlevede ▴ 90

0

Entering edit mode

The 1ter contrast will test for differential responses to time between tumor and normal cells. You can't interpret these as genes that are not DE with respect to time in normal cells - they may well be DE in normal cells, it's just that the tumor log-fold change with respect to time is not the same as the log-fold change in normal samples. Mind you, this is still interesting.

For your second question, edgeR, DESeq and empirical Bayes methods in general should still be fine for several hundred features. This is because there's still information to share across genes.

For your last question, I have no idea. If you read http://f1000research.com/articles/3-95/, you'll see that they use 4 replicates in each group. However, these look like they're "proper" replicates, i.e., same cell line and treatment. Your situation is more complicated because you have same treatments across different cell lines, and you assume that different cell lines will behave similarly.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Ok, thank you for these answers.

Thank you a lot for your patience and great help!

ADD REPLY • link 9.3 years ago Jane Merlevede ▴ 90