Dear all,
I am analyzing 34 samples of a shRNA screen using edgeR. There are 7450 shRNAs targeting 668 genes.
The experiment is composed of 7 cell lines including 3 controls, 4 treatments and 2 controls & 2 time points. The 4 cell lines which are not controls can be mutated (Mut1 or Mut2).
There are at least 2 replicates for each condition.
Below is the description of the 34 samples:
SampleID CellLines Time Drug Mut
Sample1 Ctrl1_FH T1 None None
Sample2 Ctrl1_FH T2 None None
Sample3 Ctrl2_MF T1 None None
Sample4 Ctrl2_MF T2 None None
Sample5 Ctrl3_SC T1 None None
Sample6 Ctrl3_SC T2 None None
Sample7 NE4 T1 None Mut2
Sample8 NE4 T2 None Mut2
Sample9 NE4 T2 Drug1 Mut2
Sample10 NE4 T2 Drug2 Mut2
Sample11 NE4 T2 Drug3 Mut2
Sample12 NE4 T2 Non-irradiated Mut2
Sample13 NE4 T2 Irrad Mut2
Sample14 NE5 T1 None Mut2
Sample15 NE5 T2 None Mut2
Sample16 NE5 T2 Drug1 Mut2
Sample17 NE5 T2 Drug2 Mut2
Sample18 NE5 T2 Drug3 Mut2
Sample19 NE5 T2 Non-irradiated Mut2
Sample20 NE5 T2 Irrad Mut2
Sample21 NE6 T1 None Mut1
Sample22 NE6 T2 None Mut1
Sample23 NE6 T2 Drug1 Mut1
Sample24 NE6 T2 Drug2 Mut1
Sample25 NE6 T2 Drug3 Mut1
Sample26 NE6 T2 Non-irradiated Mut1
Sample27 NE6 T2 Irrad Mut1
Sample28 NE7 T1 None Mut1
Sample29 NE7 T2 None Mut1
Sample30 NE7 T2 Drug1 Mut1
Sample31 NE7 T2 Drug2 Mut1
Sample32 NE7 T2 Drug3 Mut1
Sample33 NE7 T2 Non-irradiated Mut1
Sample34 NE7 T2 Irrad Mut1
For simplicity, I do not use the Mut variable for the beginning.
I try to answer 5 questions:
1) time effect
2) Drug1 effect
3) Drug2 effect
4) Drug3 effect
5) Irrad effect
6) "congelation" effect/ non irradiated vs untreated at T2: for irradiation study, cells were frozen. Controls cells were frozen at the same time points as well to served as controls for irradiated cells. I would like to check the effect of congelation, it is not a biological question, just for information.
I am asking help for defining the contrasts that will allow answering these questions.
Here is the R code I use:
library("edgeR")
rawdata <- read.delim("data_34Samples",header=TRUE, stringsAsFactors=FALSE)
y <- DGEList(counts=rawdata[,5:38], ,genes=rawdata[,1:4]) selr <- rowSums(cpm(y$counts)>10)>=4
y <- y[selr,]
y <- calcNormFactors(y)
# Set up design matrix for GLM Time= factor(c("T1","T2","T1","T2","T1","T2" , "T1","T2","T2", "T2", "T2", "T2", "T2", "T1","T2","T2", "T2", "T2", "T2", "T2", "T1","T2","T2", "T2", "T2", "T2", "T2", "T1","T2","T2", "T2", "T2", "T2", "T2")) CellLines=factor(c("Ctrl1_FH","Ctrl1_FH", "Ctrl2_MF","Ctrl2_MF", "Ctrl3_SC","Ctrl3_SC", "NE4", "NE4", "NE4","NE4", "NE4", "NE4", "NE4", "NE5", "NE5", "NE5", "NE5", "NE5", "NE5", "NE5", "NE6", "NE6", "NE6", "NE6", "NE6", "NE6", "NE6", "NE7", "NE7", "NE7", "NE7", "NE7", "NE7", "NE7")) Drug= factor(c( "None", "None", "None", "None", "None", "None", "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated", "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated", "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated", "None", "None", "Drug1", "Drug2", "Drug3", "NI", "Irradiated")) Drug <- relevel(Drug, "None")
design <- model.matrix(~0+CellLines+Time+Drug) rownames(design) <- colnames(rawdata[,5:38])
# Estimate dispersions yglm <- estimateDisp(y, design) sqrt(yglm$common.disp)
# Fit negative bionomial GLM fit <- glmFit(yglm, design) colnames(design)
[1] "CellLinesCtrl1_FH" "CellLinesCtrl2_MF" "CellLinesNE4" "CellLinesNE5" "CellLinesNE6" "CellLinesNE7" [7] "CellLinesCtrl3_SC" "TimeT2" "DrugDrug1" "DrugDrug2" "DrugIrradiated" "DrugNI" [13] "DrugDrug3
# Carry out Likelihood ratio tests
#1: time effect lrt1 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0)) # Show top ranked sgRNAs summary(de <- decideTestsDGE(lrt1,p.value=0.01, lfc=0)) [,1] -1 1559 0 3973 1 1239
#2: Drug1 effect lrt2 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,-1,1,0,0,0,0)) # Show top ranked sgRNAs summary(de <- decideTestsDGE(lrt2,p.value=0.01, lfc=0))
#3: Drug2 effect lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0)) # Show top ranked sgRNAs summary(de <- decideTestsDGE(lrt,p.value=0.01, lfc=0))
#4: Drug3 effect lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0)) # Show top ranked sgRNAs summary(de <- decideTestsDGE(lrt,p.value=0.01, lfc=0))
#5: Irrad effect lrt5 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,1,-1,0)) # Show top ranked sgRNAs summary(de <- decideTestsDGE(lrt5,p.value=0.01, lfc=0)) [,1] -1 0 0 6771 1 0
#6: congelation effect lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0)) # Show top ranked sgRNAs summary(de <- decideTestsDGE(lrt,p.value=0.01, lfc=0))
1) For the first contrast, I think I can compare T2 vs T1 using contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0). Am I right?Are the information of paired samples between each T1/T2 used?
2-4) For the study of drug effects, I can test the effect of each drug.
I want to compare each cell line treated by one drug with the same cell line at the same time point with no treatment. For example, for drug 1: Samples 9, 16, 23, 30 vs Samples 8, 15, 22, 29.
I don't know how to precise the "controls" 8, 15, 22, 29 with my model.
5) For the study of irrad effect, I guess my contrast is wrong, but I don't know why.
6) For the study of congelation effect, I don't know how to precise that I want to compare "DrugNI" to (Time=T2 & Drug=none & for the pairs only (without usinfg control cellLines) )
Since it seems difficult to write the contrats, my design is probably incorrect.
Any help would be greatly appreciated.
Thank you in advance.
Thank you a lot Aaron for your precise answers.
[2-4] When I test drug, irrad and congelation effects, I have none deregulation.
It is very surprising to get no effect at all, especially that for most tests, there is no FDR near 0.05 or even 0.1...
I do not understand why when using these contrasts, the samples that received drug1 are compared to the same samples that did not receive any treatment at T2. For each non-controls cell lines, there are 7 replicates, including 2 that are non treated (one at T1 and one at T2). I wonder if the treated samples are not compared to both T1 & T2 untreated cell lines...
[1] The effect of time is studied with the contrast:
contrast=c(0,0,0,0,0,0,0,1,0,0,0,0,0).
In fact, I would like to test the effect of time on control cell lines (3 replicates: samples 2,4,6 vs samples 1,3,5) and on non-control cell lines (4 replicates: samples 8, 15, 22, 29 vs samples 7, 14, 21, 28). I tried the following:
Do I have to add a variable specifying for each sample if it is a control or not?
Thank you also for your advise concerning the mutational status. I won't add it for now, I will answer all the other questions first, but I will definitively try it at the end.
Each drugged, NI or irradiated sample is done at T2, so it makes sense that those samples will only be compared to the untreated T2 sample for any given cell line. Otherwise, your treatment effect would be confounded with the time effect if you tried to compare T2 drug-treated samples to T1 untreated samples (or to the average of T1 and T2).
Now, as for why you don't detect any DE genes; perhaps you should do some data exploration to see what's happening in your analysis. Do the treatments separate on a MDS plot? How large are the dispersion estimates, and how big are the log-fold changes? It may not be that you don't have any effect, it's just that the replicate-to-replicate variability is too large to detect it. This is not surprising for shRNA screens, which seem to be a lot more variable than standard RNA-seq experiments.
Finally, if you want to test for control-specific time effects, you'll have to do something similar to what I described with
Mut
. The current design assumes that the effect of time/drug/irradiation is the same for each cell line, so there's no way you can perform mutation-specific contrasts. Switching the design may also help with reducing the dispersion estimates, because any mutation-specific behaviour will not be fitted correctly (and will inflate the apparent variability) with the current design.Thank you again.
On the MDS plot, samples do not cluster by treatment, but by cell type mainly. One highly homogeneous cluster is composed of the samples at T1, whatever the cell type. sqrt(yglm$common.disp)=0.37.
To test control vs tumor cell types, I added the variable Type and rewrote the model:
But now, the design matrix is not of full rank because 'Type' factor is redundant (we can find the type from the cell lines). Any suggestion?
Thank you in advance
To address your first point; that's a fairly big dispersion, so if the log2-fold changes are small (i.e., less than 1), it's entirely possible that you won't have enough power to detect DE. Obviously, you have the multiple testing correction to consider as well, so even if there's one or two genes with strong log-fold changes, their p-values may end up being bumped above the desired threshold by the BH correction.
As for your design matrix, I would suggest dropping
TimeT1:TypeControl
andTimeT1:TypeTumor
. This gets you back to full rank and means thatTimeT2:TypeControl
andTimeT2:TypeTumor
can be directly interpreted as the time effect (of T2 - T1) in control and tumor samples, respectively. It should also decrease your dispersion estimate, though that may or may not make a difference.Thank you a lot for your help.
I did not manage to remove
TimeT1:TypeControl
andTimeT1:TypeTumor from my design in a proper model.
I tried several models, including
~Type+Time:Type+CellLines+Drug that remove "TypeControl
:TimeT2" and "TypeTumor
:TimeT2" but for which the design matrix is not full rank. Can you please tell me how I can do it?Using the design
model.matrix(~0+CellLines+Time+Drug),
some log2FC are above 1, but not very high:For drug1 effect:
For drug2 effect:
For drug3 effect:
For irrad effect:
For congelation effect:
The FDR are smaller in congelation effect, which is not a biological effect in which we are interested. Too bad...
To remove the coefficients, you literally remove them, i.e.:
The
model.matrix
function is not all-knowing, so some manual adjustment is required for complex designs.As for your lack of DE results, I suspect that the responses to the drug in your experiment are simply too variable to get enough power. I would suggest having a look at the paired CPMs (i.e., for each drug and the untreated sample from the same cell line) and comparing these across cell lines for one of the top genes. I dare say that the drug/untreated fold change would probably not be very consistent across the cell lines.
Now, even if this is the case, you can still salvage something. The drug 3 and irradiation results give you a handful of genes at a FDR of 20%. These are candidates for further validation - which are admittedly less reliable than what you might have gotten at 5%, but better than nothing. It just depends on how much money and time you or your collaborators are prepared to spend/waste on trying to validate false positives.
Thank you, the model is correct now.
I can test the effect of time in both control and untreated tumor cell types:
Would you recommend to use several models to do the different tests separately? To see if the dispersion can decrease?
Concerning the correction for multiple testing, I will use 20%, since there is nothing below 10%. We have in average 10 sh per gene. Initially our aim was to study genes for which >=50% of sh were deregulated.
If we have >=50% sh for a same gene deregulated with FDR<=20%, it could be a candidate. From the results, there won't be several deregulated sh.
Contrasts look correct to me. One possibility is that, in the model with no type-specific time coefficient, the T2-T1 effect in the tumor samples is underestimated. This results in a lower fitted value for the T2 untreated tumor sample, and (incorrectly) a higher log-fold change for the drug effect. However, in a model with type-specific time coefficients, this is avoided which gives you the true, smaller log-fold change for the drug effect and a higher FDR. You'll have to dig around your data set and analysis to find out.
I wouldn't use multiple models to do different tests. That the DE is not robust to a relatively modest reparametrization should already be a warning signal about the reliability of the results. Changing the model to push the FDR below a threshold would be disingenuous if you know that it doesn't show up elsewhere. Your conclusion - that there won't be a lot of DE shRNAs - is probably the right one for the drug contrasts. The time contrasts seem better, but I don't know how biologically interesting that would be.
Thank you Aaron for your time and explanation.
I tried using estimateGLMRobustDisp() instead of estimateDisp() but unsurprisingly, it did not change a lot the results. I will try DESeq2, dispersion estimation is a bit different, but I guess it won't change dramatically.
Indeed, the time effect is not the most interesting effect in our experiment, but at least, there is something to work on. Since I would like to see which shRNA are deregulated across time in tumor cell types, and are not deregulated in normal cell types, I compared my two previous contrasts. Can you please confirm that it is correct? I guess it is better to do it so, rather than comparing the 2 lists.
Finally, here are my last questions:
- can I be sure that edgeR(/DESeq) models are well adapted to studied a subset of genes (here ~700 genes for ~7000 sh)?
- since shRNAseq data show more dispersion that RNAseq experiment, how many biological replicates do you recommend per condition in general?
The
1ter
contrast will test for differential responses to time between tumor and normal cells. You can't interpret these as genes that are not DE with respect to time in normal cells - they may well be DE in normal cells, it's just that the tumor log-fold change with respect to time is not the same as the log-fold change in normal samples. Mind you, this is still interesting.For your second question, edgeR, DESeq and empirical Bayes methods in general should still be fine for several hundred features. This is because there's still information to share across genes.
For your last question, I have no idea. If you read http://f1000research.com/articles/3-95/, you'll see that they use 4 replicates in each group. However, these look like they're "proper" replicates, i.e., same cell line and treatment. Your situation is more complicated because you have same treatments across different cell lines, and you assume that different cell lines will behave similarly.
Ok, thank you for these answers.
Thank you a lot for your patience and great help!