We are doing a loss-of-function shRNA-Seq experiment with 185 shRNA for 37 different genes. We have two conditions (WT and MUT) with each two groups (before and after selection). For each we have five replicate (one only with four0.
we have the following sample data:
sample condition group A1 WT Input A2 WT Input A3 WT Input A4 WT Input A5 WT Input A6 WT after sel. A7 WT after sel. A8 WT after sel. A9 WT after sel. A10 WT after sel. B1 MUT Input B2 MUT Input B3 MUT Input B4 MUT Input B5 MUT Input B6 MUT after sel. B8 MUT after sel. B9 MUT after sel. B10 MUT after sel.
after reading the paper/tutorial I was thinking on how to apply it to my data and would like to ask for advice.
We are interested in two things
1. first we would like to know genes/shRNAs which are changed within each of the conditions
2. but mainly we would like to know which genes are showing a dropout in the MUT compared to the WT, as to be able to amrk them as gene targets.
I have created the DGE object
$ x.dge
An object of class "DGEList"
$counts
A1_R1 A2_R1 A3_R1 A4_R1 A5_R1 A6_R1 A7_R1 A8_R1 A9_R1 A10_R1 B1_R1 B2_R1 B3_R1
Abcf1_TRCN0000113400 1538 2064 1817 3197 2600 876 1057 1645 808 1165 1348 1741 1968
Abcf1_TRCN0000113401 1742 1982 2450 3221 4532 1023 1187 1411 1152 1361 2206 3111 3368
B4_R1 B5_R1 B6_R1 B8_R1 B9_R1 B10_R1
Abcf1_TRCN0000113400 1257 2743 1140 1202 1619 367
Abcf1_TRCN0000113401 3579 1573 1325 1781 2620 1204
167 more rows ...
$samples
sample condition group lib.size norm.factors
1 A1 WT Input 226541 1
2 A2 WT Input 246536 1
3 A3 WT Input 246198 1
4 A4 WT Input 356525 1
5 A5 WT Input 317193 1
14 more rows ...
$genes
gene clone
[1,] "Abcf1" "TRCN0000113400"
[2,] "Abcf1" "TRCN0000113401"
[3,] "Abcf1" "TRCN0000113402"
[4,] "Abcf1" "TRCN0000113403"
[5,] "Abcf1" "TRCN0000113404"
174 more rows ...
For my first question I was wondering if it is possible to do it from within this experimental design, such as like
des = model.matrix(~0 + condition + group)
Would this gives me the pair-wise comparison for WT and MUT each separately? or should they better be separated?
For my second (main) question, I am not sure which way to go.
I was thinking about using the following design
des <- model.matrix(~condition+group)
or even
des <- model.matrix(~0 + condition + condition:group)
I would appreciate it, if you can suggest me the (more) correct experimental design for the two questions i have here.
thanks in advance.
Assa

Thanks for the reply and sorry for the late response.
when using the makeContrast function for the genotype-specific columns I took
Does this mean that shRNAs with positive values are those which will be expressed stronger in the mutant. Is that true?
To get the
topTagsI than need to (following your tutorial)estimateDisp, fit the negative GLM and calculate the LRT. Is that correct?I am not quite sure what you did with setting the design matrix to full rank by deleting the two columns of
genotypeMut.selectionInputandgenotypeWT.selectionInput.Can you please explain me why it was necessary?
Firstly, your
contmatcontrast won't work because the names aren't syntactically valid. You should replace:with., which was the whole point of usingmake.namesin my original answer.The interpretation of the log-fold changes from your
contmatcontrast is somewhat complex. (I'll call these interaction log-fold changes to distinguish them from the selection log-fold changes within each genotype.) Positive interaction log-fold changes mean that selection log-fold change in mutants is greater than the selection log-fold change in the wild-type, in a literal mathematical sense. The biological interpretation depends on the signs of the selection log-fold changes:All of these scenarios yield a positive interaction log-fold change, but can have different interpretations regarding the nature of the differences in the selection effect between genotypes. (Flip around the logic for negative interaction log-fold changes.) That's why it's important to also report the individual selection log-fold changes in your results for this comparison.
Regarding your code: I don't know why you're switching between the QL and non-QL functions. Just use the QL methods throughout, they are more accurate than the non-QL methods for most (well-behaved) data sets.
Regarding the rank: if you don't discard those two columns, you'll notice that the vector obtained by taking the row sums of the
paircolumns is the same as the vector obtained by taking the row sums of the remaining columns. This means that the columns of the design matrix are linearly dependent such that there is no unique solution for the coefficients, i.e., it is impossible to have a single best fit of the GLM. For example, let's say that I found a "best" fit where allpaircoefficients were equal to 1 and all other coefficients were equal to zero. But in this case, I could obtain the exact same fit by setting allpaircoefficients to zero and all other coefficients to 1; thus, there is no unique solution.By removing some columns, I break the linear dependencies and allow the GLM to be fitted. I removed the
Inputcolumns so that the remainingSelectedcolumns represent the selection effect (over input). If I removed theSelectedcolumns, the remainingInputcolumns would represent the log-fold change of input over selection, which is somewhat more difficult to interpret. One could also conceivably remove some of thepairscolumns instead, though again, this is difficult to interpret.The contrasts matrix I have seen the problem already and changed it beforehand. the QL-function I haven't so thanks for that remark.
But i am not really happy with the results of the differential analysis.
I don't get really good results neither for the MUT comparison before against after
> topTags(lrtMUT) Coefficient: genotypeMut.selectionSelected gene clone logFC logCPM F Actg1_TRCN0000090838 Actg1 TRCN0000090838 1.0544953 13.21519 21.076957 Tec_TRCN0000023677 Stat5b TRCN0000012555 0.7259754 13.26116 10.572919 Eif1a_TRCN0000127032 Eif1a TRCN0000127032 -0.9475732 12.07639 10.297922 Ppp2cb_TRCN0000012414 Ppp2ca TRCN0000081364 -0.9073209 11.98588 9.113047 Cct5_TRCN0000120435 Cct5 TRCN0000120435 -1.0152988 11.19151 7.786791 Ddx3y_TRCN0000103635 Ddx3y TRCN0000103635 0.6715271 12.54775 6.821690 Elp3_TRCN0000039310 Elp3 TRCN0000039310 0.6833334 12.66146 6.821188 Ppp2ca_TRCN0000081367 Pls3 TRCN0000090327 -0.6298224 12.36050 5.454700 Vti1b_TRCN0000070345 Tnk2 TRCN0000023750 0.5169813 13.76148 5.122278 PValue FDR Actg1_TRCN0000090838 4.874608e-06 0.0008384325 Tec_TRCN0000023677 1.179363e-03 0.0783710614 Eif1a_TRCN0000127032 1.366937e-03 0.0783710614 Ppp2cb_TRCN0000012414 2.591325e-03 0.1114269770 Cct5_TRCN0000120435 5.346040e-03 0.1839037928 Ddx3y_TRCN0000103635 9.117758e-03 0.2240989255 Elp3_TRCN0000039310 9.120305e-03 0.2240989255 Ppp2ca_TRCN0000081367 1.967930e-02 0.4231048988 Vti1b_TRCN0000070345 2.379805e-02 0.4548071127nor for the comparison between the two cell lines
> topTags(qlf) Coefficient: 1*genotypeMut.selectionSelected -1*genotypeWT.selectionSelected gene clone logFC logCPM F Eif1a_TRCN0000127031 Eif1a TRCN0000127031 -0.9626886 12.75681 8.960356 Esyt1_TRCN0000195831 Esyt1 TRCN0000195831 -0.9836939 12.51458 7.824886 Stat5a_TRCN0000012548 Prkcc TRCN0000361432 0.8031007 12.71289 6.289016 Actg1_TRCN0000090840 Actg1 TRCN0000090840 -0.7418338 13.40752 6.247534 G6pdx_TRCN0000041445 G6pdx TRCN0000041445 0.7678739 12.85149 6.119261 Anxa5_TRCN0000110703 Anxa5 TRCN0000110703 -0.9936413 12.04494 6.033830 Elp3_TRCN0000039312 Elp3 TRCN0000039312 -0.7492201 12.96512 5.982340 Ddx3y_TRCN0000103635 Ddx3y TRCN0000103635 0.7915068 12.54775 5.383761 Hsp90aa1_TRCN0000008490 Hsp90aa1 TRCN0000008490 0.8542422 12.30693 5.370018 PValue FDR Eif1a_TRCN0000127031 0.002815284 0.3585509 Esyt1_TRCN0000195831 0.005235240 0.3585509 Stat5a_TRCN0000012548 0.012279605 0.3585509 Actg1_TRCN0000090840 0.012568928 0.3585509 G6pdx_TRCN0000041445 0.013508780 0.3585509 Anxa5_TRCN0000110703 0.014174630 0.3585509 Elp3_TRCN0000039312 0.014592188 0.3585509 Ddx3y_TRCN0000103635 0.020491493 0.3897318 Hsp90aa1_TRCN0000008490 0.020652814 0.3897318For the comparison of the WT before against after, I get the best results (Which I didn't expect).
> topTags(lrtWT) Coefficient: genotypeWT.selectionSelected gene clone logFC logCPM F Alg2_TRCN0000110467 Alg2 TRCN0000110467 0.7415676 13.17845 13.874937 G6pdx_TRCN0000041446 G6pdx TRCN0000041446 0.7348983 13.48410 13.614341 Esyt1_TRCN0000195831 Esyt1 TRCN0000195831 0.8515671 12.51458 13.296532 Actg1_TRCN0000090840 Actg1 TRCN0000090840 0.6879894 13.40752 11.965458 Eif1a_TRCN0000127031 Eif1a TRCN0000127031 0.7234900 12.75681 11.301483 Actg1_TRCN0000090838 Actg1 TRCN0000090838 0.6364693 13.21519 10.210039 Prkacb_TRCN0000022815 Ppp2cb TRCN0000012414 -0.6652491 12.60112 8.898436 Stat5b_TRCN0000012553 Rplp0 TRCN0000104539 -0.7589073 12.15132 8.729890 Prkacb_TRCN0000022817 Ppp2cb TRCN0000012416 -0.6172584 12.66945 8.002400 PValue FDR Alg2_TRCN0000110467 0.0002044349 0.01589564 G6pdx_TRCN0000041446 0.0002344890 0.01589564 Esyt1_TRCN0000195831 0.0002772495 0.01589564 Actg1_TRCN0000090840 0.0005609017 0.02411878 Eif1a_TRCN0000127031 0.0007987634 0.02747746 Actg1_TRCN0000090838 0.0014330492 0.04108074 Prkacb_TRCN0000022815 0.0029116266 0.06861122 Stat5b_TRCN0000012553 0.0031912193 0.06861122 Prkacb_TRCN0000022817 0.0047490323 0.08371599How much weight should I put on the FDR values? Or should I in this case concentrate more on the logFC values, as i have only ~180 shRNA molecules for approx. 30 genes.?
Only the FDR (or perhaps more precisely, the adjusted p-value relative to a FDR threshold) can be trusted for decisions relating to the hypothesis of interest, i.e., is there actually an effect or not? The log-fold change is purely a descriptive statistic, as it doesn't capture the variability of the underlying observations.
Anyway, I don't see why you're unhappy about your results. Your interaction contrast (middle) indicates that no gene exhibits a significant difference in selection effects between genotypes at a FDR of 5%. As things go, this is a pretty clear-cut result. The most parsimonious explanation would be that the genotype just doesn't have a big effect on selection. Maybe you don't like that, but maybe that's the truth. You should only talk about the "goodness" of your results when you know the ground truth, and you haven't mentioned any positive controls in this experiment.
Note that a slight increase in power for the wild-type selection effect is expected as you have more samples for the wild-type (5 selection/input pairs) compared to the mutant (4 selection/input pairs + 1 input sample that is effectively ignored upon blocking). However, if you're not seeing any significant interaction effects, then it's hard to be interested in the selection effects by themselves, as they could just represent the general effect of time/passages/etc.
Of course, I could imagine all sorts of things going wrong when you have so few features, or when your replicates are highly variable. You'll have to look at diagnostic plots (e.g., MA plots for normalization, MDS plots, mean-dispersion plots) to judge that and modify the appropriate edgeR parameters in response to obvious issues, e.g., setting
robust=TRUEinglmQLFitwhen you see outliers on the QL mean-dispersion plot.yes I was thinking the same thing. I have already done the QC plots and they do not look like I have expected it to look like:
plotBCV:
plotMDS
and the Smear plots
plotSmear(lrtWT, de.tags = de.genes, , main="WT Smear plot")plotSmear(qlf, de.tags = de.genes, main="comparison between cell lines")It really does not look like I was expected. The shRNa molecules which shows a large CPM have very small logFC.
The shape of the MA plots are mostly expected. Large log-fold changes at low abundance for each plot are caused by the high variance relative to the mean at low counts. This is why you should rely on the p-values for identifying genes of interest rather than looking at the reported log-fold changes.
I am a bit surprised at the presence of a clear distinction between low- and high-abundance genes. I don't know why some shRNAs would be so readily sequenced and others not as much, even before any selection has taken place. I would guess your input shRNA composition might not be properly balanced. Note, low abundances are unlikely to be caused by a decrease in abundance upon selection, as we would expect to see all of them to have strong negative log-fold changes in that case.
The other major issue with having so few shRNAs is that normalization via
calcNormFactorsassumes that most shRNAs have no effect. However, the strength of this assumption requires negative control shRNAs to assess, so unless you have them, it's hard to say anything either way.Just for the sake of testing I have tried to analyse only the first group of samples from
WT(the first 10 samples).I have created a new
DEGListobject using only this subsetx.WT = new("DGEList") x.WT$counts = as.matrix(WT.table) x.WT$samples$sample <- as.factor(x.WT$samples$sample) x.WT$samples$condition <- as.factor(x.WT$samples$condition) x.WT$samples$group <- as.factor(x.WT$samples$group) x.WT$samples$lib.size <- colSums(x.WT$counts) x.WT$samples$norm.factors = 1 selection <- factor(rep(c("Input", "Selected"), each=5)) WT.design <- model.matrix(~selection)and run the DE analysis
Now I get (a bit) better results.
> topT$table[1:20,] gene clone logFC logCPM LR PValue Esyt1_TRCN0000195831 Esyt1 TRCN0000195831 0.8566637 12.767383 16.222104 5.633298e-05 G6pdx_TRCN0000041446 G6pdx TRCN0000041446 0.7397171 13.741256 15.886842 6.724462e-05 ... Elp3_TRCN0000039312 Elp3 TRCN0000039312 0.4948316 13.104579 7.034434 7.995714e-03 Esyt1_TRCN0000243846 Esyt1 TRCN0000243846 2.7831804 4.125457 6.794237 9.145258e-03 Pls3_TRCN0000090325 Pls3 TRCN0000090325 -0.4610140 12.890230 6.127389 1.331033e-02 FDR Esyt1_TRCN0000195831 0.005997144 G6pdx_TRCN0000041446 0.005997144 ... Elp3_TRCN0000039312 0.079512937 Esyt1_TRCN0000243846 0.086157961 Pls3_TRCN0000090325 0.113556912At least I have 19 shRNAs which show significant behavior, compared to 12 before that. I know it is not a lot, but, would it means, that it is better to run the analysis separately?
It seems that your definition of "better" is "more DE genes". This is incorrect unless you have positive controls. Why is it so hard to accept that there are not many (or any) DE genes?
And no, it is not "better" to run the analysis separately. (1) You don't get to share information about dispersion estimation across samples. (2) You can't compare between genotypes.
I hope you are not choosing your analysis parameters on the basis of what gives you the most DE genes. That would be p-hacking, which is Bad.