Question: shRNA-Seq analysis using edgeR
0
13 months ago by
Assa Yeroslaviz1.4k
Munich, Germany
Assa Yeroslaviz1.4k wrote:

We are doing a loss-of-function shRNA-Seq experiment with 185 shRNA for 37 different genes. We have two conditions (WT and MUT) with each two groups (before and after selection). For each we have five replicate (one only with four0.

we have the following sample data:

sample condition group
A1 WT Input
A2 WT Input
A3 WT Input
A4 WT Input
A5 WT Input
A6 WT after sel.
A7 WT after sel.
A8 WT after sel.
A9 WT after sel.
A10 WT after sel.
B1 MUT Input
B2 MUT Input
B3 MUT Input
B4 MUT Input
B5 MUT Input
B6 MUT after sel.
B8 MUT after sel.
B9 MUT after sel.
B10 MUT after sel.

after reading the paper/tutorial I was thinking on how to apply it to my data and would like to ask for advice.

We are interested in two things

1. first we would like to know genes/shRNAs which are changed within each of the conditions

2. but mainly we would like to know which genes are showing a dropout in the MUT compared to the WT, as to be able to amrk them as gene targets.

I have created the DGE object

$x.dge An object of class "DGEList"$counts
A1_R1 A2_R1 A3_R1 A4_R1 A5_R1 A6_R1 A7_R1 A8_R1 A9_R1 A10_R1 B1_R1 B2_R1 B3_R1
Abcf1_TRCN0000113400  1538  2064  1817  3197  2600   876  1057  1645   808   1165  1348  1741  1968
Abcf1_TRCN0000113401  1742  1982  2450  3221  4532  1023  1187  1411  1152   1361  2206  3111  3368
B4_R1 B5_R1 B6_R1 B8_R1 B9_R1 B10_R1
Abcf1_TRCN0000113400  1257  2743  1140  1202  1619    367
Abcf1_TRCN0000113401  3579  1573  1325  1781  2620   1204
167 more rows ...

$samples sample condition group lib.size norm.factors 1 A1 WT Input 226541 1 2 A2 WT Input 246536 1 3 A3 WT Input 246198 1 4 A4 WT Input 356525 1 5 A5 WT Input 317193 1 14 more rows ...$genes
gene    clone
[1,] "Abcf1" "TRCN0000113400"
[2,] "Abcf1" "TRCN0000113401"
[3,] "Abcf1" "TRCN0000113402"
[4,] "Abcf1" "TRCN0000113403"
[5,] "Abcf1" "TRCN0000113404"
174 more rows ...

For my first question I was wondering if it is possible to do it from within this experimental design, such as like

des = model.matrix(~0 + condition + group)

Would this gives me the pair-wise comparison for WT and MUT each separately? or should they better be separated?

For my second (main) question, I am not sure which way to go.

I was thinking about using the following design

des <- model.matrix(~condition+group)

or even

des <- model.matrix(~0 + condition + condition:group)

I would appreciate it, if you can suggest me the  (more) correct experimental design for the two questions i have here.

Assa

modified 13 months ago by Aaron Lun25k • written 13 months ago by Assa Yeroslaviz1.4k
1
13 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

genotype <- factor(rep(c("WT", "Mut"), each=10))[-17]
selection <- factor(rep(rep(c("Input", "Selected"), each=5), 2))[-17]

I also assume your before/after selection samples are paired in some manner? Which gives:

pair <- factor(c(rep(1:5, 2), rep(6:10, 2)))[-17]

If the above is correct, I would set up a design matrix like so:

design <- model.matrix(~0 + pair + genotype:selection)
design <- design[,!grepl("Input", colnames(design))] # get to full rank
colnames(design) <- make.names(colnames(design)) # make happy colnames


Each of the last two terms represents the effect of selection in the corresponding genotype, which should answer 1. You can also compare them to each other with makeContrasts, which will tell you the genotype-specific changes upon selection for 2.

Thanks for the reply and sorry for the late response.

when using the makeContrast function for the genotype-specific columns I took

contmat = makeContrasts(DiffGentypes = genotypeMut.selectionSelected - genotypeWT.selectionSelected, levels = design)

Does this mean that shRNAs with positive values are those which will be expressed stronger in the mutant. Is that true?

xglm = estimateDisp(DGE, design)
qlfMUT <- glmQLFit(xglm, coef=11) # to get the topTags of genotypeMut
lrtMUT = glmQLFTest(qlfMUT, 11)
topTags(lrtMUT)
#
qlfWT <- glmQLFit(xglm, coef=12) # to get the topTags of genotypeWT
lrtWT = glmQLFTest(qlfWT, 12)
topTags(lrtWT)
#
xglm <- glmQLFit(DGE, design)
qlf <- glmQLFTest(xglm, contrast=contmat[,"DiffGentypes"])
topTags(qlf)


To get the topTags I than need to (following your tutorial) estimateDisp, fit the negative GLM and calculate the LRT. Is that correct?

I am not quite sure what you did with setting the design matrix to full rank by deleting the two columns of genotypeMut.selectionInput and genotypeWT.selectionInput.

Can you please explain me why it was necessary?

Firstly, your contmat contrast won't work because the names aren't syntactically valid. You should replace : with ., which was the whole point of using make.names in my original answer.

The interpretation of the log-fold changes from your contmat contrast is somewhat complex. (I'll call these interaction log-fold changes to distinguish them from the selection log-fold changes within each genotype.) Positive interaction log-fold changes mean that selection log-fold change in mutants is greater than the selection log-fold change in the wild-type, in a literal mathematical sense. The biological interpretation depends on the signs of the selection log-fold changes:

• Selection log-FC in mutant > selection log-FC in wild-type > 0 means that the gene has a positive selection effect in both genotypes, but is greater in mutants compared to the wild-type.
• Selection log-FC in mutant > 0 > selection log-FC in wild-type means that the gene has a positive selection effect in mutants and a negative one in the wild-type.
• 0 > selection log-FC in mutant > selection log-FC in wild-type means that the gene has a negative selection effect in both genotypes, but is weaker in mutants compared to the wild-type.

All of these scenarios yield a positive interaction log-fold change, but can have different interpretations regarding the nature of the differences in the selection effect between genotypes. (Flip around the logic for negative interaction log-fold changes.) That's why it's important to also report the individual selection log-fold changes in your results for this comparison.

Regarding your code: I don't know why you're switching between the QL and non-QL functions. Just use the QL methods throughout, they are more accurate than the non-QL methods for most (well-behaved) data sets.

Regarding the rank: if you don't discard those two columns, you'll notice that the vector obtained by taking the row sums of the pair columns is the same as the vector obtained by taking the row sums of the remaining columns. This means that the columns of the design matrix are linearly dependent such that there is no unique solution for the coefficients, i.e., it is impossible to have a single best fit of the GLM. For example, let's say that I found a "best" fit where all pair coefficients were equal to 1 and all other coefficients were equal to zero. But in this case, I could obtain the exact same fit by setting all pair coefficients to zero and all other coefficients to 1; thus, there is no unique solution.

By removing some columns, I break the linear dependencies and allow the GLM to be fitted. I removed the Input columns so that the remaining Selected columns represent the selection effect (over input). If I removed the Selected columns, the remaining Input columns would represent the log-fold change of input over selection, which is somewhat more difficult to interpret. One could also conceivably remove some of the pairs columns instead, though again, this is difficult to interpret.

The contrasts matrix I have seen the problem already and changed it beforehand. the QL-function I haven't so thanks for that remark.

But i am not really happy with the results of the differential analysis.

I don't get really good results neither for the MUT comparison before against after

> topTags(lrtMUT)
Coefficient:  genotypeMut.selectionSelected
gene          clone      logFC   logCPM         F
Actg1_TRCN0000090838   Actg1 TRCN0000090838  1.0544953 13.21519 21.076957
Tec_TRCN0000023677    Stat5b TRCN0000012555  0.7259754 13.26116 10.572919
Eif1a_TRCN0000127032   Eif1a TRCN0000127032 -0.9475732 12.07639 10.297922
Ppp2cb_TRCN0000012414 Ppp2ca TRCN0000081364 -0.9073209 11.98588  9.113047
Cct5_TRCN0000120435     Cct5 TRCN0000120435 -1.0152988 11.19151  7.786791
Ddx3y_TRCN0000103635   Ddx3y TRCN0000103635  0.6715271 12.54775  6.821690
Elp3_TRCN0000039310     Elp3 TRCN0000039310  0.6833334 12.66146  6.821188
Ppp2ca_TRCN0000081367   Pls3 TRCN0000090327 -0.6298224 12.36050  5.454700
Vti1b_TRCN0000070345    Tnk2 TRCN0000023750  0.5169813 13.76148  5.122278
PValue          FDR
Actg1_TRCN0000090838  4.874608e-06 0.0008384325
Tec_TRCN0000023677    1.179363e-03 0.0783710614
Eif1a_TRCN0000127032  1.366937e-03 0.0783710614
Ppp2cb_TRCN0000012414 2.591325e-03 0.1114269770
Cct5_TRCN0000120435   5.346040e-03 0.1839037928
Ddx3y_TRCN0000103635  9.117758e-03 0.2240989255
Elp3_TRCN0000039310   9.120305e-03 0.2240989255
Ppp2ca_TRCN0000081367 1.967930e-02 0.4231048988
Vti1b_TRCN0000070345  2.379805e-02 0.4548071127

nor for the comparison between the two cell lines

> topTags(qlf)
Coefficient:  1*genotypeMut.selectionSelected -1*genotypeWT.selectionSelected
gene          clone      logFC   logCPM        F
Eif1a_TRCN0000127031       Eif1a TRCN0000127031 -0.9626886 12.75681 8.960356
Esyt1_TRCN0000195831       Esyt1 TRCN0000195831 -0.9836939 12.51458 7.824886
Stat5a_TRCN0000012548      Prkcc TRCN0000361432  0.8031007 12.71289 6.289016
Actg1_TRCN0000090840       Actg1 TRCN0000090840 -0.7418338 13.40752 6.247534
G6pdx_TRCN0000041445       G6pdx TRCN0000041445  0.7678739 12.85149 6.119261
Anxa5_TRCN0000110703       Anxa5 TRCN0000110703 -0.9936413 12.04494 6.033830
Elp3_TRCN0000039312         Elp3 TRCN0000039312 -0.7492201 12.96512 5.982340
Ddx3y_TRCN0000103635       Ddx3y TRCN0000103635  0.7915068 12.54775 5.383761
Hsp90aa1_TRCN0000008490 Hsp90aa1 TRCN0000008490  0.8542422 12.30693 5.370018
PValue       FDR
Eif1a_TRCN0000127031    0.002815284 0.3585509
Esyt1_TRCN0000195831    0.005235240 0.3585509
Stat5a_TRCN0000012548   0.012279605 0.3585509
Actg1_TRCN0000090840    0.012568928 0.3585509
G6pdx_TRCN0000041445    0.013508780 0.3585509
Anxa5_TRCN0000110703    0.014174630 0.3585509
Elp3_TRCN0000039312     0.014592188 0.3585509
Ddx3y_TRCN0000103635    0.020491493 0.3897318
Hsp90aa1_TRCN0000008490 0.020652814 0.3897318


For the comparison of the WT before against after, I get the best results (Which I didn't expect).

> topTags(lrtWT)
Coefficient:  genotypeWT.selectionSelected
gene          clone      logFC   logCPM         F
Alg2_TRCN0000110467         Alg2 TRCN0000110467  0.7415676 13.17845 13.874937
G6pdx_TRCN0000041446       G6pdx TRCN0000041446  0.7348983 13.48410 13.614341
Esyt1_TRCN0000195831       Esyt1 TRCN0000195831  0.8515671 12.51458 13.296532
Actg1_TRCN0000090840       Actg1 TRCN0000090840  0.6879894 13.40752 11.965458
Eif1a_TRCN0000127031       Eif1a TRCN0000127031  0.7234900 12.75681 11.301483
Actg1_TRCN0000090838       Actg1 TRCN0000090838  0.6364693 13.21519 10.210039
Prkacb_TRCN0000022815     Ppp2cb TRCN0000012414 -0.6652491 12.60112  8.898436
Stat5b_TRCN0000012553      Rplp0 TRCN0000104539 -0.7589073 12.15132  8.729890
Prkacb_TRCN0000022817     Ppp2cb TRCN0000012416 -0.6172584 12.66945  8.002400
PValue        FDR
Alg2_TRCN0000110467     0.0002044349 0.01589564
G6pdx_TRCN0000041446    0.0002344890 0.01589564
Esyt1_TRCN0000195831    0.0002772495 0.01589564
Actg1_TRCN0000090840    0.0005609017 0.02411878
Eif1a_TRCN0000127031    0.0007987634 0.02747746
Actg1_TRCN0000090838    0.0014330492 0.04108074
Prkacb_TRCN0000022815   0.0029116266 0.06861122
Stat5b_TRCN0000012553   0.0031912193 0.06861122
Prkacb_TRCN0000022817   0.0047490323 0.08371599

How much weight should I put on the FDR values? Or should I in this case concentrate more on the logFC values, as i have only ~180 shRNA molecules for approx. 30 genes.?

Only the FDR (or perhaps more precisely, the adjusted p-value relative to a FDR threshold) can be trusted for decisions relating to the hypothesis of interest, i.e., is there actually an effect or not? The log-fold change is purely a descriptive statistic, as it doesn't capture the variability of the underlying observations.

Anyway, I don't see why you're unhappy about your results. Your interaction contrast (middle) indicates that no gene exhibits a significant difference in selection effects between genotypes at a FDR of 5%. As things go, this is a pretty clear-cut result. The most parsimonious explanation would be that the genotype just doesn't have a big effect on selection. Maybe you don't like that, but maybe that's the truth. You should only talk about the "goodness" of your results when you know the ground truth, and you haven't mentioned any positive controls in this experiment.

Note that a slight increase in power for the wild-type selection effect is expected as you have more samples for the wild-type (5 selection/input pairs) compared to the mutant (4 selection/input pairs + 1 input sample that is effectively ignored upon blocking). However, if you're not seeing any significant interaction effects, then it's hard to be interested in the selection effects by themselves, as they could just represent the general effect of time/passages/etc.

Of course, I could imagine all sorts of things going wrong when you have so few features, or when your replicates are highly variable. You'll have to look at diagnostic plots (e.g., MA plots for normalization, MDS plots, mean-dispersion plots) to judge that and modify the appropriate edgeR parameters in response to obvious issues, e.g., setting robust=TRUE in glmQLFit when you see outliers on the QL mean-dispersion plot.

yes I was thinking the same thing. I have already done the QC plots and they do not look like I have expected it to look like:

plotBCV:

plotMDS

and the Smear plots

plotSmear(lrtMUT, de.tags = de.genes, main = "MUT smear plot")


plotSmear(lrtWT, de.tags = de.genes, , main="WT Smear plot")


plotSmear(qlf, de.tags = de.genes, main="comparison between cell lines")

It really does not look like I was expected. The shRNa molecules which shows a large CPM have very small logFC.

The shape of the MA plots are mostly expected. Large log-fold changes at low abundance for each plot are caused by the high variance relative to the mean at low counts. This is why you should rely on the p-values for identifying genes of interest rather than looking at the reported log-fold changes.

I am a bit surprised at the presence of a clear distinction between low- and high-abundance genes. I don't know why some shRNAs would be so readily sequenced and others not as much, even before any selection has taken place. I would guess your input shRNA composition might not be properly balanced. Note, low abundances are unlikely to be caused by a decrease in abundance upon selection, as we would expect to see all of them to have strong negative log-fold changes in that case.

The other major issue with having so few shRNAs is that normalization via calcNormFactors assumes that most shRNAs have no effect. However, the strength of this assumption requires negative control shRNAs to assess, so unless you have them, it's hard to say anything either way.

Just for the sake of testing I have tried to analyse only the first group of samples from WT (the first 10 samples).

I have created a new DEGList object using only this subset

x.WT = new("DGEList")
x.WT$counts = as.matrix(WT.table) x.WT$samples$sample <- as.factor(x.WT$samples$sample) x.WT$samples$condition <- as.factor(x.WT$samples$condition) x.WT$samples$group <- as.factor(x.WT$samples$group) x.WT$samples$lib.size <- colSums(x.WT$counts)
x.WT$samples$norm.factors = 1

selection <- factor(rep(c("Input", "Selected"), each=5))
WT.design <- model.matrix(~selection)

and run the DE analysis

xglm = estimateDisp(x.WT, WT.design)
plotBCV(xglm, main = "WT: BCV Plot")
fit = glmFit(xglm, WT.design)
lrt = glmLRT(fit, 2)
topT <- topTags(lrt, n = Inf)

Now I get (a bit) better results.

> topT\$table[1:20,]
gene          clone      logFC    logCPM        LR       PValue
Esyt1_TRCN0000195831       Esyt1 TRCN0000195831  0.8566637 12.767383 16.222104 5.633298e-05
G6pdx_TRCN0000041446       G6pdx TRCN0000041446  0.7397171 13.741256 15.886842 6.724462e-05
...
Elp3_TRCN0000039312         Elp3 TRCN0000039312  0.4948316 13.104579  7.034434 7.995714e-03
Esyt1_TRCN0000243846       Esyt1 TRCN0000243846  2.7831804  4.125457  6.794237 9.145258e-03
Pls3_TRCN0000090325         Pls3 TRCN0000090325 -0.4610140 12.890230  6.127389 1.331033e-02
FDR
Esyt1_TRCN0000195831    0.005997144
G6pdx_TRCN0000041446    0.005997144
...
Elp3_TRCN0000039312     0.079512937
Esyt1_TRCN0000243846    0.086157961
Pls3_TRCN0000090325     0.113556912

At least I have 19 shRNAs which show significant behavior, compared to 12 before that. I know it is not a lot, but, would it means, that it is better to run the analysis separately?

It seems that your definition of "better" is "more DE genes". This is incorrect unless you have positive controls. Why is it so hard to accept that there are not many (or any) DE genes?

And no, it is not "better" to run the analysis separately. (1) You don't get to share information about dispersion estimation across samples. (2) You can't compare between genotypes.

I hope you are not choosing your analysis parameters on the basis of what gives you the most DE genes. That would be p-hacking, which is Bad.