edgeR time series analysis with no replicates
5
1
Entering edit mode
sirintra ▴ 10
@sirintra-8831
Last seen 4.6 years ago
United Kingdom

Hi,

I have RNAseq data from an shRNA knockdown screening experiment and I 'm trying to perform differential test using EdgeR. What we are interesting in is the difference between control and treatment (Dox) at the last time point compared to the first time point. So my design matrix is as follow and I applied estimateGLMCommonDisp function with a robust option without giving a design matrix as suggested in edgeR manual. However, I was wondering is there anything specific I need to do for no-replicate dataset when applying  estimateGLMTrendedDisp and estimateGLMTagwiseDisp functions? Should I leave the design matrix out out keep it in ?

I was also wondering if this is the best approach for time series analysis with no replicate ?

In EdgeR manual also mentioned that a reduced model matrix could be used in conjuction with robust  estimateGLMCommonDisp. What reduced design model matrix would be appropriate in this situation ?

Here is my code so far:

design <- model.matrix(~sampleTreatment + sampleTime + sampleTreatment:sampleTime,
data = targets.skno.sc)

edger.counts <- counts(dds.skno.sc.noBaseline)

edger.counts.rowsum <- apply(edger.counts, 1 , sum)
edger.counts[edger.counts.rowsum <= 10,]
edger.counts.filt <- edger.counts[edger.counts.rowsum > 10,]

targets.skno.sc <- colData(dds.skno.sc.noBaseline)
group.skno.sc <- factor(paste(targets.skno.sc$sampleTreatment, targets.skno.sc$sampleTime,sep="."))

dger.filt <- DGEList(counts=skno.sc.edger.counts.filt,   group=group.skno.sc)

sc.edger.filt.uqn <- calcNormFactors(sc.edger.filt, method="upperquartile")

sc.edger.filt.uqn.robust <- estimateGLMCommonDisp(sc.edger.filt.uqn, method="deviance", robust=TRUE , subset=NULL,verbose = TRUE )

sc.edger.filt.uqn.robust <- estimateGLMTrendedDisp(sc.edger.filt.uqn.robust)

sc.edger.filt.uqn.robust <- estimateGLMTagwiseDisp(sc.edger.filt.uqn.robust)

sc.filt.uqn.robust.fit.8 <- glmLRT(sc.filt.uqn.robust.fit)
sc.filt.uqn.robust.fit.8.res <-  topTags(sc.filt.uqn.robust.fit.8, n=nrow(sc.filt.uqn.robust.fit.8))
 > sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 [6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] lattice_0.20-33 genefilter_1.50.0 RColorBrewer_1.1-2 DESeq2_1.8.1 RcppArmadillo_0.5.500.2.0 [6] Rcpp_0.12.1 GenomicRanges_1.20.6 GenomeInfoDb_1.4.2 IRanges_2.2.7 S4Vectors_0.6.5 [11] BiocGenerics_0.14.0 edgeR_3.10.2 limma_3.24.15 loaded via a namespace (and not attached): [1] futile.logger_1.4.1 plyr_1.8.3 XVector_0.8.0 futile.options_1.0.0 tools_3.2.1 rpart_4.1-10 [7] digest_0.6.8 RSQLite_1.0.0 annotate_1.46.1 gtable_0.1.2 DBI_0.3.1 proto_0.3-10 [13] gridExtra_2.0.0 stringr_1.0.0 cluster_2.0.3 locfit_1.5-9.1 nnet_7.3-11 grid_3.2.1 [19] Biobase_2.28.0 AnnotationDbi_1.30.1 XML_3.98-1.3 survival_2.38-3 BiocParallel_1.2.21 foreign_0.8-66 [25] latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.46.0 ggplot2_1.0.1 reshape2_1.4.1 lambda.r_1.1.7 [31] magrittr_1.5 scales_0.3.0 Hmisc_3.16-0 MASS_7.3-44 splines_3.2.1 xtable_1.7-4 [37] colorspace_1.2-6 stringi_0.5-5 acepack_1.3-3.3 munsell_0.4.2  

Thank you

saranj



edger • 2.4k views
0
Entering edit mode

What are the details of your samples, e.g., how many timepoints do you have for each of your conditions?

5
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 10 hours ago
The city by the bay

For time-series data, I would fit a spline with 2 degrees of freedom (d.f.) to the time points for each condition:

library(splines)
treatment <- rep(c(TRUE, FALSE), 4)
time <- rep(1:4, each=2)
X <- ns(time, df=2)
design <- model.matrix(~0 + treatment + treatment:X)

This allows for non-linear responses in gene expression with respect to time. It also allows estimation of the dispersion, as there are more samples than columns in this design matrix (i.e., residual d.f. is non-zero). Note that we can only use 2 d.f. in the spline; any higher and we would no longer have any residual d.f.

For your DE comparisons, the use of a spline means that it doesn't really make sense to test for DE at a single time point. This is because all time points become interdependent when they are used to fit a common spline. Instead, you want to look for differences in the spline fits between treated and untreated samples. I'll rename the column names to get rid of the colons, for convenience:

colnames(design) <- sub(":", ".", colnames(design))

If we look at these coefficient names, we get:

[1] "treatmentFALSE"    "treatmentTRUE"     "treatmentFALSE.X1"
[4] "treatmentTRUE.X1"  "treatmentFALSE.X2" "treatmentTRUE.X2"

The first, third and fifth coefficients represent the spline fit for the untreated samples, while the second, fourth and sixth coefficients represent the fit for the treated samples. Specifically, the first and second coefficients represent the intercept for their respective conditions. To test for any differences between the splines for each condition, you can do:

con <- makeContrasts(treatmentFALSE - treatmentTRUE,
treatmentFALSE.X1 - treatmentTRUE.X1,
treatmentFALSE.X2 - treatmentTRUE.X2, levels=design)

This is equivalent to the null hypothesis stating that the splines are equivalent between conditions. To test for differences in the shapes of the splines (i.e., ignoring any difference in the magnitude of expression), you can do:

con <- makeContrasts(treatmentFALSE.X1 - treatmentTRUE.X1,
treatmentFALSE.X2 - treatmentTRUE.X2, levels=design)

We don't involve the intercept coefficients for each spline, which means that the location of the splines is free to vary under the null hypothesis. Note that the log-fold changes of the spline coefficients aren't really interpretable, so just focus on the p-value instead.

0
Entering edit mode
sirintra ▴ 10
@sirintra-8831
Last seen 4.6 years ago
United Kingdom

I have in total of 8 samples for 4 timepoints. So two samples (one control and one treatment) for each time point. Let me know if you need more information. Thanks a lot.

0
Entering edit mode
sirintra ▴ 10
@sirintra-8831
Last seen 4.6 years ago
United Kingdom

Dear Aaron,

Thank you very much for the example code and excellent explanation. I have followed your suggestion and got a lot more genes with significant test results than my previous approach. There are a few more things regarding the results that I’d like to clarify and would appreciate your thoughts.

Regarding test results from the first contrast (also shown in the code below), am I right in thinking that this would yield a more significant result (lower p-value) if the two conditions have a much higher magnitude of difference at a later time point compared with the earlier time point?

And for genes where the null hypothesis cannot be rejected, it means that the difference between the conditions are quite stable or have a very small level of difference throughout the different time points ?

We are also interested in the magnitude (as well as the direction) of fold changes between treatment and control, particularly at the later or the last time point. Is there any way we can estimate the fold change magnitude?

If it still makes sense to still interpret the directions (+/-) of the log-fold changes of the spline coefficients, if not the magnitude ?

Thanks a lot for your help.

treatment <- rep(c("noDox", "Dox"), 4)
time <- rep(c(3,17,28,39), each=2)
X <- ns(time, df=2)
design <- model.matrix(~0 + treatment + treatment:X)
colnames(design) <- sub(":", ".", colnames(design))
colnames(design)
#[1] "treatmentDox"      "treatmentnoDox"    "treatmentDox.X1"   "treatmentnoDox.X1"
#[5] "treatmentDox.X2"   "treatmentnoDox.X2"

sc.toSpline <- DGEList(counts=sc.edger.counts.filt,
group=group)
sc.toSpline <- calcNormFactors(sc.toSpline)

sc.toSpline.voomed <- voom(sc.toSpline,
design,
plot=TRUE)

spline.fit <- lmFit(sc.toSpline.voomed, design)

#Test for any differences between the splines for each condition
con <- makeContrasts(treatmentDox - treatmentnoDox,
treatmentDox.X1 - treatmentnoDox.X1,
treatmentDox.X2 - treatmentnoDox.X2, levels=design)
spline.fit2 <- contrasts.fit(spline.fit, contrasts = con)
spline.fit2 <- eBayes(spline.fit2)
tt.spline.fit2 <- topTable(spline.fit2, number = nrow(spline.fit2))

treatmentDox...treatmentnoDox treatmentDox.X1...treatmentnoDox.X1
gene1                    0.05789551                          -4.0547351
gene2                   -0.20675258                          -2.0248539
treatmentDox.X2...treatmentnoDox.X2   AveExpr        F      P.Value    adj.P.Val
gene1                   -1.2876855 11.767900 92.80905 9.364534e-08 3.586617e-05
gene2                   -0.8202690 10.879899 54.52472 1.265582e-06 2.423590e-04
0
Entering edit mode

1. The relationship between significance and DE at each time point is difficult to predict, as it depends on how the DE affects the spline fit. Strong DE at early time points is just as likely to result in low p-values as DE at later time points, as it will provide strong evidence against the null hypothesis (see below). If most of your DE genes are those with differences at later time points, I suspect that's due to the biology of the system, i.e., it takes some time for the treatment effect to manifest.
2. If the null hypothesis cannot be rejected for this first contrast, it suggests that the spline fits are similar between treatment and control. In other words, fitting two separate splines for each condition doesn't do much better than fitting one spline for all samples. There is no evidence for DE at any time point. For the second contrast, if the null hypothesis cannot be rejected, that indicates that the shape of the spline is not significantly difference between conditions, i.e., DE can exist at each time point, but any difference between conditions is stable across time points.
3. Not of the spline coefficients. If you want log-fold changes, you'll have to compute them manually. The simplest way would be to run cpm(sc.toSpline, log=TRUE, prior=3) and subtract the control value from the treatment value at each time point for each gene. This would give you the shrunken log-fold change.
4. Not really. The actual log-fold change at each time point depends on the value of the covariate (i.e., the actual time), and how that interacts with all of the spline coefficients. Trying to interpret each spline coefficient in isolation would be meaningless.
0
Entering edit mode
owuorerick • 0
@owuorerick-22137
Last seen 23 months ago

Hi Saranja, I have almost the same data as yours. I am working with 6 samples and 3 time-points in four genotypes. My question is: how did you do voom without replicates without the following?

Error in approxfun(l, rule = 2) : need at least two non-NA values to interpolate

I am importing raw htseq-counts from STAR-Aligner mapping to edgeR.

Thanks, Mikwa.

0
Entering edit mode
owuorerick • 0
@owuorerick-22137
Last seen 23 months ago

Hi Saranja, I have almost the same data as yours. I am working with 6 samples and 3 time-points in four genotypes. My question is: how did you do voom without replicates without the following?

Error in approxfun(l, rule = 2) : need at least two non-NA values to interpolate

I am importing raw htseq-counts from STAR-Aligner mapping to edgeR.

Thanks, Mikwa.