Question

edgeR time series analysis with no replicates

1

Entering edit mode

sirintra ▴ 10

@sirintra-8831

Last seen 7.2 years ago

United Kingdom

Hi,

I have RNAseq data from an shRNA knockdown screening experiment and I 'm trying to perform differential test using EdgeR. What we are interesting in is the difference between control and treatment (Dox) at the last time point compared to the first time point. So my design matrix is as follow and I applied estimateGLMCommonDisp function with a robust option without giving a design matrix as suggested in edgeR manual. However, I was wondering is there anything specific I need to do for no-replicate dataset when applying estimateGLMTrendedDisp and estimateGLMTagwiseDisp functions? Should I leave the design matrix out out keep it in ?

I was also wondering if this is the best approach for time series analysis with no replicate ?

In EdgeR manual also mentioned that a reduced model matrix could be used in conjuction with robust estimateGLMCommonDisp. What reduced design model matrix would be appropriate in this situation ?

Here is my code so far:

design <- model.matrix(~sampleTreatment + sampleTime + sampleTreatment:sampleTime,
                       data = targets.skno.sc)

edger.counts <- counts(dds.skno.sc.noBaseline)

edger.counts.rowsum <- apply(edger.counts, 1 , sum)
edger.counts[edger.counts.rowsum <= 10,]
edger.counts.filt <- edger.counts[edger.counts.rowsum > 10,]


targets.skno.sc <- colData(dds.skno.sc.noBaseline)
group.skno.sc <- factor(paste(targets.skno.sc$sampleTreatment, targets.skno.sc$sampleTime,sep="."))


dger.filt <- DGEList(counts=skno.sc.edger.counts.filt,   group=group.skno.sc)

sc.edger.filt.uqn <- calcNormFactors(sc.edger.filt, method="upperquartile")

sc.edger.filt.uqn.robust <- estimateGLMCommonDisp(sc.edger.filt.uqn, method="deviance", robust=TRUE , subset=NULL,verbose = TRUE )

sc.edger.filt.uqn.robust <- estimateGLMTrendedDisp(sc.edger.filt.uqn.robust) 

sc.edger.filt.uqn.robust <- estimateGLMTagwiseDisp(sc.edger.filt.uqn.robust) 

sc.filt.uqn.robust.fit.8 <- glmLRT(sc.filt.uqn.robust.fit)
sc.filt.uqn.robust.fit.8.res <-  topTags(sc.filt.uqn.robust.fit.8, n=nrow(sc.filt.uqn.robust.fit.8))

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lattice_0.20-33           genefilter_1.50.0         RColorBrewer_1.1-2        DESeq2_1.8.1              RcppArmadillo_0.5.500.2.0
 [6] Rcpp_0.12.1               GenomicRanges_1.20.6      GenomeInfoDb_1.4.2        IRanges_2.2.7             S4Vectors_0.6.5          
[11] BiocGenerics_0.14.0       edgeR_3.10.2              limma_3.24.15            

loaded via a namespace (and not attached):
 [1] futile.logger_1.4.1  plyr_1.8.3           XVector_0.8.0        futile.options_1.0.0 tools_3.2.1          rpart_4.1-10        
 [7] digest_0.6.8         RSQLite_1.0.0        annotate_1.46.1      gtable_0.1.2         DBI_0.3.1            proto_0.3-10        
[13] gridExtra_2.0.0      stringr_1.0.0        cluster_2.0.3        locfit_1.5-9.1       nnet_7.3-11          grid_3.2.1          
[19] Biobase_2.28.0       AnnotationDbi_1.30.1 XML_3.98-1.3         survival_2.38-3      BiocParallel_1.2.21  foreign_0.8-66      
[25] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.46.0   ggplot2_1.0.1        reshape2_1.4.1       lambda.r_1.1.7      
[31] magrittr_1.5         scales_0.3.0         Hmisc_3.16-0         MASS_7.3-44          splines_3.2.1        xtable_1.7-4        
[37] colorspace_1.2-6     stringi_0.5-5        acepack_1.3-3.3      munsell_0.4.2

        </td>
    </tr>
    <tr>
        <td>&nbsp;</td>
    </tr>
    <tr>
        <td>
        <p>Thank you</p>

        <p>saranj</p>
        </td>
    </tr>
</tbody>

edger • 3.1k views

ADD COMMENT • link updated 4.5 years ago by owuorerick • 0 • written 8.6 years ago by sirintra ▴ 10

0

Entering edit mode

What are the details of your samples, e.g., how many timepoints do you have for each of your conditions?

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

Steve Lianoglou · Answer 1 · 2015-09-17

For time-series data, I would fit a spline with 2 degrees of freedom (d.f.) to the time points for each condition:

library(splines)
treatment <- rep(c(TRUE, FALSE), 4)
time <- rep(1:4, each=2)
X <- ns(time, df=2)
design <- model.matrix(~0 + treatment + treatment:X)

This allows for non-linear responses in gene expression with respect to time. It also allows estimation of the dispersion, as there are more samples than columns in this design matrix (i.e., residual d.f. is non-zero). Note that we can only use 2 d.f. in the spline; any higher and we would no longer have any residual d.f.

For your DE comparisons, the use of a spline means that it doesn't really make sense to test for DE at a single time point. This is because all time points become interdependent when they are used to fit a common spline. Instead, you want to look for differences in the spline fits between treated and untreated samples. I'll rename the column names to get rid of the colons, for convenience:

colnames(design) <- sub(":", ".", colnames(design))

If we look at these coefficient names, we get:

[1] "treatmentFALSE"    "treatmentTRUE"     "treatmentFALSE.X1"
[4] "treatmentTRUE.X1"  "treatmentFALSE.X2" "treatmentTRUE.X2"

The first, third and fifth coefficients represent the spline fit for the untreated samples, while the second, fourth and sixth coefficients represent the fit for the treated samples. Specifically, the first and second coefficients represent the intercept for their respective conditions. To test for any differences between the splines for each condition, you can do:

con <- makeContrasts(treatmentFALSE - treatmentTRUE,
    treatmentFALSE.X1 - treatmentTRUE.X1,
    treatmentFALSE.X2 - treatmentTRUE.X2, levels=design)

This is equivalent to the null hypothesis stating that the splines are equivalent between conditions. To test for differences in the shapes of the splines (i.e., ignoring any difference in the magnitude of expression), you can do:

con <- makeContrasts(treatmentFALSE.X1 - treatmentTRUE.X1,
    treatmentFALSE.X2 - treatmentTRUE.X2, levels=design)

We don't involve the intercept coefficients for each spline, which means that the location of the splines is free to vary under the null hypothesis. Note that the log-fold changes of the spline coefficients aren't really interpretable, so just focus on the p-value instead.

score 0 · Answer 2 · 2015-09-17

0

Entering edit mode

sirintra ▴ 10

@sirintra-8831

Last seen 7.2 years ago

United Kingdom

I have in total of 8 samples for 4 timepoints. So two samples (one control and one treatment) for each time point. Let me know if you need more information. Thanks a lot.

ADD COMMENT • link 8.6 years ago sirintra ▴ 10

score 0 · Answer 3 · 2015-09-17

Dear Aaron,

Thank you very much for the example code and excellent explanation. I have followed your suggestion and got a lot more genes with significant test results than my previous approach. There are a few more things regarding the results that I’d like to clarify and would appreciate your thoughts.

Regarding test results from the first contrast (also shown in the code below), am I right in thinking that this would yield a more significant result (lower p-value) if the two conditions have a much higher magnitude of difference at a later time point compared with the earlier time point?

And for genes where the null hypothesis cannot be rejected, it means that the difference between the conditions are quite stable or have a very small level of difference throughout the different time points ?

We are also interested in the magnitude (as well as the direction) of fold changes between treatment and control, particularly at the later or the last time point. Is there any way we can estimate the fold change magnitude?

If it still makes sense to still interpret the directions (+/-) of the log-fold changes of the spline coefficients, if not the magnitude ?

Thanks a lot for your help.

treatment <- rep(c("noDox", "Dox"), 4)
time <- rep(c(3,17,28,39), each=2)
X <- ns(time, df=2)
design <- model.matrix(~0 + treatment + treatment:X)
colnames(design) <- sub(":", ".", colnames(design))
colnames(design)
#[1] "treatmentDox"      "treatmentnoDox"    "treatmentDox.X1"   "treatmentnoDox.X1"
#[5] "treatmentDox.X2"   "treatmentnoDox.X2"

sc.toSpline <- DGEList(counts=sc.edger.counts.filt, 
                                       group=group)
sc.toSpline <- calcNormFactors(sc.toSpline)

sc.toSpline.voomed <- voom(sc.toSpline,
                                           design,
                                           plot=TRUE)

spline.fit <- lmFit(sc.toSpline.voomed, design)

#Test for any differences between the splines for each condition
con <- makeContrasts(treatmentDox - treatmentnoDox,
                     treatmentDox.X1 - treatmentnoDox.X1,
                     treatmentDox.X2 - treatmentnoDox.X2, levels=design)
spline.fit2 <- contrasts.fit(spline.fit, contrasts = con)
spline.fit2 <- eBayes(spline.fit2)
tt.spline.fit2 <- topTable(spline.fit2, number = nrow(spline.fit2))

             treatmentDox...treatmentnoDox treatmentDox.X1...treatmentnoDox.X1
gene1                    0.05789551                          -4.0547351
gene2                   -0.20675258                          -2.0248539
             treatmentDox.X2...treatmentnoDox.X2   AveExpr        F      P.Value    adj.P.Val
gene1                   -1.2876855 11.767900 92.80905 9.364534e-08 3.586617e-05
gene2                   -0.8202690 10.879899 54.52472 1.265582e-06 2.423590e-04

score 0 · Answer 4 · 2019-10-15

Hi Saranja, I have almost the same data as yours. I am working with 6 samples and 3 time-points in four genotypes. My question is: how did you do voom without replicates without the following?

Error in approxfun(l, rule = 2) : need at least two non-NA values to interpolate

I am importing raw htseq-counts from STAR-Aligner mapping to edgeR.

Thanks, Mikwa.

score 0 · Answer 5 · 2019-10-15

Hi Saranja, I have almost the same data as yours. I am working with 6 samples and 3 time-points in four genotypes. My question is: how did you do voom without replicates without the following?

Error in approxfun(l, rule = 2) : need at least two non-NA values to interpolate

I am importing raw htseq-counts from STAR-Aligner mapping to edgeR.

Thanks, Mikwa.