Question: How to account for population variation within regions in DESeq2?
0
2.9 years ago by
swaegers.janne0 wrote:

Hi everyone,

I want to analyse expression data from samples collected from different populations at different latitudes (3 populations at both latitudes) which were subjected to a treatment.

I want to know the effect of latitude (lat), treatment (treat) and their interaction on the expression data.

My data looks like this:

samp. pop  lat  treat
1     1    N    1
2     1    N    1
3     1    N    2
4     1    N    2
5     2    N    1
6     2    N    1
7     2    N    2
8     2    N    2
9     3    N    1
10    3    N    1
11    3    N    2
12    3    N    2
13    4    S    1
14    4    S    1
15    4    S    2
16    4    S    2
17    5    S    1
18    5    S    1
19    5    S    2
20    5    S    2
21    6    S    1
22    6    S    1
23    6    S    2
24    6    S    2

When I want to fit a model using this design

design <- model.matrix(~lat+pop+treat+lat:treat)

the matrix is not of full rank because pop is nested in lat.

This can be done in EdgeR by manually change the design matrix (see the full response to my earlier question How to account for population variation within regions in EdgeR? - thank you Aaron)

pop <- factor(rep(1:6, each=4))
lat <- factor(rep(c("N", "S"), each=12))
treatment <- factor(rep(rep(1:2, each=2), 6))
design <- model.matrix(~0 + pop + lat:treatment)
design <- design[,-grep("treatment1", colnames(design))]
colnames(design) <- sub(":", "_", colnames(design))

[1] "pop1"            "pop2"            "pop3"            "pop4"
[5] "pop5"            "pop6"            "latN_treatment2" "latS_treatment2"

Yet, I actually have the same question regarding DESeq2. How can I implement the above method to account for populations into the design formula of DESeq2? I can not find a way to edit the design matrix as you cannot give a matrix as input in DESeq2. At least, I can't find a straightforward way to do that.

Any help is greatly appreciated.

Best,

Janne

modified 2.9 years ago • written 2.9 years ago by swaegers.janne0

I've set up a new deseq2 file with pop.n being the new variable.

ddsFullCountTable <- DESeqDataSetFromMatrix(
countData = countdata,
colData = names,
design = ~0 +lat + lat:pop.n + lat:treat)

I was able to detect genes that were differentiated for the contrast of latN.treat24 and latS.treat 24 (so the difference of the effect of treatment between N and S) by using

results<- results(dds , contrast=list("latN.treat24","latS.treat24"))

Within this model, is there also way to make a contrast to compare latN.treat20 and latN.treat24? So compare the treatment within the N samples?

Janne

(I moved this post from an 'Answer' to a 'Comment'.)

Can you post the resultsNames(dds) ? I just would like to make sure which terms are present.

Here it is:

> resultsNames(dds)
[1] "latN"         "latS"         "latN.pop.n2"  "latS.pop.n2"  "latN.pop.n3"  "latS.pop.n3"  "latN.treat24" "latS.treat24"

thanks for the help!

Great, so the important thing to see is that

results(dds, name="latN.treat24")

is the 24 vs 20 effect for lat=N (and likewise for latS.treat24). The reason is that 20 is the reference level for treatment. You don't need to do a contrast, only pull out this term by using this line of code.

A late thanks for the reply. I'm just wondering now, can I also compare latN.treat20 to latS.treat20 using this design? So compare the gene expression within the 20 treatment between the N and S latitude? How would you go about to test this? Thanks again in advance.

No, because this comparison is confounded with the different populations. You can't compare N vs S for treat 20 or treat 24. You can only make treatment comparisons within latitude, or you can make the comparison of how the treatment effects differ across latitude. Remember this comparison:

results(dds , contrast=list("latN.treat24","latS.treat24"))

Is for whether the treat 24 vs 20 effect is different across latitude. It is only possible because it compares the difference of the within-latitude differences.

Hi again,

I have a follow-up question. When I remove an outlier sample which is in the latN.treatment24 group, this also influences which and how many DEG I find for the effect of treatment within the latS samples (so for name="latS.treat24"). How is this possible? I also noticed the removal of the outlier changes the internal filtering:

without outlier removed: low counts [2]   : 138365, 78%

with outlier removed: low counts [2]   : 6907, 3.9%

How can it change so much when removing an outlier which is not even involved in the comparison of the counts?

Yes, here it is:

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] nl_BE.UTF-8/nl_BE.UTF-8/nl_BE.UTF-8/C/nl_BE.UTF-8/nl_BE.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0
[4] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1

loaded via a namespace (and not attached):
[1] nlme_3.1-128         bitops_1.0-6         pbkrtest_0.4-6       gmodels_2.16.2
[5] RColorBrewer_1.1-2   tools_3.3.0          R6_2.2.0             vegan_2.4-1
[9] rpart_4.1-10         Hmisc_3.17-4         DBI_0.5-1            mgcv_1.8-15
[13] colorspace_1.2-6     permute_0.9-4        nnet_7.3-12          sp_1.2-3
[17] gridExtra_2.2.1      chron_2.3-47         sandwich_2.3-4       scales_0.4.0
[21] mvtnorm_1.0-5        genefilter_1.54.2    stringr_1.1.0        digest_0.6.10
[25] foreign_0.8-67       minqa_1.2.4          XVector_0.12.1       htmltools_0.3.5
[29] lme4_1.1-12          RSQLite_1.0.0        shiny_0.14.1         zoo_1.7-13
[33] lsmeans_2.23-5       BiocParallel_1.6.6   gtools_3.5.0         acepack_1.3-3.3
[37] spdep_0.6-8          dplyr_0.5.0          RCurl_1.95-4.8       magrittr_1.5
[41] Formula_1.2-1        Matrix_1.2-7.1       Rcpp_0.12.7          munsell_0.4.3
[45] ape_3.5              stringi_1.1.2        multcomp_1.4-6       MASS_7.3-45
[49] zlibbioc_1.18.0      plyr_1.8.4           grid_3.3.0           gdata_2.17.0
[53] deldir_0.1-12        lattice_0.20-34      splines_3.3.0        annotate_1.50.1
[57] locfit_1.5-9.1       igraph_1.0.1         boot_1.3-18          seqinr_3.3-2
[61] estimability_1.1-1   geneplotter_1.50.0   reshape2_1.4.1       codetools_0.2-15
[65] LearnBayes_2.15      XML_3.98-1.4         latticeExtra_0.6-28  data.table_1.9.6
[69] nloptr_1.0.4         httpuv_1.3.3         gtable_0.2.0         assertthat_0.1
[73] ggplot2_2.1.0        mime_0.5             xtable_1.8-2         coda_0.18-1
[77] survival_2.39-5      tibble_1.2           AnnotationDbi_1.34.4 cluster_2.0.5
[81] TH.data_1.0-7

Thanks for posting the sessionInfo. You're up to date. Is it possible that you changed versions of Bioconductor between the previous results and now? Or can you still generate the 78% low counts removed with current version?

Re your question: "...which and how many DEG I find for the effect of treatment within the latS samples (so for name="latS.treat24"). How is this possible?...How can it change so much when removing an outlier which is not even involved in the comparison of the counts?"

It is possible, because all of the samples are used to calculate the dispersion. If you include a sample which fails QC it will raise the dispersion value, because the model has to accommodate the additional variability.

Thanks for the reply. I did both analyses (with and without outlier) again from scratch within the same session and get the exact same results. Without the latN outlier the comparison of the latS samples results in way less filtering...Could this somehow be related to the trick I used to account for the population structure?

I wouldn't worry too much about the changes in filtering. It's certainly possible to have big changes in inference if you remove a sample which is a big outlier on a PCA plot, e.g. failing QC (quality control). The important thing to focus on is whether or not to remove the sample based on evidence not related to final inference (PCA of samples, and other standard QC measures for RNA-seq, see the MultiQC software for example).

Answer: How to account for population variation within regions in DESeq2?
2
2.9 years ago by
Michael Love24k
United States
Michael Love24k wrote:

When you get the matrix is not full rank error, it should point you to go read the section of the DESeq2 vignette where we explain what to do in this case.

Basically we suggest similar advice as edgeR (we use a different trick to Aaron's, but it was also from the edgeR User Guide).

The section starts "Consider an experiment with grouped individuals, where we seek to test the group-specific effect of a treatment, while controlling for individual effects..."