Question

Custom model matrix for multi factordesign

0

Entering edit mode

carol.brunel • 0

@carolbrunel-17548

Last seen 5.6 years ago

Hi,

I am trying to figure out what is the correct model formula and analysis approach is with DESeq2 for my multifactor design in a context of microbiome sequencing. What I have is 42 samples coming from rhizospheric soils from 7 different plant species (that are considered as invasive or native) and that are infected or not by a plant parasite.

Here is the initial colDataset:

infection host_species host_origin s2-1 2 6 2 s2-2 2 6 2 s2-3 2 6 2 s2-4 1 6 2 s2-5 1 6 2 s2-6 1 6 2 s3-1 2 7 2 [...] s7-5 1 2 1 s7-6 1 2 1 s8-1 2 3 1 s8-2 2 3 1 s8-3 2 3 1 s8-4 1 3 1 s8-5 1 3 1 s8-6 1 3 1

What I would like is to observe the main effect of the treatment and wether the response to infection is modulated by the host_origin of the plant species. I have to consider host_species since I know it acts as a batch factor. Then, I would be interested in the LRT results with the following design:

design <- ~ host_species + host_origin*infection

However, host_species is nested in the host_origin, so I created species.nested column, and since my design is unbalanced I created a model matrix where I removed the column with all zero, as suggested in the vignette.

Here is the modified colDataset:

spec_nested infection host_species host_origin s2-1 1 2 6 2 s2-2 1 2 6 2 s2-3 1 2 6 2 s2-4 1 1 6 2 s2-5 1 1 6 2 s2-6 1 1 6 2 s3-1 2 2 7 2 [...] s6-1 4 2 5 2 s7-5 3 1 2 1 s7-6 3 1 2 1 s8-1 4 2 3 1 s8-2 4 2 3 1 s8-3 4 2 3 1 s8-4 4 1 3 1 s8-5 4 1 3 1 s8-6 4 1 3 1

Here line code for the full model matrix:

fm <- model.matrix(~ spec.nested:host_origin + infection*host_origin ,colData(ps_deseq)) idx <- which(colSums(fm == 0) == nrow(fm)) fm <- fm[,-idx]

The full model matrix:

> unname(fm) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 1 0 1 0 0 0 0 0 0 [2,] 1 1 0 1 0 0 0 0 0 0 [3,] 1 1 0 1 0 0 0 0 0 0 [4,] 1 0 0 1 0 0 0 0 0 0 [5,] 1 0 0 1 0 0 0 0 0 0 [6,] 1 0 0 1 0 0 0 0 0 0 [7,] 1 1 0 0 1 0 0 0 0 0 [8,] 1 1 0 0 1 0 0 0 0 0 [9,] 1 1 0 0 1 0 0 0 0 0 [...] [32,] 1 1 0 0 0 0 0 0 1 1 [33,] 1 1 0 0 0 0 0 0 1 1 [34,] 1 0 0 0 0 0 0 0 1 0 [35,] 1 0 0 0 0 0 0 0 1 0 [36,] 1 0 0 0 0 0 0 0 1 0 [37,] 1 1 0 0 0 0 1 0 0 1 [38,] 1 1 0 0 0 0 1 0 0 1 [39,] 1 1 0 0 0 0 1 0 0 1 [40,] 1 0 0 0 0 0 1 0 0 0 [41,] 1 0 0 0 0 0 1 0 0 0 [42,] 1 0 0 0 0 0 1 0 0 0

rm1 <- model.matrix(~ spec_origin:host_origin + host_origin:infection ,colData(ps_deseq)) idx <- which(colSums(rm1 == 0) == nrow(rm1)) rm1 <- rm1[,-idx]

The reduced model matrix (I removed the pure effect of infection)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 1 0 1 0 0 0 0 0 1 0 [2,] 1 0 1 0 0 0 0 0 1 0 [3,] 1 0 1 0 0 0 0 0 1 0 [4,] 1 0 1 0 0 0 0 0 0 0 [5,] 1 0 1 0 0 0 0 0 0 0 [6,] 1 0 1 0 0 0 0 0 0 0 [7,] 1 0 0 1 0 0 0 0 1 0 [8,] 1 0 0 1 0 0 0 0 1 0 [9,] 1 0 0 1 0 0 0 0 1 0 [...]

[32,] 1 0 0 0 0 0 0 1 0 1

[33,] 1 0 0 0 0 0 0 1 0 1 [34,] 1 0 0 0 0 0 0 1 0 0 [35,] 1 0 0 0 0 0 0 1 0 0 [36,] 1 0 0 0 0 0 0 1 0 0 [37,] 1 0 0 0 0 1 0 0 0 1 [38,] 1 0 0 0 0 1 0 0 0 1 [39,] 1 0 0 0 0 1 0 0 0 1 [40,] 1 0 0 0 0 1 0 0 0 0 [41,] 1 0 0 0 0 1 0 0 0 0 [42,] 1 0 0 0 0 1 0 0 0 0

I understand I made something wrong in the designs of the full model or of the reduced model matrix since the dim of the two matrix is similar but I definitly can't determine what is my mistake.

Then I cannot run the LRT test I would like:

ps_deseq = DESeq(ps_deseq, test="LRT", full=mm, reduced=rm1)

I have two questions:

1) What would be the right syntax for my full and reduced model matrix?

2) My sample are microbiome samples, reason why every gene of my raw data contains at least one zero and I have to use the geometric mean as defined by :

gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))}

Since I am not using the design as defined in my DESeqDataSet, how can I manage estimateSizeFactors according to my full model.matrix?

I cannot not find a solution for this in the vignette or in other discussions in this forum. I'd greatly appreciate if some help can be provided. Thanks a lot.

Caroline

deseq2 LRT geometric mean estimatesizefactors model.matrix • 1.2k views

ADD COMMENT • link updated 5.6 years ago by Michael Love 41k • written 5.6 years ago by carol.brunel • 0

0

Entering edit mode

> sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] DESeq2_1.20.0 SummarizedExperiment_1.10.1 DelayedArray_0.6.1 [4] BiocParallel_1.14.1 matrixStats_0.53.1 Biobase_2.40.0 [7] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0 IRanges_2.14.10 [10] S4Vectors_0.18.3 BiocGenerics_0.26.0 biomformat_1.8.0 [13] ape_5.1 scales_1.0.0 phyloseq_1.24.0 [16] RColorBrewer_1.1-2 wesanderson_0.3.6 manipulate_1.0.1 [19] stringr_1.3.1 tidyr_0.8.1 ggplot2_3.0.0 [22] fossil_0.3.7 shapefiles_0.7 foreign_0.8-70

ADD REPLY • link 5.6 years ago carol.brunel • 0

0

Entering edit mode

For a reproductible example, please find the dataset and the script here:

https://wetransfer.com/downloads/db45667b6d030720caa485fdfd9b459f20180927090701/983649e13f9e65d3ef3cadbcefc3874720180927090701/1be251

ADD REPLY • link 5.6 years ago carol.brunel • 0

score 0 · Answer 1 · 2018-09-27

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 11 hours ago

United States

You have 7 species, and for each you have infected or not infected samples.

The way I would approach this is that, each species needs a baseline, and then you also want to fit species-specific infection effects. Because of the per-species baseline, there is no need and it is not possible to control for origin, as species is nested within origin.

In the example in the vignette, individual is a nuisance, and treatment group is the level of interest at which the effects should be compared, but in your case, species is the level of interest.

I would use ~species + species:infected, and then you can test each of the species-specific infection effects with 'name' or compare them with 'contrast'.

With 7 species and no "reference species", there isn't a main effect of infection, you have 7 different infection effects. You could do something like this with a random or mixed effects model, where you say that each species has a random infection effect, but you can't do modeling like this within DESeq2.

ADD COMMENT • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Thank you Michael for your quick reply,

Sorry, I may have been unclear; actually I consider individual (species) as a nuisance and origin is the level of interest. What I would like to observe is both the infection effect and the interaction between the infection and the origin. This is the reason why I followed the vignette instructions.

Would it then be possible to determine the differential expression within the DESeq using the Likelihood Ratio Test with this design? Or is the mixed effects model a more appropriate solution? Actually before focusing on the DESeq function I have tried mixed models but I had some difficulties to fit my models using negative binomial and the glmer function.

ADD REPLY • link 5.6 years ago carol.brunel • 0

0

Entering edit mode

Oops, I see now that you did mention this in your first post and I missed it: "modulated by the host_origin of the plant species".

Ok, so in that case, I'd recommend the following to hopefully get around the issues. Go ahead as you have constructing the full model matrix and removing the extra columns with all zeros. I like to write out the full formula, rather than putting "*" as it makes it clear which terms will be added, and provides a bit more control.

I would recommend to use the following for the full design: ~host_origin + host_origin:spec_nested + host_origin:infection.

This way you get an infection effect for each of the two host origin groups.

You can then extract each effect individually with 'name' or contrast them using a list-style contrast, where the first element of the list is the numerator and the second element of the list is the denominator.

Regarding (2), I'd recommend to use sfType="poscounts" when you run DESeq(), this takes care of the geometric means modification that was suggested from phyloseq. The size factor estimation doesn't depend in any way on the design.