Question

DESeq2 RNAseq analysis with "matrix not full rank" error corrected.

1

Entering edit mode

yifangt ▴ 20

@yifangt-9770

Last seen 5.6 years ago

Canada/NRC

Hello,

I'd like to get some help on my RNAseq data analysis with DESeq2. The experiment is balanced design with multiple factors with biological/technical replicates.

> coldata
          group variety treatment
BeC-1     T      Be   control
BeC-2     T      Be   control
BeT-1     T      Be   treated
BeT-2     T      Be   treated
LoC-1     S      Lo   control
LoC-2     S      Lo   control
LoT-1     S      Lo   treated
LoT-2     S      Lo   treated
NiC-1     T      Ni   control
NiC-2     T      Ni   control
NiT-1     T      Ni   treated
NiT-2     T      Ni   treated
SaC-1     T      Sa   control
SaC-2     T      Sa   control
SaT-1     T      Sa   treated
SaT-2     T      Sa   treated
SoC-1     S      So   control
SoC-2     S      So   control
SoT-1     S      So   treated
SoT-2     S      So   treated
ViC-1     S      Vi   control
ViC-2     S      Vi   control
ViT-1     S      Vi   treated
ViT-2     S      Vi   treated

To eliminate the "model matrix is not full rank" as suggested in the manual, I used:

> coldata$ind.n <- factor(rep(rep(1:2, each=2), 2)
> model.matrix(~ group + group:ind.n + group:variety, coldata)
> dds <- DESeq(ddsHTSeq)

Now I want to test the significance for factors:

1) between groups: T vs S;

2) among varieties,
3) between control vs treated for each variety: i.e. what are the genes differentially expressed for each variety Be, Lo, Ni, Sa, So and Vi.

I have tried 2) as following:

>results(dds, contrast=c("treatment", "treated", "control"))
>results(dds, contrast=c("group", "T", "S"))

But I am not sure for each individual variety, sth like this?

>results(dds, contrast=c("group$T.Be", "treated", "control"))
>results(dds, contrast=c("group$T.Lo", "treated", "control"))

Thanks a lot!

deseq2 multiple factor design • 3.4k views

ADD COMMENT • link 8.4 years ago yifangt ▴ 20

score 0 · Answer 1 · 2016-02-22

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 8 hours ago

United States

Can you explain what are the groups S and T biologically? Even vague descriptions would help.

And what are the varieties? Are these like species?

ADD COMMENT • link 8.4 years ago Michael Love 42k

score 0 · Answer 2 · 2016-02-22

0

Entering edit mode

yifangt ▴ 20

@yifangt-9770

Last seen 5.6 years ago

Canada/NRC

Thanks Michael!
My materials are plant/crops. S is susceptible vs T tolerant to stress. These are the 2 groups in trait/phenotype. Varieties can be called "genotypes" here, and they are different lines of the same plant/crop.

1) I am not sure my way to fix the "model matrix is not full rank" error is correct or not.

2) What is the syntax for each variety to test the significance between control and treated no matter it is in T or S group (i.e. after subtraction of the difference between S and T). I was comparing this with the dissection of variance in ANOVA analysis with multiple factors design, which may not be appropriate.

Thanks again.

ADD COMMENT • link 8.4 years ago yifangt ▴ 20

1

Entering edit mode

I'm going to 'Add Comment' here to begin a threaded discussion.

So there are some conceptual issues with testing T vs S, because you also have the different varieties, and want to make inference on these as well. It makes it not so clear what exactly you want to test. The varieties are not random samples within the groups. These are a specifically chosen set of varieties.

When you say you want to test T vs S, or across varieties, do you mean test the control samples? Or do you mean test the treatment vs control effect across T and S, or across varieties? In other words testing if the fold change is different across group or variety?

What I might recommend is an approach where you can test treatment vs control within a variety, and you can also test where treatment vs control differs between two varieties that you choose.

But I need a bit more precise information to help direct you.

ADD REPLY • link 8.4 years ago Michael Love 42k

score 0 · Answer 3 · 2016-02-22

0

Entering edit mode

yifangt ▴ 20

@yifangt-9770

Last seen 5.6 years ago

Canada/NRC

Thanks!

Several points are in my mind:

1) Test the treatment vs control effect across T and S;

2) Test the treatment vs control effect within T and S, respectively;

3) Test the treatment vs control effect across all varieties (this should NOT be surprising if any significance turns out);

4) Test the treatment vs control effect within each variety, so that all the varieties can be aligned side-by-side, to see the bias of group selection.

What I did first is like you suggested "test treatment vs control within a variety". My concern is the normalization with DESeq, as the total reads count each sample may be different, but within each sample they may be similar, or vice versa. Putting all the sample in one set is for normalization, at least it is what I expect.

The major purpose is to test the difference between treatment and control for each variety across the two groups (T & S). "...... there are some conceptual issues with testing T vs S, because you also have the different varieties, and want to make inference on these as well. It makes it not so clear what exactly you want to test." ------the reason for different varieties is to have biological replicates. It would not be surprising to see many genes are differentially expressed between treatment vs control no matter it is in T group or in S group. One big question behind: Is the difference due to the group (T) other than the treatment? This selection of group explains why "The varieties are not exactly random samples within the groups. These are a specifically chosen set of varieties."

So, a second step is to test the difference between treatment and control in each group (T, S separately). I was thinking integrating the group as the first factor may eliminate the bias of group selection.

Thank you for this discussion, and I appreciate any clarification for what I may have misunderstood.

ADD COMMENT • link 8.4 years ago yifangt ▴ 20

2

Entering edit mode

again, I'm adding a comment/reply, to create a threaded discussion.

To me it doesn't make sense to compare T and S here. An analogy is: I want to compare the weights of mammals and ~~fish~~ aquatic creatures. So I choose: dogs, mice, whales and sharks, and collect weights of individual animals within each species. It wouldn't make sense to make a comparison and say, therefore ~~fish~~ aquatic creatures weigh more than mammals.

I would recommend you look at variety specific treatment effects, like so:

Use DESeq2 version >= 1.10:

packageVersion("DESeq2")

And use a design of: ~variety + variety:treatment

You can test, e.g. treatment vs control in Be like so:

results(dds, name="varietyBe.treatmenttreated")

You can compare groups, to test if the effect is different in Lo vs Be, like so:

results(dds, contrast=list("varietyLo.treatmenttreated","varietyBe.treatmenttreated"))

I'd say if you have further questions, you may want to speak to someone local at your institute with a quantitative background. They may have further suggestions for you. Note that you can compare the average of a set of treatment effects by adding more terms and using the listValues argument of results:

results(dds, contrast = list( 
  c("varietyNi.treatmenttreated", "varietyBe.treatmenttreated", "varietySa.treatmenttreated")
  c("varietySo.treatmenttreated", "varietyLo.treatmenttreated", "varietyVi.treatmenttreated")
) , listValues=c(1/3, -1/3))

ADD REPLY • link 8.4 years ago Michael Love 42k

0

Entering edit mode

Sorry, can't resist...

An analogy is: I want to compare the weights of mammals and fish. So I choose: dogs, mice, whales and sharks, and collect weights of individual animals within each species. It wouldn't make sense to make a comparison and say, therefore fish weigh more than mammals.

It would not make sense for many reasons, one of them being that neither whales nor sharks are fish, in fact the whales are also mammals :)

ADD REPLY • link 8.4 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Oh no, you got me! Typed much too fast. Let's say, "aquatic creatures". I also recently called coral "plants"...

ADD REPLY • link 8.4 years ago Michael Love 42k

0

Entering edit mode

Sharks aren't fish? Better tell Wikipedia! https://en.wikipedia.org/wiki/Shark :)

ADD REPLY • link 8.4 years ago Gordon Smyth 51k

0

Entering edit mode

I stand corrected. Where I went to school, sharks were not considered "proper" fish, but the meaning of the word was different than in English. Sorry!

ADD REPLY • link 8.4 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

First of all, thanks for the scripts that are what I was looking for!

I did not anticipate the discussion on the selection of T and S here, which seems another big topic on experiment design. Selection of T and S is basically the idea of "Biological replicates" which is adopted from microarray/QRT-PCR analysis for better experiment control ( Udvardi, M.,etc (2008). Eleven golden rules of quantitative RT-PCR. Plant Cell 20: 1736–1737). Biological replicates are not randomly chosen, or deliberately chosen as diversely spread ones, but the opposite. Among T/S group, the 3 lines may share genetic background, or very close descendants from selection. Definitely they are not like mice and dog within mammal, or whale & shark within aquatic creature. Also I like this statement from UCDavis group on RNAseq_experiment_design

"In the case of RNAseq analyses, we recommend favoring biological replicates over technical replicates. Without biological replicates, one cannot make the inference (at least, solely based on RNAseq results) that any amount of observed differential expression is due to the treatment. This is because expression differences could exist between individuals independent of the treatment effect. Furthermore, one cannot infer that any putative treatment effect generalizes to the larger population; other individuals may respond differently to the same treatment."

The comparison on T vs S is not the main purpose, but for "error/bias control". The idea shares with ANOVA on randomized complete block design, for example, to exclude the "block/replicate" effect from the overall variation, but the appropriateness may not be accurate here.

ADD REPLY • link 8.4 years ago yifangt ▴ 20

0

Entering edit mode

So I am doing the analysis separately for each variety to compare the treated vs control:

> coldata.Be <- data.frame(row.names = c("BeC-1","BeC-2","BeT-1","BeT-2"), treatment = rep(c("control","treated"),each = 2))
> coldata.Be
      treatment
BeC-1   control
BeC-2   control
BeT-1   treated
BeT-2   treated
> ddsHTSeq.Be <- DESeqDataSetFromMatrix(countData = countMatrix[, 1:4], colData = coldata.Be, design = ~treatment)
> dds.Be <- DESeqddsHTSeq.Be)

Is this valid? Thanks!

ADD REPLY • link 8.4 years ago yifangt ▴ 20

0

Entering edit mode

This is one approach, but this is not what I would recommend (the code I already posted being my recommendation).

ADD REPLY • link 8.4 years ago Michael Love 42k