How to account for population variation within regions in EdgeR?
3
0
Entering edit mode
@swaegersjanne-11229
Last seen 5.6 years ago

Hi everyone,

I am trying to fit a multi factor comparison using EdgeR

I want to analyse expression data from samples collected from different populations at different latitudes (3 populations at both latitudes) which were subjected to a treatment.

I want to know the effect of latitude (lat), treatment (treat) and their interaction on the expression data.

My data looks like this:

sample    pop    lat    treat
1    1    N    1
2    1    N    1
3    1    N    2
4    1    N    2
5    2    N    1
6    2    N    1
7    2    N    2
8    2    N    2
9    3    N    1
10    3    N    1
11    3    N    2
12    3    N    2
13    4    S    1
14    4    S    1
15    4    S    2
16    4    S    2
17    5    S    1
18    5    S    1
19    5    S    2
20    5    S    2
21    6    S    1
22    6    S    1
23    6    S    2
24    6    S    2

When I want to fit a model using this design

design <- model.matrix(~lat+pop+treat+lat:treat)

the matrix is not of full rank because pop is nested in lat.

I have read in another post (DESeq2 paired multifactor test) and in the Edge R user guide (section 3.5) that this can be solved by creating another population variable (pop.nested) like this:

sample    pop    lat    treat    pop.nested
1    1    N    1    1
2    1    N    1    1
3    1    N    2    1
4    1    N    2    1
5    2    N    1    2
6    2    N    1    2
7    2    N    2    2
8    2    N    2    2
9    3    N    1    3
10    3    N    1    3
11    3    N    2    3
12    3    N    2    3
13    4    S    1    1
14    4    S    1    1
15    4    S    2    1
16    4    S    2    1
17    5    S    1    2
18    5    S    1    2
19    5    S    2    2
20    5    S    2    2
21    6    S    1    3
22    6    S    1    3
23    6    S    2    3
24    6    S    2    3

Yet, how does EdgeR handle this data? It seems statistically not sound to me as now population 1 and 4 are assumed to be same in the model?

Could someone comment on this?
Thanks a lot!

 

 

 

edger multifactor nested design • 968 views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay

The best way to proceed would be something like this:

pop <- factor(rep(1:6, each=4))
lat <- factor(rep(c("N", "S"), each=12))
treatment <- factor(rep(rep(1:2, each=2), 6))
design <- model.matrix(~0 + pop + lat:treatment)
design <- design[,-grep("treatment1", colnames(design))]
colnames(design) <- sub(":", "_", colnames(design))

Now, looking at colnames(design), you get:

[1] "pop1"            "pop2"            "pop3"            "pop4"           
[5] "pop5"            "pop6"            "latN_treatment2" "latS_treatment2"

The first 6 coefficients represent blocking factors for the populations, while the last two represent latitude-specific treatment effects (of treatment 2 over treatment 1). You can drop them separately to test for the effect of treatment at either latitude; or you can compare them to each other, to test for differential treatment effects between latitudes:

con <- makeContrasts(latN_treatment2 - latS_treatment2, levels=design)

... or you can average them, to test for the average effect of treatment across both latitudes:

con <- makeContrasts((latN_treatment2 + latS_treatment2)/2, levels=design)

However, you cannot directly test for the effect of latitude because this is nested in population. Comparisons between populations are not sensible in this model as we've used all the population-level information to estimate the pop* coefficients. If you want to do this, you should consider using voom with duplicateCorrelation, which allows you to compare between levels of a blocking factor.

ADD COMMENT
0
Entering edit mode

Hi, I have another question related to this answer. Can I also compare latN.treat20 to latS.treat20 using this design? So compare the gene expression within the 20 treatment between the N and S latitude? How would you go about to test this? Thanks in advance!

ADD REPLY
0
Entering edit mode

Where did the 20 come from? Your post only has values of 1 or 2 for treat.

ADD REPLY
0
Entering edit mode

Sorry, I used the wrong numbers. So using the above-mentioned design, my question is: Can I also compare latN_treatment1 to latS.treatment1 using this design? So compare the gene expression within the treatment 1 between the N and S latitude? How would you go about to test this? Would this involve making a subset of the data? Thanks in advance!

ADD REPLY
0
Entering edit mode

Remember, latN_treatment2 and latS_treatment2 represent the log-fold change between treatments at each latitude; you can't have a separate coefficient like latN_treatment1 or latS_treatment1 in this design, because the log-fold change is already computed in treatment 2 relative to treatment 1.

More generally, if you want to directly compare expression between the N/1 and S/1 groups, you can't do this with this design because the different latitudes belong in different pop levels. You've already blocked on pop, so a direct comparison is simply not possible. Rather, you'll have to subset the data to only take the treatment 1 samples, and then pool the two treatment 1 samples at each pop level, such that there would be one pooled library per pop level. These can be treated as direct replicates and modelled with a simple one-way layout based on latitude.

ADD REPLY
0
Entering edit mode

Allright, thanks!

I'm still wondering though: could I also make the dummy variable "pop.nested" (as below) and then make a design like this: design = ~0 +lat + lat:pop.n

Would this be a valid alternative for pooling my samples within the populations?

sample    pop    lat    treat    pop.nested
1    1    N    1    1
2    1    N    1    1
5    2    N    1    2
6    2    N    1    2
9    3    N    1    3
10    3    N    1    3
13    4    S    1    1
14    4    S    1    1
17    5    S    1    2
18    5    S    1    2
21    6    S    1    3
22    6    S    1    3
ADD REPLY
0
Entering edit mode

No. This doesn't properly account for correlations for samples from the same pop. Consider a gene with no latitude effect; in this situation, your design implies that samples 1, 2, 13 and 14 should be independent replicates. However, 1 and 2 are from the same pop and will be correlated, as will 13 and 14. Failure to account for this means that the degrees of freedom in the model - and the confidence of downstream inferences - will be overstated. This is why I recommended pooling, to avoid worrying about these correlations.

ADD REPLY
0
Entering edit mode

Thanks for your reply! Do you have a suggestion on how to pool the data within populations? Just add up the counts? Or would you first do a normalisation procedure? In EdgeR would this mean adding up the rows of the samples within populatons after y <- calcNormFactors(y) ?

ADD REPLY
1
Entering edit mode

Run sumTechReps before calcNormFactors.

ADD REPLY
0
Entering edit mode
@swaegersjanne-11229
Last seen 5.6 years ago

Thanks for your clear and fast answer!

ADD COMMENT
0
Entering edit mode
@swaegersjanne-11229
Last seen 5.6 years ago

Hi again,

I actually have the same question regarding DESeq2. How can I implement the above method to account for populations into the design formula of DESeq2? I can not find a way to edit the design matrix as you can not give a matrix as input in DESeq2. At least, I can't find a straightforward way to do that.

Any help is greatly appreciated.

Best,

Janne

ADD COMMENT
0
Entering edit mode

Probably best to post this as a separate question with a DESeq2 tag.

ADD REPLY

Login before adding your answer.

Traffic: 272 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6