Question

edgeR - design matrix and contrasts for paired data problems

0

Entering edit mode

emelie.nilsson • 0

@emelienilsson-22825

Last seen 4.3 years ago

Hi there!

I am new at transcriptomic analysis and at posting questions (I usually manage by reading other peoples questions), so let me know if something is unclear (or way too long). I am trying to use edgeR to analyse my transcriptomic data, but I'm running into problems both with regards to design matrix (I can get it to work, but I don't understand it) and as a consequence, I also get problems with my contrasts.

My transcripts come from a bacteria-phage model, and what I am looking at now is the bacterial genes (i.e. how does the phage change the expression of host genes). I have also done the experiment in two different treatments, "full" and "low" (it reflects what kind of medium the bacteria were grown in), and I've sampled each set 6 times. In other words, I have: 3 x bacteria+phage in Full medium for times 0-5 (PXF, where x is a time in minutes - determined by phage progression) 3 x bacteria+water in Full medium for times 0-5 (CXF, x is the same as above, water is just as a control) 3 x bacteria+phage in Low medium for times 0-5 (PYL, where y is a time in minutes - determined by phage progression) 3 x bacteria+water in Low medium for times 0-5 (CYL, y is the same as above) (The values for x and y are not the same, but they are the same proportions of the phages progression, i.e. supposed to be "translatable".)

It totals to 72 samples. It is also important to know that the samples from Full medium comes from 3 sets of starter cultures, i.e. I grew bacteria in a big bottle over night and then I split it in two and treated one part with phage and one with water = pairs. This is the same for the samples in Low medium, they are basically 3 pairs. And to complicate it a bit further, after some crude visualisation of mapping against phage/bacteria genomes - I discovered that three successive (i.e. from time 3, time 4 and time 5 of the same culture) control samples had too much phage genes and therefore had been contaminated, and that one phage sample did not contain any phage genes when that is biologically impossible. Therefore I need to remove these four samples. Is it possible to keep the "corresponding" samples, or do I have to discard them as well? In the analysis I describe here, I've only removed these four samples (which I am very uncertain whether it's correct or not).

I have several hypotheses about the experiment, but first of all I want to compare phage infected bacteria at each time point against the control, for each kind of medium. Secondly, I want to compare the difference between the phage vs control between the two kinds of medium.

To achieve this I have loaded my data and produced a data.frame with sample information:

#reading the cleaned, mapped and "counted" data into R
count <- read_tsv("counts.tsv", col_types = cols(.default = col_character(), count = col_double()))
#parsing the count table to produce metadata for the samples in count
samples <- count %>% distinct(sample) %>%
  mutate(
    treatment = sub('^(.).*', '\\1', sample),
    time = sub('.*t([0-9]+)$', '\\1', sample),
    replicate = sub('.*([0-9])+[LF]t[0-9]+', '\\1', sample),
    medium = sub('.*[0-9]+([LF])t.*', '\\1', sample),
    time = case_when(
      medium == 'L' & time == '0' ~ 1,
      medium == 'L' & time == '1' ~ 16.25,
      medium == 'L' & time == '2' ~ 32.5,
      medium == 'L' & time == '3' ~ 48.75,
      medium == 'L' & time == '4' ~ 65,
      medium == 'L' & time == '5' ~ 113.75,
      medium == 'F' & time == '0' ~ 1,
      medium == 'F' & time == '1' ~ 12.5,
      medium == 'F' & time == '2' ~ 25,
      medium == 'F' & time == '3' ~ 37.5,
      medium == 'F' & time == '4' ~ 50,
      medium == 'F' & time == '5' ~ 87.5,
      TRUE                        ~ -1
     )
  )
#merging three coloumns from data set "samples" - in other words reducing the samples information to be a guide with sample and characteristics
united <- unite(samples, "tr_ti_med", c("treatment", "time", "medium"))
#sorting
united <- united[order(united$sample),]
head(united)
sample  tr_ti_med   replicate
C1Ft0   C_1_F       1       
C1Ft1   C_12.5_F    1       
C1Ft2   C_25_F      1       
C1Ft3   C_37.5_F    1       
C1Ft4   C_50_F      1       
C1Ft5   C_87.5_F    1

And so far, I think I can handle it. Also, creating a DGEList seems fine, I've done the following:

#create DGEList
count_wide <- count %>%
  spread(sample, count, fill = 0)
count_matrix <- as.matrix(count_wide %>% select(-geneid))
rownames(count_matrix) = count_wide$geneid
dgel_count <- DGEList(count_matrix, group=united$tr_ti_med)
#subset for only bacteria genes (phage genes start with "B", no bacterial genes does)
dgel_bact <- dgel_count[grep("B.*",rownames(dgel_count$counts), invert=TRUE), ]
#subsetting bacterial dataset to remove C3Lt3-5 (contaminated), and also P2Ft3 (biologically invalid)
dgel_bact <- dgel_bact[, grep("C3Lt[3-5]", rownames(dgel_bact$samples), invert=TRUE)]
dgel_bact <- dgel_bact[, grep("P2Ft3", rownames(dgel_bact$samples), invert=TRUE)]
#filtering out low counts, then checking the output, applying and calculate nomarlistaion factors
keep_b <- rowSums(cpm(dgel_bact) > 9) >= 3
dgel_bact <- dgel_bact[keep_b, , keep.lib.size=FALSE]
dgel_bact <- calcNormFactors(dgel_bact)

However, when I construct the design matrix I get a bit lost. I've been reading the edgeR's users guide and since I've got paired samples I've been looking at section 3.4.1 and decided I want to create the matrix with an additive formula, and to make it easier for me I'd like to exclude the intercept, I've therefore done the following:

treat <- as.factor(united$tr_ti_med) #contains information on phage/control, time and medium type
subject <- as.factor(united$replicate) #which subject the samples originate from

design <- model.matrix(
  ~0+subject+treat
)
#design needs to contain all samples of interest, therefore subsetting for bacteria,
#I've removed C3Lt3 (row 34), C3Lt4 (row 35), C3Lt5 (row 36) and P2Ft3 (row 52)
design_bact <- design[-c(34:36,52),]
design_bact

My first issue with the design matrix is that I'm lying. In my first attempts, I had split "subject/replicates" into 1F, 2F, 3F, 1L, 2L and 3L to indicate first/second/third pair in Full and in Low medium. However, I ran into "matrix not of full rank" and after trying to understand I read (e.g. here) that I can not have columns that add up to each other because of linear dependencies. Therefore I removed the F/L, and only use "1", "2", "3" - even though the "1" replicates in Full medium have nothing to do with the samples in Low medium. Is this an okay way of getting around the linear dependencies?

My second issue with the design matrix is what columns I get, e.g.:

> colnames(design_bact)
 [1] "subject1"        "subject2"        "subject3"        "treatC_1_L"      "treatC_113.75_L" "treatC_12.5_F"  
 [7] "treatC_16.25_L"  "treatC_25_F"     "treatC_32.5_L"   "treatC_37.5_F"   "treatC_48.75_L"  "treatC_50_F"    
[13] "treatC_65_L"     "treatC_87.5_F"   "treatP_1_F"      "treatP_1_L"      "treatP_113.75_L" "treatP_12.5_F"  
[19] "treatP_16.25_L"  "treatP_25_F"     "treatP_32.5_L"   "treatP_37.5_F"   "treatP_48.75_L"  "treatP_50_F"    
[25] "treatP_65_L"     "treatP_87.5_F"

Since I specified that I did not want an intercept, I get the three subject-columns as I want. But for the treatment, I'm missing "treatC1F". Here, I'm properly lost. For some reason, I think the design assumes that the first "factor" (not sure if that's the right word, here it's "treatC1F") is some kind of baseline, which in my case is not true. I do not have one treatment that could be baseline for all. I have tried forcing a matrix with all my levels, but then I get issues with "design matrix not of full rank". After a while I tried to accept that this is the correct way of doing the matrix and moved on to contrasts, which leads to my issues with contrasts. But I would like to know if I have set up my matrix in a proper way? And could someone try to explain why one of my levels is missing? And what that means? I can understand if that is too big to manage, but maybe there is some not too advanced literature that you could point me to (I'm not very good with statistics, I'm sad to say).

If I assume that the design matrix is fine, I then want to do some actual analysis and I've tried the following:

#getting dispersion
dgel_bact <- estimateDisp(dgel_bact, design_bact, robust=TRUE)
#the data should be fitted with quasi-likelihood
fit_bact <- glmQLFit(dgel_bact, design_bact, robust=TRUE)
#a contrast that works
lowPvsC_t0 <- makeContrasts(treatP_1_L-treatC_1_L, levels=design_bact)
res <- glmQLFTest(fit_bact, contrast=lowPvsC_t0)
topTags(res)
is.de <- decideTestsDGE(res)
summaryis.de)
#the contrast that does not work
fullPvsC_t0 <- makeContrasts(treatP_1_F-treatC_1_F, levels=design_bact)

So, regarding my problems with contrasts, I can compare phage affected bacteria vs control in the Low medium at whatever timepoint I want. But I cannot compare anything to the control bacteria at time 0 (which is "1" minute, therefore the "1" above) in Full medium. If I've done the design matrix in a proper way, how do I manage to "pull" out the information from samples that belong to "treatC1F"?

This post is potentially too long to keep anyones interest, but I hope not, and I would appreciate any help. I've put my questions in bold in an attempt to simplify the reading, but please let me know if I should shorten this.

Best wishes, Emelie.

EDIT: After finding someone I could discuss my experimental setup with, we realised that my "subjects" (the pairs/replicates) are 100% connected to the medium (L/F) that I used (it wouldn't be biologically logical to do it any other way), which means (if I've understood it correctly) that they are linear dependent and that I cannot include both because the model will be unable to decide which causes the variation. (I might get the words wrong, I apologise if so.)

edger • 879 views

ADD COMMENT • link 4.3 years ago emelie.nilsson • 0

0

Entering edit mode

Your post is potentially too long, but I think you'll find that people will still try to power through in order to help. You can help make our lives easier by enabling us to reconstitute your united data.frame, so that we can better understand the sample information we are working with. Would that be possible?

You can either post it somewhere that we can then download, or if that's not an option, you can always copy/paste the output of dput(united) so that we can use that result to rebuild the sample information into an R workspace and poke around a bit.

Also, consider appending the columns of united into the appropriate columns of the $samples data.frame of the DGEList that you constructed and are playing with ... it will almost certainly come in handy in the long run.

ADD REPLY • link 4.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thank you very much for your reply! I could potentially upload united somewhere, but it's probably easier with you "dput"-suggestion.

dput(united)
structure(list(sample = c("C1Ft0", "C1Ft1", "C1Ft2", "C1Ft3", 
"C1Ft4", "C1Ft5", "C1Lt0", "C1Lt1", "C1Lt2", "C1Lt3", "C1Lt4", 
"C1Lt5", "C2Ft0", "C2Ft1", "C2Ft2", "C2Ft3", "C2Ft4", "C2Ft5", 
"C2Lt0", "C2Lt1", "C2Lt2", "C2Lt3", "C2Lt4", "C2Lt5", "C3Ft0", 
"C3Ft1", "C3Ft2", "C3Ft3", "C3Ft4", "C3Ft5", "C3Lt0", "C3Lt1", 
"C3Lt2", "C3Lt3", "C3Lt4", "C3Lt5", "P1Ft0", "P1Ft1", "P1Ft2", 
"P1Ft3", "P1Ft4", "P1Ft5", "P1Lt0", "P1Lt1", "P1Lt2", "P1Lt3", 
"P1Lt4", "P1Lt5", "P2Ft0", "P2Ft1", "P2Ft2", "P2Ft3", "P2Ft4", 
"P2Ft5", "P2Lt0", "P2Lt1", "P2Lt2", "P2Lt3", "P2Lt4", "P2Lt5", 
"P3Ft0", "P3Ft1", "P3Ft2", "P3Ft3", "P3Ft4", "P3Ft5", "P3Lt0", 
"P3Lt1", "P3Lt2", "P3Lt3", "P3Lt4", "P3Lt5"), tr_ti_med = c("C_1_F", 
"C_12.5_F", "C_25_F", "C_37.5_F", "C_50_F", "C_87.5_F", "C_1_L", 
"C_16.25_L", "C_32.5_L", "C_48.75_L", "C_65_L", "C_113.75_L", 
"C_1_F", "C_12.5_F", "C_25_F", "C_37.5_F", "C_50_F", "C_87.5_F", 
"C_1_L", "C_16.25_L", "C_32.5_L", "C_48.75_L", "C_65_L", "C_113.75_L", 
"C_1_F", "C_12.5_F", "C_25_F", "C_37.5_F", "C_50_F", "C_87.5_F", 
"C_1_L", "C_16.25_L", "C_32.5_L", "C_48.75_L", "C_65_L", "C_113.75_L", 
"P_1_F", "P_12.5_F", "P_25_F", "P_37.5_F", "P_50_F", "P_87.5_F", 
"P_1_L", "P_16.25_L", "P_32.5_L", "P_48.75_L", "P_65_L", "P_113.75_L", 
"P_1_F", "P_12.5_F", "P_25_F", "P_37.5_F", "P_50_F", "P_87.5_F", 
"P_1_L", "P_16.25_L", "P_32.5_L", "P_48.75_L", "P_65_L", "P_113.75_L", 
"P_1_F", "P_12.5_F", "P_25_F", "P_37.5_F", "P_50_F", "P_87.5_F", 
"P_1_L", "P_16.25_L", "P_32.5_L", "P_48.75_L", "P_65_L", "P_113.75_L"
), replicate = c("1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", 
"3", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3", 
"3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3")), row.names = c(NA, 
-72L), class = c("tbl_df", "tbl", "data.frame"))

(Sorry if I should have added this in the above post, but it's here at least.)

Thank you for your suggestion, the samples information in the DGEList does contain my "compromised" sample information, but you're probably correct about having all the information in there - I'll try to manage that!

ADD REPLY • link 4.3 years ago emelie.nilsson • 0

score 1 · Accepted Answer · 2020-02-03

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Emelie, I don't have time to check your experimental design or to respond to all your questions, but I think your questions are probably all FAQ and you will find info on this forum if you do a Google search. I will say that if you simply reverse the factors in the linear model:

design <- model.matrix(~0+treat+subject)

then all your contrasts will work.

There is also no problem with the standard design matrix

design <- model.matrix(~ subject + treat)

In that case, you simply use topTags(fit, coef="treatP_1_F") to get the P_1_F vs C_1_F contrast, because all the treat coefficients are relative to C_1_F.

ADD COMMENT • link 4.3 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you very much for your reply! I understand that you do not have time for that and I will go back to searching. I still don't understand how the design matrix works, but I thank you for helping me to get the contrasts to work - now I can at least go back to looking at the biology.

However, one of the things that I really don't get is why I would want "all the treat coefficients are relative to C_1_F"? Or to any other one sample? For example, if I do comparison between phage/no phage in the Low medium, then comparing them to C_1_F seems strange to me. Is it correct to do that?

ADD REPLY • link 4.3 years ago emelie.nilsson • 0

0

Entering edit mode

Yes, it is absolutely correct. The reason why it is not only correct but inevitable has been explained at least a dozen times on this forum. It is a basic statistical point that is implicit in any anova model but one that is not explained explicitly in statistics textbooks. Some related questions:

ADD REPLY • link 4.3 years ago Gordon Smyth 50k