Search
Question: Adjusting for procedural effect and deciding on statistical test
2
2.1 years ago by
elineverbon20
Utrecht University, the Netherlands
elineverbon20 wrote:

Hi,

We are currently analysing our RNA sequencing data and have decided on EdgeR for calling differentially expressed genes. We have two questions about the analysis. One concerns correcting for the effect of an experimental procedure, the other about the test used to call differentially expressed genes.

We have done RNA sequencing on samples that were either directed through a flowcytometer ('sorted') or not ('unsorted'). The 'sorted' samples were sorted for six different cell types. We know that in EdgeR, we can make a design matrix and thus determine differential gene expression per cell type (and per the unsorted sample). In addition, we could add the replicate number as a variable to take into account (since we always collected a mock and a treatment sample for each cell type at the same time).

We are now wondering whether we could also correct for the 'sorting' effect by adding the extra variable 'sorting' to the design matrix. The other solution to correct for the sorting effect that we were thinking of, is looking for genes that relatively contribute a lot to the clustering by sorted vs unsorted and removing those genes from the analysis. We hope we explained this clear enough. Which of these do you think is the better option?

Second, we were wondering whether we should test with GLM or QL. For us, it seems like the user's guide says the QL method is better in general, (page 21 'the QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene'), but that made us wonder why the GLM method is out there in the first place.

Any help would be greatly appreciated.

Eline Verbon and Ronnie de Jonge

modified 17 months ago • written 2.1 years ago by elineverbon20

Are only the sorted samples associated with a cell type? What cell types are the unsorted samples? It might be illustrative to post a minimal example of the targets table that represents your experimental design.

Dear Aaron, the unsorted samples are the whole organ (in our case: the whole root). The sorted samples are either also the whole root or one of the five major cell types in the root. We had to sort to isolate the cell types and also collected all the cells after going through the flowcytometer a few times, to get 3 replicates of the whole root sorted sample as a control. Of each we have both bacteria-exposed and bacteria non-exposed samples.

 Whole root unsorted Bacteria-exposed, 3 replicates Not exposed, 3 replicates Whole root sorted Bacteria-exposed, 3 replicates Not exposed, 3 replicates Cell type 1 (sorted) Bacteria-exposed, 3 replicates Not exposed, 3 replicates Cell type 2 (sorted) Bacteria-exposed, 3 replicates Not exposed, 3 replicates Cell type 3 (sorted) Bacteria-exposed, 3 replicates Not exposed, 3 replicates Cell type 4 (sorted) Bacteria-exposed, 3 replicates Not exposed, 3 replicates Cell type 5 (sorted) Bacteria-exposed, 3 replicates Not exposed, 3 replicates

Hope this helps, let me know if you have further questions.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by elineverbon20
2
2.1 years ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Let's set up an example below. I'm assuming that you have three batches of mock/treatment replicates, where all samples in a single batch (across all tissue types) were generated at the same time, from the same plant, etc.

tissue <- rep(c("root", "root", "type1", "type2", "type3", "type4", "type5"), each=6)
sorting <- rep(c("unsorted", rep("sorted", 6)), each=6)
exposure <- rep(rep(c("exposed", "unexposed"), each=3), 7)
batch <- rep(1:3, 14)

I would suggest a design matrix like this:

group <- factor(paste0(tissue, ".", sorting, ".", exposure))
batch <- factor(batch)
design <- model.matrix(~0 + group + batch)

The inclusion of the sorting factor in group will allow for differences between the whole root sorted and unsorted samples. You can determine the sorting effect by testing for DE between these two groups. However, you shouldn't compare between the sorted cell types and the whole root unsorted samples, because there's no way to disentangle cell type-specific changes from the sorting effect.

Obviously, you can also use this design to test for the treatment effect in each tissue, or for DE between cell types. I don't see a need to correct for the sorting effect here, because you're comparing sorted groups so any sorting effect should cancel out. If it doesn't for some reason, then you can just filter out those DE genes that were also found in the whole root sorted vs. unsorted comparison.

As for the QL vs GLM question, there are several reasons that I can think of/remember:

1. The QL framework is an extension of the GLM machinery, so it makes sense to keep the underlying functions as well as the higher-level extensions. For example, glmQLFTest depends on glmLRT.
2. The LRT still works for routine DE analyses, albeit not as well as the QL methods.
3. They are retained for compatibility with older scripts and packages.
4. In some cases, the LRT actually provides better performance, e.g., for single-cell data where there's plenty of samples (so loss of type I error control from uncertainty in the dispersions is less of an issue) but low, highly overdispersed counts (which compromises the accuracy of the saddlepoint approximation required by the QL methods).
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Aaron Lun21k

Thank you for your speedy replies!

I remade my design matrix according to your instructions. I changed 'batch' into replicate, though: all samples across the different tissue types could unfortunately not be generated at the same time or from the same plant. Thus, there is a replicate effect per cell type / whole root, not a batch effect overall.

I am now doing the tests as follows

y <- estimateDisp(y,design) #not sure whether I should do tagwise dispersions, is that only for GLM or also for QL?
fit <- glmQLFit(y,design)
ql <- glmQLFTest(fit, contrast=coeffi)

In 'coeffi', I have a vector of 17 numbers (14 sample types, plus the 3 replicate, or 'batch', columns). With this, I should be able to determine the sorting effect and the effect of the treatment per cell type.

First, I am wondering how to correct for a possible replicate effect. For now, I just put a '-1' in the xth value of the coeffi vector for the treated cell type sample that is present in column x of the design matrix and a '1' in the yth value of the coeffi vector for the mock cell type sample that is present in column y of the design matrix for the mock cell type sample to get the DE induced by the treatment for each sample. However, I feel like this way I am not taking the replicates into account. Should I add them somehow or are they already accounted for when fitting the model? If the latter is true, should I make separate replicate numbers for each sample? I now have R1, R2, R3 for each condition, however, R1 is not generated at the same time and fromt he same plants as R1 from a different cell type.

My other question is related to correcting for the sorting effect. By comparing the DE between the sorted and unsorted whole root I can determine the sorting effect (namely the difference in DE), but I am not sure how to correct for it. What I gather from your previous answer is that it does not make sense to add the sorting variable as a separate variable in the design (as we did with batch / replicate), since it does not make sense to compare sorted cell types with the unsorted whole root (we agree). Do you think instead we should subtract any changes found between the sorted and unsorted whole root from all the other comparisons?

I hope I formulated my questions clearly. Let me know if this is not the case.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by elineverbon20

It's really hard to figure out what you did when you try to describe it in words. Put some code instead. Also, given how complicated your experimental design seems to be, showing all or part of the table from the targets file would also be useful (see an example in the post here: https://support.bioconductor.org/p/88716/).

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Aaron Lun21k
0
2.1 years ago by
elineverbon20
Utrecht University, the Netherlands
elineverbon20 wrote:

#make design matrix
tissue <- s2c$cell_type #s2c is a table with all the information about the samples, which the order as the order of my sequencing files sorting <- s2c$sortedness
exposure <- s2c$treatment replicate <- s2c$replicate

group <- factor(paste0(tissue, ".", sorting, ".", exposure))
replicate <- factor(replicate)
design <- model.matrix(~0 + group + replicate)

#for design matrix see next post: not enough space in 1 comment to add it.

y <- estimateDisp(y,design)
fit <- glmQLFit(y,design)

#perform tests

for(x in 1:length(unique(s2c$cell_type))) { #get and save fold change and FDR for each gene in each sample coeffi <- c(rep(0,17)) #to indicate which which groups (=columns) to compare in the design matrix) #not sure what to do with the replicate effects #find the samples in the info table in which this cell_type is present samplesWithThisCellTypeWCS <- which(grepl(unique(s2c$cell_type)[x], s2c$cell_type) & grepl("WCS", s2c$treatment))
samplesWithThisCellTypeNB <- which(grepl(unique(s2c$cell_type)[x], s2c$cell_type) & grepl("NB", s2c$treatment)) #find which of the first 14 columns of the design matrix contains a '1' in the rows considering to the correct samples (the first occurrence is enough) --> group containing that cell type + treatment columnCellTypeWCS <- which(design[samplesWithThisCellTypeWCS[1],]==1) columnCellTypeNB <- which(design[samplesWithThisCellTypeNB[1],]==1) #make the coeffi 1 for the treatment column, -1 for the mock column coeffi <- replace(coeffi, columnCellTypeWCS, 1) coeffi <- replace(coeffi, columnCellTypeNB, -1) ql <- glmQLFTest(fit, contrast=coeffi) results <- topTags(ql, n=Inf) #there is more code here to save all results and the up- and down-regulated genes with a certain fold change and FDR } ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by elineverbon20 Design matrix:  groupCOB.sorted.NB groupCOB.sorted.WCS groupCOLsort.sorted.NB groupCOLsort.sorted.WCS groupCOLunsort.unsorted.NB 1 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0 groupCOLunsort.unsorted.WCS groupCOR.sorted.NB groupCOR.sorted.WCS groupSCR.sorted.NB groupSCR.sorted.WCS 1 0 0 0 1 0 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 1 groupWER.sorted.NB groupWER.sorted.WCS groupWOL.sorted.NB groupWOL.sorted.WCS replicateR2 1 0 0 0 0 0 2 0 0 0 1 1 3 1 0 0 0 0 4 0 1 0 0 0 5 0 0 0 0 0 replicateR3 replicateR4 1 0 0 2 0 0 3 1 0 4 1 0 5 0 0 ADD REPLYlink written 2.1 years ago by elineverbon20 I don't know how you managed to get values of 10-50 in your design matrix, I can only assume that you've mixed the row names with the first column. And it's still not clear what your batches actually are - this would be easy to see if you just showed the targets file, but I'm just going to guess that your whole root batches are not the same as your cell type batches: tissue <- rep(c("root", "root", "type1", "type2", "type3", "type4", "type5"), each=6) sorting <- rep(c("unsorted", rep("sorted", 6)), each=6) exposure <- rep(rep(c("exposed", "unexposed"), each=3), 7) batch <- c(rep(1:3, 4), rep(4:6, 10)) In which case, the design I mentioned in the previous post still applies. Each group coefficient represents the average log-expression of the corresponding group in its first batch (batch 1 for the whole root samples, batch 4 for the cell type samples). Note that you'll have to prune out batch4 to get your design matrix to full rank: group <- factor(paste0(tissue, ".", sorting, ".", exposure)) batch <- factor(batch) design <- model.matrix(~0 + group + batch) design <- design[,-which(colnames(design)=="batch4")] # get ti fykk rabj, You can compare between groups using makeContrasts, which is easier than manually constructing contrasts: con <- makeContrasts(grouproot.sorted.exposed - grouproot.sorted.unexposed, levels=design) ... but don't compare between any of the whole root and cell types samples if they belong to different batches. ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Aaron Lun21k 0 17 months ago by elineverbon20 Utrecht University, the Netherlands elineverbon20 wrote: Dear Aaron Lun (and others reading along), We have continued working on this since our discussion here on the web. Here I first explain how we used your method and run into problems when trying to validate it. Next, I explain how I tried to correct for sorting myself and the problems I run into then. We are probably not understanding some details of your (simpler) method and don't need our own method. However, I included it so show our reasoning, which might help you understand where we go wrong trying to use your method. It became quite a bit of text, I hope you will understand what I say. To correct for our sorting effect we followed your suggestion to make a design matrix.  #make design matrix, with a column for each of our conditions (celltype-sortedornot-treatment) #and '1's in the rows that correspond to a sample from that condition tissue <- s2c$cell_type sorting <- s2c$sortedness exposure <- s2c$treatment group <- factor(paste0(tissue, ".", sorting, ".", exposure)) group <- factor(paste0(tissue, ".", exposure)) design <- model.matrix(~0 + group) #in here, s2c is a table with our run_accession number and corresponding treatment, #cell type, and whether they are sorted or not.

Afterwards, we make the fit object and calculate the genes changed due to our treatment. (As you said, I could make the contrasts by using makeContrasts, but my, indeed more complex, method only works, so for now I have kept it like this. I will adjust later.)

 y <- estimateDisp(y,design) #perform test fit <- glmQLFit(y,design) for(x in 1:length(unique(s2c$cell_type))) { #we need to tell the program which conditions (= which columns) to compare coeffi <- c(rep(0,length(levels(group)))) #find the columns in the design matrix that contain the samples of this cell type currentCellType <- unique(s2c$cell_type)[x]     columnCellTypeWCS <- which(grepl(currentCellType,levels(group)) & grepl("WCS",levels(group)))     columnCellTypeNB <- which(grepl(currentCellType,levels(group)) & grepl("NB",levels(group)))          #make the coeffi 1 for the treatment column, -1 for the mock column      coeffi <- replace(coeffi, columnCellTypeWCS, 1)     coeffi <- replace(coeffi, columnCellTypeNB, -1)          ql <- glmQLFTest(fit, contrast=coeffi)      results <- topTags(ql, n=Inf)     write.csv(results, paste(saveDir,paste(currentCellType,'_WCSvsNB.csv', sep=""), sep="/"), row.names = T, na = "") }

The code works (no bugs and we end up getting fold changes per gene for each of our cell types and the controls), but we're still struggling to find out how to determine whether the correction works and adjusts the values in the sorted samples as needed. In other words, how we can convince ourselves and others that the sorting effect is adjusted for and our results show the effect of our treatment (almost) only.

We figured we could plot our fold changes of all the genes in the sorted control on the x-axis and the unsorted control on the y-axis (because as you mentioned, we cannot compare with the cell types directly, too many confounding variables). We did this both with correcting for sorting and without (in the latter case, removing 'sorting' from 'group'). We expected the plot without the correction to have points that fell less close to the identity line then the points in the plot with correction. However, we got two exactly the same graphs (with the same R squared). To us, this seems to indicate that the sorting correction did not do anything.

What do you think? Are we missing something here?

Ps. We collected 1 replicate of 2 cell types or controls per day, with and without bacteria. So the experiment was performed over many different days and we do not have batches we can adjust for (or very many of them). Hope that makes the batch-thing clear. I am not sure what you mean with a 'target file', but if you explain I would be happy to include it.

Well, you've defined group as:

group <- factor(paste0(tissue, ".", sorting, ".", exposure))
group <- factor(paste0(tissue, ".", exposure))

... so it doesn't seem like you're actually using the sorting level here.

Yes, you are right of course. I accidentally pasted the code that I used to remove the sorting adjustment to compare the two. In my original script I did not include the second line and did correct for the sorting.

So, I used the first line for my original analysis. I added the second line (and commented the first one out) to redo the analysis without 'sorting' as a factor to compare the results.

I am sorry for pasting the wrong code here.

I don't understand what your problem is. What's NB? What's WCS? Why are you using grep to construct the contrast matrices, and not makeContrasts? Your experimental design is too large and complicated to explain verbally; please provide some code that I can use to completely reconstruct your design matrix from scratch. For example:

grouping <- rep(LETTERS[1:5], each=5) # need you to define something like this
design <- model.matrix(~0 + grouping)

I would also suggest posting as a new question, it's becoming difficult to keep track of the information across your separate posts.

ADD REPLYlink modified 17 months ago • written 17 months ago by Aaron Lun21k

Dear Aaron,

I followed up on your advice and started a new post here: Adjusting for procedural effect

0
17 months ago by
elineverbon20
Utrecht University, the Netherlands
elineverbon20 wrote:

To correct for the sorting effect myself, I

1) took the counts per gene of the DGELists of the untreated sorted and unsorted control samples. I calculated the fold change due to sorting with EdgeR and saved this in a table.
2) divided the counts of all the sorted treated samples by the fold change calculated from the treated control samples
3) repeat step 1 and 2 for the treated control samples

 #get the counts from the DGEList (not the geneCounts --> non filtered, so not same length at the sortingEffect table)    #this table will be corrected for the sorting effect   sortingCorrectedCounts <- y$counts #compare NB and WCS samples for(x in c("WCS", "NB")) { ###check how sorting changed gene exp in NB and WCS samples### #make a vector to tell the program which conditions to compare coeffi <- c(rep(0,length(levels(group)))) #find the columns in the design matrix that contain the Col-0 WCS and NB samples (sorted and unsorted) col0sort <- which(grepl("COLsort",levels(group)) & grepl(x,levels(group))) col0unsort <- which(grepl("COLunsort",levels(group)) & grepl(x,levels(group))) #compare sorted vs unsorted #make the coeffi 1 for the treatment column, -1 for the mock column coeffi <- replace(coeffi, col0sort, 1) coeffi <- replace(coeffi, col0unsort, -1) ql <- glmQLFTest(fit, contrast=coeffi) #this gives a table with fold changes for each gene (used to correct for sorting effect later) results <- topTags(ql, n=Inf) #get table with the fold changes due to sorting per gene sortingEffect <- results$table     #order the table by row.names (= gene names) (default: by p-value or FDR)     sortingEffect <- sortingEffect[order(row.names(sortingEffect)),]          #in the new counts table, find the columns of current loop variable (so NB or WCS, or 'x')     #disregard the unsorted samples (containing the string "un"), we do not want to change those (no sorting effect there)     columns <- intersect(grep(x,colnames(sortingCorrectedCounts)),grep("un",colnames(sortingCorrectedCounts),invert=TRUE))     #go through the columns of this variable     for(i in columns) {       #divide the counts of each gene by the fold change in the corresponding row in sortingEffect table       sortingCorrectedCounts[, i] <- sortingCorrectedCounts[, i] / sortingEffect[, "logFC"]     }   }

The next step is to make a new DGEList to perform EdgeR on again (and normalize). So, something like this (this is the code I used to generate the DGEList all the way at the beginning of my script, which I would have to adapt):

 #get counts and lengths of the genes   geneCounts <- geneFromTranscripts$counts normMat <- geneFromTranscripts$length   #o is used to correct for changes to the average transcript length across samples   o <- log(calcNormFactors(geneCounts/normMat)) + log(colSums(geneCounts/normMat))   y <- DGEList(geneCounts) #create a DGEList, contains y$counts and y$samples    y\$offset <- t(t(log(normMat)) + o)      #filter   keep <- rowSums(cpm(y)>2) >= 3    y <- y[keep, , keep.lib.sizes=FALSE]      y <- calcNormFactors(y)

However, here I run into problems because I started with a DGEList to calculate the fold changes due to sorting with EdgeR. But that DGEList is already filtered, which leads to problems when trying to normalize the new DGEList. I could remove the filtering before correcting for the sorting effect, but then the changes due to sorting are not calculated as they should be.

0
17 months ago by
elineverbon20
Utrecht University, the Netherlands
elineverbon20 wrote:

(sorry for the long messages, tried to make it as clear as possible)