4 months ago by
Utrecht University, the Netherlands
Dear Aaron Lun (and others reading along),
We have continued working on this since our discussion here on the web. Here I first explain how we used your method and run into problems when trying to validate it. Next, I explain how I tried to correct for sorting myself and the problems I run into then. We are probably not understanding some details of your (simpler) method and don't need our own method. However, I included it so show our reasoning, which might help you understand where we go wrong trying to use your method.
It became quite a bit of text, I hope you will understand what I say.
To correct for our sorting effect we followed your suggestion to make a design matrix.
#make design matrix, with a column for each of our conditions (celltypesortedornottreatment)
#and '1's in the rows that correspond to a sample from that condition
tissue < s2c$cell_type
sorting < s2c$sortedness
exposure < s2c$treatment
group < factor(paste0(tissue, ".", sorting, ".", exposure))
group < factor(paste0(tissue, ".", exposure))
design < model.matrix(~0 + group)
#in here, s2c is a table with our run_accession number and corresponding treatment, #cell type, and whether they are sorted or not.

Afterwards, we make the fit object and calculate the genes changed due to our treatment. (As you said, I could make the contrasts by using makeContrasts, but my, indeed more complex, method only works, so for now I have kept it like this. I will adjust later.)
y < estimateDisp(y,design)
#perform test
fit < glmQLFit(y,design)
for(x in 1:length(unique(s2c$cell_type))) {
#we need to tell the program which conditions (= which columns) to compare
coeffi < c(rep(0,length(levels(group))))
#find the columns in the design matrix that contain the samples of this cell type
currentCellType < unique(s2c$cell_type)[x]
columnCellTypeWCS < which(grepl(currentCellType,levels(group)) & grepl("WCS",levels(group)))
columnCellTypeNB < which(grepl(currentCellType,levels(group)) & grepl("NB",levels(group)))
#make the coeffi 1 for the treatment column, 1 for the mock column
coeffi < replace(coeffi, columnCellTypeWCS, 1)
coeffi < replace(coeffi, columnCellTypeNB, 1)
ql < glmQLFTest(fit, contrast=coeffi)
results < topTags(ql, n=Inf)
write.csv(results, paste(saveDir,paste(currentCellType,'_WCSvsNB.csv', sep=""), sep="/"), row.names = T, na = "")
}

The code works (no bugs and we end up getting fold changes per gene for each of our cell types and the controls), but we're still struggling to find out how to determine whether the correction works and adjusts the values in the sorted samples as needed. In other words, how we can convince ourselves and others that the sorting effect is adjusted for and our results show the effect of our treatment (almost) only.
We figured we could plot our fold changes of all the genes in the sorted control on the xaxis and the unsorted control on the yaxis (because as you mentioned, we cannot compare with the cell types directly, too many confounding variables). We did this both with correcting for sorting and without (in the latter case, removing 'sorting' from 'group'). We expected the plot without the correction to have points that fell less close to the identity line then the points in the plot with correction. However, we got two exactly the same graphs (with the same R squared). To us, this seems to indicate that the sorting correction did not do anything.
What do you think? Are we missing something here?
Ps. We collected 1 replicate of 2 cell types or controls per day, with and without bacteria. So the experiment was performed over many different days and we do not have batches we can adjust for (or very many of them). Hope that makes the batchthing clear. I am not sure what you mean with a 'target file', but if you explain I would be happy to include it.
Are only the sorted samples associated with a cell type? What cell types are the unsorted samples? It might be illustrative to post a minimal example of the targets table that represents your experimental design.
Dear Aaron, the unsorted samples are the whole organ (in our case: the whole root). The sorted samples are either also the whole root or one of the five major cell types in the root. We had to sort to isolate the cell types and also collected all the cells after going through the flowcytometer a few times, to get 3 replicates of the whole root sorted sample as a control. Of each we have both bacteriaexposed and bacteria nonexposed samples.
Hope this helps, let me know if you have further questions.