Question

Adjusting for procedural effect and deciding on statistical test

2

Entering edit mode

elineverbon ▴ 20

@elineverbon-11768

Last seen 8.6 years ago

Utrecht University, the Netherlands

Hi,

We are currently analysing our RNA sequencing data and have decided on EdgeR for calling differentially expressed genes. We have two questions about the analysis. One concerns correcting for the effect of an experimental procedure, the other about the test used to call differentially expressed genes.

We have done RNA sequencing on samples that were either directed through a flowcytometer ('sorted') or not ('unsorted'). The 'sorted' samples were sorted for six different cell types. We know that in EdgeR, we can make a design matrix and thus determine differential gene expression per cell type (and per the unsorted sample). In addition, we could add the replicate number as a variable to take into account (since we always collected a mock and a treatment sample for each cell type at the same time).

We are now wondering whether we could also correct for the 'sorting' effect by adding the extra variable 'sorting' to the design matrix. The other solution to correct for the sorting effect that we were thinking of, is looking for genes that relatively contribute a lot to the clustering by sorted vs unsorted and removing those genes from the analysis. We hope we explained this clear enough. Which of these do you think is the better option?

Second, we were wondering whether we should test with GLM or QL. For us, it seems like the user's guide says the QL method is better in general, (page 21 'the QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene'), but that made us wonder why the GLM method is out there in the first place.

Any help would be greatly appreciated.

Eline Verbon and Ronnie de Jonge

EdgeR glmlrt() ql f-test • 2.8k views

ADD COMMENT • link 9.3 years ago • updated 8.6 years ago elineverbon ▴ 20

0

Entering edit mode

Are only the sorted samples associated with a cell type? What cell types are the unsorted samples? It might be illustrative to post a minimal example of the targets table that represents your experimental design.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Dear Aaron, the unsorted samples are the whole organ (in our case: the whole root). The sorted samples are either also the whole root or one of the five major cell types in the root. We had to sort to isolate the cell types and also collected all the cells after going through the flowcytometer a few times, to get 3 replicates of the whole root sorted sample as a control. Of each we have both bacteria-exposed and bacteria non-exposed samples.

Whole root unsorted	Bacteria-exposed, 3 replicates	Not exposed, 3 replicates
Whole root sorted	Bacteria-exposed, 3 replicates	Not exposed, 3 replicates
Cell type 1 (sorted)	Bacteria-exposed, 3 replicates	Not exposed, 3 replicates
Cell type 2 (sorted)	Bacteria-exposed, 3 replicates	Not exposed, 3 replicates
Cell type 3 (sorted)	Bacteria-exposed, 3 replicates	Not exposed, 3 replicates
Cell type 4 (sorted)	Bacteria-exposed, 3 replicates	Not exposed, 3 replicates
Cell type 5 (sorted)	Bacteria-exposed, 3 replicates	Not exposed, 3 replicates

Hope this helps, let me know if you have further questions.

ADD REPLY • link 9.3 years ago elineverbon ▴ 20

score 2 · Answer 1 · 2016-11-01

Let's set up an example below. I'm assuming that you have three batches of mock/treatment replicates, where all samples in a single batch (across all tissue types) were generated at the same time, from the same plant, etc.

tissue <- rep(c("root", "root", "type1", "type2", "type3", "type4", "type5"), each=6)
sorting <- rep(c("unsorted", rep("sorted", 6)), each=6)
exposure <- rep(rep(c("exposed", "unexposed"), each=3), 7)
batch <- rep(1:3, 14)

I would suggest a design matrix like this:

group <- factor(paste0(tissue, ".", sorting, ".", exposure))
batch <- factor(batch)
design <- model.matrix(~0 + group + batch)

The inclusion of the sorting factor in group will allow for differences between the whole root sorted and unsorted samples. You can determine the sorting effect by testing for DE between these two groups. However, you shouldn't compare between the sorted cell types and the whole root unsorted samples, because there's no way to disentangle cell type-specific changes from the sorting effect.

Obviously, you can also use this design to test for the treatment effect in each tissue, or for DE between cell types. I don't see a need to correct for the sorting effect here, because you're comparing sorted groups so any sorting effect should cancel out. If it doesn't for some reason, then you can just filter out those DE genes that were also found in the whole root sorted vs. unsorted comparison.

As for the QL vs GLM question, there are several reasons that I can think of/remember:

The QL framework is an extension of the GLM machinery, so it makes sense to keep the underlying functions as well as the higher-level extensions. For example, glmQLFTest depends on glmLRT.
The LRT still works for routine DE analyses, albeit not as well as the QL methods.
They are retained for compatibility with older scripts and packages.
In some cases, the LRT actually provides better performance, e.g., for single-cell data where there's plenty of samples (so loss of type I error control from uncertainty in the dispersions is less of an issue) but low, highly overdispersed counts (which compromises the accuracy of the saddlepoint approximation required by the QL methods).

score 0 · Answer 2 · 2016-11-07

0

Entering edit mode

elineverbon ▴ 20

@elineverbon-11768

Last seen 8.6 years ago

Utrecht University, the Netherlands

#make design matrix
tissue <- s2c$cell_type #s2c is a table with all the information about the samples, which the order as the order of my sequencing files
sorting <- s2c$sortedness
exposure <- s2c$treatment
replicate <- s2c$replicate

group <- factor(paste0(tissue, ".", sorting, ".", exposure))
replicate <- factor(replicate)
design <- model.matrix(~0 + group + replicate)

#for design matrix see next post: not enough space in 1 comment to add it.

y <- estimateDisp(y,design)
fit <- glmQLFit(y,design)

#perform tests

for(x in 1:length(unique(s2c$cell_type))) {
#get and save fold change and FDR for each gene in each sample
coeffi <- c(rep(0,17)) #to indicate which which groups (=columns) to compare in the design matrix)
#not sure what to do with the replicate effects

#find the samples in the info table in which this cell_type is present
samplesWithThisCellTypeWCS <- which(grepl(unique(s2c$cell_type)[x], s2c$cell_type) & grepl("WCS", s2c$treatment))
samplesWithThisCellTypeNB <- which(grepl(unique(s2c$cell_type)[x], s2c$cell_type) & grepl("NB", s2c$treatment))

#find which of the first 14 columns of the design matrix contains a '1' in the rows considering to the correct samples (the first occurrence is enough) --> group containing that cell type + treatment
columnCellTypeWCS <- which(design[samplesWithThisCellTypeWCS[1],]==1)
columnCellTypeNB <- which(design[samplesWithThisCellTypeNB[1],]==1)

#make the coeffi 1 for the treatment column, -1 for the mock column
coeffi <- replace(coeffi, columnCellTypeWCS, 1)
coeffi <- replace(coeffi, columnCellTypeNB, -1)

ql <- glmQLFTest(fit, contrast=coeffi)
results <- topTags(ql, n=Inf)

#there is more code here to save all results and the up- and down-regulated genes with a certain fold change and FDR

}

ADD COMMENT • link 9.3 years ago elineverbon ▴ 20

0

Entering edit mode

Design matrix:

	groupCOB.sorted.NB	groupCOB.sorted.WCS	groupCOLsort.sorted.NB	groupCOLsort.sorted.WCS	groupCOLunsort.unsorted.NB
1	0	0	0	0	0
2	0	0	0	0	0
3	0	0	0	0	0
4	0	0	0	0	0
5	0	0	0	0	0

	groupCOLunsort.unsorted.WCS	groupCOR.sorted.NB	groupCOR.sorted.WCS	groupSCR.sorted.NB	groupSCR.sorted.WCS
1	0	0	0	1	0
2	0	0	0	0	0
3	0	0	0	0	0
4	0	0	0	0	0
5	0	0	0	0	1

	groupWER.sorted.NB	groupWER.sorted.WCS	groupWOL.sorted.NB	groupWOL.sorted.WCS	replicateR2
1	0	0	0	0	0
2	0	0	0	1	1
3	1	0	0	0	0
4	0	1	0	0	0
5	0	0	0	0	0

	replicateR3	replicateR4
1	0	0
2	0	0
3	1	0
4	1	0
5	0	0

ADD REPLY • link 9.3 years ago elineverbon ▴ 20

0

Entering edit mode

I don't know how you managed to get values of 10-50 in your design matrix, I can only assume that you've mixed the row names with the first column. And it's still not clear what your batches actually are - this would be easy to see if you just showed the targets file, but I'm just going to guess that your whole root batches are not the same as your cell type batches:

tissue <- rep(c("root", "root", "type1", "type2", "type3", "type4", "type5"), each=6)
sorting <- rep(c("unsorted", rep("sorted", 6)), each=6)
exposure <- rep(rep(c("exposed", "unexposed"), each=3), 7)
batch <- c(rep(1:3, 4), rep(4:6, 10))

In which case, the design I mentioned in the previous post still applies. Each group coefficient represents the average log-expression of the corresponding group in its first batch (batch 1 for the whole root samples, batch 4 for the cell type samples). Note that you'll have to prune out batch4 to get your design matrix to full rank:

group <- factor(paste0(tissue, ".", sorting, ".", exposure))
batch <- factor(batch)
design <- model.matrix(~0 + group + batch)
design <- design[,-which(colnames(design)=="batch4")] # get ti fykk rabj,

You can compare between groups using makeContrasts, which is easier than manually constructing contrasts:

con <- makeContrasts(grouproot.sorted.exposed - grouproot.sorted.unexposed, levels=design)

... but don't compare between any of the whole root and cell types samples if they belong to different batches.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k

score 0 · Answer 3 · 2017-07-14

Dear Aaron Lun (and others reading along),

We have continued working on this since our discussion here on the web. Here I first explain how we used your method and run into problems when trying to validate it. Next, I explain how I tried to correct for sorting myself and the problems I run into then. We are probably not understanding some details of your (simpler) method and don't need our own method. However, I included it so show our reasoning, which might help you understand where we go wrong trying to use your method.

It became quite a bit of text, I hope you will understand what I say.

To correct for our sorting effect we followed your suggestion to make a design matrix.

#make design matrix, with a column for each of our conditions (celltype-sortedornot-treatment)
#and '1's in the rows that correspond to a sample from that condition

tissue <- s2c$cell_type
sorting <- s2c$sortedness
exposure <- s2c$treatment
group <- factor(paste0(tissue, ".", sorting, ".", exposure))
group <- factor(paste0(tissue, ".", exposure))
design <- model.matrix(~0 + group)
#in here, s2c is a table with our run_accession number and corresponding treatment, #cell type, and whether they are sorted or not.

Afterwards, we make the fit object and calculate the genes changed due to our treatment. (As you said, I could make the contrasts by using makeContrasts, but my, indeed more complex, method only works, so for now I have kept it like this. I will adjust later.)

y <- estimateDisp(y,design)
#perform test
fit <- glmQLFit(y,design)

for(x in 1:length(unique(s2c$cell_type))) {
#we need to tell the program which conditions (= which columns) to compare
coeffi <- c(rep(0,length(levels(group))))

#find the columns in the design matrix that contain the samples of this cell type
currentCellType <- unique(s2c$cell_type)[x]
columnCellTypeWCS <- which(grepl(currentCellType,levels(group)) & grepl("WCS",levels(group)))
columnCellTypeNB <- which(grepl(currentCellType,levels(group)) & grepl("NB",levels(group)))

#make the coeffi 1 for the treatment column, -1 for the mock column
coeffi <- replace(coeffi, columnCellTypeWCS, 1)
coeffi <- replace(coeffi, columnCellTypeNB, -1)

ql <- glmQLFTest(fit, contrast=coeffi)
results <- topTags(ql, n=Inf)
write.csv(results, paste(saveDir,paste(currentCellType,'_WCSvsNB.csv', sep=""), sep="/"), row.names = T, na = "")

}

The code works (no bugs and we end up getting fold changes per gene for each of our cell types and the controls), but we're still struggling to find out how to determine whether the correction works and adjusts the values in the sorted samples as needed. In other words, how we can convince ourselves and others that the sorting effect is adjusted for and our results show the effect of our treatment (almost) only.

We figured we could plot our fold changes of all the genes in the sorted control on the x-axis and the unsorted control on the y-axis (because as you mentioned, we cannot compare with the cell types directly, too many confounding variables). We did this both with correcting for sorting and without (in the latter case, removing 'sorting' from 'group'). We expected the plot without the correction to have points that fell less close to the identity line then the points in the plot with correction. However, we got two exactly the same graphs (with the same R squared). To us, this seems to indicate that the sorting correction did not do anything.

What do you think? Are we missing something here?

Ps. We collected 1 replicate of 2 cell types or controls per day, with and without bacteria. So the experiment was performed over many different days and we do not have batches we can adjust for (or very many of them). Hope that makes the batch-thing clear. I am not sure what you mean with a 'target file', but if you explain I would be happy to include it.

score 0 · Answer 4 · 2017-07-14

To correct for the sorting effect myself, I

1) took the counts per gene of the DGELists of the untreated sorted and unsorted control samples. I calculated the fold change due to sorting with EdgeR and saved this in a table.
2) divided the counts of all the sorted treated samples by the fold change calculated from the treated control samples
3) repeat step 1 and 2 for the treated control samples

#get the counts from the DGEList (not the geneCounts --> non filtered, so not same length at the sortingEffect table)
#this table will be corrected for the sorting effect
sortingCorrectedCounts <- y$counts

#compare NB and WCS samples
for(x in c("WCS", "NB")) {

###check how sorting changed gene exp in NB and WCS samples###

#make a vector to tell the program which conditions to compare
coeffi <- c(rep(0,length(levels(group))))

#find the columns in the design matrix that contain the Col-0 WCS and NB samples (sorted and unsorted)
col0sort <- which(grepl("COLsort",levels(group)) & grepl(x,levels(group)))
col0unsort <- which(grepl("COLunsort",levels(group)) & grepl(x,levels(group)))

#compare sorted vs unsorted
#make the coeffi 1 for the treatment column, -1 for the mock column
coeffi <- replace(coeffi, col0sort, 1)
coeffi <- replace(coeffi, col0unsort, -1)

ql <- glmQLFTest(fit, contrast=coeffi)
#this gives a table with fold changes for each gene (used to correct for sorting effect later)
results <- topTags(ql, n=Inf)

#get table with the fold changes due to sorting per gene
sortingEffect <- results$table
#order the table by row.names (= gene names) (default: by p-value or FDR)
sortingEffect <- sortingEffect[order(row.names(sortingEffect)),]

#in the new counts table, find the columns of current loop variable (so NB or WCS, or 'x')
#disregard the unsorted samples (containing the string "un"), we do not want to change those (no sorting effect there)
columns <- intersect(grep(x,colnames(sortingCorrectedCounts)),grep("un",colnames(sortingCorrectedCounts),invert=TRUE))

#go through the columns of this variable
for(i in columns) {

#divide the counts of each gene by the fold change in the corresponding row in sortingEffect table
sortingCorrectedCounts[, i] <- sortingCorrectedCounts[, i] / sortingEffect[, "logFC"]
}
}

The next step is to make a new DGEList to perform EdgeR on again (and normalize). So, something like this (this is the code I used to generate the DGEList all the way at the beginning of my script, which I would have to adapt):

#get counts and lengths of the genes
geneCounts <- geneFromTranscripts$counts
normMat <- geneFromTranscripts$length
#o is used to correct for changes to the average transcript length across samples
o <- log(calcNormFactors(geneCounts/normMat)) + log(colSums(geneCounts/normMat))
y <- DGEList(geneCounts) #create a DGEList, contains y$counts and y$samples
y$offset <- t(t(log(normMat)) + o)

#filter
keep <- rowSums(cpm(y)>2) >= 3
y <- y[keep, , keep.lib.sizes=FALSE]

y <- calcNormFactors(y)

However, here I run into problems because I started with a DGEList to calculate the fold changes due to sorting with EdgeR. But that DGEList is already filtered, which leads to problems when trying to normalize the new DGEList. I could remove the filtering before correcting for the sorting effect, but then the changes due to sorting are not calculated as they should be.

score 0 · Answer 5 · 2017-07-14

0

Entering edit mode

elineverbon ▴ 20

@elineverbon-11768

Last seen 8.6 years ago

Utrecht University, the Netherlands

(sorry for the long messages, tried to make it as clear as possible)

ADD COMMENT • link 8.6 years ago elineverbon ▴ 20