Question: Adjusting for procedural effect
gravatar for elineverbon
11 months ago by
Utrecht University, the Netherlands
elineverbon20 wrote:


This is a follow-up on a previous question (to be found here: C: Adjusting for procedural effect and deciding on statistical test )

I am trying to analyze an RNA seq dataset with EdgeR. I have 44 different samples, coming from two controls (both the full organ, namely a plant root) and five different cell types. I have 6 samples for all but one of the controls, for which I have 8 samples. Half of the samples is treated, the other is not. We are interested in the effect of this treatment. 

To get the individual cell types, we had to sort our protoplasts. This means that all the cell type samples are sorted. We also sorted one of the controls (the one with 8 samples), while we did not sort the other control. We hope to be able to use these controls to correct for the sorting effect when analyzing our data for treatment-induced effects with EdgeR. To summarize, this is our experimental design:


tissue <- c(rep("control_wholeRoot",6),rep("control_sorted", 8), rep(c("type1", "type2",  "type3", "type4", "type5"), each=6))
sorting <- c(rep(c("unsorted", rep("sorted", 6)), each=6), c("sorted", "sorted"))
treatment <- c(rep(c("exposed", "unexposed"), each = 3), rep(c("exposed",  "unexposed"), each = 4), rep(rep(c("exposed", "unexposed"), each = 3), 5))

group <- factor(paste0(tissue, ".", sorting, ".", treatment))
design <- model.matrix(~0 + group)


This design is used to construct the fit object from the digital gene expression list (DGEList) that I made from my counts table.


#make the DGEList
geneCounts <- geneFromTranscripts$counts
normMat <- geneFromTranscripts$length
o <- log(calcNormFactors(geneCounts/normMat)) +                  log(colSums(geneCounts/normMat))
y <- DGEList(geneCounts)
y$offset <- t(t(log(normMat)) + o)

#filter out lowly expressed genes
#smallest libraries are 461161, then 3 million. A cpm of >2 means >6 reads in all samples but the smallest one.
keep <- rowSums(cpm(y)>2) >= 3 
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

#fit the data
y <- estimateDisp(y,design)
fit <- glmQLFit(y,design)


I then make contrasts to compare the treated vs the non-treated sample of each sample type (each of the two controls and each of the 5 cell types). This gives 7 lists of all genes with the FC in the specific sample due to the treatment.

We want to check whether including 'sorting' in our design matrix actually does anything. To check this, we reran the script, but without the 'sorting' in 'group'. We then made a scatterplot of FC of the genes in the unsorted control vs the FC of the same genes in the sorted control from both the resulting datasets.

We expected the scatterplot made from the dataset with 'sorting' in 'group' to show points that fall more along the identity line (and resulted in a larger R squared) then the scatterplot made from the dataset without 'sorting' in 'group'.

However, we got two graphs that are exactly the same. Are we missing something here and should we do a different check or is it indeed bad that the two graphs are the same?

Any help would be highly appreciated,
Eline Verbon and Ronnie de Jonge

ADD COMMENTlink modified 11 months ago • written 11 months ago by elineverbon20
gravatar for Aaron Lun
11 months ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

In your current setup, the sorting factor will have no effect whatsoever. This is because it is redundant with the tissue factor, i.e., for any given level of tissue, all samples are either sorted or unsorted. This means that the groups are unchanged (i.e., they are not split into subgroups) regardless of whether sorting is used.

Adding the sorting factor was only necessary in my answer to your previous question because I had assumed that the two control groups had the same "root" level, thus requiring an additional factor to separate samples from the sorted/unsorted groups. Here, the two groups already have distinct levels (control_wholeRoot and control_sorted) so there is no need for an extra factor to distinguish them.

P.S. I would throw out the smallest library if I were you. Having a sample that has an order of magnitude less coverage than every other sample is usually symptomatic of a problem in library preparation or sequencing.

ADD COMMENTlink modified 11 months ago • written 11 months ago by Aaron Lun19k
gravatar for elineverbon
11 months ago by
Utrecht University, the Netherlands
elineverbon20 wrote:

Ah, okay. That clears things up, thank you! Thank you for making me write out the design like this.

Does that mean that if I would want to correct for the sorting effect, I would have to do it by 'hand' - as in, by making a script for it myself? (As mentioned in the previous post, in that case I would determine how genes are affected by sorting by using EdgeR again, but this time not to compare treated vs not treated, but sorted vs not sorted. So, I would calculate the FC of the genes by comparing the sorted vs unsorted of the nontreated controls to correct all the nontreated cell type samples and the sorted vs unsorted treated controls to correct all the treated cell types.)

Or would it be enough to name all controls the same (eg simply 'control'). In that way the sorting factor would change the groups. However, I am now unsure whether that will also change anything for the cell types (which are also in the 'sorting' group) or whether it will only adjust things in the control group.

Because ultimately, I want the program to use the controls to determine the effect of sorting and to use that information to adjust counts / fold changes found in the cell type samples. 

Thanks also for the PS, it is a good point. That sample does have the same number of genes with a cpm > 2, though, that's why we kept it in. With that in mind, would you still throw it?

ADD COMMENTlink modified 11 months ago • written 11 months ago by elineverbon20

For future reference, replies to existing answers should be added using "add comment", rather than using "add your answer", unless you're actually answering your own question.

Anyway, there is nothing extra you need to do to correct for the sorting effect when testing for the treatment effect. This is because the sorting effect is implicitly encoded in the tissue factor. When you compare between treated and untreated samples within each tissue, you are already accounting for the sorting effect; any given level of tissue is either sorted or it isn't, there are no levels that have both sorted and unsorted samples.

I don't really understand what you're trying to do with the sorting comparisons. What exactly are you correcting for? You don't have unsorted cell type groups, so there's nothing to correct here. There is only one sorting effect you can measure, and that's between the sorted and unsorted control samples. Even if you did calculate the sorting-induced log-fold change - so what?

As for the low-depth sample; if it is a clear outlier on a MDS plot, I would definitely throw it out.

ADD REPLYlink written 11 months ago by Aaron Lun19k
gravatar for elineverbon
11 months ago by
Utrecht University, the Netherlands
elineverbon20 wrote:

- empty post, don't know how to remove it -

ADD COMMENTlink modified 11 months ago • written 11 months ago by elineverbon20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 443 users visited in the last hour