Question

Adjusting for procedural effect

0

Entering edit mode

elineverbon ▴ 20

@elineverbon-11768

Last seen 8.5 years ago

Utrecht University, the Netherlands

Dear,

This is a follow-up on a previous question (to be found here: C: Adjusting for procedural effect and deciding on statistical test )

I am trying to analyze an RNA seq dataset with EdgeR. I have 44 different samples, coming from two controls (both the full organ, namely a plant root) and five different cell types. I have 6 samples for all but one of the controls, for which I have 8 samples. Half of the samples is treated, the other is not. We are interested in the effect of this treatment.

To get the individual cell types, we had to sort our protoplasts. This means that all the cell type samples are sorted. We also sorted one of the controls (the one with 8 samples), while we did not sort the other control. We hope to be able to use these controls to correct for the sorting effect when analyzing our data for treatment-induced effects with EdgeR. To summarize, this is our experimental design:

tissue <- c(rep("control_wholeRoot",6),rep("control_sorted", 8), rep(c("type1", "type2", "type3", "type4", "type5"), each=6))
sorting <- c(rep(c("unsorted", rep("sorted", 6)), each=6), c("sorted", "sorted"))
treatment <- c(rep(c("exposed", "unexposed"), each = 3), rep(c("exposed", "unexposed"), each = 4), rep(rep(c("exposed", "unexposed"), each = 3), 5))

group <- factor(paste0(tissue, ".", sorting, ".", treatment))
design <- model.matrix(~0 + group)

This design is used to construct the fit object from the digital gene expression list (DGEList) that I made from my counts table.

#make the DGEList
geneCounts <- geneFromTranscripts$counts
normMat <- geneFromTranscripts$length
o <- log(calcNormFactors(geneCounts/normMat)) + log(colSums(geneCounts/normMat))
y <- DGEList(geneCounts)
y$offset <- t(t(log(normMat)) + o)

#filter out lowly expressed genes
#smallest libraries are 461161, then 3 million. A cpm of >2 means >6 reads in all samples but the smallest one.
keep <- rowSums(cpm(y)>2) >= 3
y <- y[keep, , keep.lib.sizes=FALSE]

y <- calcNormFactors(y)

#fit the data
y <- estimateDisp(y,design)
fit <- glmQLFit(y,design)

I then make contrasts to compare the treated vs the non-treated sample of each sample type (each of the two controls and each of the 5 cell types). This gives 7 lists of all genes with the FC in the specific sample due to the treatment.

We want to check whether including 'sorting' in our design matrix actually does anything. To check this, we reran the script, but without the 'sorting' in 'group'. We then made a scatterplot of FC of the genes in the unsorted control vs the FC of the same genes in the sorted control from both the resulting datasets.

We expected the scatterplot made from the dataset with 'sorting' in 'group' to show points that fall more along the identity line (and resulted in a larger R squared) then the scatterplot made from the dataset without 'sorting' in 'group'.

However, we got two graphs that are exactly the same. Are we missing something here and should we do a different check or is it indeed bad that the two graphs are the same?

Any help would be highly appreciated,
Eline Verbon and Ronnie de Jonge

edger • 1.5k views

ADD COMMENT • link 8.5 years ago elineverbon ▴ 20

score 2 · Answer 1 · 2017-07-17

In your current setup, the sorting factor will have no effect whatsoever. This is because it is redundant with the tissue factor, i.e., for any given level of tissue, all samples are either sorted or unsorted. This means that the groups are unchanged (i.e., they are not split into subgroups) regardless of whether sorting is used.

Adding the sorting factor was only necessary in my answer to your previous question because I had assumed that the two control groups had the same "root" level, thus requiring an additional factor to separate samples from the sorted/unsorted groups. Here, the two groups already have distinct levels (control_wholeRoot and control_sorted) so there is no need for an extra factor to distinguish them.

P.S. I would throw out the smallest library if I were you. Having a sample that has an order of magnitude less coverage than every other sample is usually symptomatic of a problem in library preparation or sequencing.

score 0 · Answer 2 · 2017-07-17

Ah, okay. That clears things up, thank you! Thank you for making me write out the design like this.

Does that mean that if I would want to correct for the sorting effect, I would have to do it by 'hand' - as in, by making a script for it myself? (As mentioned in the previous post, in that case I would determine how genes are affected by sorting by using EdgeR again, but this time not to compare treated vs not treated, but sorted vs not sorted. So, I would calculate the FC of the genes by comparing the sorted vs unsorted of the nontreated controls to correct all the nontreated cell type samples and the sorted vs unsorted treated controls to correct all the treated cell types.)

Or would it be enough to name all controls the same (eg simply 'control'). In that way the sorting factor would change the groups. However, I am now unsure whether that will also change anything for the cell types (which are also in the 'sorting' group) or whether it will only adjust things in the control group.

Because ultimately, I want the program to use the controls to determine the effect of sorting and to use that information to adjust counts / fold changes found in the cell type samples.

Thanks also for the PS, it is a good point. That sample does have the same number of genes with a cpm > 2, though, that's why we kept it in. With that in mind, would you still throw it?

score 0 · Answer 3 · 2017-07-17

0