estimating common dispersion and tagwise dispersion.

Question

Removing unused samples when estimating dispersion with EdgeR?

0

Entering edit mode

elise.skottene • 0

@eliseskottene-23234

Last seen 4.5 years ago

Hello,

This is more of a statistics question rather than a code issue. I have performed differential analyses of RNA seq data using the GLM approach in EdgeR. I have the following setup in an experiment using small worms: 4 treatments of a 2 x 2 factorial design:

Treatment 1: high food + predation Treatment 2: high food - predation Treatment 3: low food + predation Treatment 4: low food - predation

All groups are sampled three times during the course of the experiment (20 days).

I also have a separate group that I use for a baseline comparison, called "the reference group". This group is like a control group, which was not physically included in the experiment. Let's say this is a group of nice clean worms that has been in hibernation for 6 months (haha).

In one aspect of my study, I compare all the treatment groups to the reference group. This is fine.

In the second aspect (in the same paper), I compare the treatments to each other, so the reference group is not used in this aspect. My question is: do I need to re-calculate the dispersion, EXCLUDING the reference group data, as I don't use the reference group in the second aspect?

I think that including the reference group can make sense because of the setup of the whole study (sorry for not explaining it in full here, hope it might be possible to understand regardless). I.e. the reference group was part of my study, removing available data does not make biological sense.

However, excluding the reference group (which probably explains A LOT of variation) will more clearly show what actually goes in in the experiment.

Is there a simple answer to this question? I am open to the possibility that this has more than one answer.

Below is the script with all samples included. Please note that this script works fine (though it can be written in a nicer way), and I'm just wondering about the stats here.

Best,

Elise S.

library(edgeR)

genecounts <- read.delim("Trinitygenesall.isoform.counts.matrix", header=T, row.names=1)

y <- DGEList(counts=genecounts)

filtering out lowly expressed genes

keep <- rowSums(cpm(y)>1) >=2 y <- y[keep, , keep.lib.sizes=FALSE]

calculating Normalization factors (performing TMM normalization)

y <- calcNormFactors(y)

the sample file (tab delimited)

targets <- read.table("samples_all.txt", header=T)

stage <- factor(targets$stage, levels=c("AD", "C4", "C5")) day <- factor(targets$day, levels=c("0","2","10","14")) treatment <- factor(targets$treatment, levels=c("HN","HP","LN", "LP", "REF")) dat<-paste(stage, treatment, day, sep=".") design <- model.matrix(~0+dat)

estimating common dispersion and tagwise dispersion.

y <- estimateDisp(y, design)

making contrasts

my.contrasts <- makeContrasts( A1= datC4.LN.2-datC5.REF.0, A2= datC4.LN.2-datC4.HP.2, (...) etc.... , levels=design)

res1 <- glmQLFTest(y, design, contrast=my.contrasts[,"A1"]) write.table((topTags(res1, n=Inf), file="C5_CT0vsRef.txt") (...)

edger • 304 views

ADD COMMENT • link 4.5 years ago elise.skottene • 0