Entering edit mode
gianfilippo coppola
▴
10
@gianfilippo-coppola-5764
Last seen 10.2 years ago
Hi,
I would be grateful for any comments/criticism/suggestions on my
approach to the following datasets.
I have RNA-seq data from four families, with incomplete genotype. One
member of each family is a proband. I am interested in Proband vs
Control kind on analysis. I am using the latest release of edgeR.
=====================================================
Samples structure as follow:
Family 1 (COMPLETE)
Father, Mother, Sibling (Male), Proband
Family 2 (COMPLETE)
Father, Mother, Proband
Family 3 (INCOMPLETE)
Sibling (Female), Proband
Family 4 (INCOMPLETE)
Mother, Proband
=====================================================
FIRST CASE
In the first batch of data, I have RNA-seq from three clones each
individual. That makes all together 33 samples. Family 1 and Family 2
were processed in two different flow cells. Most of family 3 and 4
were together on the same flow cell. Therefore there is overlap
between flow cells (or batches) and families. No females proband, so
likely some sex bias.
I added all the clones, and ended up with 11 samples, 7 normal
controls and 4 probands. Samples are obviously not independent.
Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 4
Family <- c(1,1,1,1,2,2,2,3,3,4,4)
Group <- c("C","C","C","P","C","C","P","C","P","C","P")
Defined a design matrix to account for family blocks:
design <- model.matrix(~Family+Group)
and ran the analysis
data <- estimateGLMCommonDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion)
de.MPtag <- glmLRT(d.glmfit)
data <- estimateGLMTrendedDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion)
de.MPtag <- glmLRT(d.glmfit)
data <- estimateGLMTagwiseDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion)
de.MPtag <- glmLRT(d.glmfit)
and got my diff expr calls for common, trended and tagwise dispersion
estimates.
=====================================================
SECOND CASE
In the first batch of data, I have RNA-seq from one clones each
individual. But cells have been taken at different days of culture:
15, 26, 46. I have incomplete data and family genotypes each day. Here
I would also be interested in day by day differences or sort of time
course, but I am describing only the Proband vs Control analysis.
Day 15
Family 1: Sibling, Proband
Family 2: Father, Mother, Proband
Day 26
Family 2: Father, Mother, Proband
Family 3: Mother, Proband
Family 4: Sibling (Female), Proband
Day 46
Family 2: Father, Mother, Proband
Family 3: Mother, Proband
Family 4: Sibling (Female), Proband
That makes all together 19 samples, 11 normal controld and 8 probands.
Samples are again not independent. Cannot comment on batches at this
point. No females proband, so likely some sex bias.
Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 8
Family <- c(1,1,1,1,2,2,2,3,3,4,4)
Day <- c(15,26,46)
Group <- c("C","P","C","C","P","C","C","P","C","P","C","P","C","C","P"
,"C","P","C","P")
Defined a design matrix to account for family and day blocks:
design <- model.matrix(~Family+Day+Group)
and ran the analysis
data <- estimateGLMCommonDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion)
de.MPtag <- glmLRT(d.glmfit)
data <- estimateGLMTrendedDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion)
de.MPtag <- glmLRT(d.glmfit)
data <- estimateGLMTagwiseDisp(data, design)
d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion)
de.MPtag <- glmLRT(d.glmfit)
and got my diff expr calls for common, trended and tagwise dispersion
estimates.
NOTE
Here I need to add that I put all together, in this specific case, as
I am thinking I will have more power to get DEG. That said, given the
I am also expecting biological differences from day to day, I am
probably getting only what is common to the different stages, in spite
of the Day blocking.
=====================================================
QUESTIONS:
Is there anything fundamentally wrong in what I did ?
Is the design matrix reasonable ?
I used two values for CUTOFF: 0.01 and 1. I get quite more DEG with
CUTOFF=0.01 than with CUTOFF=1, while I would expect the opposite,
since in the first case I am processing about 28000 genes, while in
the second case I am processing about 18000 genes. Unless, the many
more lowly (or zero) expressed genes affect the dispersion estimate.
Is this the case ?
Thanks
Gianfilippo