edgeR GLM : family based experimental design
0
0
Entering edit mode
@gianfilippo-coppola-5764
Last seen 9.6 years ago
Hi, I would be grateful for any comments/criticism/suggestions on my approach to the following datasets. I have RNA-seq data from four families, with incomplete genotype. One member of each family is a proband. I am interested in Proband vs Control kind on analysis. I am using the latest release of edgeR. ===================================================== Samples structure as follow: Family 1 (COMPLETE) Father, Mother, Sibling (Male), Proband Family 2 (COMPLETE) Father, Mother, Proband Family 3 (INCOMPLETE) Sibling (Female), Proband Family 4 (INCOMPLETE) Mother, Proband ===================================================== FIRST CASE In the first batch of data, I have RNA-seq from three clones each individual. That makes all together 33 samples. Family 1 and Family 2 were processed in two different flow cells. Most of family 3 and 4 were together on the same flow cell. Therefore there is overlap between flow cells (or batches) and families. No females proband, so likely some sex bias. I added all the clones, and ended up with 11 samples, 7 normal controls and 4 probands. Samples are obviously not independent. Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 4 Family <- c(1,1,1,1,2,2,2,3,3,4,4) Group <- c("C","C","C","P","C","C","P","C","P","C","P") Defined a design matrix to account for family blocks: design <- model.matrix(~Family+Group) and ran the analysis data <- estimateGLMCommonDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTrendedDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTagwiseDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion) de.MPtag <- glmLRT(d.glmfit) and got my diff expr calls for common, trended and tagwise dispersion estimates. ===================================================== SECOND CASE In the first batch of data, I have RNA-seq from one clones each individual. But cells have been taken at different days of culture: 15, 26, 46. I have incomplete data and family genotypes each day. Here I would also be interested in day by day differences or sort of time course, but I am describing only the Proband vs Control analysis. Day 15 Family 1: Sibling, Proband Family 2: Father, Mother, Proband Day 26 Family 2: Father, Mother, Proband Family 3: Mother, Proband Family 4: Sibling (Female), Proband Day 46 Family 2: Father, Mother, Proband Family 3: Mother, Proband Family 4: Sibling (Female), Proband That makes all together 19 samples, 11 normal controld and 8 probands. Samples are again not independent. Cannot comment on batches at this point. No females proband, so likely some sex bias. Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 8 Family <- c(1,1,1,1,2,2,2,3,3,4,4) Day <- c(15,26,46) Group <- c("C","P","C","C","P","C","C","P","C","P","C","P","C","C","P" ,"C","P","C","P") Defined a design matrix to account for family and day blocks: design <- model.matrix(~Family+Day+Group) and ran the analysis data <- estimateGLMCommonDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTrendedDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTagwiseDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion) de.MPtag <- glmLRT(d.glmfit) and got my diff expr calls for common, trended and tagwise dispersion estimates. NOTE Here I need to add that I put all together, in this specific case, as I am thinking I will have more power to get DEG. That said, given the I am also expecting biological differences from day to day, I am probably getting only what is common to the different stages, in spite of the Day blocking. ===================================================== QUESTIONS: Is there anything fundamentally wrong in what I did ? Is the design matrix reasonable ? I used two values for CUTOFF: 0.01 and 1. I get quite more DEG with CUTOFF=0.01 than with CUTOFF=1, while I would expect the opposite, since in the first case I am processing about 28000 genes, while in the second case I am processing about 18000 genes. Unless, the many more lowly (or zero) expressed genes affect the dispersion estimate. Is this the case ? Thanks Gianfilippo
edgeR edgeR • 896 views
ADD COMMENT

Login before adding your answer.

Traffic: 712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6