Question

edgeR GLM : family based experimental design

0

Entering edit mode

gianfilippo coppola ▴ 10

@gianfilippo-coppola-5764

Last seen 9.6 years ago

Hi, I would be grateful for any comments/criticism/suggestions on my approach to the following datasets. I have RNA-seq data from four families, with incomplete genotype. One member of each family is a proband. I am interested in Proband vs Control kind on analysis. I am using the latest release of edgeR. ===================================================== Samples structure as follow: Family 1 (COMPLETE) Father, Mother, Sibling (Male), Proband Family 2 (COMPLETE) Father, Mother, Proband Family 3 (INCOMPLETE) Sibling (Female), Proband Family 4 (INCOMPLETE) Mother, Proband ===================================================== FIRST CASE In the first batch of data, I have RNA-seq from three clones each individual. That makes all together 33 samples. Family 1 and Family 2 were processed in two different flow cells. Most of family 3 and 4 were together on the same flow cell. Therefore there is overlap between flow cells (or batches) and families. No females proband, so likely some sex bias. I added all the clones, and ended up with 11 samples, 7 normal controls and 4 probands. Samples are obviously not independent. Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 4 Family <- c(1,1,1,1,2,2,2,3,3,4,4) Group <- c("C","C","C","P","C","C","P","C","P","C","P") Defined a design matrix to account for family blocks: design <- model.matrix(~Family+Group) and ran the analysis data <- estimateGLMCommonDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTrendedDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTagwiseDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion) de.MPtag <- glmLRT(d.glmfit) and got my diff expr calls for common, trended and tagwise dispersion estimates. ===================================================== SECOND CASE In the first batch of data, I have RNA-seq from one clones each individual. But cells have been taken at different days of culture: 15, 26, 46. I have incomplete data and family genotypes each day. Here I would also be interested in day by day differences or sort of time course, but I am describing only the Proband vs Control analysis. Day 15 Family 1: Sibling, Proband Family 2: Father, Mother, Proband Day 26 Family 2: Father, Mother, Proband Family 3: Mother, Proband Family 4: Sibling (Female), Proband Day 46 Family 2: Father, Mother, Proband Family 3: Mother, Proband Family 4: Sibling (Female), Proband That makes all together 19 samples, 11 normal controld and 8 probands. Samples are again not independent. Cannot comment on batches at this point. No females proband, so likely some sex bias. Applied some filtering : keep <- rowSums( cpm(data) > CUTOFF ) >= 8 Family <- c(1,1,1,1,2,2,2,3,3,4,4) Day <- c(15,26,46) Group <- c("C","P","C","C","P","C","C","P","C","P","C","P","C","C","P" ,"C","P","C","P") Defined a design matrix to account for family and day blocks: design <- model.matrix(~Family+Day+Group) and ran the analysis data <- estimateGLMCommonDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$common.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTrendedDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$trended.dispersion) de.MPtag <- glmLRT(d.glmfit) data <- estimateGLMTagwiseDisp(data, design) d.glmfit <- glmFit(data,design,dispersion=data$tagwise.dispersion) de.MPtag <- glmLRT(d.glmfit) and got my diff expr calls for common, trended and tagwise dispersion estimates. NOTE Here I need to add that I put all together, in this specific case, as I am thinking I will have more power to get DEG. That said, given the I am also expecting biological differences from day to day, I am probably getting only what is common to the different stages, in spite of the Day blocking. ===================================================== QUESTIONS: Is there anything fundamentally wrong in what I did ? Is the design matrix reasonable ? I used two values for CUTOFF: 0.01 and 1. I get quite more DEG with CUTOFF=0.01 than with CUTOFF=1, while I would expect the opposite, since in the first case I am processing about 28000 genes, while in the second case I am processing about 18000 genes. Unless, the many more lowly (or zero) expressed genes affect the dispersion estimate. Is this the case ? Thanks Gianfilippo

edgeR edgeR • 896 views

ADD COMMENT • link 11.2 years ago gianfilippo coppola ▴ 10