Hi, I have 1500 HTA 2.0 affy human microarray CEL files. These file are a combination of treatment, placebo and healthy controls at 4 different times (weeks). They are not paired samples. Hence I started out doing the following - 1- Perform batch QC using Affy power tools 2- Perform sample wise QC 3- Batch processing of arrays and normalization using rma.
eset_norm <- oligo::rma(raw_data, target = "core")
4- Combine all the batches of normalized eset and annotate using pd.hta.2.0
eset_norm_anno <- annotateEset(eset_norm ,pd.hta.2.0, columns = c("PROBEID", "ENTREZID", "SYMBOL", "GENENAME"), multivals = "list")
5- Filter probe id that are related to main gene expression
eset_norm_anno_main <- getMainProbes(eset_norm_anno )
6- Filtered probe ids using manual threshold in order to remove all the low expression probes across samples. 7- Filtered probe ids that were not annotated, i.e which had gene symbol /ID as "NA". Ended up with 34886 probe ids (Started with 70k probes after core rma norm). 8- set up design matrix - I would like to look at a list of differential expressed gene based on treatment and time effect. In order to figure out genes that are expressed deferentially per dosage at a given time point, I set up the following design matrix and develop contrast matrix that will compare DrugA.Dose1.WEEK0 and PLACEBO.WEEK0 and so forth.
design = model.matrix(~0+groups) colnames(design) # colnames(design) # "DrugA.Dose1.WEEK0" "DrugA.Dose1.WEEK6" "DrugA.Dose1.WEEK1" "DrugA.Dose1.WEEK8" # "DrugA.Dose1.WEEK2" "DrugA.Dose2.WEEK0" "DrugA.Dose2.WEEK6" "DrugA.Dose2.WEEK1" # "DrugA.Dose2.WEEK8" "DrugA.Dose2.WEEK2" "CONTROL.WEEK0.HEALTHY" "PLACEBO.WEEK0" # "PLACEBO.WEEK6" "PLACEBO.WEEK1" "PLACEBO.WEEK8" "PLACEBO.WEEK2"
9- Setting up analysis using Limma -
contrast = makeContrasts( Dose1week0=DrugA.Dose1.WEEK0-PLACEBO.WEEK0) data.fit = lmFit(eset_norm_anno_main_final,design) data.fit.con = contrasts.fit(fit = data.fit,contrasts = contrast) data.fit.eb = eBayes(fit = data.fit.con) results <- decideTests(data.fit.eb, adjust.method = "BH", lfc = 1)
Can you please tell me if the design and my process is right ? My ultimate list represents a list of probeids that are mapped to a gene id with statistics and p values associated. Do I have to aggregate the result based on Gene symbols to understand what Genes have been regulated or the result with transcripts just fine to present and compare ?
A quick check will help me very much.