Based on an initial analysis of a TCGA RNA-Seq dataset(processed like microarrays), i would like to formulate a specific design matrix with limma, in order to perform a 2-factor type analysis. In detail, a subset of my phenoData object:
sample_type age_at_initial_pathologic_diagnosis msi gender
TCGA.AA.3662 adjacentnormal 80 <NA> f
TCGA.A6.4105 tumor 79 MSS m
TCGA.F4.6463 tumor 51 <NA> m
TCGA.F4.6806 tumor 59 MSS f
TCGA.A6.6650 tumor 69 <NA> f
TCGA.AZ.6600 adjacentnormal 64 <NA> m
colon_type <- paste0(pData(TCGA.RNASeqV2_eset)$sample_type,"_",
adjacentnormal_MSI adjacentnormal_MSS adjacentnormal_NA tumor_MSI
1 2 11 8
My main contrast of interest, is the tumor_MSI versus tumor_MSS (30 vs 8):
however, as you can see from the above information, a lot of cancer samples are not available (NA marked) for this information, and also a very small number of normal samples has this information (which is anyway irrelevant as i would not like to utilize at all any normal samples).
So, any naive approach like the following, which is first to subset my initial expression set of 195 samples:
ind <- sampleNames(TCGA.RNASeqV2_eset)[TCGA.RNASeqV2_eset$msi%in%c("MSS","MSI")& TCGA.RNASeqV2_eset$sample_type%in%"tumor"]
eset.sel <- TCGA.RNASeqV2_eset[,ind]
mult_con <- factor(paste0(pData(eset.sel)$sample_type,"_",
 "tumor_MSI" "tumor_MSS"
and then proceed:
age <- pData(eset.sel)$age_at_initial_pathologic_diagnosis
sex <- factor(pData(eset.sel)$gender)
design.2 <- model.matrix(~mult_con+sex+age) # include also these covariates to adjust for age and gender
fit <- lmFit(eset.sel,design)
fit2 <- eBayes(fit,robust=TRUE,trend=TRUE)...
Or subsetting the initial eSet, is inappropriate, and another approach is less erroneous ?