Based on an initial analysis of a TCGA RNA-Seq dataset(processed like microarrays), i would like to formulate a specific design matrix with limma, in order to perform a 2-factor type analysis. In detail, a subset of my phenoData object:
head(pdat.sel)
sample_type age_at_initial_pathologic_diagnosis msi gender
TCGA.AA.3662 adjacentnormal 80 <NA> f
TCGA.A6.4105 tumor 79 MSS m
TCGA.F4.6463 tumor 51 <NA> m
TCGA.F4.6806 tumor 59 MSS f
TCGA.A6.6650 tumor 69 <NA> f
TCGA.AZ.6600 adjacentnormal 64 <NA> m
colon_type <- paste0(pData(TCGA.RNASeqV2_eset)$sample_type,"_",
pData(TCGA.RNASeqV2_eset)$msi)
table(colon_type)
colon_type
adjacentnormal_MSI adjacentnormal_MSS adjacentnormal_NA tumor_MSI
1 2 11 8
tumor_MSS tumor_NA
30 143
My main contrast of interest, is the tumor_MSI versus tumor_MSS (30 vs 8):
however, as you can see from the above information, a lot of cancer samples are not available (NA marked) for this information, and also a very small number of normal samples has this information (which is anyway irrelevant as i would not like to utilize at all any normal samples).
So, any naive approach like the following, which is first to subset my initial expression set of 195 samples:
ind <- sampleNames(TCGA.RNASeqV2_eset)[TCGA.RNASeqV2_eset$msi%in%c("MSS","MSI")& TCGA.RNASeqV2_eset$sample_type%in%"tumor"]
eset.sel <- TCGA.RNASeqV2_eset[,ind]
dim(eset.sel)
Features Samples
20502 38
mult_con <- factor(paste0(pData(eset.sel)$sample_type,"_",
pData(eset.sel)$msi))
levels(mult_con)
[1] "tumor_MSI" "tumor_MSS"
and then proceed:
age <- pData(eset.sel)$age_at_initial_pathologic_diagnosis
sex <- factor(pData(eset.sel)$gender)
design.2 <- model.matrix(~mult_con+sex+age) # include also these covariates to adjust for age and gender
fit <- lmFit(eset.sel,design)
fit2 <- eBayes(fit,robust=TRUE,trend=TRUE)...
Or subsetting the initial eSet, is inappropriate, and another approach is less erroneous ?
Thank you Gordon for the confirmation. I will probably stick and subset the initial dataset, in order to formulate the above mentioned comparison, than perhaps mess up things with adding arbitary values. Finally, do you believe that any non-specific intensity procedure for removing "unexpressed" genes, should be better performed after i subset my dataset, based on the final number of samples remained ?
Yes, you should subset your features after you have removed the samples that you drop.
To put it another way, the removal of "unexpressed" genes should be performed on the same exact dataset (ie. same samples) that you will be pushing through your differential expression pipeline.
Thank you Steve for your answer !!