Question: Appropriate design matrix set up in limma for 2-factor approches with NA values for specific levels
gravatar for svlachavas
15 months ago by
Greece/Athens/National Hellenic Research Foundation
svlachavas570 wrote:

Based on an initial analysis of a TCGA RNA-Seq dataset(processed like microarrays), i would like to formulate a specific design matrix with limma, in order to perform a 2-factor type analysis. In detail, a subset of my phenoData object:

                sample_type age_at_initial_pathologic_diagnosis  msi gender
TCGA.AA.3662 adjacentnormal                                  80 <NA>      f
TCGA.A6.4105          tumor                                  79  MSS      m
TCGA.F4.6463          tumor                                  51 <NA>      m
TCGA.F4.6806          tumor                                  59  MSS      f
TCGA.A6.6650          tumor                                  69 <NA>      f
TCGA.AZ.6600 adjacentnormal                                  64 <NA>      m


colon_type <- paste0(pData(TCGA.RNASeqV2_eset)$sample_type,"_",


adjacentnormal_MSI adjacentnormal_MSS  adjacentnormal_NA          tumor_MSI 
                 1                  2                 11                  8 
         tumor_MSS          tumor_NA 
                30                143

My main contrast of interest, is the tumor_MSI versus tumor_MSS (30 vs 8):

however, as you can see from the above information, a lot of cancer samples are not available (NA marked) for this information, and also a very small number of normal samples has this information (which is anyway irrelevant as i would not like to utilize at all any normal samples).

So, any naive approach like the following, which is first to subset my initial expression set of 195 samples:

ind <- sampleNames(TCGA.RNASeqV2_eset)[TCGA.RNASeqV2_eset$msi%in%c("MSS","MSI")& TCGA.RNASeqV2_eset$sample_type%in%"tumor"]

eset.sel <- TCGA.RNASeqV2_eset[,ind]

Features  Samples 
   20502       38 

mult_con <- factor(paste0(pData(eset.sel)$sample_type,"_",

[1] "tumor_MSI" "tumor_MSS"

and then proceed:

age <- pData(eset.sel)$age_at_initial_pathologic_diagnosis
sex <- factor(pData(eset.sel)$gender)

design.2 <- model.matrix(~mult_con+sex+age) # include also these covariates to adjust for age and gender

fit <- lmFit(eset.sel,design) 
fit2 <- eBayes(fit,robust=TRUE,trend=TRUE)...


Or subsetting the initial eSet, is inappropriate, and another approach is less erroneous ?


ADD COMMENTlink modified 15 months ago by Gordon Smyth33k • written 15 months ago by svlachavas570
gravatar for Gordon Smyth
15 months ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

As you probably already know, you can't have NAs in the columns of the design matrix. So you can either subset the data to those samples for which MSI status is known, or you can code the NA values of MSI status to some new arbitrary value. Either approach could be ok and neither is "erroneous". Subsetting is probably the easiest. It's up to you really.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Gordon Smyth33k

Thank you Gordon for the confirmation. I will probably stick and subset the initial dataset, in order to formulate the above mentioned comparison, than perhaps mess up things with adding arbitary values. Finally, do you believe that any non-specific intensity procedure for removing "unexpressed" genes, should be better performed after i subset my dataset, based on the final number of samples remained ?

ADD REPLYlink modified 15 months ago • written 15 months ago by svlachavas570

Yes, you should subset your features after you have removed the samples that you drop.

To put it another way, the removal of "unexpressed" genes should be performed on the same exact dataset (ie. same samples) that you will be pushing through your differential expression pipeline.

ADD REPLYlink written 15 months ago by Steve Lianoglou12k

Thank you Steve for your answer !!

ADD REPLYlink written 15 months ago by svlachavas570
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 118 users visited in the last hour