Question

Appropriate design matrix set up in limma for 2-factor approches with NA values for specific levels

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 13 months ago

Germany/Heidelberg/German Cancer Resear…

Based on an initial analysis of a TCGA RNA-Seq dataset(processed like microarrays), i would like to formulate a specific design matrix with limma, in order to perform a 2-factor type analysis. In detail, a subset of my phenoData object:

head(pdat.sel) sample_type age_at_initial_pathologic_diagnosis msi gender TCGA.AA.3662 adjacentnormal 80 <NA> f TCGA.A6.4105 tumor 79 MSS m TCGA.F4.6463 tumor 51 <NA> m TCGA.F4.6806 tumor 59 MSS f TCGA.A6.6650 tumor 69 <NA> f TCGA.AZ.6600 adjacentnormal 64 <NA> m

colon_type <- paste0(pData(TCGA.RNASeqV2_eset)$sample_type,"_",

pData(TCGA.RNASeqV2_eset)$msi) table(colon_type)

colon_type adjacentnormal_MSI adjacentnormal_MSS adjacentnormal_NA tumor_MSI 1 2 11 8 tumor_MSS tumor_NA 30 143

My main contrast of interest, is the tumor_MSI versus tumor_MSS (30 vs 8):

however, as you can see from the above information, a lot of cancer samples are not available (NA marked) for this information, and also a very small number of normal samples has this information (which is anyway irrelevant as i would not like to utilize at all any normal samples).

So, any naive approach like the following, which is first to subset my initial expression set of 195 samples:

ind <- sampleNames(TCGA.RNASeqV2_eset)[TCGA.RNASeqV2_eset$msi%in%c("MSS","MSI")& TCGA.RNASeqV2_eset$sample_type%in%"tumor"]

eset.sel <- TCGA.RNASeqV2_eset[,ind]

dim(eset.sel) Features Samples 20502 38

mult_con <- factor(paste0(pData(eset.sel)$sample_type,"_", pData(eset.sel)$msi))

levels(mult_con) [1] "tumor_MSI" "tumor_MSS"

and then proceed:

age <- pData(eset.sel)$age_at_initial_pathologic_diagnosis sex <- factor(pData(eset.sel)$gender)

design.2 <- model.matrix(~mult_con+sex+age) # include also these covariates to adjust for age and gender

fit <- lmFit(eset.sel,design) fit2 <- eBayes(fit,robust=TRUE,trend=TRUE)...

Or subsetting the initial eSet, is inappropriate, and another approach is less erroneous ?

limma multiple factor design NA values design and contrast matrix • 1.7k views

ADD COMMENT • link updated 7.8 years ago by Gordon Smyth 51k • written 7.8 years ago by svlachavas ▴ 830

score 2 · Answer 1 · 2017-02-16

2

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

As you probably already know, you can't have NAs in the columns of the design matrix. So you can either subset the data to those samples for which MSI status is known, or you can code the NA values of MSI status to some new arbitrary value. Either approach could be ok and neither is "erroneous". Subsetting is probably the easiest. It's up to you really.

ADD COMMENT • link 7.8 years ago Gordon Smyth 51k

0

Entering edit mode

Thank you Gordon for the confirmation. I will probably stick and subset the initial dataset, in order to formulate the above mentioned comparison, than perhaps mess up things with adding arbitary values. Finally, do you believe that any non-specific intensity procedure for removing "unexpressed" genes, should be better performed after i subset my dataset, based on the final number of samples remained ?

ADD REPLY • link 7.8 years ago svlachavas ▴ 830

1

Entering edit mode

Yes, you should subset your features after you have removed the samples that you drop.

To put it another way, the removal of "unexpressed" genes should be performed on the same exact dataset (ie. same samples) that you will be pushing through your differential expression pipeline.