Dear Bioconductor community,
i will just refer to one of my previous posts as it is the basis of my new questions (Merge different datasets and perform differential expression analysis in limma) -in summary, regarding the merged dataset i have created, combining two colorectal cancer microarray affymetrix datasets[refering to the comparison of the patients cancer samples vs adjucent controls], i recently acquired important phenotypic information, regarding the anatomic location of the tumor of each patient. In detail,
pData(eset_COMBAT)
Disease Location Meta_factor Study
St_1_WL57.CEL Normal sigmoid_colon 0 hgu133plus2
St_2_WL57.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EC59.CEL Normal sigmoid_colon 0 hgu133plus2
St_T_EC59.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EJ58.CEL Normal cecum 0 hgu133plus2
St_T_EJ58.CEL Cancer cecum 0 hgu133plus2
St_N_FD58.CEL Normal rectum 0 hgu133plus2
St_T_FD58.CEL Cancer rectum 0 hgu133plus2
St_N_FG39.CEL Normal cecum 1 hgu133plus2
St_T_FG39.CEL Cancer cecum 1 hgu133plus2
St_1_GG52.CEL Normal cecum 1 hgu133plus2
St_2_GG52.CEL Cancer cecum 1 hgu133plus2
St_N_HJ33.CEL Normal descending_colon 0 hgu133plus2
St_T_HJ33.CEL Cancer descending_colon 0 hgu133plus2
St_N_HK57.CEL Normal ascending_colon 0 hgu133plus2
St_T_HK57.CEL Cancer ascending_colon 0 hgu133plus2
St_N_KA49.CEL Normal descending_colon 0 hgu133plus2
St_T_KA49.CEL Cancer descending_colon 0 hgu133plus2
St_N_KW40.CEL Normal sigmoid_colon 0 hgu133plus2
St_T_KW40.CEL Cancer sigmoid_colon 0 hgu133plus2
St_1_LH53.CEL Normal descending_colon 0 hgu133plus2
St_2_LH53.CEL Cancer descending_colon 0 hgu133plus2
St_N_LW46.CEL Normal cecum 0 hgu133plus2
St_T_LW46.CEL Cancer cecum 0 hgu133plus2
St_N_NH45.CEL Normal rectosigmoid_colon 0 hgu133plus2
St_T_NH45.CEL Cancer rectosigmoid_colon 0 hgu133plus2
St_N_PH42.CEL Normal ascending_colon 0 hgu133plus2
St_T_PH42.CEL Cancer ascending_colon 0 hgu133plus2
St_1_SSch60.CEL Normal rectum 0 hgu133plus2
St_2_SSch60.CEL Cancer rectum 0 hgu133plus2
St_N_WH37.CEL Normal ascending_colon 0 hgu133plus2
St_T_WH37.CEL Cancer ascending_colon 0 hgu133plus2
St_N_WM60.CEL Normal right_colic_flexure 0 hgu133plus2
St_T_WM60.CEL Cancer right_colic_flexure 0 hgu133plus2
7_Tesch.CEL Normal cecum 0 hgu133a
8_Tesch.CEL Cancer cecum 0 hgu133a
0687_04_Blauth.CEL Normal rectosigmoid_colon 0 hgu133a
0687_04_Blauth_1.CEL Cancer rectosigmoid_colon 0 hgu133a
0863_03_Schmidt.CEL Normal ascending_colon 0 hgu133a
0863_03_Schmidt_1.CEL Cancer ascending_colon 0 hgu133a
0948_04_Leiber.CEL Normal ascending_colon 0 hgu133a
0948_04_Leiber_1.CEL Cancer ascending_colon 0 hgu133a
1043_04_Nagel.CEL Normal right_colic_flexure 0 hgu133a
1043_04_Nagel_1.CEL Cancer right_colic_flexure 0 hgu133a
1103_03_Braun.CEL Normal rectum 1 hgu133a
1103_03_Braun_1.CEL Cancer rectum 1 hgu133a
1234_06_Floersch.CEL Normal sigmoid_colon 0 hgu133a
1234_06_Floersch_1.CEL Cancer sigmoid_colon 0 hgu133a
1235_06_Hey.CEL Normal transverse_colon 0 hgu133a
1235_06_Hey_1.CEL Cancer transverse_colon 0 hgu133a
1236_06_Liebich.CEL Normal cecum 1 hgu133a
... (i dont include the rest in order to be below the word limit-in total 60 paired samples/30 patients)
So,my first important question is the below:
1. If i want just to perform a paired analysis for cancer vs control samples, adjusted for the tumor location, i believe the following is correct:
condition <- factor(eset.2$Disease, levels=c("Normal","Cancer"))
pairs <- factor(rep(1:30, each = 2))
location <- eset_COMBAT$Location
class(location)
[1] "factor"
location
St_1_WL57.CEL St_2_WL57.CEL St_N_EC59.CEL
sigmoid_colon sigmoid_colon sigmoid_colon
St_T_EC59.CEL St_N_EJ58.CEL St_T_EJ58.CEL
sigmoid_colon cecum cecum
St_N_FD58.CEL St_T_FD58.CEL St_N_FG39.CEL
rectum rectum cecum
....
9 Levels: ascending_colon cecum descending_colon ... transverse_colon
and then:
design <- model.matrix(~condition +pairs+location)
fit <- lmFit(eset_COMBAT,design)
but then i get the following message:
Coefficients not estimable: locationcecum locationdescending_colon locationleft_colic_flexure locationrectosigmoid_colon locationrectum locationright_colic_flexure locationsigmoid_colon locationtransverse_colon
Warning message:
Partial NA coefficients for 22277 probe(s)
Any help or suggestions would be essential !!
Dear Mr. Smyth,
firstly thank you for your recommendations-regarding the batch effect between the platforms, i used the guidelines from the inSilicoMerging R package when i merged the two datasets, based on their common probesets, because i wasnt sure about how to adjust for this effect-so your explanation gave me more confidence on this specific issue. On the other hand, regarding the issue with the location information, this leads me to my second very important question:
In detail, as in my current project we are highly motivated in implementing this information in our microarray analysis, as from both biology field and other perspectives, we are interested in identifying-if it is possible and regarding the results- any crusial and interesting gene expression alteration patterns between different anatomic locations of the patiens colorectal tumors involved in our analysis(i.e ascending vs descending colon, which could possibly underline fundamental functional and distinct pathways-groups of genes involved). Thus, i tried in the first place to incorporate in a wrong way this information.
On the other hand, as i searched from the limma vignette[in particular section 9.7, page 48], i thought maybe a similar implementation like the specific tutorial with the multifactor analysis would be correct, in order to perform two possible analyses:
1. To compare specific tumor (locations) vs their adjucent control samples(i.e ascending.tumor vs ascending.control)
2.If it is possible compare different tumor anatomic locations(i.e ascending vs descending tumor samples)
So, if my idea/implementation although naive is correct without any incorrect points, should i proceed like this (?):
combine <- factor(paste(eset_COMBAT$Disease, eset_COMBAT$Location, sep="."))
design <- model.matrix(~0 +combine)
colnames(design) <- levels(combine)
and then to estimate the correlation between the matched samples on the same patient:
corfit <- duplicateCorrelation(eset_COMBAT, design, block=pairs) ?
or regarding the block argument should i firstly incorporate this information in my phenoData object:
eset_COMBAT$Subjects <- pairs
and then corfit <- duplicateCorrelation(eset_COMBAT, design, block=eset_COMBAT$Subjects) ?
AND if the above are fine,
move with fit <- lmFit(eset_COMBAT,design,block=eset_COMBAT$Subjects,correlation=corfit$consensus)
and afterwards define the necessary contrasts with makeContrasts etc ??
To summarize, is my implementation regarding the use of the tumor location correct or it is different from the current tutorial i mentioned ??
Thank you for your patience on this matter
Any suggestions or evaluation regarding
my above answer would be beneficial for my analysis!
Best,
Efstathios