Dear Community,
i would like to run in R with edgeR, some multiple conditions/contrasts, regarding different cancer subtypes, versus the control healthy tissues group of samples. The format of my Summarized experiment and my phenodata object are the following:
coad_clear
class: RangedSummarizedExperiment
dim: 56963 497
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56963): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
original_ensembl_gene_id
colnames(497): TCGA-3L-AA1B-01A-11R-A37K-07
TCGA-DM-A1D8-01A-11R-A155-07 ... TCGA-AZ-6601-11A-01R-1774-07
TCGA-AA-3511-11A-01R-1839-07
colData names(101): sample patient ...
subtype_vascular_invasion_present subtype_vital_status
pdat.sel <- SummarizedExperiment::colData(coad_clear)[,c("definition",
+ "subtype_expression_subtype")]
> head(pdat.sel)
DataFrame with 6 rows and 2 columns
definition subtype_expression_subtype
<character> <factor>
TCGA-3L-AA1B-01A-11R-A37K-07 Primary solid Tumor NA
TCGA-DM-A1D8-01A-11R-A155-07 Primary solid Tumor NA
TCGA-AU-6004-01A-11R-1723-07 Primary solid Tumor NA
TCGA-T9-A92H-01A-11R-A37K-07 Primary solid Tumor NA
TCGA-AA-A01T-01A-21R-A16W-07 Primary solid Tumor NA
TCGA-CK-5913-01A-11R-1653-07 Primary solid Tumor NA
y <- DGEList(counts=assay(coad_clear))
My main issue before moving for downstream analysis, is about the experimental conditions below:
plyr::count(pdat.sel$definition)
x freq
1 Primary solid Tumor 456
2 Solid Tissue Normal 41
plyr::count(pdat.sel$subtype_expression_subtype)
x freq
1 CIN 55
2 Invasive 36
3 MSI/CIMP 58
4 NA 49
5 <NA> 299
So except actual NA values, i have also "NA" strings for the factor concerning the molecular subtypes.
My goal, is essentially to compare the 3 differerent cancer subtypes (CIN,Invasive and MSI/CIMP) with the control samples.
How i could proceed and merge the 2 columns of information ?
Any suggestion or idea would be grateful !!

Dear Aaron,
nice to hear from you again and thank you for your detailed considerations on this matter !! Actually, i could not agree more, especially for the option 3 of your suggestions, concerning especially the cancer samples. I understand that any NA values in any levels in the design matrix would cause the sequences you have described.
Firstly, regarding the part of normal samples:
ind <- colnames(coad_clear)[coad_clear$definition%in%("Solid Tissue Normal")] length(ind) [1] 41 dd <- coad_clear[,ind] pdat.sel <- SummarizedExperiment::colData(dd)[,c("definition","subtype_expression_subtype")] head(pdat.sel) DataFrame with 6 rows and 2 columns definition subtype_expression_subtype <character> <factor> TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal NA TCGA-AZ-6605-11A-01R-1839-07 Solid Tissue Normal NA TCGA-AA-3531-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-A6-2682-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-AA-3517-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-AA-3525-11A-01R-A32Z-07 Solid Tissue Normal NA plyr::count(pdat.sel$subtype_expression_subtype) x freq 1 <NA> 41Thus, all the solid tissue normal samples, have the "description" <NA> in the relative field of the molecular subtype.
Overall, in order to follow the third and most safe option for the subtype comparisons-which i also support due to the fact that the "NA" string in the cancer samples, as also the <NA> fields can't biologically explained, as also they have not any other information-, i initially proceeded with a small subsetting:
norm.ind <- colnames(coad_clear)[coad_clear$definition%in%("Solid Tissue Normal")] subtype.ind <- colnames(coad_clear)[coad_clear$subtype_expression_subtype%in%("CIN","MSI/CIMP","Invasive")] coad_subtype <- coad_clear[,c(norm.ind,subtype.ind)] pdat.sub <- SummarizedExperiment::colData(coad_subtype)[,c("definition", "subtype_expression_subtype")] plyr::count(pdat.sub$definition) x freq 1 Primary solid Tumor 149 2 Solid Tissue Normal 41 plyr::count(pdat.sub$subtype_expression_subtype) x freq 1 CIN 55 2 Invasive 36 3 MSI/CIMP 58 4 <NA> 41 head(pdat.sub) DataFrame with 6 rows and 2 columns definition subtype_expression_subtype <character> <factor> TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal NA TCGA-AZ-6605-11A-01R-1839-07 Solid Tissue Normal NA TCGA-AA-3531-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-A6-2682-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-AA-3517-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-AA-3525-11A-01R-A32Z-07 Solid Tissue Normal NAFrom this point, i should proceed for creating a multifactor variable like the folllowing ? after then also renaming the normal samples to remove the NA abbreviations ?
comb.fac <- paste0(pdat.sub$definition,"_",pdat.sub$subtype_expression_subtype) table(comb.fac) comb.fac Primary solid Tumor_CIN Primary solid Tumor_Invasive 55 36 Primary solid Tumor_MSI/CIMP Solid Tissue Normal_NA 58 41 y2 <- DGEList(counts=assay(coad_subtype, group=comb.fac)....Best,
Efstathios
Yes, that seems sensible to me.