Dear Community,
i would like to run in R with edgeR, some multiple conditions/contrasts, regarding different cancer subtypes, versus the control healthy tissues group of samples. The format of my Summarized experiment and my phenodata object are the following:
coad_clear class: RangedSummarizedExperiment dim: 56963 497 metadata(1): data_release assays(1): HTSeq - Counts rownames(56963): ENSG00000000003 ENSG00000000005 ... ENSG00000281912 ENSG00000281920 rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id colnames(497): TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07 ... TCGA-AZ-6601-11A-01R-1774-07 TCGA-AA-3511-11A-01R-1839-07 colData names(101): sample patient ... subtype_vascular_invasion_present subtype_vital_status pdat.sel <- SummarizedExperiment::colData(coad_clear)[,c("definition", + "subtype_expression_subtype")] > head(pdat.sel) DataFrame with 6 rows and 2 columns definition subtype_expression_subtype <character> <factor> TCGA-3L-AA1B-01A-11R-A37K-07 Primary solid Tumor NA TCGA-DM-A1D8-01A-11R-A155-07 Primary solid Tumor NA TCGA-AU-6004-01A-11R-1723-07 Primary solid Tumor NA TCGA-T9-A92H-01A-11R-A37K-07 Primary solid Tumor NA TCGA-AA-A01T-01A-21R-A16W-07 Primary solid Tumor NA TCGA-CK-5913-01A-11R-1653-07 Primary solid Tumor NA y <- DGEList(counts=assay(coad_clear)) My main issue before moving for downstream analysis, is about the experimental conditions below: plyr::count(pdat.sel$definition) x freq 1 Primary solid Tumor 456 2 Solid Tissue Normal 41 plyr::count(pdat.sel$subtype_expression_subtype) x freq 1 CIN 55 2 Invasive 36 3 MSI/CIMP 58 4 NA 49 5 <NA> 299
So except actual NA values, i have also "NA" strings for the factor concerning the molecular subtypes.
My goal, is essentially to compare the 3 differerent cancer subtypes (CIN,Invasive and MSI/CIMP) with the control samples.
How i could proceed and merge the 2 columns of information ?
Any suggestion or idea would be grateful !!
Dear Aaron,
nice to hear from you again and thank you for your detailed considerations on this matter !! Actually, i could not agree more, especially for the option 3 of your suggestions, concerning especially the cancer samples. I understand that any NA values in any levels in the design matrix would cause the sequences you have described.
Firstly, regarding the part of normal samples:
Thus, all the solid tissue normal samples, have the "description" <NA> in the relative field of the molecular subtype.
Overall, in order to follow the third and most safe option for the subtype comparisons-which i also support due to the fact that the "NA" string in the cancer samples, as also the <NA> fields can't biologically explained, as also they have not any other information-, i initially proceeded with a small subsetting:
From this point, i should proceed for creating a multifactor variable like the folllowing ? after then also renaming the normal samples to remove the NA abbreviations ?
Best,
Efstathios
Yes, that seems sensible to me.