Question: edgeR multifactorial analysis implementation with RNA-Seq data
0
19 months ago by
svlachavas740
Greece/Athens/National Hellenic Research Foundation
svlachavas740 wrote:

Dear Community,

i would like to run in R with edgeR, some multiple conditions/contrasts, regarding different cancer subtypes, versus the control healthy tissues group of samples. The format of my Summarized experiment and my phenodata object are the following:

 coad_clear
class: RangedSummarizedExperiment
dim: 56963 497
assays(1): HTSeq - Counts
rownames(56963): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
original_ensembl_gene_id
colnames(497): TCGA-3L-AA1B-01A-11R-A37K-07
TCGA-DM-A1D8-01A-11R-A155-07 ... TCGA-AZ-6601-11A-01R-1774-07
TCGA-AA-3511-11A-01R-1839-07
colData names(101): sample patient ...
subtype_vascular_invasion_present subtype_vital_status

+ "subtype_expression_subtype")]
DataFrame with 6 rows and 2 columns
definition subtype_expression_subtype
<character>                   <factor>
TCGA-3L-AA1B-01A-11R-A37K-07 Primary solid Tumor                         NA
TCGA-DM-A1D8-01A-11R-A155-07 Primary solid Tumor                         NA
TCGA-AU-6004-01A-11R-1723-07 Primary solid Tumor                         NA
TCGA-T9-A92H-01A-11R-A37K-07 Primary solid Tumor                         NA
TCGA-AA-A01T-01A-21R-A16W-07 Primary solid Tumor                         NA
TCGA-CK-5913-01A-11R-1653-07 Primary solid Tumor                         NA

My main issue before moving for downstream analysis, is  about the experimental conditions below:

plyr::count(pdat.sel$definition) x freq 1 Primary solid Tumor 456 2 Solid Tissue Normal 41 plyr::count(pdat.sel$subtype_expression_subtype)

x freq
1      CIN   55
2 Invasive   36
3 MSI/CIMP   58
4       NA   49
5     <NA>  299

So except actual NA values, i have also "NA" strings for the factor concerning the molecular subtypes.

My goal, is essentially to compare the 3 differerent cancer subtypes (CIN,Invasive and MSI/CIMP) with the control samples.

How i could proceed and merge the 2 columns of information ?

Any suggestion or idea would be grateful !!

modified 19 months ago by Aaron Lun25k • written 19 months ago by svlachavas740
Answer: edgeR multifactorial analysis implementation with RNA-Seq data
2
19 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

In general, if you have missing values for a factor, the corresponding samples cannot be modelled. model.matrix, for example, just pretends that the samples with NA values don't exist:

A <- c(1,2,3, NA, 5, 6)
model.matrix(~A)
##   (Intercept) A
## 1           1 1
## 2           1 2
## 3           1 3
## 5           1 5
## 6           1 6
## attr(,"assign")
## [1] 0 1


... which is fair enough, because what do you expect it to do?

In your case, at least some of your NA values would probably correspond to the healthy samples, which is okay as there shouldn't be any sensible cancer subtype for healthy tissue. I would paste the definition and subtype_expression_subtype factors together and rename all the healthy samples to get rid of their NAs.

For the cancer samples, there are several options:

1. Treat all NA subtypes as a single group. This is probably unwise, though it depends on the context. I find it unlikely that all samples with missing labels originate from a single unnamed subtype - if that subtype was so common, it would have been named!
2. Same as (1), but run RUVSeq or sva to account for hidden factors of variation in the NA group. This would allow you to use the NA samples in the model without inflating the dispersion estimates. However, even if you can now use all of the samples, I would struggle to interpret DE analyses between the NA group and the other subtypes or the healthy controls, in the likely case that the NA group is an aggregation of lots of unknown subtypes.
3. Discard all samples with NA subtypes (and presumably also the "NA"s, unless this stands for something biologically relevant). This is probably the safest option. Yes, you are discarding a lot of data, but these samples are useless without proper labels.

Dear Aaron,

nice to hear from you again and thank you for your detailed considerations on this matter !! Actually, i could not agree more, especially for the option 3 of your suggestions, concerning especially the cancer samples. I understand that any NA values in any levels in the design matrix would cause the sequences you have described.

Firstly, regarding the part of normal samples:

ind <- colnames(coad_clear)[coad_clear$definition%in%("Solid Tissue Normal")] length(ind) [1] 41 dd <- coad_clear[,ind] pdat.sel <- SummarizedExperiment::colData(dd)[,c("definition","subtype_expression_subtype")] head(pdat.sel) DataFrame with 6 rows and 2 columns definition subtype_expression_subtype <character> <factor> TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal NA TCGA-AZ-6605-11A-01R-1839-07 Solid Tissue Normal NA TCGA-AA-3531-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-A6-2682-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-AA-3517-11A-01R-A32Z-07 Solid Tissue Normal NA TCGA-AA-3525-11A-01R-A32Z-07 Solid Tissue Normal NA plyr::count(pdat.sel$subtype_expression_subtype)
x freq
1 <NA>   41

Thus, all the solid tissue normal samples, have the "description" <NA> in the relative field of the molecular subtype.

Overall, in order to follow the third and most safe option for the subtype comparisons-which i also support due to the fact that the "NA" string in the cancer samples, as also the <NA> fields can't biologically explained, as also they have not any other information-, i initially proceeded with a small subsetting:

 norm.ind <- colnames(coad_clear)[coad_clear$definition%in%("Solid Tissue Normal")] subtype.ind <- colnames(coad_clear)[coad_clear$subtype_expression_subtype%in%("CIN","MSI/CIMP","Invasive")]

plyr::count(pdat.sub$definition) x freq 1 Primary solid Tumor 149 2 Solid Tissue Normal 41 plyr::count(pdat.sub$subtype_expression_subtype)
x freq
1      CIN   55
2 Invasive   36
3 MSI/CIMP   58
4     <NA>   41

DataFrame with 6 rows and 2 columns
definition subtype_expression_subtype
<character>                   <factor>
TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal                         NA
TCGA-AZ-6605-11A-01R-1839-07 Solid Tissue Normal                         NA
TCGA-AA-3531-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-A6-2682-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-AA-3517-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-AA-3525-11A-01R-A32Z-07 Solid Tissue Normal                         NA

From this point,  i should proceed for creating a multifactor variable like the folllowing ? after then also renaming the normal samples to remove the NA abbreviations ?

comb.fac <- paste0(pdat.sub$definition,"_",pdat.sub$subtype_expression_subtype)
table(comb.fac)
comb.fac
Primary solid Tumor_CIN Primary solid Tumor_Invasive
55                           36
Primary solid Tumor_MSI/CIMP       Solid Tissue Normal_NA
58                           41

group=comb.fac)....

Best,

Efstathios

1

Yes, that seems sensible to me.