Question

edgeR multifactorial analysis implementation with RNA-Seq data

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

i would like to run in R with edgeR, some multiple conditions/contrasts, regarding different cancer subtypes, versus the control healthy tissues group of samples. The format of my Summarized experiment and my phenodata object are the following:

 coad_clear
class: RangedSummarizedExperiment 
dim: 56963 497 
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56963): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
  ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
  original_ensembl_gene_id
colnames(497): TCGA-3L-AA1B-01A-11R-A37K-07
  TCGA-DM-A1D8-01A-11R-A155-07 ... TCGA-AZ-6601-11A-01R-1774-07
  TCGA-AA-3511-11A-01R-1839-07
colData names(101): sample patient ...
  subtype_vascular_invasion_present subtype_vital_status

 pdat.sel <- SummarizedExperiment::colData(coad_clear)[,c("definition",
+ "subtype_expression_subtype")]
> head(pdat.sel)
DataFrame with 6 rows and 2 columns
                                      definition subtype_expression_subtype
                                     <character>                   <factor>
TCGA-3L-AA1B-01A-11R-A37K-07 Primary solid Tumor                         NA
TCGA-DM-A1D8-01A-11R-A155-07 Primary solid Tumor                         NA
TCGA-AU-6004-01A-11R-1723-07 Primary solid Tumor                         NA
TCGA-T9-A92H-01A-11R-A37K-07 Primary solid Tumor                         NA
TCGA-AA-A01T-01A-21R-A16W-07 Primary solid Tumor                         NA
TCGA-CK-5913-01A-11R-1653-07 Primary solid Tumor                         NA

y <- DGEList(counts=assay(coad_clear))

My main issue before moving for downstream analysis, is  about the experimental conditions below:

 plyr::count(pdat.sel$definition)
                    x freq
1 Primary solid Tumor  456
2 Solid Tissue Normal   41

plyr::count(pdat.sel$subtype_expression_subtype)

         x freq
1      CIN   55
2 Invasive   36
3 MSI/CIMP   58
4       NA   49
5     <NA>  299

So except actual NA values, i have also "NA" strings for the factor concerning the molecular subtypes.

My goal, is essentially to compare the 3 differerent cancer subtypes (CIN,Invasive and MSI/CIMP) with the control samples.

How i could proceed and merge the 2 columns of information ?

Any suggestion or idea would be grateful !!

edger multifactorial design multiple comparisons differential gene expression • 862 views

ADD COMMENT • link updated 6.1 years ago by Aaron Lun ★ 28k • written 6.1 years ago by svlachavas ▴ 830

score 2 · Answer 1 · 2018-03-08

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 3 hours ago

The city by the bay

In general, if you have missing values for a factor, the corresponding samples cannot be modelled. model.matrix, for example, just pretends that the samples with NA values don't exist:

A <- c(1,2,3, NA, 5, 6)
model.matrix(~A)
##   (Intercept) A
## 1           1 1
## 2           1 2
## 3           1 3
## 5           1 5
## 6           1 6
## attr(,"assign")
## [1] 0 1

... which is fair enough, because what do you expect it to do?

In your case, at least some of your NA values would probably correspond to the healthy samples, which is okay as there shouldn't be any sensible cancer subtype for healthy tissue. I would paste the definition and subtype_expression_subtype factors together and rename all the healthy samples to get rid of their NAs.

For the cancer samples, there are several options:

Treat all NA subtypes as a single group. This is probably unwise, though it depends on the context. I find it unlikely that all samples with missing labels originate from a single unnamed subtype - if that subtype was so common, it would have been named!
Same as (1), but run RUVSeq or sva to account for hidden factors of variation in the NA group. This would allow you to use the NA samples in the model without inflating the dispersion estimates. However, even if you can now use all of the samples, I would struggle to interpret DE analyses between the NA group and the other subtypes or the healthy controls, in the likely case that the NA group is an aggregation of lots of unknown subtypes.
Discard all samples with NA subtypes (and presumably also the "NA"s, unless this stands for something biologically relevant). This is probably the safest option. Yes, you are discarding a lot of data, but these samples are useless without proper labels.

ADD COMMENT • link 6.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

nice to hear from you again and thank you for your detailed considerations on this matter !! Actually, i could not agree more, especially for the option 3 of your suggestions, concerning especially the cancer samples. I understand that any NA values in any levels in the design matrix would cause the sequences you have described.

Firstly, regarding the part of normal samples:

ind <- colnames(coad_clear)[coad_clear$definition%in%("Solid Tissue Normal")]

length(ind)
[1] 41

dd <- coad_clear[,ind]

pdat.sel <- SummarizedExperiment::colData(dd)[,c("definition","subtype_expression_subtype")]
head(pdat.sel)
DataFrame with 6 rows and 2 columns
                                      definition subtype_expression_subtype
                                     <character>                   <factor>
TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal                         NA
TCGA-AZ-6605-11A-01R-1839-07 Solid Tissue Normal                         NA
TCGA-AA-3531-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-A6-2682-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-AA-3517-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-AA-3525-11A-01R-A32Z-07 Solid Tissue Normal                         NA

plyr::count(pdat.sel$subtype_expression_subtype)
     x freq
1 <NA>   41

Thus, all the solid tissue normal samples, have the "description" <NA> in the relative field of the molecular subtype.

Overall, in order to follow the third and most safe option for the subtype comparisons-which i also support due to the fact that the "NA" string in the cancer samples, as also the <NA> fields can't biologically explained, as also they have not any other information-, i initially proceeded with a small subsetting:

 norm.ind <- colnames(coad_clear)[coad_clear$definition%in%("Solid Tissue Normal")]

 subtype.ind <-  colnames(coad_clear)[coad_clear$subtype_expression_subtype%in%("CIN","MSI/CIMP","Invasive")]

 coad_subtype <- coad_clear[,c(norm.ind,subtype.ind)]

 pdat.sub <- SummarizedExperiment::colData(coad_subtype)[,c("definition", "subtype_expression_subtype")]

 plyr::count(pdat.sub$definition)
                    x freq
1 Primary solid Tumor  149
2 Solid Tissue Normal   41

plyr::count(pdat.sub$subtype_expression_subtype)
         x freq
1      CIN   55
2 Invasive   36
3 MSI/CIMP   58
4     <NA>   41

head(pdat.sub)
DataFrame with 6 rows and 2 columns
                                      definition subtype_expression_subtype
                                     <character>                   <factor>
TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal                         NA
TCGA-AZ-6605-11A-01R-1839-07 Solid Tissue Normal                         NA
TCGA-AA-3531-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-A6-2682-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-AA-3517-11A-01R-A32Z-07 Solid Tissue Normal                         NA
TCGA-AA-3525-11A-01R-A32Z-07 Solid Tissue Normal                         NA

From this point, i should proceed for creating a multifactor variable like the folllowing ? after then also renaming the normal samples to remove the NA abbreviations ?

comb.fac <- paste0(pdat.sub$definition,"_",pdat.sub$subtype_expression_subtype)
table(comb.fac)
comb.fac
     Primary solid Tumor_CIN Primary solid Tumor_Invasive 
                          55                           36 
Primary solid Tumor_MSI/CIMP       Solid Tissue Normal_NA 
                          58                           41 

y2 <- DGEList(counts=assay(coad_subtype,
group=comb.fac)....

Best,

Efstathios

ADD REPLY • link 6.1 years ago svlachavas ▴ 830

1

Entering edit mode

Yes, that seems sensible to me.

ADD REPLY • link 6.1 years ago Aaron Lun ★ 28k