Question

DESeq2 analysis with multiple biological parameters

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 4 months ago

Germany

Hi all,

I have a data set of ~360 human samples, which I would to analyse for differential expression (DESeq2) as well as differential exon usage (DEXSeq).

As we have quite a complicated data set, I would like to know if it is possible to calculate multiple factors in one design and if it make sense in general.

our data has three main categories as well as multiple sub-categories based on biological parameters.
the three main groups are classified as "favorable", "intermediate" and "poor"; the other biological characteristics are deletions (e.g del(7q)/7q-), translocations (e.g t(15;17)), inversions (e.g. inv(16)), etc. A sample can have only one of these parameters or more.

In the data set we don't have the classic control samples, but we would like to compare the groups of biological characteristics against each other.
for the differential gene expression, we would like to find out which genes are strongly expressed with which biological constellation. For the exon usage, we are mainly interested in a few specific genes/transcripts of specific genes and how they are being differentially used in the different categories.

here is a very small example of the dta set with its biological parameters:

sampleName    TCGA.ID    riskGroup    biolI    biolII
61GAEAAXX_4    TCGA-AB-2803    Favorable    t(15;17)    IDH1 R172 Negative
61671AAXX_1    TCGA-AB-2803    Favorable    t(15;17)    IDH1 R132 Negative
61U20AAXX_2    TCGA-AB-2999    Favorable    FLT3 Mutation Positive    del (5q) / 5q-
700GFAAXX_7    TCGA-AB-2810    Favorable    t(15;17)    FLT3 Mutation Negative
62P29AAXX_1    TCGA-AB-2841    Favorable    t(15;17)    FLT3 Mutation Negative
700GEAAXX_7    TCGA-AB-2810    Favorable    t(15;17)    FLT3 Mutation Negative
62P29AAXX_2    TCGA-AB-2977    Intermediate    Activating RAS Negative    IDH1 R172 Negative
631WGAAXX_6    TCGA-AB-2977    Intermediate    Activating RAS Negative    21, 9(9;22)
6165JAAXX_7    TCGA-AB-2984    Intermediate    Activating RAS Negative    IDH1 R172 Negative
6165CAAXX_7    TCGA-AB-2984    Intermediate    del (5q) / 5q-    21, 9(9;22)
700GJAAXX_1    TCGA-AB-2986    Intermediate    del (5q) / 5q-    
631WGAAXX_4    TCGA-AB-2811    Intermediate    FLT3 Mutation Positive    
631TVAAXX_3    TCGA-AB-2811    Intermediate    FLT3 Mutation Positive    
61627AAXX_4    TCGA-AB-2816    Intermediate    FLT3 Mutation Positive    21, 9(9;22)
610W1AAXX_7    TCGA-AB-2808    Intermediate    FLT3 Mutation Negative    21, 9(9;22)
61627AAXX_3    TCGA-AB-2833    Intermediate    FLT3 Mutation Negative    21, 9(9;22)
61GAFAAXX_1    TCGA-AB-2854    Intermediate    FLT3 Mutation Negative    21
7008LAAXX_3    TCGA-AB-2856    Intermediate    FLT3 Mutation Negative    21
624YPAAXX_2    TCGA-AB-2990    Intermediate    FLT3 Mutation Negative    21
700J3AAXX_1    TCGA-AB-2986    Intermediate    IDH1 R140 Negative    21
700GFAAXX_3    TCGA-AB-2862    Intermediate    IDH1 R140 Negative    21
61671AAXX_7    TCGA-AB-2808    Intermediate    IDH1 R140 Negative    21
61671AAXX_4    TCGA-AB-2826    Intermediate    IDH1 R140 Negative    21
6165KAAXX_4    TCGA-AB-2883    Poor    BCR-ABL Negative    
6165CAAXX_3    TCGA-AB-2893    Poor    del (5q) / 5q-    IDH1 R132 Negative
700APAAXX_2    TCGA-AB-2861    Poor    del (5q) / 5q-    IDH1 R132 Negative
6165JAAXX_3    TCGA-AB-2893    Poor    del (5q) / 5q-    IDH1 R132 Negative
6165JAAXX_2    TCGA-AB-2878    Poor    del (5q) / 5q-    Normal|Complex…
631TVAAXX_2    TCGA-AB-2920    Poor    del (5q) / 5q-    Normal|Complex …

The columns with the biological parameters are mixed together.
Is it better to put each biological parameter in a separate columns and fill the rest with 'NA'?

We would like not only to test two biological parameters against each other, but also to test for combinations of parameters
Is it possible at all to put all these parameters in one big design matrix or do I need to construct it manually?

I was thinking in term of something like that:

design(dds) <- formula(~ riskGroup + biolI + biolII + ...)

or maybe even better as explained in the vignette (3.3), it will be better to add the factors as a new column (but is this possible when the columns are so mixed?

dds$group <- factor(paste0(dds$riskGroup, dds$biolI, dds$biolII, ...))
design(dds) <- ~group

and than I have all possible comparisons in the results().

I would appreciate any kind of entangling help for this design.

Thanks
Assa

deseq2 dexseq multiple factor design multifactorial design • 1.4k views

ADD COMMENT • link updated 9.9 years ago by Michael Love 43k • written 9.9 years ago by Assa Yeroslaviz ★ 1.5k

score 1 · Answer 1 · 2016-03-16

I can offer a few answers, but I can't answer all your questions:

"Is it better to put each biological parameter in a separate columns and fill the rest with 'NA'?"

You can't have NA in the column data for a variable in the design formula. You could have TRUE/FALSE, but no NA.

"We would like not only to test two biological parameters against each other, but also to test for combinations of parameters Is it possible at all to put all these parameters in one big design matrix or do I need to construct it manually?"

My opinion is that, it sounds like you would be better off doing exploratory analyses here, than trying to fit your exploration into a null hypothesis testing framework. When you have such a complex dataset with many properties, and you don't have a particular hypothesis to test, but want to look for patterns associated with many combinations of properties of the samples, null hypothesis testing can be very misleading. You can end up with "significant" associations, just because many combinations were tried in pursuit of a pattern of association.

I'd recommend just exploring the data by performing transformation, and then looking at, e.g. PCA of the transformed data. Using PCA, you can find which genes are associated with the largest directions of variance separating the samples.