Hi all,
I have a data set of ~360 human samples, which I would to analyse for differential expression (DESeq2
) as well as differential exon usage (DEXSeq
).
As we have quite a complicated data set, I would like to know if it is possible to calculate multiple factors in one design and if it make sense in general.
our data has three main categories as well as multiple sub-categories based on biological parameters.
the three main groups are classified as "favorable", "intermediate" and "poor"; the other biological characteristics are deletions (e.g del(7q)/7q-), translocations (e.g t(15;17)), inversions (e.g. inv(16)), etc. A sample can have only one of these parameters or more.
In the data set we don't have the classic control samples, but we would like to compare the groups of biological characteristics against each other.
for the differential gene expression, we would like to find out which genes are strongly expressed with which biological constellation. For the exon usage, we are mainly interested in a few specific genes/transcripts of specific genes and how they are being differentially used in the different categories.
here is a very small example of the dta set with its biological parameters:
sampleName TCGA.ID riskGroup biolI biolII 61GAEAAXX_4 TCGA-AB-2803 Favorable t(15;17) IDH1 R172 Negative 61671AAXX_1 TCGA-AB-2803 Favorable t(15;17) IDH1 R132 Negative 61U20AAXX_2 TCGA-AB-2999 Favorable FLT3 Mutation Positive del (5q) / 5q- 700GFAAXX_7 TCGA-AB-2810 Favorable t(15;17) FLT3 Mutation Negative 62P29AAXX_1 TCGA-AB-2841 Favorable t(15;17) FLT3 Mutation Negative 700GEAAXX_7 TCGA-AB-2810 Favorable t(15;17) FLT3 Mutation Negative 62P29AAXX_2 TCGA-AB-2977 Intermediate Activating RAS Negative IDH1 R172 Negative 631WGAAXX_6 TCGA-AB-2977 Intermediate Activating RAS Negative 21, 9(9;22) 6165JAAXX_7 TCGA-AB-2984 Intermediate Activating RAS Negative IDH1 R172 Negative 6165CAAXX_7 TCGA-AB-2984 Intermediate del (5q) / 5q- 21, 9(9;22) 700GJAAXX_1 TCGA-AB-2986 Intermediate del (5q) / 5q- 631WGAAXX_4 TCGA-AB-2811 Intermediate FLT3 Mutation Positive 631TVAAXX_3 TCGA-AB-2811 Intermediate FLT3 Mutation Positive 61627AAXX_4 TCGA-AB-2816 Intermediate FLT3 Mutation Positive 21, 9(9;22) 610W1AAXX_7 TCGA-AB-2808 Intermediate FLT3 Mutation Negative 21, 9(9;22) 61627AAXX_3 TCGA-AB-2833 Intermediate FLT3 Mutation Negative 21, 9(9;22) 61GAFAAXX_1 TCGA-AB-2854 Intermediate FLT3 Mutation Negative 21 7008LAAXX_3 TCGA-AB-2856 Intermediate FLT3 Mutation Negative 21 624YPAAXX_2 TCGA-AB-2990 Intermediate FLT3 Mutation Negative 21 700J3AAXX_1 TCGA-AB-2986 Intermediate IDH1 R140 Negative 21 700GFAAXX_3 TCGA-AB-2862 Intermediate IDH1 R140 Negative 21 61671AAXX_7 TCGA-AB-2808 Intermediate IDH1 R140 Negative 21 61671AAXX_4 TCGA-AB-2826 Intermediate IDH1 R140 Negative 21 6165KAAXX_4 TCGA-AB-2883 Poor BCR-ABL Negative 6165CAAXX_3 TCGA-AB-2893 Poor del (5q) / 5q- IDH1 R132 Negative 700APAAXX_2 TCGA-AB-2861 Poor del (5q) / 5q- IDH1 R132 Negative 6165JAAXX_3 TCGA-AB-2893 Poor del (5q) / 5q- IDH1 R132 Negative 6165JAAXX_2 TCGA-AB-2878 Poor del (5q) / 5q- Normal|Complex… 631TVAAXX_2 TCGA-AB-2920 Poor del (5q) / 5q- Normal|Complex …
The columns with the biological parameters are mixed together.
Is it better to put each biological parameter in a separate columns and fill the rest with 'NA'
?
We would like not only to test two biological parameters against each other, but also to test for combinations of parameters
Is it possible at all to put all these parameters in one big design matrix or do I need to construct it manually?
I was thinking in term of something like that:
design(dds) <- formula(~ riskGroup + biolI + biolII + ...)
or maybe even better as explained in the vignette (3.3), it will be better to add the factors as a new column (but is this possible when the columns are so mixed?
dds$group <- factor(paste0(dds$riskGroup, dds$biolI, dds$biolII, ...)) design(dds) <- ~group
and than I have all possible comparisons in the results()
.
I would appreciate any kind of entangling help for this design.
Thanks
Assa