Hello everyone,
I'm Daniele, a junior researcher and former participant of the CSAMA event in 2015. After some month of waiting, I finally received my first RNA-Seq data, so I wanted to test DESeq2 to perform my DGE analysis.
What I'm trying to do is to find differentially expressed features at gene level on a series of tumor cell lines expressing two isoforms of a particular gene, and no isoforms (control). Moreover, I also have knock-out tumor cell lines for these isoforms. Finally, all these lines where exposed to a drug, and the relative controls (no treatment) where sampled. I have at least 2 biological replicate for each cell condition.
I am trying to compare all the combinations of these cell lines in order to obtain all possible information regarding the significant DEGenes in this experiment (e.g.: cells with isoform1 treated vs cell with isoform 1 untreated, all tumor cell treated vs all tumor cells untreated, all knockdown untreated cells for isoform 1 vs all knockdown untreated cells for isoform 2, and so on). So, since there are MUCH more factors than the ones the "contrast" function in the results() function allows to, I builded my DESeqDataset using a combination of these factors.
(Briefly I paste()d the columns specifying treatment, knockdown and so on and then used them to model the dataset. eg:
sampleinfo$contrasts <- as.factor(paste(sampleinfo$isoform, sampleinfo$knockout, sampleinfo$treatment ,sep = "_")
#create a factor like isoform1_noknockout_untreated) ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleinfo, directory = ".",design = ~contrasts)
And then I proceeded with my DESeq2 analysis following the RNA-seq workflow found in Bioconductor.
What I found is that little or no significant DEgenes where found using DeSeq2 with default parameters. Moreover, when looking at the adjusted po-valuue, ALL values are set to 1 Just to output some of the DeSeq2 results:
gene_name baseMean log2FoldChange lfcSE stat pvalue padj
A1BG 32.6061798000319 -0.233823005166891 0.233034876711654 -1.00338201931984 0.315676576201699 1
A1BG-AS1 0.701182193651673 -0.036636625854256 0.0819481333750818 -0.447070901378166 0.654823868155052 1
A1CF 1475.624054965 -0.0140027561993027 0.218784550709677 -0.0640024908243367 0.948968243485277 1
A2M 51012.7388481331 -0.0452381081136272 0.264470450676274 -0.171051654345313 0.864183149758649 1
So, several questions arises on how these findings should be interpreted and I am asking for your support in order to solve them. The questions are:
- does the choice of "pasting" factors influence the DEseq analysis?
- Should I use different parameters for my DESeq analysis?
- Should I use the LR method for statistical testing rather than Wald? what are the pros and cons of the two approaches?
- Are these findings maybe dependant from the low number of replicates of my samples? What would you suggest to restrict the use of DESeq to "supergroups" rather than these small comparisons?
I have to mention that this is my first RNA-SEq analysis so I apologise in advance for any mistake I could have committed.
Thanks a lot for your time!
Daniele