My question follows on from two related questions (https://support.bioconductor.org/p/79209/ and https://support.bioconductor.org/p/98476/).
I use DESeq2 in order to get normalized counts from raw TCGA-HTseq Counts.
My method is simple, put the data through DESeq2 - using design = ~ 1 (which gives a warning). https://support.bioconductor.org/p/79209/
exptDesign_TCGA = data.frame( row.names = colnames(matrix_TCGA_data), condition = sample_tsv$Project.ID) dds <- DESeqDataSetFromMatrix( countData = matrix_TCGA_data, colData = exptDesign_TCGA, design = ~ 1)
Then I do all of my analysis (clustering, survival, gene expression heat maps) on the normalised counts from Deseq2.
However I am running into problems with huge numbers of samples and computational power - directly related to the following post: https://support.bioconductor.org/p/98476/
I know Michael Love suggests using Limma-Voom, but I have tried that - run my analysis and it is essentially scales the counts by gene across all samples, not what I want at all, as I lose the relative expression of one gene to another gene in the same sample. I need each sample to be normalised by the DESeq2 method.
In his response, Michael Love has also suggested this approach - https://support.bioconductor.org/p/98476/
rowSums(counts(dds,normalized=TRUE) >= 10) >= 5).
However this does not make sense - getting normalized counts is what is taking too long - as in order to perform the counts function with normalized=T - you have to first run the DeSeq2 function in the first place to run (which estimates dispersion and size factors and is incredibly time consuming).
So, can I please have clarification on the following:
a) Was the above response meant to suggest I take the raw counts and cut out all low reads - then perform Deseq2 on this slim data set.
keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep, ] #differential analysis -DESeq()- is then run on this object analysisObject_TCGA <- DESeq(dds, parallel = T)
b) Is there another way to do this? Should I just work with TCGA raw counts and not normalise other than scaling prior to clustering and survival analysis? - Is this wrong?
c) Is my use of Limma-Voom incorrect - as the results appear to be simply gene scaling. Should I transpose my counts matrix and re-run Limma-Voom?
Any help or advice would be much appreciated.
Kind Regards, Alex
This works perfectly, a bit silly I didn't realise it just needed size factors, many thanks. Was finished in under 5 minutes on ~10,000 samples.
Thanks again, Alex