Hi all,

I am now running Deseq2 analysis. However, I am stuck at extracting the results with results(dds, contrast = ()) after deseq() from the large DESeqDataSet. The problem was, it took me extremely long to only extract one comparison (3 hours/comparison meanwhile, I have 265 comparison to extract). I am planning to make a subset of the dds large matrix (after running the deseq()) since I don't want to lose genes before normalization. However, I always got an error when running results(dds, contrast = ()) using the new subset matrix :

Error in counts(object) %*% whichSamples : non-conformable arguments

Then, I tried to relevel the variable containing the numerator and denumerator for contrast function and it gave me another error :

Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues

It seems like the contrast = could not recognize the numerator and denumerator from the new subset matrix. I also read that actually releveling the factor after running the deseq() isn't allowed as it makes the software confused. https://support.bioconductor.org/p/102317/

If there is a way to do solve this problem, please share with me. Thanks in advance.

in total I have 845 samples, each sample has 21111 genes. The samples per group vary depending on the group. Some, I have 6 only, but some I have 116. In total, I have 8 groups so that I plan to subset the matrix into 8 parts.

Two things: I personally use limma-voom for datasets with 100s of individuals in my lab, as I've said before on the support site.

Also, the development branch of DESeq2 (released in one month) is 10x faster for datasets with >100 samples.

That would be great to try the second option. Thanks a lot. However, is there anything worth trying to solve this issue? I mean, like any way to subset the dds large matrix. I can also consider to use the first option.

You can subset the dds and run

`DESeq()`

and`results()`

for pairs of groups.And if you wanted to share the dispersion estimate across all groups, you could do

`DESeq()`

on the whole dataset, then subset and run only`nbinomWaldTest()`

and`results()`

on pairs.Currently, I am trying to subset the outcome matrix of DESeq(). However, I encountered errors written above, What do you mean running results() on pairs? Perhaps, if I manage to subset the matrix, I can run nbinomWaldTest() and calculate the log 2 FC with results().

Yes the second is what I’m recommending.

By the way, is there any way I can use a function to spread the cores for results(). For instance, like deseq(), I can use BIOPARAM to spread the worker. This would be super helpful if I can do this stuff for results(). I tried to use mclapply for instance, bus still the results() was only performed by one worker, The results() is the one taking longest time.

Your first step should be looking up help for functions.

?results

See BPPARAM

Also if you want to keep going with DESeq2, go ahead and use the development branch which is 10x faster for this size of dataset. That’s more than you’ll get with parallelization.

Since the development branch of DESeq2 will be in 1 month, I will try to use the parallelization. I will keep an eye on the update. Thanks a lot for the helps!!

Hi, I have a question. So finally I manage to get my log2FC with your suggestions. However, when I assess the DEGs, I don't get many since my adj-pval is so high (even the p-val) even though the log2FC is around 5 for instance. When I checked the count per replicate, I didn't see much variation. This makes me think that the calculation of p-value with dseq() accounts the variation from the whole data set as the ifcSE value is very high and my data set has non-responding -- high responding compounds. My question is : 1. Is it correct if I subset my data set per compound and analyse it compound by compound? I normalized it with CPM then calculate the log2FC with dseq() function. 2. If I do so, is it fair to compare the expression of the same gene from the different compounds? My concern is because each compound normalized differently and the log2FC is calculated differently, it's not mathematically correct to compare them. Maybe this question is out of topic, but I appreciate your answers. Thanks a lot.

If you want to use a different method, e.g. LFC of CPM data, you should use limma and not DESeq2. You can't insert CPM normalized data into DESeq2.

So I use this chunk of the script. Before I process the DESeq(dds), I define the sizeFactor(dds) first with CPM normalization. I don't know it this s an appropriate manner to do log2FC calculation.

I do not recommend those size factors that you are using under NormMethod: CPM. Note, that is not what

`cpm`

gives you in edgeR/limma either.Ah, ok. I will have a look further for this. So that my question is again : do you recommend to normalize the dataset per compound + control separately using DESeq? And Is it still thereafter appropriate to compare the log2FC after DESeq() processing? As far as I am aware that DESeq is a bit tricky to be applied for the comparisson from separated dataset.

It’s up to you. In this thread I’ve given some valid options for analysis but the analysis is up to you.

Ok then. Thanks a lot!!!