Subset-ing Large DESeqDataSet for data extraction with contrast function
1
0
Entering edit mode
l.s.wijaya • 0
@lswijaya-21856
Last seen 4.6 years ago

Hi all,

I am now running Deseq2 analysis. However, I am stuck at extracting the results with results(dds, contrast = ()) after deseq() from the large DESeqDataSet. The problem was, it took me extremely long to only extract one comparison (3 hours/comparison meanwhile, I have 265 comparison to extract). I am planning to make a subset of the dds large matrix (after running the deseq()) since I don't want to lose genes before normalization. However, I always got an error when running results(dds, contrast = ()) using the new subset matrix :

Error in counts(object) %*% whichSamples : non-conformable arguments

Then, I tried to relevel the variable containing the numerator and denumerator for contrast function and it gave me another error :

Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues

It seems like the contrast = could not recognize the numerator and denumerator from the new subset matrix. I also read that actually releveling the factor after running the deseq() isn't allowed as it makes the software confused. https://support.bioconductor.org/p/102317/

If there is a way to do solve this problem, please share with me. Thanks in advance.

deseq2 • 3.5k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

How many samples do you have? Also, how many samples per group?

ADD COMMENT
0
Entering edit mode

in total I have 845 samples, each sample has 21111 genes. The samples per group vary depending on the group. Some, I have 6 only, but some I have 116. In total, I have 8 groups so that I plan to subset the matrix into 8 parts.

ADD REPLY
0
Entering edit mode

Two things: I personally use limma-voom for datasets with 100s of individuals in my lab, as I've said before on the support site.

Also, the development branch of DESeq2 (released in one month) is 10x faster for datasets with >100 samples.

ADD REPLY
0
Entering edit mode

That would be great to try the second option. Thanks a lot. However, is there anything worth trying to solve this issue? I mean, like any way to subset the dds large matrix. I can also consider to use the first option.

ADD REPLY
0
Entering edit mode

You can subset the dds and run DESeq() and results() for pairs of groups.

ADD REPLY
0
Entering edit mode

And if you wanted to share the dispersion estimate across all groups, you could do DESeq() on the whole dataset, then subset and run only nbinomWaldTest() and results() on pairs.

ADD REPLY
0
Entering edit mode

Currently, I am trying to subset the outcome matrix of DESeq(). However, I encountered errors written above, What do you mean running results() on pairs? Perhaps, if I manage to subset the matrix, I can run nbinomWaldTest() and calculate the log 2 FC with results().

ADD REPLY
0
Entering edit mode

Yes the second is what I’m recommending.

ADD REPLY
0
Entering edit mode

By the way, is there any way I can use a function to spread the cores for results(). For instance, like deseq(), I can use BIOPARAM to spread the worker. This would be super helpful if I can do this stuff for results(). I tried to use mclapply for instance, bus still the results() was only performed by one worker, The results() is the one taking longest time.

ADD REPLY
0
Entering edit mode

Your first step should be looking up help for functions.

?results

See BPPARAM

Also if you want to keep going with DESeq2, go ahead and use the development branch which is 10x faster for this size of dataset. That’s more than you’ll get with parallelization.

ADD REPLY
0
Entering edit mode

Since the development branch of DESeq2 will be in 1 month, I will try to use the parallelization. I will keep an eye on the update. Thanks a lot for the helps!!

ADD REPLY
0
Entering edit mode

Hi, I have a question. So finally I manage to get my log2FC with your suggestions. However, when I assess the DEGs, I don't get many since my adj-pval is so high (even the p-val) even though the log2FC is around 5 for instance. When I checked the count per replicate, I didn't see much variation. This makes me think that the calculation of p-value with dseq() accounts the variation from the whole data set as the ifcSE value is very high and my data set has non-responding -- high responding compounds. My question is : 1. Is it correct if I subset my data set per compound and analyse it compound by compound? I normalized it with CPM then calculate the log2FC with dseq() function. 2. If I do so, is it fair to compare the expression of the same gene from the different compounds? My concern is because each compound normalized differently and the log2FC is calculated differently, it's not mathematically correct to compare them. Maybe this question is out of topic, but I appreciate your answers. Thanks a lot.

ADD REPLY
0
Entering edit mode

If you want to use a different method, e.g. LFC of CPM data, you should use limma and not DESeq2. You can't insert CPM normalized data into DESeq2.

ADD REPLY
0
Entering edit mode
CPM <- apply(count_Filtered, 2, function(x) (x/sum(x))*1000000)

dds <- DESeqDataSetFromMatrix(countData = count_Filtered,
                              colData = meta_Filtered,
                              design = ~ meanID)

if (NormMethod == "DESeq2"){ 
  dds <- estimateSizeFactors(dds) 
} else if (NormMethod == "CPM"){
  sizeFactors(dds) <- colSums(count_Filtered)/1000000
} else { 
  warning("Incorrect input for 'NormMethod', please double check above!")
}

dds <- DESeq(dds)

So I use this chunk of the script. Before I process the DESeq(dds), I define the sizeFactor(dds) first with CPM normalization. I don't know it this s an appropriate manner to do log2FC calculation.

ADD REPLY
0
Entering edit mode

I do not recommend those size factors that you are using under NormMethod: CPM. Note, that is not what cpm gives you in edgeR/limma either.

ADD REPLY
0
Entering edit mode

Ah, ok. I will have a look further for this. So that my question is again : do you recommend to normalize the dataset per compound + control separately using DESeq? And Is it still thereafter appropriate to compare the log2FC after DESeq() processing? As far as I am aware that DESeq is a bit tricky to be applied for the comparisson from separated dataset.

ADD REPLY
0
Entering edit mode

It’s up to you. In this thread I’ve given some valid options for analysis but the analysis is up to you.

ADD REPLY
0
Entering edit mode

Ok then. Thanks a lot!!!

ADD REPLY

Login before adding your answer.

Traffic: 522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6