Question

statistics : edger or DESeq2 to asses the differential binding using the average scores

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 2.3 years ago

Palo Alto, CA, USA

Dear all,

I do have a statistical question about the use of limma /edgeR/ DEseq2 in order to assess the differential binding / differential peaks in condition1 versus condition2.

I only have bigwig / bedGraph files of two replicates (condition1) and two replicates (condition2) (I do not have access to bam / fastq files).

Considering these 4 samples, deeptools (MultiWigSummary) (or other tools in BioC) generates a file that contains the average scores per region for each bigWig file.

The file format is :

 chr  start    end   s01   s02    s3   s4    s13   s15   s19   s21
1 chr1 858542 864405 0.1214228 0.1374998 0.09338397 1.2103991 0.09819990 0.1288492 0.2059920 0.2368644
2 chr1 867458 869835 0.1994169 0.1824069 0.11373949 0.6112617 0.11488137 0.1379892 0.2067585 0.2780747

From a statistical point of view, is it legitimate to use these average score per region as input into edgeR/ DESe2 /limma in order to assess if a region (chr1 : 858542 - 864405) is differentially bound in condition 1 (samples 01 and 02) and not in condition 2 (samples s3 and s4).

Although I expect that the answer to my question is a huge "no", I still want to ask the question on BioC forum in order to potentially receive any other suggestions / opinions.

How shall I check if these scores follow a negative binomial distribution (or a normal distribution) ?

Could I mathematically transform these average scores in such a way that it fits a NB distribution ?

Is there any other alternative to edgeR/ limma/ DESeq2 (or to T-test) in order to answer the question ?

Thanks a lot,

Bogdan

DESeq2 edgeR • 1.8k views

ADD COMMENT • link updated 3.1 years ago by chaco001 • 0 • written 3.3 years ago by Bogdan ▴ 670

score 0 · Answer 1 · 2022-10-20

0

Entering edit mode

Yunshun Chen ▴ 900

@yunshun-chen-5451

Last seen 14 months ago

Australia

It is not legitimate to use these average scores as input into any of the three packages you mentioned. These scores can't be negative binomial distributed as they are not even integers. You may need to follow the deeptools pipeline to use those average scores for all the downstream analysis.

ADD COMMENT • link 3.3 years ago Yunshun Chen ▴ 900

0

Entering edit mode

Hi Yunshun, thank you for your reply. I guess that I could multiply each element of the matrix by 1000 in order to transform those numerical values into integers ?

ADD REPLY • link 3.3 years ago Bogdan ▴ 670

0

Entering edit mode

No, it doesn't make sense to do so. The distribution we assume the data would follow is completely lost during the calculation of the 'average scores'. You really need the raw counts in order to use these packages.

ADD REPLY • link 3.3 years ago Yunshun Chen ▴ 900

0

Entering edit mode

OK, thank you. Would it make sense to use a T-test ?

Generally speaking though, shall I have a list of integers, how shall I verify if those integers fit a NB distribution or a gaussian distribution ?

Which commands in R shall I use ?

Or how can I transform the data in order to make it follow a NB, or a normal distribution (in which case I can use a T-test) ?

ADD REPLY • link 3.3 years ago Bogdan ▴ 670

1

Entering edit mode

As mentioned before, it would be better to have a look at the deeptools pipeline where the 'average score' was introduced, and see how the scores are utilized in the downstream DE analysis. Which test to use depends on how the scores were computed in the first place.

There is no short answer to your general questions (except that integers can't be gaussian distributed).

ADD REPLY • link 3.3 years ago Yunshun Chen ▴ 900

0

Entering edit mode

Hi Yunshun. When you get the chance, would it possible please to mention a few websites that I could use in order to learn how to verify in R if a list of numbers (that may represent gene expression levels, peak intensities, chromatin looping strength, etc) fit a NB distribution or a gaussian distribution (or any other distribution, Poisson, exponential, etc )? Thanks !

ADD REPLY • link 3.3 years ago Bogdan ▴ 670

0

Entering edit mode

Hi Yunshun and Bogdan, I think that it is incorrect to say that limma cannot be used for this type of data. Only DESeq2 and edgeR expect counts (and therefore have the NB assumption). My understanding is that limma, having been built to use with (nearly) continuous micro-array data, can handle continuous data just fine.

ADD REPLY • link 3.1 years ago chaco001 • 0