statistics : edger or DESeq2 to asses the differential binding using the average scores
1
0
Entering edit mode
Bogdan ▴ 670
@bogdan-2367
Last seen 14 months ago
Palo Alto, CA, USA

Dear all,

I do have a statistical question about the use of limma /edgeR/ DEseq2 in order to assess the differential binding / differential peaks in condition1 versus condition2.

I only have bigwig / bedGraph files of two replicates (condition1) and two replicates (condition2) (I do not have access to bam / fastq files).

Considering these 4 samples, deeptools (MultiWigSummary) (or other tools in BioC) generates a file that contains the average scores per region for each bigWig file.

The file format is :

 chr  start    end   s01   s02    s3   s4    s13   s15   s19   s21
1 chr1 858542 864405 0.1214228 0.1374998 0.09338397 1.2103991 0.09819990 0.1288492 0.2059920 0.2368644
2 chr1 867458 869835 0.1994169 0.1824069 0.11373949 0.6112617 0.11488137 0.1379892 0.2067585 0.2780747 

From a statistical point of view, is it legitimate to use these average score per region as input into edgeR/ DESe2 /limma in order to assess if a region (chr1 : 858542 - 864405) is differentially bound in condition 1 (samples 01 and 02) and not in condition 2 (samples s3 and s4).

Although I expect that the answer to my question is a huge "no", I still want to ask the question on BioC forum in order to potentially receive any other suggestions / opinions.

How shall I check if these scores follow a negative binomial distribution (or a normal distribution) ?

Could I mathematically transform these average scores in such a way that it fits a NB distribution ?

Is there any other alternative to edgeR/ limma/ DESeq2 (or to T-test) in order to answer the question ?

Thanks a lot,

Bogdan

DESeq2 edgeR • 1.3k views
ADD COMMENT
0
Entering edit mode
Yunshun Chen ▴ 900
@yunshun-chen-5451
Last seen 5 weeks ago
Australia

It is not legitimate to use these average scores as input into any of the three packages you mentioned. These scores can't be negative binomial distributed as they are not even integers. You may need to follow the deeptools pipeline to use those average scores for all the downstream analysis.

ADD COMMENT
0
Entering edit mode

Hi Yunshun, thank you for your reply. I guess that I could multiply each element of the matrix by 1000 in order to transform those numerical values into integers ?

ADD REPLY
0
Entering edit mode

No, it doesn't make sense to do so. The distribution we assume the data would follow is completely lost during the calculation of the 'average scores'. You really need the raw counts in order to use these packages.

ADD REPLY
0
Entering edit mode

OK, thank you. Would it make sense to use a T-test ?

Generally speaking though, shall I have a list of integers, how shall I verify if those integers fit a NB distribution or a gaussian distribution ?

Which commands in R shall I use ?

Or how can I transform the data in order to make it follow a NB, or a normal distribution (in which case I can use a T-test) ?

ADD REPLY
1
Entering edit mode

As mentioned before, it would be better to have a look at the deeptools pipeline where the 'average score' was introduced, and see how the scores are utilized in the downstream DE analysis. Which test to use depends on how the scores were computed in the first place.

There is no short answer to your general questions (except that integers can't be gaussian distributed).

ADD REPLY
0
Entering edit mode

Hi Yunshun. When you get the chance, would it possible please to mention a few websites that I could use in order to learn how to verify in R if a list of numbers (that may represent gene expression levels, peak intensities, chromatin looping strength, etc) fit a NB distribution or a gaussian distribution (or any other distribution, Poisson, exponential, etc )? Thanks !

ADD REPLY
0
Entering edit mode

Hi Yunshun and Bogdan, I think that it is incorrect to say that limma cannot be used for this type of data. Only DESeq2 and edgeR expect counts (and therefore have the NB assumption). My understanding is that limma, having been built to use with (nearly) continuous micro-array data, can handle continuous data just fine.

ADD REPLY

Login before adding your answer.

Traffic: 743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6