Question

how can I calculate FDR with no replicate

0

Entering edit mode

Bioinformatics ▴ 30

@bioinformatics-10931

Last seen 2.3 years ago

United States

Hi,

I have a representative data like this. the first column is control and second is treated

df<- structure(list(col1A = c(1.64, 0.03, 0, 4.202, 2.981, 0.055, 
0, 0.002, 0.005, 0, 0.002, 0.649, 2.55, 2.762, 6.402, 0.91, 0.037, 
0, 5.757, 3.916, 0.022, 0, 0, 0.003, 0, 0.262, 0.136, 2.874, 
3.466, 5.003), col2A = c(2.635, 0.019, 0, 5.638, 3.542, 0.793, 
0.259, 0, 0.046, 0.004, 0.017, 0.971, 3.81, 3.104, 5.849, 1.027, 
0.021, 0, 4.697, 2.832, 0.038, 0.032, 0.001, 0.003, 0, 0, 0.317, 
2.743, 3.187, 6.455)), row.names = c("A", "AA", "AAA", "Aab", 
"buy", "yuyn", "gff", "fgd", "kil", "lilk", "hhk", "lolo", "fdd", 
"vgfh", "nghg", "gdtd", "ayad", "terer", "quwte", "nshdg", "ahaf", 
"eqew", "tars", "nshdt", "andydv", "oalkd", "jayqgd", "nahdgd", 
"nagdd", "hdydy"), class = "data.frame")


group<-factor(c("C","T"))
design<-model.matrix(~group)
data<-DGEList(counts=df,group=group)
data<-estimateGLMCommonDisp(d,design,method="deviance",robust="TRUE",subset=NULL)
data$common.dispersion=0.4
res <- exactTest(data)

I am wondering if this is correct? can I reply somehow on the p value and FDR? let me know where I am making mistake or if there is other package that can be better ?

microarray edger • 1.4k views

ADD COMMENT • link updated 5.2 years ago by Gordon Smyth 50k • written 5.2 years ago by Bioinformatics ▴ 30

0

Entering edit mode

Can I clarify a few things?

What exactly are these numbers? They certainly aren't read counts. What does an expression value of 0.03 mean? edgeR is designed to work with counts or, at very least, expected counts.
How many genes do you have in total? Your representative data only 30 genes. Does your total dataset also have only a handful of genes?
You must already know that the code you give doesn't run. Please run and debug your code before posting it in a question.

ADD REPLY • link 5.2 years ago Gordon Smyth 50k

0

Entering edit mode

@Gordon Smyth these are normalized expression not count. so it means if people do not use raw count, it won't be trusted? I have read many papers they published in science with FMPK which is normalized count by length of gene analyzed by edger !! (I personally do not like FMPK but just as example.

-more than a 1000 genes - the code I give runs for me, which error do you get ? the only function which does not run is estimateGLMCommonDisp. because of lack of replicate I guess

I appreciate your input , THANKS

ADD REPLY • link 5.2 years ago Bioinformatics ▴ 30

1

Entering edit mode

I don't know what you mean by "normalized expression", but you must give counts to edgeR so your current analysis has no meaning. Inputing FPKM to edgeR would also be nonsense, as has been said many times before on this website.

I've never seen a Science paper inputing FPKM to edgeR. I can't say it's never happened, but perhaps you have mis-interpreted what was done.

ADD REPLY • link 5.2 years ago Gordon Smyth 50k

0

Entering edit mode

@Gordon Smyth Look at this manuscript somewhat got me thinking https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287564/

ADD REPLY • link 5.2 years ago Bioinformatics ▴ 30

score 0 · Answer 1 · 2019-01-29

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 10 hours ago

The city by the bay

I assume you've read the relevant section of the edgeR user's guide (i.e., Section 2.11). If you don't have replicates, there is very little that you can do. This is a fundamental theoretical limitation that cannot be overcome. The p-values and FDR resulting from any analysis without replication should not be trusted. Also, the estimateGLMCommonDisp call above will fail for various reasons (no residual d.f., run on d instead of data, robust should be a logical scalar).

ADD COMMENT • link 5.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

@ Aaron Lun I did read that part for sure, I cannot estimate the dispersion because I don't have many housekeeping genes so I set it as 0.4 which I think it is a good number between 0.1 and 0.6. Yes estimateGLMCommonDisp does not work due to problem you mentioned.

I am just wondering if there is other ways to do that . do you think any sort of permutation test can lead me to any expression analysis when I don't have replicate ?

ADD REPLY • link 5.2 years ago Bioinformatics ▴ 30

0

Entering edit mode

@Aaron Lun Look at this manuscript somewhat got me thinking https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287564/

ADD REPLY • link 5.2 years ago Bioinformatics ▴ 30

score 0 · Answer 2 · 2019-01-29

You see to be trying to use a suggestion from Section 2.11 of the edgeR User's Guide, but your analysis is nevertheless still nonsense at the moment because edgeR is not designed to work with arbitrary "normalized expression values".

Trying to use permutation with only 2 samples also makes no sense. No informative permutations are possible with only 2 observations.

If you can obtain raw data in the form of meaningful counts, and you really want FDR values, then the two possible approaches are suggestions 2 and 3 in the edgeR User's Guide section on What to do if you have n replicates.

Suggestion 2 from the User's Guide is to use pre-specified dispersion value. For example you might use

group <- factor(c("C","T"))
d <- DGEList(counts=df, group=group)
keep <- filterByExpr(d)
d <- d[keep,,keep.lib.sizes=FALSE]
d <- calcNormFactors(d, method="TMMwzp")
res <- exactTest(d, dispersion=0.2)
topTags(res)

This analysis without replicates isn't ideal, but nevertheless is vastly better than the method proposed in the Chen et al (2011) paper that you cite, which fails to take biological variation into account at all. You not need to use estimateDisp, or any of its variants, which are only for replicated data.

The dispersion value of 0.4 that you use in your code is higher than I have ever seen for published mRNA-seq data, although a high value like that could perhaps be appropriate for single cell RNA-seq. The example in Section 2.11 of the edgeR User's uses dispersion = 0.4^2 = 0.16, which is a typical sort of value for independent human samples in my experience.

Suggestion 3 from the edgeR User's Guide suggests a rough way of trying to conservatively estimate the dispersion. That would go like this:

group <- factor(c("C","T"))
d <- DGEList(counts=df, group=group)
keep <- filterByExpr(d)
d <- d[keep,,keep.lib.sizes=FALSE]
d <- calcNormFactors(d, method="TMMwzp")
d <- estimateGLMCommonDisp(d, method="deviance", robust=TRUE, subset=NULL)
res <- exactTest(d)
topTags(res)

This method uses estimateGLMCommonDisp to fit a single-group model to both libraries, and to estimate the dispersion allow for a few genes to be outliers. This method does however need a large number of genes to work with, which you don't seem to have