Sample size estimation in R
1
0
Entering edit mode
Nana • 0
@16757986
Last seen 9 months ago
United States

Hello, I am currently trying to do a sample size estimation for an RNAseq experiment I am planning, using ssizeRNA package in R. This package uses average read counts and dispersion, proportion of DEGs and total genes mapped to estimate sample size based on power. Here is a link to the vignette (https://cran.r-project.org/web/packages/ssizeRNA/vignettes/ssizeRNA.pdf). I used a publically available dataset somewhat related to my topic/tissue of interest to estimate the parameters needed, similar to what they did in the last part of the vignette.

However, I am getting really high numbers per group (400+) . I am not sure if I am doing it right and not many people seem to estimate sample sizes prior to RNAseq experiments. I also noticed that RNAseq papers done in humans use relatively high number of samples per group however, nothing as high as what the analysis gave me. Has anyone used this package before? And are there any tips you can give? Or are there other tools/packages/websites that you can recommend for this? Thanks

ssize • 831 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 14 minutes ago
United States

I have used ssize quite extensively. It's actually really simple - you provide a vector of SD values per gene, and it tells you the N required for a given logFC. If you are getting 400+ samples, then you are likely either providing really small SD values, or are specifying 'too large' logFC that you want to identify. But unless you show code, all I can do is speculate.

0
Entering edit mode

Hi James, thanks for your response. So this is the code that I used;

library(ssizeRNA)
set.seed(2016)
size1 <- ssizeRNA_single(nGenes = 10000, pi0 = 0.8, m = 200, mu = mu,
 disp = dispersion, fc = fc, fdr = 0.05, power = 0.8, maxN = 20)
size1$ssize

For fold change I used; fc <- function(x){exp(rnorm(x, log(2), 0.5*log(2)))} as provided in the Vignette.

For dispersion and mu, I calculated it based on a publicly available dataset GSE1285873 using the code;

library(edgeR)
counts <- as.matrix(Sharpton2019)
if (any(duplicated(colnames(counts)))) {
colnames(counts) <- make.unique(colnames(counts), sep = ".")}
dge <- DGEList(counts = counts)
dge <- calcNormFactors(dge)
dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)
dispersion <- dge$tagwise.dispersion
mu <- rowMeans(Sharpton2019)

where Sharpton 2019 is an expression dataset I generated with gene ids (rows) and corresponding number of counts for each sample (columns).

In your experience, is it necessary to use a preliminary dataset or is it okay to use the default values for dispersion and average read counts?

ADD REPLY
0
Entering edit mode

Oh wait. You used ssize as the tag, so I assumed you had a typo when you called it ssizeRNA. That's a CRAN package, so you are in the wrong place. This support site is meant for Bioconductor packages. You might try asking on biostars.org instead.

ADD REPLY
0
Entering edit mode

Alright. I have also posted in biostars and waiting for a response. I just looked up as ssize and it seems to be exclusive to Microarray data. Do you know of a similar bioconductor package that can be used for RNAseq sample size estimation?

ADD REPLY
0
Entering edit mode

If you are using the limma-voom pipeline, ssize is fine.

ADD REPLY
0
Entering edit mode

You could also try PROPER

ADD REPLY

Login before adding your answer.

Traffic: 723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6