Question

Sample size estimation in R

0

Entering edit mode

Nana • 0

@16757986

Last seen 15 months ago

United States

Hello, I am currently trying to do a sample size estimation for an RNAseq experiment I am planning, using ssizeRNA package in R. This package uses average read counts and dispersion, proportion of DEGs and total genes mapped to estimate sample size based on power. Here is a link to the vignette (https://cran.r-project.org/web/packages/ssizeRNA/vignettes/ssizeRNA.pdf). I used a publically available dataset somewhat related to my topic/tissue of interest to estimate the parameters needed, similar to what they did in the last part of the vignette.

However, I am getting really high numbers per group (400+) . I am not sure if I am doing it right and not many people seem to estimate sample sizes prior to RNAseq experiments. I also noticed that RNAseq papers done in humans use relatively high number of samples per group however, nothing as high as what the analysis gave me. Has anyone used this package before? And are there any tips you can give? Or are there other tools/packages/websites that you can recommend for this? Thanks

ssize • 1.2k views

ADD COMMENT • link updated 15 months ago by James W. MacDonald 67k • written 15 months ago by Nana • 0

score 1 · Answer 1 · 2023-07-10

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 2 days ago

United States

I have used ssize quite extensively. It's actually really simple - you provide a vector of SD values per gene, and it tells you the N required for a given logFC. If you are getting 400+ samples, then you are likely either providing really small SD values, or are specifying 'too large' logFC that you want to identify. But unless you show code, all I can do is speculate.

ADD COMMENT • link 15 months ago James W. MacDonald 67k

0

Entering edit mode

Hi James, thanks for your response. So this is the code that I used;

library(ssizeRNA)
set.seed(2016)
size1 <- ssizeRNA_single(nGenes = 10000, pi0 = 0.8, m = 200, mu = mu,
 disp = dispersion, fc = fc, fdr = 0.05, power = 0.8, maxN = 20)
size1$ssize

For fold change I used; fc <- function(x){exp(rnorm(x, log(2), 0.5*log(2)))} as provided in the Vignette.

For dispersion and mu, I calculated it based on a publicly available dataset GSE1285873 using the code;

library(edgeR)
counts <- as.matrix(Sharpton2019)
if (any(duplicated(colnames(counts)))) {
colnames(counts) <- make.unique(colnames(counts), sep = ".")}
dge <- DGEList(counts = counts)
dge <- calcNormFactors(dge)
dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)
dispersion <- dge$tagwise.dispersion
mu <- rowMeans(Sharpton2019)

where Sharpton 2019 is an expression dataset I generated with gene ids (rows) and corresponding number of counts for each sample (columns).

In your experience, is it necessary to use a preliminary dataset or is it okay to use the default values for dispersion and average read counts?

ADD REPLY • link 15 months ago Nana • 0

0

Entering edit mode

Oh wait. You used ssize as the tag, so I assumed you had a typo when you called it ssizeRNA. That's a CRAN package, so you are in the wrong place. This support site is meant for Bioconductor packages. You might try asking on biostars.org instead.

ADD REPLY • link 15 months ago James W. MacDonald 67k

0

Entering edit mode

Alright. I have also posted in biostars and waiting for a response. I just looked up as ssize and it seems to be exclusive to Microarray data. Do you know of a similar bioconductor package that can be used for RNAseq sample size estimation?

ADD REPLY • link 15 months ago Nana • 0

0

Entering edit mode

If you are using the limma-voom pipeline, ssize is fine.

ADD REPLY • link 15 months ago James W. MacDonald 67k

0

Entering edit mode

You could also try PROPER

ADD REPLY • link 15 months ago James W. MacDonald 67k