RNA-seq read simulation for benchmarking differential expression/splicing variance
1
0
Entering edit mode
@endre-sebestyen-2707
Last seen 14 months ago
Hungary, Budapest

I was wondering if anybody knows of a package or some method to simulate RNA-seq sequencing reads from samples where we have difference in the variance of gene expression or more importantly, splicing as opposed to the classical difference in mean gene expression or mean splicing values.

I was thinking about getting read counts for transcripts using some of the distributions from the extraDistr package, and feeding these read counts to polyester.

Something like:

library("extraDistr")
library("polyester")

# different read counts simulated somehow
count <- c(10, 50, 100)

# different alpha values depending on uniform random splicing or a dominant transcript
alphas <- c(1, 2, 10)

count_random <- rdirmnom(10, count, alphas)

# run polyester with the simulated counts
simulate_experiment_countmat(fasta = NULL, gtf = NULL, seqpath = NULL,
count_random, outdir = ".", paired = TRUE, seed = NULL, ...)


Anybody ever tried something similar? Are there tools for simulating reads with differential variance?

polyester simulation RNA-seq syntheticdata • 573 views
0
Entering edit mode

Hey, polyester author here -- that's exactly why we made the function simulate_experiment_countmat! It's for generating reads with whatever kind of differential expression you want -- if you can make a matrix of it, you can simulate RNA-seq with that pattern.

When polyester was written, there wasn't a specific tool for simulating any kind of differential expression (means or variances), but it's been like six years, so something new may have come along in that time! I will let others chime in if they know of a better approach, but can confirm this is the type of situation we were thinking about when we wrote simulate_experiment_countmat.

0
Entering edit mode
@gordon-smyth
Last seen 25 minutes ago
WEHI, Melbourne, Australia

You can simulate expression values (in the form of TPMs, not counts) from any distribution you like, then simulate sequence reads as if you were sequencing an RNA sample with transcripts in those abundances using Rsubread::simReads.