Question

RNA-seq read simulation for benchmarking differential expression/splicing variance

0

Entering edit mode

Endre Sebestyén ▴ 70

@endre-sebestyen-2707

Last seen 4 months ago

Hungary, Budapest

I was wondering if anybody knows of a package or some method to simulate RNA-seq sequencing reads from samples where we have difference in the variance of gene expression or more importantly, splicing as opposed to the classical difference in mean gene expression or mean splicing values.

I was thinking about getting read counts for transcripts using some of the distributions from the extraDistr package, and feeding these read counts to polyester.

Something like:

library("extraDistr")
library("polyester")

# different read counts simulated somehow
count <- c(10, 50, 100)

# different alpha values depending on uniform random splicing or a dominant transcript
alphas <- c(1, 2, 10)

# simulate read counts
count_random <- rdirmnom(10, count, alphas)

# run polyester with the simulated counts
simulate_experiment_countmat(fasta = NULL, gtf = NULL, seqpath = NULL,
  count_random, outdir = ".", paired = TRUE, seed = NULL, ...)

Anybody ever tried something similar? Are there tools for simulating reads with differential variance?

polyester simulation RNA-seq syntheticdata • 1.5k views

ADD COMMENT • link updated 3.4 years ago by Gordon Smyth 50k • written 3.4 years ago by Endre Sebestyén ▴ 70

0

Entering edit mode

Hey, polyester author here -- that's exactly why we made the function simulate_experiment_countmat! It's for generating reads with whatever kind of differential expression you want -- if you can make a matrix of it, you can simulate RNA-seq with that pattern.

When polyester was written, there wasn't a specific tool for simulating any kind of differential expression (means or variances), but it's been like six years, so something new may have come along in that time! I will let others chime in if they know of a better approach, but can confirm this is the type of situation we were thinking about when we wrote simulate_experiment_countmat.

ADD REPLY • link 3.4 years ago Alyssa Frazee ▴ 210

score 0 · Answer 1 · 2020-11-17

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

You can simulate expression values (in the form of TPMs, not counts) from any distribution you like, then simulate sequence reads as if you were sequencing an RNA sample with transcripts in those abundances using Rsubread::simReads.

ADD COMMENT • link 3.4 years ago Gordon Smyth 50k