Question

StringTie + Ballgown: handling biological replicates

1

Entering edit mode

bhawley1991 ▴ 10

@bhawley1991-9841

Last seen 9.7 years ago

Hi all,

I've been trying to analyse an RNA-seq dataset, and I decided to try the newer HISAT2>StringTie>Ballgown approach instead of Tophat2>Cufflinks>CummeRbund etc.

I'm having real trouble working out how to handle my biological replicates, as there doesn't seem to be much documentation or discussion on these newer tools. It seems like most people would use Cuffnorm and it's easy to see why as you can very easily specify what are your repeats for each sample. I'm sure there's a way to do this in Ballgown but I'm far to inexperienced to spot it so any help would be fantastic.

Thanks in advance.

ballgown stringtie • 6.7k views

ADD COMMENT • link updated 8.7 years ago by linda.boshans • 0 • written 9.8 years ago by bhawley1991 ▴ 10

score 1 · Answer 1 · 2016-03-04

1

Entering edit mode

Alyssa Frazee ▴ 210

@alyssa-frazee-6710

Last seen 5.0 years ago

San Francisco, CA, USA

Ballgown handles biological replicates. The idea is to run StringTie on each replicate (either biological or technical) separately using the -B option (for "ballgown"), constructing the output directoy structure as specified in which will give you a separate output directory for each replicate, which should look something like this: https://github.com/alyssafrazee/ballgown#loading-data-into-r. When the data is loaded into R from there, ballgown and the associated statistical tests (in "stattest") assumes only that each sample (each separate output directory) is independent of the others. (So they can either be a set of technical replicates from one biological sample, or a set of biological replicates).

If you have both biological and technical replicates, one way to handle this with ballgown is to read in the data as you normally would (one directory per bio/tech rep), but include a column in "pData" denoting bio rep ID. Then you could combine expression values across tech reps (e.g. using average expression) to get a data set with one row per bio rep, and you could use that data set with the stattest function.

ADD COMMENT • link 9.8 years ago Alyssa Frazee ▴ 210

0

Entering edit mode

Hi Alyssa,

I have a similar question to what was posted here, except I have 6 biological replicates (2 samples, 3 replicates each) and 4 technical replicates per biological replicates (for a total of 24). I have done as you stated for denoting the replicates in pData. How do I go about combining the expression values and getting the average expression? And at what step of the analysis do I do that for?

Thanks.

ADD REPLY • link 8.7 years ago linda.boshans • 0

0

Entering edit mode

hi Alyssa, new to R and ballgown. Have 16 samples run thru hisat2 with the --dta and stringtie with -B option, made pheno_data, and ballgown dir with the 16 sample dir with the .ctab and .gtf files for each sample. Got to run in ballgown ok, and made .csv files for genes and transcripts. What I need to do now is tell ballgown how to handle the 16 samples. There are 2 biological reps per sample, and two treatment groups, ctr and bmp2, and 4 time points. Could you give me some help on how to make pheno_data csv file. I need to deal with the varience in the biology rep first, then the stats of diff between ctr and bmp2 treatments, then the stat of the changes between time points and treatment. Thanks so much, Enjoying the program. steveharris

ADD REPLY • link 7.4 years ago harris • 0

score 0 · Answer 2 · 2016-04-24

0

Entering edit mode

jnpitt • 0

@jnpitt-10172

Last seen 9.6 years ago

Alyssa, can you please demonstrate how you would add the bio rep ID to your built in extdata, to say treat your 20 provided samples as 10 independent biological replicates from 2 different treatments? and then use stattest to look at the statistically significant changes between the 2 treatments.

ADD COMMENT • link 9.6 years ago jnpitt • 0

score 0 · Answer 3 · 2016-04-24

0

Entering edit mode

jnpitt • 0

@jnpitt-10172

Last seen 9.6 years ago

just to answer my own question from the ballgown docs:

pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))

here group= assigns the samples to either group 1 or 0, subsequent stattest calls compare groups 0 and 1.

ADD COMMENT • link 9.6 years ago jnpitt • 0

score 0 · Answer 4 · 2016-04-26

0

Entering edit mode

Alyssa Frazee ▴ 210

@alyssa-frazee-6710

Last seen 5.0 years ago

San Francisco, CA, USA

Yep, the above is the correct answer. You can edit pData directly. Each column of the data frame is a covariate and each row is a sample; the group each sample belongs to should be denoted by a covariate (column) exactly as you wrote.

ADD COMMENT • link 9.6 years ago Alyssa Frazee ▴ 210

score 0 · Answer 5 · 2016-04-26

another thing that wasn't clear is that ballgown also requires that the sample ids be independent, for example, a samples vector

filelist <-c("/data/wildtype/sample1", "/data/wildtype/sample2","/data/wildtype/sample3", "/data/mutant/sample1","/data/mutant/sample2", "/data/mutant/sample3")

when loaded into ballgown thus:

bg = ballgown(samples= filelist,meas='all')

will NOT be treated as independent samples...however renaming the directories thus will:

filelist <-c("/data/wildtype/sample1", "/data/wildtype/sample2","/data/wildtype/sample3", "/data/mutant/sample4","/data/mutant/sample5", "/data/mutant/sample6")

score 0 · Answer 6 · 2017-03-07

Hi Alyssa,

I have a similar question to what was posted here, except I have 6 biological replicates (2 samples, 3 replicates each) and 4 technical replicates per biological replicates (for a total of 24). I have done as you stated for denoting the replicates in pData. How do I go about combining the expression values and getting the average expression? And at what step of the analysis do I do that for?

Thanks.