Search
Question: StringTie + Ballgown: handling biological replicates
1
2.6 years ago by
bhawley199110
bhawley199110 wrote:

Hi all,

I've been trying to analyse an RNA-seq dataset, and I decided to try the newer HISAT2>StringTie>Ballgown approach instead of Tophat2>Cufflinks>CummeRbund etc.

I'm having real trouble working out how to handle my biological replicates, as there doesn't seem to be much documentation or discussion on these newer tools. It seems like most people would use Cuffnorm and it's easy to see why as you can very easily specify what are your repeats for each sample. I'm sure there's a way to do this in Ballgown but I'm far to inexperienced to spot it so any help would be fantastic.

modified 19 months ago by linda.boshans0 • written 2.6 years ago by bhawley199110
1
2.6 years ago by
Alyssa Frazee200
San Francisco, CA, USA
Alyssa Frazee200 wrote:

Ballgown handles biological replicates. The idea is to run StringTie on each replicate (either biological or technical) separately using the -B option (for "ballgown"), constructing the output directoy structure as specified in which will give you a separate output directory for each replicate, which should look something like this: https://github.com/alyssafrazee/ballgown#loading-data-into-rWhen the data is loaded into R from there, ballgown and the associated statistical tests (in "stattest") assumes only that each sample (each separate output directory) is independent of the others. (So they can either be a set of technical replicates from one biological sample, or a set of biological replicates).

If you have both biological and technical replicates, one way to handle this with ballgown is to read in the data as you normally would (one directory per bio/tech rep), but include a column in "pData" denoting bio rep ID. Then you could combine expression values across tech reps (e.g. using average expression) to get a data set with one row per bio rep, and you could use that data set with the stattest function.

Hi Alyssa,

I have a similar question to what was posted here, except I have 6 biological replicates (2 samples, 3 replicates each) and 4 technical replicates per biological replicates (for a total of 24). I have done as you stated for denoting the replicates in pData. How do I go about combining the expression values and getting the average expression? And at what step of the analysis do I do that for?

Thanks.

hi Alyssa, new to R and ballgown. Have 16 samples run thru hisat2 with the --dta  and stringtie with -B option, made pheno_data, and ballgown dir with the 16 sample dir with the .ctab and .gtf  files  for each sample. Got to run in ballgown ok, and made .csv files for genes and transcripts.   What I need to do now is tell ballgown how to handle the 16 samples. There are 2 biological reps per sample,  and two treatment groups, ctr and bmp2, and 4 time points. Could you give me some help on how to make pheno_data  csv file.  I need to deal with the varience  in the biology rep first, then the stats of diff between ctr and bmp2 treatments, then the stat of the changes between time points and treatment. Thanks so much, Enjoying the program. steveharris

0
2.5 years ago by
jnpitt0
jnpitt0 wrote:

Alyssa, can you please demonstrate how you would add the bio rep ID to your built in extdata, to say treat your 20 provided samples as 10 independent biological replicates from 2 different treatments?   and then use stattest to look at the statistically significant changes between the 2 treatments.

0
2.5 years ago by
jnpitt0
jnpitt0 wrote:

just to answer my own question from the ballgown docs:

pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))

here group= assigns the samples to either group 1 or 0, subsequent stattest calls compare groups 0 and 1.

0
2.5 years ago by
Alyssa Frazee200
San Francisco, CA, USA
Alyssa Frazee200 wrote:

Yep, the above is the correct answer. You can edit pData directly. Each column of the data frame is a covariate and each row is a sample; the group each sample belongs to should be denoted by a covariate (column) exactly as you wrote.

0
2.5 years ago by
jnpitt0
jnpitt0 wrote:

another thing that wasn't clear is that ballgown also requires that the sample ids be independent, for example, a samples vector

filelist <-c("/data/wildtype/sample1", "/data/wildtype/sample2","/data/wildtype/sample3", "/data/mutant/sample1","/data/mutant/sample2", "/data/mutant/sample3")

bg = ballgown(samples= filelist,meas='all')

will NOT be treated as independent samples...however renaming the directories thus will:

filelist <-c("/data/wildtype/sample1", "/data/wildtype/sample2","/data/wildtype/sample3", "/data/mutant/sample4","/data/mutant/sample5", "/data/mutant/sample6")

0
19 months ago by
linda.boshans0 wrote:

Hi Alyssa,

I have a similar question to what was posted here, except I have 6 biological replicates (2 samples, 3 replicates each) and 4 technical replicates per biological replicates (for a total of 24). I have done as you stated for denoting the replicates in pData. How do I go about combining the expression values and getting the average expression? And at what step of the analysis do I do that for?

Thanks.