Hey guys,
after profiting from this forum quite a lot during the setup of my pipeline, I have finally come to a point where I have to ask a question myself... the fact that I got this far says something about how amazing and informative the forums are, so thanks :)
My Challenge is:
I have a dataset of 36 independent samples: control and two treatments, four time points for each treatment/control and three replicates for each timepoint of each treatment/control, 3x4x3 = 36. The genome I'm working on is sequenced, but not very well annotated, as well as fairly big (3 Gb). I have mapped my reads using TopHat (via the Galaxy server usegalaxy.org) and discovered that ca 13% of my reads map to multiple locations. Furthermore, the group of genes I'm especially interested in shares a high degree of homology (so I would very much like to distinguish between two transcripts even if they are ~95% similar). I used StringTie (via Galaxy) to assemble my transcriptome for each sample , setting the -M option to 1 (to get all possible transcripts, even if they are 95% similar). For transcript abundance estimation, Salmon (new version of Sailfish) was suggested to me as a program that can adequately deal wit multi-mapped reads. As far as I got it, Salmon actually distributes reads to transcripts until all reads map uniquely, which is more likely to reveal true transcript abundance than just having one gene for a group that shares 95% homology (like StringTie, run in the default mode, would do it) or counting all multi-mapped reads multiple times (like StringTie when run with -M=1 (?)). So I'd like to merge my transcriptome assemblies for all samples and the reference annotation using Cuffmerge and then feed this mastertranscriptome, together with the TopHat output, into Salmon.
After Salmon, however, I would like to use the Ballgown package to explore my data in R, formulate hypotheses and test them. As far as I know, Ballgown is so far the only tool that adequately deals with time courses (via the spline fit), so it seems a very good choice for my dataset.
Now comes the question (sorry for the lengthy intro): how can I use the Salmon output in Ballgown? According to the Salmon documentation, I would ideally work with the count data (tanscripts per million, TPM) that Salmon calculates, . However, can I use TPM instead of the read counts in Ballgown? To me, that seems unlikely, because Ballgown was written for read count data - correct me if I'm wrong?
But even if I use Salmons NumReads and make a file resembling a typical Ballgown input (as specified at https://github.com/alyssafrazee/ballgown), how would I use them? instead of rcount or ucount or mrcount and will Ballgown work at all if not all three of these are specified?
Finally, any suggestions on an alternative to Salmon or Ballgown that adequately handles multi-mapped reads as well as timecourses would also be highly appreciated. However, I would very much prefer to work in Windows using R as far as possible, because my Linux skills are virtually zero (will have to somehow survive the Salmon part, anyway).
Thanks for your time if you've read this whole novel of a post ;)
Friederike
I don't know much about either Salmon or Ballgown, so I can't answer most of your questions, but I know Ballgown handles time courses using natural splines, and you can do the same thing in any differential expression testing package that uses a design matrix, including limma, edgeR, and DESeq2.