Hej Natalia!
For measuring expression:
(i) at the gene level, i.e. ignoring splice isoforms
Raw counts represent "exact" counts, i.e. integer values. A raw count represents one mRNA fragment that was sequenced as one read. Counting these is very easy for unique regions, but becomes difficult when a read originating from an mRNA fragment can be mapped to several positions in the genome (i.e. these are known as multi-mapping reads; think e.g. of tandem repeat genes). You can only assign one count as the read comes from a single mRNA fragment, but to which gene? Earlier tools simply discarded these reads (truth being told, these multi-mapping reads affect very few loci in model organisms such as the fruitfly), but this is non-optimal, especially if you consider complex multi-ploid genomes or simply plants, which have had a lot of whole genome duplications (some very recent). To address this problem, methods were develop to statistically assign the reads to their most likely gene of origin; as for example RSEM which uses an expectation-maximisation approach to the problem. With such tools, you do not obtain raw integer counts anymore, but estimates of abundance; i.e. non-integer values associated with a value representing the probability that the estimate is correct.
(ii) at the transcript levels
the problem is the same, but possibly even more complex as genes have several isoforms that possibly share multiple exons. Counting transcript expression then relies on the same principles I've described above and also give you estimate of expression
(iii) using pseudo-aligment methods
You may have heard of Kallisto, Salmon? These are tools that are also providing estimate of expression, but do not rely on alignments. Instead they process all the kmers (string of a common length) derived from your sequencing reads and look them up in an index created from all your transcripts, finding the most likely transcript the reads originated from.
HTH,
Nico
Hi Nicolas,
Thank you so much for your kind explanation. You taught me a lot.
The reason why I asked this question is that I only have the raw counts and I wanted to know if EBseqHMM can handle them. According to your explanation, EBseqHMM requires the estimates of gene expressions (Which is non-integer). Will it cause a problem if I use the raw counts? I think since both "estimates of gene expressions" and "raw counts" are not normalized, EBseq-HMM shouldn't technically have issues with either of them. Right?
Hej Natalia!
How did you obtain your raw counts? This is of importance here. The reason is that raw counts and expression estimates have different properties, and while technically you can probably give raw counts as input to EBseqHMM and would probably get an output, the results will very likely be inaccurate. I.e. the assumption that the tool has on the data property might be violated by the differences between the raw counts and the expression estimates. I'm not familiar at all with EBseqHMM, so I may be wrong. Let's hope someone familiar with the tool chimes in. Meanwhile, since EBseqHMM suggests using RSEM for calculating expression estimates, I think it would be good for you to compare how your raw counts where obtained with the pipeline suggested by RSEM (http://deweylab.github.io/RSEM/) and possibly regenerate expression estimates from your data.
Hey there,
I didn't obtain the raw counts, they were available on ncbi. I just read the EBseq (not EBseq-HMM) vignette. let's see what is written there:
"The object data should be a matrix containing the expression values for each gene and each sample, where G is the number of genes and S is the number of samples. These values should exhibit raw counts, without normalization across samples. Counts of this nature may be obtained from RSEM [4], Cufflinks [6], or a similar approach. "
It seems for EBseq package they can accept raw counts while they suggest using RSEM to obtain them. I think maybe I can use raw counts for EBseqHMM too.
They also assume in EBseqHMM that read counts follow a negative binomial distribution. If they use non-integer counts obtained by RSEM, then it can't follow NB distribution. Are RSEM data normalized somehow (for example FPKM values)? If they are not normalized and only are estimation of gene expressions, that could make sense to use raw counts instead.
I took a look at the sample data in EBseq-HMM package and they were integer. What do you think?
Hej Natalia!
It sounds like you can use raw counts indeed. What confuses me here is the reference to RSEM, which is a tool to calculate expression estimate, but possibly it does also report raw counts.
As a side note, I would advise you to look up how the raw count that you retrieved from the NCBI (I guess from the SRA or GEO?) were created. The pre-processing of RNA-Seq data has a number of pitfalls and caveats that you want to be aware of as these could significantly affect your results. We have reported guidelines on this (http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis) 2 years ago, but they are still pretty accurate.
Good luck!
Nico
Hi
Thanks a lot for the comprehensive info. That's so kind of you. I'm going to read the paper you sent me. The read counts are from GEO (A published paper in which they had analyzed DE gens).
Best wishes!