The input data for EBseq-HMM
1
0
Entering edit mode
@nataliafghf-11494
Last seen 7.6 years ago

Hi everybody,

I'm reading the manual (Vignette) for EBseq-HMM package. for the input data they say:

"The object Data should be a matrix containing the expression values for each gene and each sample, where G is the number of genes and S is the number of samples. These values should exhibit estimates of gene expression, without normalization across samples. Counts of this nature may be obtained from RSEM (Li and Dewey (2011)), Cufflinks (Trapnell et al. (2012)), or a similar approach."

Since I'm new to analysis of RNA-seq data, I have some difficulties understanding what they exactly mean. As I know, read counts are estimates of gene expression. What's the nature of counts obtained by RSEM or Cufflinks? Are they raw counts or pre-normalized counts such as FPKM?

To simplify my question, is it ok if I use the "raw counts" as the input?

Thanks

ebseq-hmm input files • 2.0k views
ADD COMMENT
2
Entering edit mode
@nicolas-delhomme-6252
Last seen 5.5 years ago
Sweden

Hej Natalia!

For measuring expression:

(i)  at the gene level, i.e. ignoring splice isoforms

Raw counts represent "exact" counts, i.e. integer values. A raw count represents one mRNA fragment that was sequenced as one read. Counting these is very easy for unique regions, but becomes difficult when a read originating from an mRNA fragment can be mapped to several positions in the genome (i.e. these are known as multi-mapping reads; think e.g. of tandem repeat genes). You can only assign one count as the read comes from a single mRNA fragment, but to which gene? Earlier tools simply discarded these reads (truth being told, these multi-mapping reads affect very few loci in model organisms such as the fruitfly), but this is non-optimal, especially if you consider complex multi-ploid genomes or simply plants, which have had a lot of whole genome duplications (some very recent). To address this problem, methods were develop to statistically assign the reads to their most likely gene of origin; as for example RSEM which uses an expectation-maximisation approach to the problem. With such tools, you do not obtain raw integer counts anymore, but estimates of abundance; i.e. non-integer values associated with a value representing the probability that the estimate is correct.

(ii) at the transcript levels

the problem is the same, but possibly even more complex as genes have several isoforms that possibly share multiple exons. Counting transcript expression then relies on the same principles I've described above and also give you estimate of expression

(iii) using pseudo-aligment methods

You may have heard of Kallisto, Salmon? These are tools that are also providing estimate of expression, but do not rely on alignments. Instead they process all the kmers (string of a common length) derived from your sequencing reads and look them up in an index created from all your transcripts, finding the most likely transcript the reads originated from.

HTH,

Nico

ADD COMMENT
0
Entering edit mode

Hi Nicolas,

Thank you so much for your kind explanation. You taught me a lot.

The reason why I asked this question is that I only have the raw counts and I wanted to know if EBseqHMM can handle them. According to your explanation, EBseqHMM requires the estimates of gene expressions (Which is non-integer). Will it cause a problem if I use the raw counts? I think since both "estimates of gene expressions" and "raw counts" are not normalized, EBseq-HMM shouldn't technically have issues with either of them. Right?

ADD REPLY
0
Entering edit mode

Hej Natalia!

How did you obtain your raw counts? This is of importance here. The reason is that raw counts and expression estimates have different properties, and while technically you can probably give raw counts as input to EBseqHMM and would probably get an output, the results will very likely be inaccurate. I.e. the assumption that the tool has on the data property might be violated by the differences between the raw counts and the expression estimates. I'm not familiar at all with EBseqHMM, so I may be wrong. Let's hope someone familiar with the tool chimes in. Meanwhile, since EBseqHMM suggests using RSEM for calculating expression estimates, I think it would be good for you to compare how your raw counts where obtained with the pipeline suggested by RSEM (http://deweylab.github.io/RSEM/) and possibly regenerate expression estimates from your data.

ADD REPLY
0
Entering edit mode

Hey there,

I didn't obtain the raw counts, they were available on ncbi. I just read the EBseq (not EBseq-HMM) vignette. let's see what is written there:

"The object data should be a matrix containing the expression values for each gene and each sample, where G is the number of genes and S is the number of samples. These values should exhibit raw counts, without normalization across samples. Counts of this nature may be obtained from RSEM [4], Cufflinks [6], or a similar approach. "

It seems for EBseq package they can accept raw counts while they suggest using RSEM to obtain them. I think maybe I can use raw counts for EBseqHMM too.

They also assume in EBseqHMM that read counts follow a negative binomial distribution. If they use non-integer counts obtained by RSEM, then it can't follow NB distribution. Are RSEM data normalized somehow (for example FPKM values)? If they are not normalized and only are estimation of gene expressions, that could make sense to use raw counts instead.

I took a look at the sample data in EBseq-HMM package and they were integer. What do you think?

ADD REPLY
1
Entering edit mode

Hej Natalia!

It sounds like you can use raw counts indeed. What confuses me here is the reference to RSEM, which is a tool to calculate expression estimate, but possibly it does also report raw counts.

As a side note, I would advise you to look up how the raw count that you retrieved from the NCBI (I guess from the SRA or GEO?) were created. The pre-processing of RNA-Seq data has a number of pitfalls and caveats that you want to be aware of as these could significantly affect your results. We have reported guidelines on this (http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis) 2 years ago, but they are still pretty accurate.

Good luck!

Nico

ADD REPLY
0
Entering edit mode

Hi

Thanks a lot for the comprehensive info. That's so kind of you. I'm going to read the paper you sent me. The read counts are from GEO (A published paper in which they had analyzed DE gens).

Best wishes!

ADD REPLY

Login before adding your answer.

Traffic: 692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6