Search
Question: Using RPKM data in bioconductor for gene expression analysis
0
gravatar for AB
2.1 years ago by
AB0
United States
AB0 wrote:

Hi,

I'm a masters student and I'm new to RNA-Seq analysis. I have an RNA-Seq dataset which I was analyzing using Galaxy. I ran cufflinks on galaxy and I have the RPKM file. I understand DESeq and EdgeR are only suited for raw counts and not normalized values. How do I use the RPKM values in bioconductor to perform differential gene expression analysis ?

Thanks,

Apoorva

ADD COMMENTlink modified 2.1 years ago by James W. MacDonald45k • written 2.1 years ago by AB0
0
gravatar for James W. MacDonald
2.1 years ago by
United States
James W. MacDonald45k wrote:

My understanding of the RPKM measure is that it was intended to make the data amenable to analysis with conventional modeling methods. I don't know that this is actually true, and people like Lior Pachter, who were early proponents of this measure seem to have decided that TPM is a more reasonable measure than RPKM or FPKM, so the data you have in hand may not be considered to be particularly useful these days.

So if that is all you have, then I think the conventional thing to do is just use something like limma, and pretend that RPKM are reasonable inputs. But do note that tools like cufflinks spend a lot of time trying to tease out differences in isoform expression. In other words, instead of giving you some measure of the expression of a gene, cufflinks is trying to say how much of each possible isoform of that gene is being expressed. That is sort of old tech these days as well, as aligners like salmon or kallisto will do an arguably better job at much faster speeds, and will also give you TPM, which you can then just round to the nearest integer and use with edgeR or DESeq2.

Or if you really just want to summarize at the gene level, ignoring transcriptional differences, you could use something like subread and featureCounts to get counts. But all of this assumes you have access to the bamfiles.

ADD COMMENTlink written 2.1 years ago by James W. MacDonald45k
2

Just to add a caveat: You can't go from FPKM or TPM to a count-like thing. The count has information about the precision, while the TPM alone does not (note software using/reporting TPM typically keeps track of precision internally). To give a concrete example about the problem of going from FPKM/TPM to count, as you increase sequencing depth, NB methods are more confident in quantifying the difference across condition in the log scale, because it's built into the statistical distributions (or likewise for voom, the method involves estimation of the precision for weighting in the linear model). However, unlike the count, the TPM (in expectation) will stay flat with increasing sequencing depth.

ADD REPLYlink written 2.1 years ago by Michael Love14k
1

Good point. I should have said that salmon (and possibly kallisto, although I haven't used it) will give you estimated counts/transcript that you could round and use.

ADD REPLYlink written 2.1 years ago by James W. MacDonald45k

Depending on organism, i think it can be reasonable to use estimated gene counts. However, estimated transcript counts are highly correlated within a gene, and so I'd use a special treatment to do a transcript level analysis. See BitSeq, EBSeq papers for references.

ADD REPLYlink written 2.1 years ago by Michael Love14k

Thank you. However, I'm not trying to go to count from FPKM. I'm just trying to figure out how to perform the differential expression analysis using just an xls sheet having the fpkm values from 8 samples (6 normal and 2 defective). I tried using limma but i'm getting this error 

rowMeans(y$exprs, na.rm = TRUE) : 'x' must be numeric

Can you please help me solve this ?

Thank you

ADD REPLYlink written 2.1 years ago by AB0
1

Sure. Make sure that whatever you are feeding to limma is actually numeric.
 

ADD REPLYlink written 2.1 years ago by James W. MacDonald45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 210 users visited in the last hour