I'm a masters student and I'm new to RNA-Seq analysis. I have an RNA-Seq dataset which I was analyzing using Galaxy. I ran cufflinks on galaxy and I have the RPKM file. I understand DESeq and EdgeR are only suited for raw counts and not normalized values. How do I use the RPKM values in bioconductor to perform differential gene expression analysis ?
My understanding of the RPKM measure is that it was intended to make the data amenable to analysis with conventional modeling methods. I don't know that this is actually true, and people like Lior Pachter, who were early proponents of this measure seem to have decided that TPM is a more reasonable measure than RPKM or FPKM, so the data you have in hand may not be considered to be particularly useful these days.
So if that is all you have, then I think the conventional thing to do is just use something like limma, and pretend that RPKM are reasonable inputs. But do note that tools like cufflinks spend a lot of time trying to tease out differences in isoform expression. In other words, instead of giving you some measure of the expression of a gene, cufflinks is trying to say how much of each possible isoform of that gene is being expressed. That is sort of old tech these days as well, as aligners like salmon or kallisto will do an arguably better job at much faster speeds, and will also give you TPM, which you can then just round to the nearest integer and use with edgeR or DESeq2.
Or if you really just want to summarize at the gene level, ignoring transcriptional differences, you could use something like subread and featureCounts to get counts. But all of this assumes you have access to the bamfiles.
Just to add a caveat: You can't go from FPKM or TPM to a count-like thing. The count has information about the precision, while the TPM alone does not (note software using/reporting TPM typically keeps track of precision internally). To give a concrete example about the problem of going from FPKM/TPM to count, as you increase sequencing depth, NB methods are more confident in quantifying the difference across condition in the log scale, because it's built into the statistical distributions (or likewise for voom, the method involves estimation of the precision for weighting in the linear model). However, unlike the count, the TPM (in expectation) will stay flat with increasing sequencing depth.
Good point. I should have said that salmon (and possibly kallisto, although I haven't used it) will give you estimated counts/transcript that you could round and use.
Depending on organism, i think it can be reasonable to use estimated gene counts. However, estimated transcript counts are highly correlated within a gene, and so I'd use a special treatment to do a transcript level analysis. See BitSeq, EBSeq papers for references.
Thank you. However, I'm not trying to go to count from FPKM. I'm just trying to figure out how to perform the differential expression analysis using just an xls sheet having the fpkm values from 8 samples (6 normal and 2 defective). I tried using limma but i'm getting this error
rowMeans(y$exprs, na.rm = TRUE) : 'x' must be numeric
Just to add a caveat: You can't go from FPKM or TPM to a count-like thing. The count has information about the precision, while the TPM alone does not (note software using/reporting TPM typically keeps track of precision internally). To give a concrete example about the problem of going from FPKM/TPM to count, as you increase sequencing depth, NB methods are more confident in quantifying the difference across condition in the log scale, because it's built into the statistical distributions (or likewise for voom, the method involves estimation of the precision for weighting in the linear model). However, unlike the count, the TPM (in expectation) will stay flat with increasing sequencing depth.
Good point. I should have said that salmon (and possibly kallisto, although I haven't used it) will give you estimated counts/transcript that you could round and use.
Depending on organism, i think it can be reasonable to use estimated gene counts. However, estimated transcript counts are highly correlated within a gene, and so I'd use a special treatment to do a transcript level analysis. See BitSeq, EBSeq papers for references.
Thank you. However, I'm not trying to go to count from FPKM. I'm just trying to figure out how to perform the differential expression analysis using just an xls sheet having the fpkm values from 8 samples (6 normal and 2 defective). I tried using limma but i'm getting this error
rowMeans(y$exprs, na.rm = TRUE) : 'x' must be numeric
Can you please help me solve this ?
Thank you
Sure. Make sure that whatever you are feeding to limma is actually numeric.