I'm going to digest and analyze the RNA-Seq transcriptomic data in the following link (GEO accession: GSE96058). I did many searches on the web, but I couldn't find an applicable strategy to analyze this data. So It would be very much appreciated if you could please help me out about how to start analyzing this type of data.
It is indicated that the data that is made available by the authors comprises log2 FPKM expression units:
Gene expression data in FPKM were generated using cufflinks 2.2.1
(default parameters except –GTF, --frag-bias-correct GRCh38.fa,
--multi-read-correct, --library-type fr-firststrand, --total-hits-norm, --max-bundle-frags 10000000). The resulting data was was post-processed by collapsing on 30,865 unique gene symbols
(sum of FPKM values of each matching transcript), adding to each
expression measurement 0.1 FPKM, and performing a log2 transformation.
In this case, one is very much limited by what one can do. It is undesirable to start any differential expression analysis with just FPKM units or their log2-transformed equivalents. I may suggest following the limma-trend pipeline, though, taking advice from this post by my colleague, keeping in mind that a pseudocount of 0.1 has already been added by the authors of the data in question: A: Differential expression of RNA-seq data using limma and voom()
An (admittedly computationally and temporally expensive) alternative would be to get the raw data from SRA and then use something more modern than the Tuxedo Suite to get counts and then use edgeR or DESeq2 to analyze.
Thank you so much for your reply.
So as I'm new in this field, would you please provide me more detail about how you noticed that data is comprised of "log2 FPKM expression units"?
And following the link that I provided, would you please let me know which one is the raw data?
The samples are all listed there, and they all have links. If you click on one you get this page. And under Data Processing there is a description of what they did. And at the very bottom of that page is a link called Biosample, which if you click the link will bring you here. And if you click the link that says something about all samples it goes here.
And since there are like 3000 samples, you can't see them all at once. And how you would get those data is not a simple thing to explain, and certainly beyond the scope of this site, being that it's intended for Bioconductor package help. You would probably want to install NCBI's toolset and do some scripting to get all the sample IDs and then download.
An (admittedly computationally and temporally expensive) alternative would be to get the raw data from SRA and then use something more modern than the Tuxedo Suite to get counts and then use
edgeR
orDESeq2
to analyze.Indeed, better to get the raw data.
Thank you so much for your reply. So as I'm new in this field, would you please provide me more detail about how you noticed that data is comprised of "log2 FPKM expression units"?
And following the link that I provided, would you please let me know which one is the raw data?
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96058
Thanks again for your time!
The samples are all listed there, and they all have links. If you click on one you get this page. And under Data Processing there is a description of what they did. And at the very bottom of that page is a link called Biosample, which if you click the link will bring you here. And if you click the link that says something about all samples it goes here.
And since there are like 3000 samples, you can't see them all at once. And how you would get those data is not a simple thing to explain, and certainly beyond the scope of this site, being that it's intended for Bioconductor package help. You would probably want to install NCBI's toolset and do some scripting to get all the sample IDs and then download.