Question

analyzing RNA-Seq transcriptomic data (GEO accession: GSE96058)

0

Entering edit mode

H. Z. Amini ▴ 10

@habibolla-24859

Last seen 2.7 years ago

Morgantown

Hello Everyone,

I'm going to digest and analyze the RNA-Seq transcriptomic data in the following link (GEO accession: GSE96058). I did many searches on the web, but I couldn't find an applicable strategy to analyze this data. So It would be very much appreciated if you could please help me out about how to start analyzing this type of data.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96058

Thank you so much for your help in advance!

GEO RNASeqR Transcriptomics • 2.4k views

ADD COMMENT • link updated 3.2 years ago by James W. MacDonald 65k • written 3.2 years ago by H. Z. Amini ▴ 10

score 3 · Accepted Answer · 2021-02-23

3

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 8 days ago

Republic of Ireland

Hi,

It is indicated that the data that is made available by the authors comprises log2 FPKM expression units:

Gene expression data in FPKM were generated using cufflinks 2.2.1 (default parameters except –GTF, --frag-bias-correct GRCh38.fa, --multi-read-correct, --library-type fr-firststrand, --total-hits-norm, --max-bundle-frags 10000000). The resulting data was was post-processed by collapsing on 30,865 unique gene symbols (sum of FPKM values of each matching transcript), adding to each expression measurement 0.1 FPKM, and performing a log2 transformation.

In this case, one is very much limited by what one can do. It is undesirable to start any differential expression analysis with just FPKM units or their log2-transformed equivalents. I may suggest following the limma-trend pipeline, though, taking advice from this post by my colleague, keeping in mind that a pseudocount of 0.1 has already been added by the authors of the data in question: A: Differential expression of RNA-seq data using limma and voom()

Kevin

ADD COMMENT • link 3.2 years ago Kevin Blighe ★ 3.9k

1

Entering edit mode

An (admittedly computationally and temporally expensive) alternative would be to get the raw data from SRA and then use something more modern than the Tuxedo Suite to get counts and then use edgeR or DESeq2 to analyze.

ADD REPLY • link 3.2 years ago James W. MacDonald 65k

0

Entering edit mode

Indeed, better to get the raw data.

ADD REPLY • link 3.2 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Thank you so much for your reply. So as I'm new in this field, would you please provide me more detail about how you noticed that data is comprised of "log2 FPKM expression units"?

And following the link that I provided, would you please let me know which one is the raw data?

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96058

Thanks again for your time!

ADD REPLY • link 3.2 years ago H. Z. Amini ▴ 10

1

Entering edit mode

The samples are all listed there, and they all have links. If you click on one you get this page. And under Data Processing there is a description of what they did. And at the very bottom of that page is a link called Biosample, which if you click the link will bring you here. And if you click the link that says something about all samples it goes here.

And since there are like 3000 samples, you can't see them all at once. And how you would get those data is not a simple thing to explain, and certainly beyond the scope of this site, being that it's intended for Bioconductor package help. You would probably want to install NCBI's toolset and do some scripting to get all the sample IDs and then download.

ADD REPLY • link 3.2 years ago James W. MacDonald 65k