Question

How to Analyse Datasets from GEO

0

Entering edit mode

andrej.stoll.de ▴ 10

@andrejstollde-23490

Last seen 5.7 years ago

Hello,

I am no bioinformatician but did a lot of reading and experimenting on RNAseq over the last 2 years and I think I developed quite some understanding about the necessary steps and possible pitfalls etc. Recently, a couple of times I found interesting datasets on gene expression omnibus (GEO) and after downloading I realized that supplied metric for gene expression was TPM. This seems to be the case with a lot of datasets on GEO. As to my understanding TPM is not a good metric when it comes to differential expression analysis. Also DESeq2 won't accept TPM as input as values are not integer. The only truly clean way I can think of for performing the analysis would be downloading raw files from sra and doing the whole QC, alignment and counting from scratch.

So my question is what would be an elegant/simple and clean way of analysing such GEO datasets?

geo rnaseq deseq2 tpm • 3.3k views

ADD COMMENT • link updated 5.8 years ago by Kevin Blighe ★ 4.0k • written 5.8 years ago by andrej.stoll.de ▴ 10

score 1 · Answer 1 · 2020-05-08

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 6 weeks ago

The Cave, 181 Longwood Avenue, Boston, …

The tutorial by my Biostars colleague, ATpoint, is quite useful for downloading FASTQ data from ENA. I noticed recently, however, that even SRA is now hosting FASTQ files, but they can be difficult to obtain in any automated fashion. Note that a lot of studies have a record on ENA, GEO, and SRA. To find the ENA record, go by the 'BioProject ID'.

I have come across others who are content to work with TPM by transforming them to pseudo-count via log10(TPM + 1); however, as to which you imply, obtaining the FASTQs will provide the ultimate flexibility in your analysis.

Kevin

ADD COMMENT • link 5.8 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

During my search I also came across the suggestion of using log10(TPM + 1). Maybe one might use this approach in order to get a first glimps at the data and depeneding on that decide whether it's worth while doing the analysis from scratch.

I don't have access to a lot of computing power as I'm not a bioinformatician, so mainly simple office hardware. This is why I am trying to avoid doing the whole alignment as it takes me about 1-2h per 10 Million reads.

Thanks for your answer!

ADD REPLY • link 5.8 years ago andrej.stoll.de ▴ 10