How to analyze pre-processed RNA-Seq data from GEO based on the deseq R package for filtering and DE analysis
2
0
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 1 day ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

based on a validation project, I have downloaded some processed RNA-Seq data from GEO, as I would like to test very quickly, if a specific gene signature, is found:

1) Expressed above a minimal threshold 2) Differentially expressed between cancer and normal samples

The relative link for the processed dataset is the following:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60052

A very small import to R:

dataset <- read.csv("GSE60052_79tumor.7normal.normalized.log2.data.Rda.tsv", sep = "\t", header = T, row.names = 1, check.names = F)

11A       12A      13A       14A      15A       16A       17A       18A
5S_rRNA  0.000000  0.000000 0.000000  0.000000 0.000000  0.000000  0.000000  5.402481
7SK     10.828115 11.803973 9.608837 10.419233 9.859020 11.048513 10.533401 10.255479
A1BG     5.615586  7.337469 3.867909  1.772974 5.634852  5.493924  5.964345  0.000000
A1CF     0.000000  0.000000 0.000000  0.000000 0.000000  0.000000  0.000000  3.080553
A2LD1    1.952621  0.000000 3.924493  0.000000 0.000000  0.000000  4.716418  5.080553
A2M      6.827090  5.587447 7.393978  5.473414 6.300433  5.567925  5.964345  9.124947


The only description I found for the analysis of the RNA-Seq data from the relative paper was the following:

"For RNASeq data, the average read count per mate was 50 million. RNA reads were mapped to the human genome (UCSC hg19; Feb 2009 release; Genome Reference Consortium GRCh37) using TopHat2 (v2.0.9) and the human reference gtf annotation file (GRCh37.68). Transcript counts were calculated and normalized using htseq-count and DESeq (v1.12.1). The DESeq negative binomial distribution was used to calculate the p-value and fold changes between 48 lung tumor and 6 normal adjacent lung samples using adjusted p<0.05 and |fold change|>2 as a threshold"

My questions are the following:

1) Based on the above processed data with DESeq, can i perform initially a simple expression filtering analysis ? based on a log2 expression cutoff ? similarly like microarrays ? It is different by the newer DESeq2 versions ?

2) Can I directly use the processed data for DE analysis ? Or the more appropriate way would be to analyze from fastq files ?

Best,

Efstathios

deseq rna-seq filtering DE deseq2 • 401 views
1
Entering edit mode
ATpoint ▴ 990
@atpoint-13662
Last seen 3 hours ago
Germany

You could use the signed -log10(nominal p-values) as ranking metric and perform GSEA with the gene signature you have as gene set. From what I understand you have the statistics from the original DESeq output? On Bioc the fgsea package is helpful for this (among many others). This does not require any custom filtering as GSEA takes the full expression profile as input (represented by the ranking metric).

0
Entering edit mode

Dear ATpoint, thanks for your suggestion-i will check fgsea-but my initial target is to further narrow down this gene signature by checking which genes are expressed based on this dataset and/or DE, and then apply functional analysis.

1
Entering edit mode
@mikelove
Last seen 3 hours ago
United States

Use of DESeq2 would be on original counts, where the column sum equals the number of fragments aligned to the genes.

0
Entering edit mode

Dear Michael,

thanks for pointing this out !! Regarding the above transformation mentioned- that is log2 transformation and normalization by an older version of the original deseq algorithm-you think that any filtering or de analysis could be still applied ? for example even some z-scores ?

1
Entering edit mode

This input data isn’t appropriate for DESeq2, which is what I provide user support for.