Entering edit mode
Federico Gaiti
▴
130
@federico-gaiti-6419
Last seen 10.3 years ago
Hi all,
I am using DESEq for a DGE analysis.
I have STRANDED RNA-Seq data for 4 developmental stages with no
replicates but I know that to have a more reliable DGE I should have
replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data
with 3 replicates per stage.
So my data situation at the moment is:
STAGE 1 stranded
STAGE 1.1 unstranded
STAGE 1.2 unstranded
STAGE 1.3 unstranded
STAGE 2 stranded
STAGE 2.1 unstranded
STAGE 2.2 unstranded
STAGE 2.3 unstranded
STAGE 3 stranded
STAGE 3.1 unstranded
STAGE 3.2 unstranded
STAGE 3.3 unstranded
STAGE 4 stranded
STAGE 4.1 unstranded
STAGE 4.2 unstranded
STAGE 4.3 unstranded
Before doing a DGE, I thought to test the correlation between these
samples, just to show that similar samples cluster together. If so,
I thought to use the unstranded data for my DGE analysis to reach the
final number of 4 replicates per stage.
I mapped the raw reads to the genome using TOPHAT (v2.0.9) (fr-
unstranded for unstranded data and fr-secondstrand for stranded data),
used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both
the data. For the stranded data I used the option -s yes and for the
unstranded data I used -s no. I then used DESeq (v1.14.0) to include
metadata and for normalization, and I removed the genes that always
have a 0 value. I then calcualted the correlation which was really
low.
I tried to use the option -s reverse for the stranded data and still
got really low correlation. So I reran htseq-count on the stranded
data selecting the option -s no and in this way I got a very similar
number of total counts between the unstranded and stranded data,
around 4-5M counts each stage (while both cases before the stranded
ones were double in number).
I included the metadata
> Design
condition
ADULT ADULT
ADULT1 ADULT
ADULT2 ADULT
ADULT3 ADULT
JUV JUV
JUV1 JUV
JUV2 JUV
JUV3 JUV
COMP COMP
COMP1 COMP
COMP2 COMP
COMP3 COMP
PRECOMP PRECOMP
PRECOMP1 PRECOMP
PRECOMP2 PRECOMP
PRECOMP3 PRECOMP
and estimated the new size factors, normalized and calculated the new
correlation. Pearson performed pretty well, confirmed by both a PCA
and correlogram. So my initial idea was to do a DGE "treating" the
stranded data as unstranded, having 4 replicates per stage. Though,
I'd still like to figure out a way to use the stranded counts since I
am not sure if I am losing some information running htseq-count using
-s no on the stranded data.
What I had in mind was using unstranded data to estimate the level of
variation to get a threshold for DE detection but still use the
stranded data as expression values. Not sure if I can do that though
given one is stranded and the other is not.
I would like to hear from you if you have any thoughts about this.
Let me know if you need any further details to better understand the
issue.
Thanks in advance,
Federico
Federico Gaiti
Ph.D. Candidate
School of Biological Sciences
University of Queensland
St Lucia QLD 4072
Australia
f.gaiti@uq.edu.au
[[alternative HTML version deleted]]