I have been working with this RNA-Seq dataset on and off for 2 years now. I can't wrap my head around why the data looks so skewed. Please see attached figures. Need help with understanding the underlying reason and any suggestions what to do with it.
Basically I have ~7 samples in four groups, but to simplify I only present two groups tested against each other here. I've mapped paired reads with tophat against the reference genome. Analyzed differential gene expression over and over using different versions of different programs, edgeR, cufflinks, etc. Mostly I've used DESeq and DESeq2. I've tried so many angles. I've used all different normalization techniques in tophat, they all give similar results. Tried being strict and less strict. No matter what I do, I end up with this distribution. Get huge numbers of significant genes every time.
The samples were ordered in such a way that batch effects during sample preparation and sequencing is highly unlikely.
Please any suggestions on why it looks so skewed and if it's something I can adjust? Is it possible it is a biological difference?
Grateful for any help.
edit: More figures.
Scatter plot
> multidensity( log10(0.1 + counts_W[counts_W > 0,])
> split(round(colSums(counts(dds))/1e6), dds$condition)
$NB
N03B N17B N21B N32B N39B N41B N48B
14 14 10 5 6 11 10
$NW
N03W N17W N21W N32W N39W N41W N48W
9 17 7 12 11 13 12
$SB
S52B S69B S81B S82B S94B S96B
8 10 8 9 8 5
$SW
S52W S69W S81W S82W S94W S96W
3 12 3 10 9 7
(only the two conditions with W are presented in the figures)