Deleted:Genes with highest differential expression have very low read counts for most genes
1
0
Entering edit mode
Ben ▴ 10
@974411d6
Last seen 6 weeks ago
United Kingdom

tl;dr: The most differentially expressed genes from DESeq2 are genes with very low read counts for many of my samples, and extremely high read counts for a few, even with pre-filtering - is this correct?

Hi!

I have a dataset comprising over 700 unique cell lines, each with a sensitivity data point (AUC) representing the response to a drug (measured by CellTiter-Glo assay). I have sourced a counts matrix with corresponding basal gene expression for matching cell lines with over 50,000 protein coding and non-protein coding genes here. I would like to see if sensitivity of my cell lines to my drug corresponds with differences in basal gene expression. e.g. high/increasing expression of protein x is significantly associated with sensitivity to my drug.

Here is a histogram of my AUC data:

enter image description here

I have created a DESeq Dataset as follows:

deseq_ds = DESeqDataSetFromMatrix(countData = round(counts),
                       colData = viab,
                       design = ~ AUC)
`

Where viab is a cell line identifier as rowname and single column containing centered and scaled AUC values.

I started to run into issues (I think?!) when I was running the code without pre-filtering, and plotting counts of my most significantly differentially expressed genes (ordered by adjusted p-value):

plt = plotCounts(deseq_ds, gene=idx, intgroup = 'AUC', returnData = T)
## plot with ggplot

Plot of gene with lowest adj p-value

As you can see, we have one extreme outlier gene that is seemingly producing this significance. I added pre-filtering to exclude this, by filtering out genes with <= 10 samples with counts <= 50.

  keep = rowSums(counts(deseq_ds) >= 10) >= 50
  deseq_ds = deseq_ds[keep,]

I replotted the most significant gene (padj 6.87e-73) (this gene also has one of the highest absolute LFC (-2.979)):

enter image description here

Adding ylim 0 -> 1000 to zoom into interesting part:

enter image description here

Here is another example, with 3rd most significant gene, (padj 3.51e-47, log2FC 2.79):

cts

Here is an example where I feel a trend is more noticeable, and the results seem more realistic (padj 1.7e-15; log2FC -0.47):

cts

I am a bit confused, as pretty much all of the most significant plots don't look very convincing and from visual inspection I don't think I would say there is even a trend based on this data. They seem to being heavily skewed by a few samples with extremely high read count.

Furthermore, I have >9000 genes significant in my dataset, of a total of 26433 (following pre-filtering) which is approximately 1/3 of all my genes are being classified as DEGs. Is this quite high?

Am I doing something wrong here or making an assumption I shouldn't be? Is DESeq2 suited to this kind of analysis, or should I be using a different package? Or does everything look like it's working as intended? Thanks for any help and guidance!

Here is my PCA plot:

pca

Here is plot of my gene dispersion (it's very high, I believe?):

dispersion

Here is plot of my SD against ranked mean following VTS normalisation:

vst

Here is my MA plot:

maplot

RNASeq depmap DESeq2 • 825 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 477 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6