The data/experimental set up I’m working with:
A dataset obtained through sequencing isolated blood vessels from brain tissue (so not bulk tissue per se but not single cell, more like “single structure”) from two groups: cases and controls. The cases are of a subtle phenotype (pertaining to a psychiatric disorder) so, because of this set up, I guess it’s reasonable to expect fewer genes picked up by sequencing as well as lower gene counts. It’s also reasonable to assume that differences in gene expression will not be drastic (but still crucial I detect them accurately). I’m learning R and RNAseq analysis all at the same time and I’ve read myself into a stand-still regarding how to best optimize the different variables within the DESEq2 package for my particular data.
I really want to better understand the details of how Deseq2 processed my counts My questions: 1) What script would obtain the list of counts that were refitted/replaced with a new estimate during Deseq2()? The following code assays(dds)[["replaceCounts"]] gives back NULL
2) What script would obtain a list of datapoints flagged as outlier per subject per gene?
3) What script would obtain a list of genes that's pvalue/adjust p value became NA?
4) With 25/24 replicates per group, is it still best to keep cooksCutoff and minReplicatesForReplace on? My samples are from postmortem human brain which does demonstrate heterogeneity that perhaps I don’t want to refine. I find this to be quite a difficult call to make. Should cooksCutoff and minReplicatesForReplace be customized instead?
5) When plotting individual gene counts for both experimental groups, can I correctly assume that an obvious outlier has been excluded from statistical analysis even if it remains present in the plot?
6) Should l leave independent filtering on its default settings?
As I said, there’s going to be subtle differences between groups that I would like to detect (if they are indeed there), so any tweaking of the script that could improve the output, if there is, would be much appreciated!
My script:
# read in the Subject*Counts (raw counts) matrix
cts <- as.matrix(read.csv("rawcounts.csv",
sep= ",",
header = TRUE,
row.names=1,
stringsAsFactors = F))
# read in the subject metadata file that has sample/experimental labels
colData <- read.csv("subject_metadata_deseq2.csv", row.names=NULL, fill=TRUE)
# now make this a factor as it will be the variable we will use define groups for the differential expression analysis
colData$Group <- factor(colData$Group, levels=c("CTRL", "SA"))
# create the DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = round(cts),
colData = colData,
design = ~ Group)
# drop genes with less than 10 reads across all samples
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# run the DEseq2 analysis
dds <- DESeq(dds)
# get results for DEG analysis (and order by Pval) by specifying design
res <- results(dds,
name = "Group_SA_vs_CTRL",
alpha = 0.1)
# quick check for how many DEGs with significance @ 10% level in either FC direction
sum(res$padj < 0.1, na.rm=TRUE)
# gives you summary of DESeq2 results
summary(res)
log2 fold change (MLE): Group SA vs CTRL
Wald test p-value: Group SA vs CTRL
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000007908 53.71782 -2.781780 0.5228620 -5.32029 1.03600e-07 0.00167422
ENSG00000185710 19.28648 -3.125167 0.5803412 -5.38505 7.24245e-08 0.00167422
ENSG00000169248 29.48800 -2.430210 0.4705398 -5.16473 2.40787e-07 0.00259416
ENSG00000086189 104.73989 -0.417687 0.0833796 -5.00947 5.45812e-07 0.00441030
ENSG00000169245 10.18849 -2.193207 0.4427281 -4.95385 7.27606e-07 0.00470339
ENSG00000226017 3.48526 1.440409 0.3039228 4.73939 2.14362e-06 0.01154735
# quick check for how many DEGs with significance @ 10% level in either FC direction
sum(res$padj < 0.1, na.rm=TRUE)
[1] 30
# gives you summary of DESeq2 results
summary(res)
out of 32961 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 12, 0.036%
LFC < 0 (down) : 18, 0.055%
outliers [1] : 0, 0%
low counts [2] : 643, 2%
(mean count < 0)
# boxplot of Cook's distances
assays(dds)[["cooks"]]
par(mar=c(8,5,2,2))
boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2)
write.csv(as.data.frame(assays(dds)[["cooks"]]), file= "Cooks_Distances_Counts.csv")
[https://drive.google.com/file/d/1B5yXsaTgnoFUsdwzXqExHQbNzxbFYJ4_/view?usp=sharing]
# plotting maximum Cook's for each gene over the base mean to identify outlier genes
mcols(dds)$maxCooks <- apply(assays(dds)[["cooks"]], 1, max)
plot(mcols(dds)$baseMean, mcols(dds)$maxCooks)
[https://drive.google.com/file/d/1yIDFFAwRfzS7kQFYU_O5-sAMaCEkuxlc/view?usp=sharing]
# plotting maximum Cook's for each gene over the log fold change to identify outlier genes
stopifnot(all.equal(rownames(dds), rownames(res)))
plot(res$log2FoldChange, mcols(dds)$maxCooks)
[https://drive.google.com/file/d/1CGgsiPlO7kVFBhHW_0olr3__qswFfAd3/view?usp=sharing]
# I do want to control for age so I’m also using the above code with the follow object, and comparing the outcomes of both objects:
dds <- DESeqDataSetFromMatrix(countData = round(cts),
colData = colData,
design = ~ Age + Group)
# when design is Age + Group
sum(res$padj < 0.1, na.rm=TRUE)
[1] 61
# when design is Age + Group
summary(res)
out of 32964 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 36, 0.11%
LFC < 0 (down) : 25, 0.076%
outliers [1] : 10, 0.03%
low counts [2] : 5113, 16%
(mean count < 1)