Hi All,
I have RNA-seq data from 7 healthy (Control) and 8 breast cancer carriers and using DESeq2 to find DEGs. So, every sample of mine is collected from an individual person (no real biological replicates and no experimental replicates). I put these samples into two different groups of Control (healthy) and Carrier and followed the DESeq2 analysis as suggested in in bioconductor and published paper titled: Differential analysis of count data – the DESeq2 package, 2016)
I ended up with no significant (padj<0.05) DEG. I did clustering analysis and some of my control samples were in the same cluster as my carriers. All my analysis suggested that the differences between my samples within a group are bigger than the differences between two groups (Control vs Carrier). So, based on the cluster, I created a new csv file with selected samples that they were not interfering with each other (4 control and 4 carriers) and I ended up with 60 DEG (padj<0.05). So, depending which samples I include or exclude the outcome is different (which does make sense to me).
However, I am doing this DESeq2 analysis for our group and they are insisting on doing the analysis using all samples. I also tried to do remove hidden batch effect using sva package (got two surrogate variables SV1 and SV2) and re-done the DESeq2 analysis as suggested in bioconductor and still did not get more statistically significant DEGs when using all cell lines (the command lines are at the end of this message). However, by removing the hidden batch analysis from my groups of 4 control and 4 carriers, the number of 60 significant DEGs decreased to 46 DEGs. Should not I have been getting more significantly DEGs?
So, my questions are:
1. Did I correctly grouped the samples? Status had two groups (Control =7 samples and Carrier=8 samples) and I did one factor analysis.
se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=FALSE, fragments=FALSE ) se dim(se) assayNames(se) head(assay(se), 3) colSums(assay(se)) rowRanges(se) str(metadata(rowRanges(se))) colData(se) (colData(se) <- DataFrame(sampleTable)) se$Status se$Status <- relevel(se$Status, "Control") round( colSums(assay(se)) / 1e6, 1 ) colData(se) library("DESeq2") dds <- DESeqDataSet(se, design = ~ Status) countdata <- assay(se) head(countdata, 3) coldata <- colData(se) (ddsMat <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ Status)) nrow(dds) dds <- dds[ rowSums(counts(dds)) > 1, ] nrow(dds) rld <- rlog(dds, blind=FALSE) par( mfrow = c( 1, 2 ) ) dds <- estimateSizeFactors(dds) dds <- DESeq(dds) (res <- results(dds)) mcols(res, use.names=TRUE) summary(res) res.05 <- results(dds, alpha=.05) table(res.05$padj < .05)
2. Why DESeq2 analysis did not result in more DEG after removing hidden batch effect and whether is this approach is correct or appropriate for my type of analysis.
3. I am not sure if using below command lines, I removed batch effects or just estimated the sabotage groups? sorry for the silly question, but as I mentioned these are all new to me and I am a bit confused by option "comBat" in sva package.
4. Is there any other approaches I can take to keep all the samples and still get rid of variations between the samples?Sorry for my long message. I am kind of stuck here and I really appreciate your advice.
My command lines for removing hidden batch effects were:
library("sva") dat <- counts(dds, normalized=TRUE) idx <- rowMeans(dat) > 1 dat <- dat[idx,] mod <- model.matrix(~ Status, colData(dds)) mod0 <- model.matrix(~ 1, colData(dds)) svseq <- svaseq(dat, mod, mod0, n.sv=2) par(mfrow=c(2,1),mar=c(3,5,3,1)) png(filename = "stripchart1.png") stripchart(svseq$sv[,1] ~ dds$CellLine,vertical=TRUE,main="SV1") abline(h=0) dev.off() png(filename = "stripchart2.png") stripchart(svseq$sv[,2] ~ dds$CellLine,vertical=TRUE,main="SV2") abline(h=0) dev.off() ddssva <- dds ddssva$SV1 <- svseq$sv[,1] ddssva$SV2 <- svseq$sv[,2] design(ddssva) <- ~ SV1 + SV2 + Status ddssva <- DESeq(ddssva) library("DESeq2") ddssva <- DESeq(ddssva) (res_sva <- results(ddssva)) mcols(res_sva, use.names=TRUE) summary(res_sva) res_sva.05 <- results(ddssva, alpha=.05) table(res_sva.05$padj < .05) library("AnnotationDbi") library("org.Hs.eg.db") columns(org.Hs.eg.db) res_sva$symbol <- mapIds(org.Hs.eg.db, keys=row.names(res_sva), column="SYMBOL", keytype="ENSEMBL", multiVals="first") res_sva$entrez <- mapIds(org.Hs.eg.db, keys=row.names(res_sva), column="ENTREZID", keytype="ENSEMBL", multiVals="first") res_svaOrdered <- res_sva[order(res_sva$padj),] head(res_svaOrdered) (res_svaGR <- results(ddssva, lfcThreshold=1, format="GRanges")) res_svaGR$symbol <- mapIds(org.Hs.eg.db, names(res_svaGR), "SYMBOL", "ENSEMBL") res_svaOrderedDF <- as.data.frame(res_svaOrdered)[1:1600,] write.csv(res_svaOrderedDF, file="results_sva.csv")
Thanks in advance, Leila
I'm not sure why you say you have no replicates. You have two groups of samples with every sample collected from a different person. That's exactly what biological replicates are.
Hi Ryan,
Thank you for the recommendations. I'll try this approach and hopefully I'll get a better outcome. I also looked into PCA plots as you suggested and the groups were separated in PC5 and PC6. I know that my samples can be considered as biological replicates but because the replicates are from different human samples, the variations between replicates are much larger that what we can expect between cell lines. These samples were also collected at different stages of cancer development, from different age group, different stages of treatment and one replicate in carrier group has not developed cancer yet (still healthy) and only has the carrier gene. And because every single of them were different in the group, I could not do the two factor DESeq2 analysis also.
Thanks again for your helpful suggestions, Leila