Getting different results with DESeq2
1
0
Entering edit mode
Lillyhchen ▴ 10
@lillyhchen-8385
Last seen 6.2 years ago
United States

Hi I'm having an issue with DESeq2 results.  I'm working with 2 cell lines from different patients and originally I created a .csv excel file with both cell lines in it and then contrasted the results from the DESEq Analysis:

setwd("C:/Users/lchen/Documents/transfer.lilly/TMZ.052215")
#Get the paths to the file
#Make sure the sample names in the sample.csv match up with the end path text
path=paste(getwd(),"/",sample$sample,".sorted.rmdup.bam",sep="") library(GenomicRanges) #Clump the bam files into a BamFileList with the same samples as listed in the .csv list=BamFileList(path, index=character()) library(GenomicFeatures) #Obtain Reference genome from UCSC Txdb=makeTxDbFromUCSC( genome="hg19", tablename="ensGene", transcript_ids=NULL, circ_seqs=DEFAULT_CIRC_SEQS, url="http://genome.ucsc.edu/cgi-bin/", goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath", miRBaseBuild=NA) #Sort the exons of the Txdb by gene exons<-exonsBy(Txdb, by="gene") #SummarizeOverlaps to count unique reads library(BiocParallel) se <- summarizeOverlaps(features=exons, reads=list, mode="Union", singleEnd=FALSE, ignore.strand=FALSE, fragments=TRUE, BPPARAM = SerialParam()) #Add column as listed in sample.csv colData(se)$group= sample$group #Run DESeq2 with variables library(DESeq2) dds<-DESeqDataSet(se, design=~group) dds<-DESeq(dds) #Results that contrast the Control and Treated groups res<-results(dds, contrast=c("group", "HF2303.Control", "HF2303.TMZ")) #Match Gene Names to the Ensembl IDs from the Txdb res$ensembl <- sapply( strsplit( rownames(res), split="\\+" ), "[", 1 )
library( "biomaRt" )
ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
filters = "ensembl_gene_id",
values = res$ensembl, mart = ensembl ) idx <- match( res$ensembl, genemap$ensembl_gene_id ) res$entrez <- genemap$entrezgene[ idx ] res$hgnc_symbol <- genemap$hgnc_symbol[ idx ] #Create a MAPlot and Dispersion Estimate Plot plotMA(res, ylim=c(-1, 1), main="DESeq2") plotDispEsts(dds) #Filter by subsetting results for padj<0.1 and log2FoldChange>=2 res=as.data.frame(res) sig=subset(res, padj<=0.1) sig=subset(res, abs(log2FoldChange)>=2) #Save sig and gene list as csv file write.csv(sig, "sig.HF2303.csv") gene=unique(sig$hgnc_symbol)
gene=gene[!is.na(gene)]
write.csv(gene, "gene.HF2303.csv")

The code runs and gives me about 81 significant genes

However, when I separate the 2 cell lines into separate 2 .csv excel files (one for each cell line) and run the same code, I only get about 5 significant genes.  At first I thought it would be better to work the cell lines separately, but I didn't think that it would make a huge difference if only the cell lines were contrasted in the results, so I'm confused why I am getting different results.

Here is a code of when I separated the 2 cell lines into 2 excel files.

setwd("C:/Users/lchen/Documents/transfer.lilly/TMZ.052215")
#Get the paths to the file
#Make sure the sample names in the sample.csv match up with the end path text
path=paste(getwd(),"/",sample$sample,".HF2927.sorted.rmdup.bam",sep="") library(GenomicRanges) #Clump the bam files into a BamFileList with the same samples as listed in the .csv list=BamFileList(path, index=character()) library(GenomicFeatures) #Obtain Reference genome from UCSC Txdb=makeTxDbFromUCSC( genome="hg19", tablename="ensGene", transcript_ids=NULL, circ_seqs=DEFAULT_CIRC_SEQS, url="http://genome.ucsc.edu/cgi-bin/", goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath", miRBaseBuild=NA) #Sort the exons of the Txdb by gene exons<-exonsBy(Txdb, by="gene") #SummarizeOverlaps to count unique reads library(BiocParallel) se <- summarizeOverlaps(features=exons, reads=list, mode="Union", singleEnd=FALSE, ignore.strand=FALSE, fragments=TRUE, BPPARAM = SerialParam()) #Add column as listed in sample.csv colData(se)$group= sample$group #Run DESeq2 with variables library(DESeq2) dds<-DESeqDataSet(se, design=~group) dds<-DESeq(dds) #Results that contrast the Control and Treated groups res<-results(dds, contrast=c("group", "Control", "TMZ")) #Match Gene Names to the Ensembl IDs from the Txdb res$ensembl <- sapply( strsplit( rownames(res), split="\\+" ), "[", 1 )
library( "biomaRt" )
ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
filters = "ensembl_gene_id",
values = res$ensembl, mart = ensembl ) idx <- match( res$ensembl, genemap$ensembl_gene_id ) res$entrez <- genemap$entrezgene[ idx ] res$hgnc_symbol <- genemap$hgnc_symbol[ idx ] #Create a MAPlot and Dispersion Estimate Plot plotMA(res, ylim=c(-1, 1), main="DESeq2") plotDispEsts(dds) #Filter by subsetting results for padj<0.1 and log2FoldChange>=2 res=as.data.frame(res) sig=subset(res, padj<=0.1) sig=subset(res, abs(log2FoldChange)>=2) #Save sig and gene list as csv file write.csv(sig, "sig.HF2927.csv") gene=unique(sig$hgnc_symbol)
gene=gene[!is.na(gene)]
write.csv(gene, "gene.HF2927.csv")

Any help or tips as to why would be very appreciated! Thank you!

deseq2 results rstudio • 1.5k views
0
Entering edit mode
@mikelove
Last seen 6 hours ago
United States

"At first I thought it would be better to work the cell lines separately, but I didn't think that it would make a huge difference if only the cell lines were contrasted in the results, so I'm confused why I am getting different results."

I can't tell exactly from your post, but just to make clear: one time you had four groups in the DESeqDataSet, and another time you had two groups in two separate DESeqDataSets?

Yes, it is expected that this will change the results.

The estimation steps for the model parameters look at all the samples in the DESeqDataSet. This often helps to have all the samples from all groups, because it improves the accuracy estimates (more residual degrees of freedom).

0
Entering edit mode

Sorry for the confusion, but yes that's what I'm working with. Basically, I have two tumor cell lines each from a different patient, and each one of the tumor cell lines has a control and treated group, so as you were saying that would be 4 groups all together when I created a .csv file with both tumor cell lines.  I'm trying to look at the changes in gene expression that occur in each tumor cell line between untreated and those treated with Treatment A.

That's when i tried to just work with the two tumor cell lines separately.  Is there a way to contrast the tumor cell lines when working with samples from all the groups?  I've been trying to edit the design when creating a DESeqDataSet but i've been getting an error saying that "-- standard model matrices are used for factors with two levels and an interaction,
where the main effects are for the reference level of other factors.
see the 'Interactions' section of the vignette for more details: vignette('DESeq2')"

Here's the code i've been working with:

> exons<-exonsBy(Txdb, by="gene")
> library(BiocParallel)
+                         mode="Union",
+                         singleEnd=FALSE,
+                         ignore.strand=FALSE,
+                         fragments=TRUE, BPPARAM = SerialParam())
> colData(se)=DataFrame(sample)
> library(DESeq2)
> dds<-DESeqDataSet(se, design=~Cell + Treat + Cell:Treat)
> #Relevel Treat column so that Control is first
> dds$Treat <- relevel(dds$Treat, "Cont")
> #Run DESeq pipeline
> dds<-DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- standard model matrices are used for factors with two levels and an interaction,
where the main effects are for the reference level of other factors.
see the 'Interactions' section of the vignette for more details: vignette('DESeq2')

0
Entering edit mode

answer posted on the new post:

A: Running DESeqDataSet with 4 factors?