Question

Getting different results with DESeq2

0

Entering edit mode

Lillyhchen ▴ 10

@lillyhchen-8385

Last seen 10.6 years ago

United States

Hi I'm having an issue with DESeq2 results. I'm working with 2 cell lines from different patients and originally I created a .csv excel file with both cell lines in it and then contrasted the results from the DESEq Analysis:

setwd("C:/Users/lchen/Documents/transfer.lilly/TMZ.052215")
sample=read.csv("sample.csv", header=T)
#Get the paths to the file
#Make sure the sample names in the sample.csv match up with the end path text
path=paste(getwd(),"/",sample$sample,".sorted.rmdup.bam",sep="")
library(GenomicRanges)
#Clump the bam files into a BamFileList with the same samples as listed in the .csv
list=BamFileList(path, index=character())
library(GenomicFeatures)
#Obtain Reference genome from UCSC
Txdb=makeTxDbFromUCSC(
genome="hg19",
tablename="ensGene",
transcript_ids=NULL,
circ_seqs=DEFAULT_CIRC_SEQS,
url="http://genome.ucsc.edu/cgi-bin/",
goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
miRBaseBuild=NA)
#Sort the exons of the Txdb by gene
exons<-exonsBy(Txdb, by="gene")
#SummarizeOverlaps to count unique reads
library(BiocParallel)
se <- summarizeOverlaps(features=exons, reads=list,
mode="Union",
singleEnd=FALSE,
ignore.strand=FALSE,
fragments=TRUE, BPPARAM = SerialParam())
#Add column as listed in sample.csv
colData(se)$group= sample$group
#Run DESeq2 with variables
library(DESeq2)
dds<-DESeqDataSet(se, design=~group)
dds<-DESeq(dds)
#Results that contrast the Control and Treated groups
res<-results(dds, contrast=c("group", "HF2303.Control", "HF2303.TMZ"))
#Match Gene Names to the Ensembl IDs from the Txdb
res$ensembl <- sapply( strsplit( rownames(res), split="\\+" ), "[", 1 )
library( "biomaRt" )
ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
filters = "ensembl_gene_id",
values = res$ensembl,
mart = ensembl )
idx <- match( res$ensembl, genemap$ensembl_gene_id )
res$entrez <- genemap$entrezgene[ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
#Create a MAPlot and Dispersion Estimate Plot
plotMA(res, ylim=c(-1, 1), main="DESeq2")
plotDispEsts(dds)
#Filter by subsetting results for padj<0.1 and log2FoldChange>=2
res=as.data.frame(res)
sig=subset(res, padj<=0.1)
sig=subset(res, abs(log2FoldChange)>=2)
#Save sig and gene list as csv file
write.csv(sig, "sig.HF2303.csv")
gene=unique(sig$hgnc_symbol)
gene=gene[!is.na(gene)]
write.csv(gene, "gene.HF2303.csv")

The code runs and gives me about 81 significant genes

However, when I separate the 2 cell lines into separate 2 .csv excel files (one for each cell line) and run the same code, I only get about 5 significant genes. At first I thought it would be better to work the cell lines separately, but I didn't think that it would make a huge difference if only the cell lines were contrasted in the results, so I'm confused why I am getting different results.

Here is a code of when I separated the 2 cell lines into 2 excel files.

setwd("C:/Users/lchen/Documents/transfer.lilly/TMZ.052215")
sample=read.csv("HF2927.csv", header=T)
#Get the paths to the file
#Make sure the sample names in the sample.csv match up with the end path text
path=paste(getwd(),"/",sample$sample,".HF2927.sorted.rmdup.bam",sep="")
library(GenomicRanges)
#Clump the bam files into a BamFileList with the same samples as listed in the .csv
list=BamFileList(path, index=character())
library(GenomicFeatures)
#Obtain Reference genome from UCSC
Txdb=makeTxDbFromUCSC(
genome="hg19",
tablename="ensGene",
transcript_ids=NULL,
circ_seqs=DEFAULT_CIRC_SEQS,
url="http://genome.ucsc.edu/cgi-bin/",
goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
miRBaseBuild=NA)
#Sort the exons of the Txdb by gene
exons<-exonsBy(Txdb, by="gene")
#SummarizeOverlaps to count unique reads
library(BiocParallel)
se <- summarizeOverlaps(features=exons, reads=list,
mode="Union",
singleEnd=FALSE,
ignore.strand=FALSE,
fragments=TRUE, BPPARAM = SerialParam())
#Add column as listed in sample.csv
colData(se)$group= sample$group
#Run DESeq2 with variables
library(DESeq2)
dds<-DESeqDataSet(se, design=~group)
dds<-DESeq(dds)
#Results that contrast the Control and Treated groups
res<-results(dds, contrast=c("group", "Control", "TMZ"))
#Match Gene Names to the Ensembl IDs from the Txdb
res$ensembl <- sapply( strsplit( rownames(res), split="\\+" ), "[", 1 )
library( "biomaRt" )
ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol"),
filters = "ensembl_gene_id",
values = res$ensembl,
mart = ensembl )
idx <- match( res$ensembl, genemap$ensembl_gene_id )
res$entrez <- genemap$entrezgene[ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
#Create a MAPlot and Dispersion Estimate Plot
plotMA(res, ylim=c(-1, 1), main="DESeq2")
plotDispEsts(dds)
#Filter by subsetting results for padj<0.1 and log2FoldChange>=2
res=as.data.frame(res)
sig=subset(res, padj<=0.1)
sig=subset(res, abs(log2FoldChange)>=2)
#Save sig and gene list as csv file
write.csv(sig, "sig.HF2927.csv")
gene=unique(sig$hgnc_symbol)
gene=gene[!is.na(gene)]
write.csv(gene, "gene.HF2927.csv")

Any help or tips as to why would be very appreciated! Thank you!

deseq2 results rstudio • 2.6k views

ADD COMMENT • link updated 10.5 years ago by Michael Love 43k • written 10.5 years ago by Lillyhchen ▴ 10

score 0 · Answer 1 · 2015-07-27

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

"At first I thought it would be better to work the cell lines separately, but I didn't think that it would make a huge difference if only the cell lines were contrasted in the results, so I'm confused why I am getting different results."

I can't tell exactly from your post, but just to make clear: one time you had four groups in the DESeqDataSet, and another time you had two groups in two separate DESeqDataSets?

Yes, it is expected that this will change the results.

The estimation steps for the model parameters look at all the samples in the DESeqDataSet. This often helps to have all the samples from all groups, because it improves the accuracy estimates (more residual degrees of freedom).

ADD COMMENT • link 10.5 years ago Michael Love 43k

0

Entering edit mode

Sorry for the confusion, but yes that's what I'm working with. Basically, I have two tumor cell lines each from a different patient, and each one of the tumor cell lines has a control and treated group, so as you were saying that would be 4 groups all together when I created a .csv file with both tumor cell lines. I'm trying to look at the changes in gene expression that occur in each tumor cell line between untreated and those treated with Treatment A.

That's when i tried to just work with the two tumor cell lines separately. Is there a way to contrast the tumor cell lines when working with samples from all the groups? I've been trying to edit the design when creating a DESeqDataSet but i've been getting an error saying that "-- standard model matrices are used for factors with two levels and an interaction,
where the main effects are for the reference level of other factors.
see the 'Interactions' section of the vignette for more details: vignette('DESeq2')"

Here's the code i've been working with:

> exons<-exonsBy(Txdb, by="gene")
> library(BiocParallel)
> se <- summarizeOverlaps(features=exons, reads=list,
+ mode="Union",
+ singleEnd=FALSE,
+ ignore.strand=FALSE,
+ fragments=TRUE, BPPARAM = SerialParam())
> colData(se)=DataFrame(sample)
> library(DESeq2)
Loading required package: Rcpp
Loading required package: RcppArmadillo
> dds<-DESeqDataSet(se, design=~Cell + Treat + Cell:Treat)
> #Relevel Treat column so that Control is first
> dds$Treat <- relevel(dds$Treat, "Cont")
> #Run DESeq pipeline
> dds<-DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- standard model matrices are used for factors with two levels and an interaction,
where the main effects are for the reference level of other factors.
see the 'Interactions' section of the vignette for more details: vignette('DESeq2')

ADD REPLY • link 10.5 years ago Lillyhchen ▴ 10

0

Entering edit mode

answer posted on the new post:

A: Running DESeqDataSet with 4 factors?

ADD REPLY • link 10.5 years ago Michael Love 43k