Search
Question: DESeq2: add annotations (from data frame) to DESeqDataSet
2
gravatar for Darya Vanichkina
3.6 years ago by
Australia/Centenary Institute University of Sydney
Darya Vanichkina90 wrote:

Hi! 

Sorry if this is a very basic question, but how do I add additional information from a data frame (i.e. gene symbols, chromosomal coordinates, etc) to a DESeqDataSet which I obtained from a count table with gene names as row names?

I.e. my count table looks like this: 

 head(counttable)
  A1 A2 A3 B1 B2 B3
ENSMUSG00000000001 2153 2549 2114 1142 1454 442
ENSMUSG00000000003 0 0 0 0 0 0
ENSMUSG00000000028 297 365 338 120 181 63
ENSMUSG00000000031 6628 5779 6986 10880 18988 9827
ENSMUSG00000000037 206 183 112 125 167 30
ENSMUSG00000000049 0 0 0 0 0 0

I'm using the following code to create the object:

groups <- c(rep("A", 3), rep("B", 3))
dds <- DESeqDataSetFromMatrix(countData = counttable, colData = as.data.frame(groups) ,design = ~ groups)

I've used biomaRt to query ensembl and obtain an annotation data frame which has MGI gene symbols and other information I'm interested in:

head(annotation)

     ensembl_gene_id mgi_symbol   gene_biotype
1 ENSMUSG00000064336      mt-Tf        Mt_tRNA
2 ENSMUSG00000064337    mt-Rnr1        Mt_rRNA
3 ENSMUSG00000064338      mt-Tv        Mt_tRNA
4 ENSMUSG00000064339    mt-Rnr2        Mt_rRNA
5 ENSMUSG00000064340     mt-Tl1        Mt_tRNA
6 ENSMUSG00000064341     mt-Nd1 protein_coding

What I am trying to do (in the end) is to add this information to the DESeqDataSet so that (for example)

- when I plot the heatmap my gene names are MGI symbols, not ensembl IDs

- when I use ReportingTools to generate the report, I have human - readable gene names and biotype classes in the output table as well

Thanks in advance !!!

 


R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] parallel  grid      stats     graphics  grDevices datasets  utils     methods  
[9] base     

other attached packages:
 [1] gplots_2.14.2             RColorBrewer_1.0-5        Biobase_2.24.0           
 [4] pasilla_0.4.0             BiocInstaller_1.14.3      DESeq2_1.4.5             
 [7] RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3               GenomicRanges_1.16.4     
[10] GenomeInfoDb_1.0.2        IRanges_1.22.10           BiocGenerics_0.10.0      
[13] gridExtra_0.9.1           ggplot2_1.0.0             edgeR_3.6.8              
[16] limma_3.20.9              biomaRt_2.20.0           

loaded via a namespace (and not attached):
 [1] annotate_1.42.1      AnnotationDbi_1.26.1 bitops_1.0-6         caTools_1.17.1      
 [5] colorspace_1.2-4     DBI_0.3.1            DESeq_1.16.0         digest_0.6.4        
 [9] gdata_2.13.3         genefilter_1.46.1    geneplotter_1.42.0   gtable_0.1.2        
[13] gtools_3.4.1         KernSmooth_2.23-13   lattice_0.20-29      locfit_1.5-9.1      
[17] MASS_7.3-35          munsell_0.4.2        plyr_1.8.1           proto_0.3-10        
[21] RCurl_1.95-4.3       reshape2_1.4         RSQLite_1.0.0        scales_0.2.4        
[25] splines_3.1.1        stats4_3.1.1         stringr_0.6.2        survival_2.37-7     
[29] tools_3.1.1          XML_3.98-1.1         xtable_1.7-4         XVector_0.4.0       
ADD COMMENTlink modified 2.4 years ago by bioprog0 • written 3.6 years ago by Darya Vanichkina90
6
gravatar for Michael Love
3.6 years ago by
Michael Love18k
United States
Michael Love18k wrote:

hi Darya,

DESeqDataSet is a subclass of SummarizedExperiment, and you can add to the metadata columns of the rowData with code like

mcols(dds) <- cbind(mcols(dds), newMcols)

However, you have to make sure that these new columns line up row for row with the DESeqDataSet:

all(rownames(dds) == annotation$ensembl_gene_id)

If this is not the case you need to reorder the annotation data.frame using the match() function, so that the Ensembl IDs match up, then check again that everything is lined up:

annotation <- annotation[match(rownames(dds), annotation$ensembl_gene_id),]
all(rownames(dds) == annotation$ensembl_gene_id)

The heatmap.2 function looks for the rownames of the matrix you give it. So you could assign the mgi_symbols as the rownames of the DESeqDataSet, but you need to make sure you have unique mgi_symbols for every ensembl_gene_id.

sum(is.na(annotation$mgi_symbol))
sum(duplicated(annotation$mgi_symbol))

If these sums above are both 0, then:

rownames(dds) <- mcols(dds)$mgi_symbol

should work.

If these sums are not 0, you can manually supply the row labels to the heatmap.2() function using the "labRow" argument. See ?heatmap.2

You can add this mgi symbol column to the results table like so:

res <- results(dds)
res$mgi_symbol <- mcols(dds)$mgi_symbol

I don't know the details on how to tell ReportingTools which columns to output into the final report. The ReportingTools vignette might have some more information.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Michael Love18k
1

Thanks for getting back to me so quickly!

Unfortunately, the annotation (gencode M3) is not as clean as one would like, so 

sum(duplicated(annotation$mgi_symbol)) # corrected your code here: did you mean this?
is non-zero.

To get around this I followed your advice (posting code here in case someone else has this question):

annotation <- annotation[match(rownames(dds), annotation$ensembl_gene_id),]
all(rownames(dds) == annotation$ensembl_gene_id)
TRUE

sum(is.na(annotation$mgi_symbol))
0

sum(duplicated(annotation$mgi_symbol)) 
2680

Many of these are have alternative transcripts annotated as separate genes:

ENSMUSG00000062078 and ENSMUSG00000044407 Qk

ENSMUSG00000067798, ENSMUSG00000040003, and ENSMUSG00000073174 - Magi2 PCG and ncRNA

So I shouldn't be ignoring them. For the heat map, therefor, I can use 

heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6), labRow = annotation[select,"mgi_symbol"])

For exporting to a report, I just merged my differential gene list (as a data frame) with the annotation data frame and exported the merged data frame instead of the DESeq2 object.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Darya Vanichkina90

Hi Michael,

I have the same question on the project I am involved in. I tried the line of code you mentioned: 

all(rownames(dds) == annotation$ensembl_gene_id)

But instead of'TRUE', I got 'NA ' as my answer. So I tried 

all(rownames(dds) == annotation$ensembl_gene_id,na.rm = TRUE)

which gave me TRUE as the answer. But when I tried

sum(is.na(rownames(dds))),

the answer is 0. I am not able to figure out what the difference is. 

Also, on a slightly tangential track, most of the genes in the biomaRt database are missing mgi_symbols. I am using the Rattus norvegicus annotation dataset from biomaRt. Is there an alternative to this?

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by krishnakarthik.vemuri0

Hi Michael,

I have a question about checking order of rows between the columns. My dds and lengthData objects both have the same number of rows (lengthData is a data frame with only one column named lengthData):

> nrow(dds)
[1] 16548
> nrow(lengthData)
[1] 16548

And I was pretty sure they had the same row order:

> head(rownames(dds))
[1] "ML000110a" "ML000111a" "ML000112a" "ML000113a" "ML000114a" "ML000115a"
> head(rownames(lengthData))
[1] "ML000110a" "ML000111a" "ML000112a" "ML000113a" "ML000114a" "ML000115a"
> tail(rownames(dds))
[1] "ML50541a" "ML50721a" "ML50771a" "ML50791a" "ML50792a" "ML50851a"
> tail(rownames(lengthData))
[1] "ML50541a" "ML50721a" "ML50771a" "ML50791a" "ML50792a" "ML50851a"

But this returns FALSE:

> all(rownames(dds) == lengthData$lengthData)
[1] FALSE

Still this works: mcols(dds) <- cbind(mcols(dds), lengthData) and I am able to use the fpkm() function after renaming lengthData column to basepairs.

How can I be sure that lengthData and dds do not have the same row order?

 

ADD REPLYlink written 2.0 years ago by Jon Bråte150

What about:

all.equal(rownames(dds), rownames(lengthData))
ADD REPLYlink written 2.0 years ago by Michael Love18k

TRUE!
Thanks for showing me `all.equal()`

Jon
 

ADD REPLYlink written 2.0 years ago by Jon Bråte150

HI Michael,

If your time allows, Can you please look into this problem? Add new columns to DESeqResults from text file by matching rownames.

ADD REPLYlink written 8 months ago by kirannbishwa010

Hi Micheal

I think the gene Symbol when we use DESeqDataSetFromMatrix, it is somehow easy compare the unclear way how to deal with the SummarizedExperiment object

when we construct the count matrix from featcherCounts we add the argument (GTF.attrType="gene_name")

I would like you to answer our questions in clear way how to deal with TxDb object and the GRangeList and get the genes symbol as rownames in the assay of the RangedSummarizedExperiment object.

 

I will write now a separate question hoping to get an answer or clearer vignette concerning SummarizedExperiment object.

 

Thanks

ADD REPLYlink written 15 months ago by layal87850
0
gravatar for bioprog
2.4 years ago by
bioprog0
bioprog0 wrote:

I tried the above steps but i cant get it to work to add my features. Here is what I tried:

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT")
dds <- DESeq(dds)

#Add Feature data
featureData <-read.delim(file="FeatureData.txt", sep="\t", header=T)
all(rownames(dds) ==featureData$GeneID) #to confirm that they are in the same order
-->TRUE
mcols(dds) <- cbind(mcols(dds), featureData)
names(mcols(dds)) #I see the new columns added at the end

res <- results(dos) #When i do this, i get the following error:

Error in results(dds) : 
  cannot find results columns in object, first call DESeq, nbinomWaldTest, or nbinomLRT

If I do NOT add the feature, i dont get an error. hence, how can I run this with the added features, and how can i output a table with the added features include in it?

Further, how can I output a matrix with the normalized counts of each sample along with the log2fold change and p-value results, + added features? 

 

 

 

ADD COMMENTlink written 2.4 years ago by bioprog0

hi, 

Can you confirm that you are using DESeq2 version 1.10?

packageVersion("DESeq2)
ADD REPLYlink written 2.4 years ago by Michael Love18k

No, i am using 1.6.3 apparently when checking by:

packageVersion("DESeq2")
[1] ‘1.6.3’

To try to get 1.10, i first removed DESeq2 using remove.packages('DESeq2')

then

source("http://bioconductor.org/biocLite.R")
biocLite('DESeq2')

then reloaded R and the DESeq2, but i still got 1.6.3

Here's my R version:

"R version 3.1.2 (2014-10-31)"

I then downloaded the .tar file and i think i got it as 1.10.1 

packageVersion("DESeq2")
[1] ‘1.10.1’

However, DESEq2 wouldnt work then. I get:

Error: package ‘SummarizedExperiment’ required by ‘DESeq2’ could not be found << this is not available for 3.1 version of R
In addition: Warning message:
package ‘DESeq2’ was built under R version 3.2.3 

Unfortunately, once I upgrade to 3.2.3, I lose most of my working R packages so ih ad to downgrade to 3.1

Is that the right package for the above steps to work?

The manual states the following step for features:

mcols(dds) <- DataFrame(mcols(dds), featureData), which didn't work for me. 

Why do you prefer cbind? And does the mcols(dds) step automatically merge data based on correct ID irrespective if it is in same order?

I hope you can also advise whether there is a way to output a whole dataframe/matrix with the counts for each sample, as well as the added features, log2 fold, p value and the adjusted p value. It would be quite useful for looking into specific counts.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by bioprog0
Check this page: http://bioconductor.org/install You will need to upgrade to the latest version of R, and update all Bioconductor packages. This is best practice to stay up to date because you want to benefit from the bugs fixed (including this bug about adding to mcols) and features added to all the various Bioconductor packages since October 2014.
ADD REPLYlink written 2.4 years ago by Michael Love18k

thanks, i tried that and lost most of my packages unfortunately due to the fact it wont work with the upgraded version of R! Anyway, in my current version ,can i at least output a table iwht the normalized counts for each samples along with the log2 fold and p values?

ADD REPLYlink written 2.4 years ago by bioprog0
1

I think this should work, anyway, you can get the normalized counts matrix with the code below and combine with whatever other vectors you like:

res.mat <- cbind(counts(dds, normalized=TRUE), res$log2FoldChange, res$padj)
ADD REPLYlink written 2.4 years ago by Michael Love18k
0
gravatar for bioprog
2.4 years ago by
bioprog0
bioprog0 wrote:

 

 

.

 

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by bioprog0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 373 users visited in the last hour