edgeR adding RefSeq IDs

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

My pipeline for adding annotations suddenly didn't work, maybe because some packages are updated. Could you check if something is wrong with my pipeline, I loose all the rows when I add annotations: > data <- read.table("~/Documents/Jobb/mRNA_Ch12/TreatedControlCounts.csv", row.names=1, sep="", header=T) > head(data) T0h T0.25h T0.5h T1h T2h T3h T6h T12h T24h T48h C0h C0.25h C0.5h C1h C2h C3h NM_001001130 68 95 56 43 66 62 68 90 63 89 65 85 58 49 81 49 NM_001001144 0 1 4 0 1 1 1 4 1 2 1 3 1 0 6 3 NM_001001152 79 129 52 50 24 45 130 154 112 147 56 85 67 33 52 31 NM_001001160 1 1 2 0 0 1 0 0 0 1 0 1 2 4 2 1 NM_001001176 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NM_001001177 1 3 2 3 0 1 1 0 2 0 1 6 4 1 3 0 C6h C12h C24h C48h NM_001001130 76 73 48 77 NM_001001144 0 1 2 1 NM_001001152 93 77 65 109 NM_001001160 0 2 1 0 NM_001001176 0 0 0 0 NM_001001177 0 3 0 2 > y <- DGEList(counts=data[,1:20], genes=data[,0:1]) > library(org.Mm.eg.db) > idfound <- rownames(y$genes) %in% mappedRkeys(org.Mm.egREFSEQ) > y <- y[idfound,] > dim(y) [1] 0 20 > egREFSEQ <- toTable(org.Mm.egREFSEQ) > m <- match(rownames(y$genes), egREFSEQ$accession) > y$genes$EntrezGene <- egREFSEQ$gene_id[m] > y$genes$Symbol <- egSYMBOL$symbol[m] > m <- match(y$genes$EntrezGene, egSYMBOL$gene_id) > y$genes$Symbol <- egSYMBOL$symbol[m] > head(y$genes) [1] genes EntrezGene Symbol <0 rows> (or 0-length row.names) -- output of sessionInfo(): > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.4.0 limma_3.18.0 GSEABase_1.24.0 [4] annotate_1.40.0 org.Mm.eg.db_2.10.1 AnnotationForge_1.4.0 [7] org.Hs.eg.db_2.10.1 GOstats_2.28.0 graph_1.40.0 [10] Category_2.28.0 GO.db_2.10.1 RSQLite_0.11.4 [13] DBI_0.2-7 Matrix_1.0-14 lattice_0.20-24 [16] AnnotationDbi_1.24.0 Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] genefilter_1.44.0 grid_3.0.2 IRanges_1.20.0 RBGL_1.38.0 [5] splines_3.0.2 stats4_3.0.2 survival_2.37-4 tools_3.0.2 [9] XML_3.98-1.1 xtable_1.7-1 -- Sent via the guest posting facility at bioconductor.org.

GO GO • 1.4k views

ADD COMMENT • link updated 10.5 years ago by James W. MacDonald 65k • written 10.5 years ago by Guest User ★ 13k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Hi Robin, On Monday, October 21, 2013 9:23:34 AM, Robin [guest] wrote: > > My pipeline for adding annotations suddenly didn't work, maybe because some packages are updated. Could you check if something is wrong with my pipeline, I loose all the rows when I add annotations: > >> data <- read.table("~/Documents/Jobb/mRNA_Ch12/TreatedControlCounts.csv", row.names=1, sep="", header=T) >> head(data) > T0h T0.25h T0.5h T1h T2h T3h T6h T12h T24h T48h C0h C0.25h C0.5h C1h C2h C3h > NM_001001130 68 95 56 43 66 62 68 90 63 89 65 85 58 49 81 49 > NM_001001144 0 1 4 0 1 1 1 4 1 2 1 3 1 0 6 3 > NM_001001152 79 129 52 50 24 45 130 154 112 147 56 85 67 33 52 31 > NM_001001160 1 1 2 0 0 1 0 0 0 1 0 1 2 4 2 1 > NM_001001176 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > NM_001001177 1 3 2 3 0 1 1 0 2 0 1 6 4 1 3 0 > C6h C12h C24h C48h > NM_001001130 76 73 48 77 > NM_001001144 0 1 2 1 > NM_001001152 93 77 65 109 > NM_001001160 0 2 1 0 > NM_001001176 0 0 0 0 > NM_001001177 0 3 0 2 >> y <- DGEList(counts=data[,1:20], genes=data[,0:1]) What are you expecting data[,0:1] to do? Note that R is not Perl, so counting starts at 1, so R is ignoring the zero. What you are in fact doing is putting the first column of your 'data' data.frame into the DGEList object, and since you are only using one column, R is dropping the dimension attribute, and thereby reducing to a vector. You later ask for the row.names() of this vector, and since it has no rows, it has no row.names either. As an example: > x <- data.frame(letters=1:26) > row.names(x) <- letters > x[,0:1] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 [26] 26 > row.names(x[,0:1]) > Not that the call to row.names above didn't return anything, because there wasn't anything to return. In addition, you are going through a lot of unnecessary work below, trying to get mappings to Entrez Gene from RefSeq IDs. This is actually a simple one-liner. gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"), "ACCNUM") You may get a warning that you have some one-to-many mappings here. How you deal with that is up to you. I normally take the most naive approach possible and just do gns <- gns[!duplicated(gns[,1]),] After which you an check that all.equal(gns$ACCNUM, row.names(data)) and then you can do y <- DGEList(data, genes = gns) Best, Jim >> library(org.Mm.eg.db) >> idfound <- rownames(y$genes) %in% mappedRkeys(org.Mm.egREFSEQ) >> y <- y[idfound,] >> dim(y) > [1] 0 20 >> egREFSEQ <- toTable(org.Mm.egREFSEQ) >> m <- match(rownames(y$genes), egREFSEQ$accession) >> y$genes$EntrezGene <- egREFSEQ$gene_id[m] >> y$genes$Symbol <- egSYMBOL$symbol[m] >> m <- match(y$genes$EntrezGene, egSYMBOL$gene_id) >> y$genes$Symbol <- egSYMBOL$symbol[m] >> head(y$genes) > [1] genes EntrezGene Symbol > <0 rows> (or 0-length row.names) > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.4.0 limma_3.18.0 GSEABase_1.24.0 > [4] annotate_1.40.0 org.Mm.eg.db_2.10.1 AnnotationForge_1.4.0 > [7] org.Hs.eg.db_2.10.1 GOstats_2.28.0 graph_1.40.0 > [10] Category_2.28.0 GO.db_2.10.1 RSQLite_0.11.4 > [13] DBI_0.2-7 Matrix_1.0-14 lattice_0.20-24 > [16] AnnotationDbi_1.24.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] genefilter_1.44.0 grid_3.0.2 IRanges_1.20.0 RBGL_1.38.0 > [5] splines_3.0.2 stats4_3.0.2 survival_2.37-4 tools_3.0.2 > [9] XML_3.98-1.1 xtable_1.7-1 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.5 years ago James W. MacDonald 65k

0

Entering edit mode

Small comment on Jim's script On Mon, Oct 21, 2013 at 10:50 AM, James W. MacDonald <jmacdon@uw.edu> wrote: > gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"), > "ACCNUM") > > You may get a warning that you have some one-to-many mappings here. How > you deal with that is up to you. I normally take the most naive approach > possible and just do > > gns <- gns[!duplicated(gns[,1]),] > Because duplicated is FALSE only for the second occurance of an object, ie > duplicated(c("a", "a")) [1] FALSE TRUE you are effectively resolving your one-to-many mappings by including only the first mapping in the object. This (of course) depends on the order of the mappings. I may be wrong, but I don't think there is any implied order of the mappings, and even if there is, I am not sure that the database interface respects this order in how it returns things. Unfortunately, the code above hides the fact that you are just picking a random mapping, so I would suggest something like tapply(gns, gns[,1], function(xx) { xx[ sample(1:nrow(xx), size = 1), ] }) which makes the random nature obvious, or, in case you want to discard stuff, gns[ ! gns[,1] %in% gns[duplicated(gns[,1]),1],] (code not fully tested) Best, Kasper [[alternative HTML version deleted]]

ADD REPLY • link 10.5 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Thanks guys, it worked. My code also worked if I justed changed [,0:1] to [,1:1]. I will try your suggestion Kasper. On Mon, Oct 21, 2013 at 5:24 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > Small comment on Jim's script > > > On Mon, Oct 21, 2013 at 10:50 AM, James W. MacDonald <jmacdon@uw.edu>wrote: > >> gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"), >> "ACCNUM") >> >> You may get a warning that you have some one-to-many mappings here. How >> you deal with that is up to you. I normally take the most naive approach >> possible and just do >> >> gns <- gns[!duplicated(gns[,1]),] >> > > Because duplicated is FALSE only for the second occurance of an object, ie > > duplicated(c("a", "a")) > [1] FALSE TRUE > you are effectively resolving your one-to-many mappings by including only > the first mapping in the object. This (of course) depends on the order of > the mappings. I may be wrong, but I don't think there is any implied order > of the mappings, and even if there is, I am not sure that the database > interface respects this order in how it returns things. Unfortunately, the > code above hides the fact that you are just picking a random mapping, so I > would suggest something like > > tapply(gns, gns[,1], function(xx) { > xx[ sample(1:nrow(xx), size = 1), ] > }) > > which makes the random nature obvious, or, in case you want to discard > stuff, > gns[ ! gns[,1] %in% gns[duplicated(gns[,1]),1],] > > (code not fully tested) > > Best, > Kasper > > [[alternative HTML version deleted]]

ADD REPLY • link 10.5 years ago R ▴ 40

Login before adding your answer.