edgeR adding RefSeq IDs
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.3 years ago
My pipeline for adding annotations suddenly didn't work, maybe because some packages are updated. Could you check if something is wrong with my pipeline, I loose all the rows when I add annotations: > data <- read.table("~/Documents/Jobb/mRNA_Ch12/TreatedControlCounts.csv", row.names=1, sep="", header=T) > head(data) T0h T0.25h T0.5h T1h T2h T3h T6h T12h T24h T48h C0h C0.25h C0.5h C1h C2h C3h NM_001001130 68 95 56 43 66 62 68 90 63 89 65 85 58 49 81 49 NM_001001144 0 1 4 0 1 1 1 4 1 2 1 3 1 0 6 3 NM_001001152 79 129 52 50 24 45 130 154 112 147 56 85 67 33 52 31 NM_001001160 1 1 2 0 0 1 0 0 0 1 0 1 2 4 2 1 NM_001001176 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NM_001001177 1 3 2 3 0 1 1 0 2 0 1 6 4 1 3 0 C6h C12h C24h C48h NM_001001130 76 73 48 77 NM_001001144 0 1 2 1 NM_001001152 93 77 65 109 NM_001001160 0 2 1 0 NM_001001176 0 0 0 0 NM_001001177 0 3 0 2 > y <- DGEList(counts=data[,1:20], genes=data[,0:1]) > library(org.Mm.eg.db) > idfound <- rownames(y$genes) %in% mappedRkeys(org.Mm.egREFSEQ) > y <- y[idfound,] > dim(y) [1] 0 20 > egREFSEQ <- toTable(org.Mm.egREFSEQ) > m <- match(rownames(y$genes), egREFSEQ$accession) > y$genes$EntrezGene <- egREFSEQ$gene_id[m] > y$genes$Symbol <- egSYMBOL$symbol[m] > m <- match(y$genes$EntrezGene, egSYMBOL$gene_id) > y$genes$Symbol <- egSYMBOL$symbol[m] > head(y$genes) [1] genes EntrezGene Symbol <0 rows> (or 0-length row.names) -- output of sessionInfo(): > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.4.0 limma_3.18.0 GSEABase_1.24.0 [4] annotate_1.40.0 org.Mm.eg.db_2.10.1 AnnotationForge_1.4.0 [7] org.Hs.eg.db_2.10.1 GOstats_2.28.0 graph_1.40.0 [10] Category_2.28.0 GO.db_2.10.1 RSQLite_0.11.4 [13] DBI_0.2-7 Matrix_1.0-14 lattice_0.20-24 [16] AnnotationDbi_1.24.0 Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] genefilter_1.44.0 grid_3.0.2 IRanges_1.20.0 RBGL_1.38.0 [5] splines_3.0.2 stats4_3.0.2 survival_2.37-4 tools_3.0.2 [9] XML_3.98-1.1 xtable_1.7-1 -- Sent via the guest posting facility at bioconductor.org.
GO GO • 1.5k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 14 hours ago
United States
Hi Robin, On Monday, October 21, 2013 9:23:34 AM, Robin [guest] wrote: > > My pipeline for adding annotations suddenly didn't work, maybe because some packages are updated. Could you check if something is wrong with my pipeline, I loose all the rows when I add annotations: > >> data <- read.table("~/Documents/Jobb/mRNA_Ch12/TreatedControlCounts.csv", row.names=1, sep="", header=T) >> head(data) > T0h T0.25h T0.5h T1h T2h T3h T6h T12h T24h T48h C0h C0.25h C0.5h C1h C2h C3h > NM_001001130 68 95 56 43 66 62 68 90 63 89 65 85 58 49 81 49 > NM_001001144 0 1 4 0 1 1 1 4 1 2 1 3 1 0 6 3 > NM_001001152 79 129 52 50 24 45 130 154 112 147 56 85 67 33 52 31 > NM_001001160 1 1 2 0 0 1 0 0 0 1 0 1 2 4 2 1 > NM_001001176 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > NM_001001177 1 3 2 3 0 1 1 0 2 0 1 6 4 1 3 0 > C6h C12h C24h C48h > NM_001001130 76 73 48 77 > NM_001001144 0 1 2 1 > NM_001001152 93 77 65 109 > NM_001001160 0 2 1 0 > NM_001001176 0 0 0 0 > NM_001001177 0 3 0 2 >> y <- DGEList(counts=data[,1:20], genes=data[,0:1]) What are you expecting data[,0:1] to do? Note that R is not Perl, so counting starts at 1, so R is ignoring the zero. What you are in fact doing is putting the first column of your 'data' data.frame into the DGEList object, and since you are only using one column, R is dropping the dimension attribute, and thereby reducing to a vector. You later ask for the row.names() of this vector, and since it has no rows, it has no row.names either. As an example: > x <- data.frame(letters=1:26) > row.names(x) <- letters > x[,0:1] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 [26] 26 > row.names(x[,0:1]) > Not that the call to row.names above didn't return anything, because there wasn't anything to return. In addition, you are going through a lot of unnecessary work below, trying to get mappings to Entrez Gene from RefSeq IDs. This is actually a simple one-liner. gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"), "ACCNUM") You may get a warning that you have some one-to-many mappings here. How you deal with that is up to you. I normally take the most naive approach possible and just do gns <- gns[!duplicated(gns[,1]),] After which you an check that all.equal(gns$ACCNUM, row.names(data)) and then you can do y <- DGEList(data, genes = gns) Best, Jim >> library(org.Mm.eg.db) >> idfound <- rownames(y$genes) %in% mappedRkeys(org.Mm.egREFSEQ) >> y <- y[idfound,] >> dim(y) > [1] 0 20 >> egREFSEQ <- toTable(org.Mm.egREFSEQ) >> m <- match(rownames(y$genes), egREFSEQ$accession) >> y$genes$EntrezGene <- egREFSEQ$gene_id[m] >> y$genes$Symbol <- egSYMBOL$symbol[m] >> m <- match(y$genes$EntrezGene, egSYMBOL$gene_id) >> y$genes$Symbol <- egSYMBOL$symbol[m] >> head(y$genes) > [1] genes EntrezGene Symbol > <0 rows> (or 0-length row.names) > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.4.0 limma_3.18.0 GSEABase_1.24.0 > [4] annotate_1.40.0 org.Mm.eg.db_2.10.1 AnnotationForge_1.4.0 > [7] org.Hs.eg.db_2.10.1 GOstats_2.28.0 graph_1.40.0 > [10] Category_2.28.0 GO.db_2.10.1 RSQLite_0.11.4 > [13] DBI_0.2-7 Matrix_1.0-14 lattice_0.20-24 > [16] AnnotationDbi_1.24.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] genefilter_1.44.0 grid_3.0.2 IRanges_1.20.0 RBGL_1.38.0 > [5] splines_3.0.2 stats4_3.0.2 survival_2.37-4 tools_3.0.2 > [9] XML_3.98-1.1 xtable_1.7-1 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Small comment on Jim's script On Mon, Oct 21, 2013 at 10:50 AM, James W. MacDonald <jmacdon@uw.edu> wrote: > gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"), > "ACCNUM") > > You may get a warning that you have some one-to-many mappings here. How > you deal with that is up to you. I normally take the most naive approach > possible and just do > > gns <- gns[!duplicated(gns[,1]),] > Because duplicated is FALSE only for the second occurance of an object, ie > duplicated(c("a", "a")) [1] FALSE TRUE you are effectively resolving your one-to-many mappings by including only the first mapping in the object. This (of course) depends on the order of the mappings. I may be wrong, but I don't think there is any implied order of the mappings, and even if there is, I am not sure that the database interface respects this order in how it returns things. Unfortunately, the code above hides the fact that you are just picking a random mapping, so I would suggest something like tapply(gns, gns[,1], function(xx) { xx[ sample(1:nrow(xx), size = 1), ] }) which makes the random nature obvious, or, in case you want to discard stuff, gns[ ! gns[,1] %in% gns[duplicated(gns[,1]),1],] (code not fully tested) Best, Kasper [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Thanks guys, it worked. My code also worked if I justed changed [,0:1] to [,1:1]. I will try your suggestion Kasper. On Mon, Oct 21, 2013 at 5:24 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > Small comment on Jim's script > > > On Mon, Oct 21, 2013 at 10:50 AM, James W. MacDonald <jmacdon@uw.edu>wrote: > >> gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"), >> "ACCNUM") >> >> You may get a warning that you have some one-to-many mappings here. How >> you deal with that is up to you. I normally take the most naive approach >> possible and just do >> >> gns <- gns[!duplicated(gns[,1]),] >> > > Because duplicated is FALSE only for the second occurance of an object, ie > > duplicated(c("a", "a")) > [1] FALSE TRUE > you are effectively resolving your one-to-many mappings by including only > the first mapping in the object. This (of course) depends on the order of > the mappings. I may be wrong, but I don't think there is any implied order > of the mappings, and even if there is, I am not sure that the database > interface respects this order in how it returns things. Unfortunately, the > code above hides the fact that you are just picking a random mapping, so I > would suggest something like > > tapply(gns, gns[,1], function(xx) { > xx[ sample(1:nrow(xx), size = 1), ] > }) > > which makes the random nature obvious, or, in case you want to discard > stuff, > gns[ ! gns[,1] %in% gns[duplicated(gns[,1]),1],] > > (code not fully tested) > > Best, > Kasper > > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 809 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6