Search
Question: How to process ids containing "." in BiomaRt?
0
gravatar for linda.c.dansereau
4 months ago by
linda.c.dansereau0 wrote:

Hello,

I am trying to annotate an SCEset using getBMFeatureAnnos where the filter column contains values such as "MTCE.31" and "MTCE.23". In making the SCEset these are recognized as different row names, however when running the getBM function the ".31" and ".23" are ignored and they are interpreted as duplicate row.names. Is there another way to format the column to get around this?

Thank you for any advice you may have.

#TestData loaded as a .csv file

TestData <- read.csv("testdata.csv", colClasses = c(list("character"), rep("numeric", 8)), row.names = 1)

TestData
#        X cell.1a cell.1b cell.1c cell.2a cell.2b cell.3a cell.3b cell.3c
#1 2RSSE.1     866    1404     898     129    1053     141      33      70
#2 2RSSE.2      58     171      65      17      70      36      11      17
#3 MTCE.23   14911   27132   10405   82033  117449   57775   11544   14426
#4 MTCE.25    1888    3615    1453    5891   40047    9144    2396    2947
#5 MTCE.31   20818   38746   12289  235235  211993  109575   19117   20580
#6   cct-6    1488    2236    1274     487    6430    1006    2311     381
#7   cct-8    1113    1679    1099     530    3727    1012    1135     130
#8   CD4.3      58      70      64      45     122      19      59      70
#9   CD4.7      34      37      27      56     400      11      53      88

sce <- newSCESet(countData = TestData)

sce <- getBMFeatureAnnos(sce, filters = "external_gene_name", attributes = c("wormbase_gene", "ensembl_gene_id","external_gene_name", "chromosome_name", "transcript_biotype", "go_id", "kegg_enzyme", "entrezgene"), feature_symbol = "external_gene_name", feature_id = "wormbase_gene", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "celegans_gene_ensembl", host = "www.ensembl.org")

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘2RSSE’, ‘CD4’, ‘MTCE’

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.30.0       scater_1.2.0         ggplot2_2.2.1        Biobase_2.34.0      
 [5] BiocGenerics_0.20.0  gplots_3.0.1         RColorBrewer_1.1-2   edgeR_3.16.5        
 [9] limma_3.30.13        openxlsx_4.0.17      BiocInstaller_1.24.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11         locfit_1.5-9.1       lattice_0.20-34      GO.db_3.4.0         
 [5] gtools_3.5.0         assertthat_0.2.0     digest_0.6.12        mime_0.5            
 [9] R6_2.2.2             plyr_1.8.4           stats4_3.3.2         RSQLite_2.0         
[13] zlibbioc_1.20.0      rlang_0.1.1          lazyeval_0.2.0       data.table_1.10.4   
[17] gdata_2.18.0         blob_1.1.0           S4Vectors_0.12.2     stringr_1.2.0       
[21] RCurl_1.95-4.8       bit_1.1-12           munsell_0.4.3        shiny_1.0.3         
[25] httpuv_1.3.5         vipor_0.4.5          pkgconfig_2.0.1      ggbeeswarm_0.5.3    
[29] htmltools_0.3.6      tximport_1.2.0       tibble_1.3.3         gridExtra_2.2.1     
[33] IRanges_2.8.2        matrixStats_0.52.2   XML_3.98-1.9         viridisLite_0.2.0   
[37] dplyr_0.7.1          bitops_1.0-6         grid_3.3.2           xtable_1.8-2        
[41] gtable_0.2.0         DBI_0.7              magrittr_1.5         scales_0.4.1        
[45] KernSmooth_2.23-15   stringi_1.1.5        reshape2_1.4.2       viridis_0.4.0       
[49] bindrcpp_0.2         org.Ce.eg.db_3.4.0   rjson_0.2.15         tools_3.3.2         
[53] bit64_0.9-7          glue_1.1.1           beeswarm_0.2.3       AnnotationDbi_1.36.2
[57] colorspace_1.3-2     rhdf5_2.18.0         caTools_1.17.1       shinydashboard_0.6.1
[61] memoise_1.1.0        bindr_0.1           
ADD COMMENTlink modified 4 months ago by Mike Smith2.1k • written 4 months ago by linda.c.dansereau0
0
gravatar for Mike Smith
4 months ago by
Mike Smith2.1k
EMBL Heidelberg / de.NBI
Mike Smith2.1k wrote:

Hi,

I don't think this is an issue with biomaRt, but rather with scater.  biomaRt will happily search for gene IDs with '.' in them.

However the function getBMFeatureAnnos() in scater has the following lines, which will strip any numbers after the '.' from your gene names.

## Remove transcript ID artifacts from runKallisto (eg. ENSMUST00000201087.11 -> ENSMUST00000201087)
feature_ids <- gsub(pattern = "\\.[0-9]+", replacement = "", x = feature_ids)

I guess this probably creates invalid gene names in your case, and it also produces duplicate values which are then used to set the row names later in the function.

rownames(feature_info_full) <- feature_ids

You probably want to contact the scater authors to see if this number stripping can be made optional.

ADD COMMENTlink written 4 months ago by Mike Smith2.1k

Ah, thanks for finding that.

ADD REPLYlink written 4 months ago by linda.c.dansereau0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 150 users visited in the last hour