Question

How to process ids containing "." in BiomaRt?

0

Entering edit mode

linda.c.dansereau • 0

@lindacdansereau-13115

Last seen 6.7 years ago

Hello,

I am trying to annotate an SCEset using getBMFeatureAnnos where the filter column contains values such as "MTCE.31" and "MTCE.23". In making the SCEset these are recognized as different row names, however when running the getBM function the ".31" and ".23" are ignored and they are interpreted as duplicate row.names. Is there another way to format the column to get around this?

Thank you for any advice you may have.

#TestData loaded as a .csv file

TestData <- read.csv("testdata.csv", colClasses = c(list("character"), rep("numeric", 8)), row.names = 1)

TestData
#        X cell.1a cell.1b cell.1c cell.2a cell.2b cell.3a cell.3b cell.3c
#1 2RSSE.1     866    1404     898     129    1053     141      33      70
#2 2RSSE.2      58     171      65      17      70      36      11      17
#3 MTCE.23   14911   27132   10405   82033  117449   57775   11544   14426
#4 MTCE.25    1888    3615    1453    5891   40047    9144    2396    2947
#5 MTCE.31   20818   38746   12289  235235  211993  109575   19117   20580
#6   cct-6    1488    2236    1274     487    6430    1006    2311     381
#7   cct-8    1113    1679    1099     530    3727    1012    1135     130
#8   CD4.3      58      70      64      45     122      19      59      70
#9   CD4.7      34      37      27      56     400      11      53      88

sce <- newSCESet(countData = TestData)

sce <- getBMFeatureAnnos(sce, filters = "external_gene_name", attributes = c("wormbase_gene", "ensembl_gene_id","external_gene_name", "chromosome_name", "transcript_biotype", "go_id", "kegg_enzyme", "entrezgene"), feature_symbol = "external_gene_name", feature_id = "wormbase_gene", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "celegans_gene_ensembl", host = "www.ensembl.org")

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘2RSSE’, ‘CD4’, ‘MTCE’

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.30.0       scater_1.2.0         ggplot2_2.2.1        Biobase_2.34.0      
 [5] BiocGenerics_0.20.0  gplots_3.0.1         RColorBrewer_1.1-2   edgeR_3.16.5        
 [9] limma_3.30.13        openxlsx_4.0.17      BiocInstaller_1.24.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11         locfit_1.5-9.1       lattice_0.20-34      GO.db_3.4.0         
 [5] gtools_3.5.0         assertthat_0.2.0     digest_0.6.12        mime_0.5            
 [9] R6_2.2.2             plyr_1.8.4           stats4_3.3.2         RSQLite_2.0         
[13] zlibbioc_1.20.0      rlang_0.1.1          lazyeval_0.2.0       data.table_1.10.4   
[17] gdata_2.18.0         blob_1.1.0           S4Vectors_0.12.2     stringr_1.2.0       
[21] RCurl_1.95-4.8       bit_1.1-12           munsell_0.4.3        shiny_1.0.3         
[25] httpuv_1.3.5         vipor_0.4.5          pkgconfig_2.0.1      ggbeeswarm_0.5.3    
[29] htmltools_0.3.6      tximport_1.2.0       tibble_1.3.3         gridExtra_2.2.1     
[33] IRanges_2.8.2        matrixStats_0.52.2   XML_3.98-1.9         viridisLite_0.2.0   
[37] dplyr_0.7.1          bitops_1.0-6         grid_3.3.2           xtable_1.8-2        
[41] gtable_0.2.0         DBI_0.7              magrittr_1.5         scales_0.4.1        
[45] KernSmooth_2.23-15   stringi_1.1.5        reshape2_1.4.2       viridis_0.4.0       
[49] bindrcpp_0.2         org.Ce.eg.db_3.4.0   rjson_0.2.15         tools_3.3.2         
[53] bit64_0.9-7          glue_1.1.1           beeswarm_0.2.3       AnnotationDbi_1.36.2
[57] colorspace_1.3-2     rhdf5_2.18.0         caTools_1.17.1       shinydashboard_0.6.1
[61] memoise_1.1.0        bindr_0.1

sce biomart annotation • 1.0k views

ADD COMMENT • link updated 6.8 years ago by Mike Smith ★ 6.5k • written 6.8 years ago by linda.c.dansereau • 0

score 0 · Answer 1 · 2017-07-07

Hi,

I don't think this is an issue with biomaRt, but rather with scater. biomaRt will happily search for gene IDs with '.' in them.

However the function getBMFeatureAnnos() in scater has the following lines, which will strip any numbers after the '.' from your gene names.

## Remove transcript ID artifacts from runKallisto (eg. ENSMUST00000201087.11 -> ENSMUST00000201087)
feature_ids <- gsub(pattern = "\\.[0-9]+", replacement = "", x = feature_ids)

I guess this probably creates invalid gene names in your case, and it also produces duplicate values which are then used to set the row names later in the function.

rownames(feature_info_full) <- feature_ids

You probably want to contact the scater authors to see if this number stripping can be made optional.