Question

Accessing feature data of a Malus domestica Nimblegen microarray

0

Entering edit mode

theobroma22 ▴ 10

@theobroma22-11920

Last seen 7.3 years ago

Dear administrators,

I'm analyzing an apple (Malus x domestica) fruit dataset I downloaded from NCBI-GEO, accession # GSE24523. This experiment used a Nimblegen Microarray, platform GPL11164 and NimbleGen design name: 080501_GDR_Malus_EST-V4_EXPNimbleGen design ID: 7552.

Using the oligo and pdInfoBuilder packages I created an eset after converting the pair files to the xys files:

> eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 193586 features, 24 samples
element names: exprs
protocolData
rowNames: GSM618107_14418002_532.xys GSM618108_12742302_532.xys ... GSM618130_12782502_532.xys (24 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: GSM618107_14418002_532.xys GSM618108_12742302_532.xys ... GSM618130_12782502_532.xys (24 total)
varLabels: Timepoint
varMetadata: channel labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.080501.gdr.malus.est.v4.exp

I'm able to construct volcano plots and use limma to test for differential expression, however, I'm not able to obtain the feature data of the assay data features. I do have the pmSequence nucleotide strings, width and sequence:

> head(pmSeqHC)
  A DNAStringSet instance of length 6
    width seq
[1]    65 GTGTGAAACATGTTTGGGCACCATCAAATCTCAGACTATTATCTTTAGATGATAACATGATTCAA
[2]    54 TCTTTTCGGAATGTGAGAAATAGTGGTGTTTTCACTTTCTCTCGCATTTCAACT
[3]    50 GCCGTAACAACCCTGTGTGAGGCGGATTATGAAACAAGTAAAATGAGTGT
[4]    51 ATCTCTAGAATCTCAAGGTGGCCCTCTCTCACTGCTGCTGTTGTCGCGAAG
[5]    57 CAAAGACTTGTCCTTCTTTATATTCTATTCAGCACTTTTGCTCTCGCCAGTGGCATT
[6]    55 GTCCACTGAGGAATTTATTTCTATGGAATTCCGTTTTACTTTAAGAGATAATGGA

As well as the feature names:

> head(ID)
[1] "add_Affy_dap_1_1031" "add_Affy_dap_1_109"  "add_Affy_dap_1_1096" "add_Affy_dap_1_1198" "add_Affy_dap_1_1249" "add_Affy_dap_1_1300"

And the limma result rownames:

> head(tab)
                                logFC  AveExpr         t      P.Value    adj.P.Val        B
Contig9003_2_r_204_2_696     2.476124 5.441725 13.398555 1.322361e-12 2.559905e-07 17.05329
Malus_CN941480_2_f_292_5_373 1.764380 5.897758 10.625992 1.558974e-10 1.508977e-05 13.27949
Contig23165_2_r_145_8_238    1.639928 9.028845  9.973012 5.425434e-10 3.011584e-05 12.24010
Contig20209_1_f_261_1_795    2.927348 6.719980  9.902833 6.222731e-10 3.011584e-05 12.12464
Contig16255_2_f_213_3_301    2.496227 7.413003  9.229728 2.391889e-09 9.260724e-05 10.97910
Contig16255_1_r_157_2_270    2.584128 7.658646  9.133751 2.911828e-09 9.394820e-05 10.81005

My current issue is that I'm not able to obtain any feature data online when using BiomaRt to obtain the annotated result. I've tried plants.ensembl.org and well as plantdb.org. I also tried rosaceae.org but it seems they don't have the feature data on their database although this group constructed the Nimblegen array. Below is an example of the many variations of code I tried to access the mapped features to the array:

ID = featureNames(eset)
length(ID)
[1] 193586
plantbase <- useMart(biomart = "plants_mart", host = "plants.ensembl.org")
plantbase <- useDataset(mart = plantbase, dataset = "ptrichocarpa_eg_gene")
listDatasets(plantbase)
listAttributes(plantbase)
listFilters(plantbase)
getBM(attributes = c("ensembl_transcript_id", "uniprot_swissprot_accession"),
      filters="ensembl_transcript_id",
      values= ID[1:5],
      mart=plantbase)

This is the typical output I get after my query, and is without errors:

> getBM(attributes = c("ensembl_transcript_id", "uniprot_swissprot_accession"), 
+       filters="ensembl_transcript_id", 
+       values= ID[1:5],
+   mart=plantbase)
[1] ensembl_transcript_id       uniprot_swissprot_accession
<0 rows> (or 0-length row.names)

Any help or insight is greatly appreciated. Thanks, Franklin

> sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] R6_2.2.0                             rentrez_1.0.4                        annotate_1.50.0                      XML_3.98-1.4                        
 [5] AnnotationDbi_1.34.4                 biomaRt_2.28.0                       genefilter_1.54.2                    pd.080501.gdr.malus.est.v4.exp_0.0.1
 [9] limma_3.28.18                        pdInfoBuilder_1.36.0                 affxparser_1.44.0                    RSQLite_1.0.0                       
[13] DBI_0.5                              oligo_1.36.1                         Biostrings_2.40.2                    XVector_0.12.1                      
[17] IRanges_2.6.1                        S4Vectors_0.10.3                     Biobase_2.32.0                       oligoClasses_1.34.0                 
[21] BiocGenerics_0.18.0                 

loaded via a namespace (and not attached):
 [1] bit_1.1-12                 codetools_0.2-14           preprocessCore_1.34.0      splines_3.3.1              curl_1.2                  
 [6] grid_3.3.1                 bitops_1.0-6               httr_1.2.1                 survival_2.39-4            zlibbioc_1.18.0           
[11] lattice_0.20-33            foreach_1.4.3              iterators_1.0.8            Matrix_1.2-6               KernSmooth_2.23-15        
[16] jsonlite_1.0               BiocInstaller_1.22.3       SummarizedExperiment_1.2.3 RCurl_1.95-4.8             tools_3.3.1               
[21] affyio_1.42.0              GenomeInfoDb_1.8.3         GenomicRanges_1.24.2       ff_2.2-13                  xtable_1.8-2              
>

nimblegen Biomart Uniprot • 1.1k views

ADD COMMENT • link updated 7.4 years ago by Gordon Smyth 50k • written 7.4 years ago by theobroma22 ▴ 10

0

Entering edit mode

First of all, I don't know anything on the organism you are investigating. However, using a simple Google search I found this site (@ rosaceae) https://www.rosaceae.org/species/malus/malus_spp/unigene_v4 , which seems to contain (a little) more annotation info for the v4 release. That is, at the download section BLAST results files are available that map GDR_ID (e.g. Malus_v4_Contig1) to for example SwissProt IDs. I think this is the kind of info you are after.

Having said this, I realize the info is rather outdated. Although a V5 annotation release seems to be available, it is not clear to me how this can be related the the Nimblegene array you would like to analyze. Except for getting in touch with the ppl that maintain the GDR, I cannot think about a solution for this....

ADD REPLY • link 7.4 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

This is always the problem with a non-model organism. You can use whatever annotations someone else did (and for which you have little knowledge of how they did the annotations), or you can 'roll your own'. It wouldn't take much time to blast those sequences against nt, for what that's worth.

ADD REPLY • link 7.4 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks Guido. The MDP numbers don't match any other database than rosaceae-GDR. As per James's comment, I rolled my own, and did the heavy lifting to get the Entrez Gene IDs. It was tedious, but learned a lot while doing it. Thanks guys!!

ADD REPLY • link 7.3 years ago theobroma22 ▴ 10