Applying NetAffx biological annotation to Affymetrix gene expression dataset
1
0
Entering edit mode
lm795 ▴ 10
@lm795-17736
Last seen 18 months ago
United Kingdom

I am running an analysis of transcriptional changes in Drosophila using Affymetrix (Drogene-1_1-st). The issue is that even though I can retrieve the biological annotation for the different probes, I do not know how to then merge the annotation with the primary data.

How do I include the biological annotation from "featureData" in the data matrix or the final analysis when I try to identify differentially expressed genes?

In essence, how do I apply the featureData to the file containing the probe IDs so that additional columns, containing metadata such as gene names, are created?

#Load CEL files into R using the oligo package since the old affy package does not work with ST array. Make sure you run the commands inside the CELFiles folder

library(oligo)
list = list.files(full.names=TRUE)
ph = data@phenoData
ph@data[ ,1] = c("C_3d_01","LL_3d_01","LS_30d_01","C_3d_02","LL_3d_02","LS_30d_02","C_3d_03","LL_3d_03","LS_30d_03",
"C_3d_04","LL_3d_04","LS_30d_04","LS_3d_01","C_30d_01","LL_30d_01","LS_3d_02","C_30d_02","LL_30d_02",
"LS_3d_03","C_30d_03","LL_30d_03","LS_3d_04","C_30d_04","LL_30d_04")
Pset = fitProbeLevelModel(data)
data.rma_V02 = rma(data, target="core") #Summaries at the transcript level
featureData(data.rma_V02) <- getNetAffx(data.rma_V02, "transcript") # Retrieving NetAffx Biological Annotation
data.matrix02 = exprs(data.rma_V02)

#How do I apply the featureData to data.matrix02?

microarray annotation • 417 views
2
Entering edit mode
@james-w-macdonald-5106
Last seen 8 hours ago
United States

The ExpressionSet object that you are calling data.rma_V02 is designed to hold the summarized data, as well as the metadata associated with it. So by definition, the featureData is already 'applied to it'. Why are you extracting the data from an object meant to harmonize the data and associated metadata and then trying to recapitulate what you already had?

Put a different way, if you just take the ExpressionSet you have in hand, run that through limma to make comparisons, and then output the results using topTable, the output will include all the annotation data.

0
Entering edit mode

Many thanks James, I am obviously a novice. Unfortunately, all the features on the platform design pd.drogene.1.1.st) return as "NA". Any thoughts for a plan B?

> contrasts_30d[1:3,]
coefficients.LL_30d...C_30d coefficients.LS_30d...C_30d coefficients.LS_30d...LL_30d df.residual     sigma
18130001                  1.13176346                  -0.3617338                   -1.4934973          18 0.5123965
18130003                  0.73807082                   0.5492098                   -0.1888610          18 0.3922108
18130005                 -0.09149468                   0.2577900                    0.3492847          18 0.4303247
stdev.unscaled.LL_30d...C_30d stdev.unscaled.LS_30d...C_30d stdev.unscaled.LS_30d...LL_30d genes.transcriptclusterid
18130001                     0.7071068                     0.7071068                      0.7071068                  18130001
18130003                     0.7071068                     0.7071068                      0.7071068                  18130003
18130005                     0.7071068                     0.7071068                      0.7071068                  18130005
genes.probesetid genes.seqname genes.strand genes.start genes.stop genes.totalprobes genes.geneassignment
18130001         18130001          <NA>         <NA>          NA         NA                 5                 <NA>
18130003         18130003          <NA>         <NA>          NA         NA                 5                 <NA>
18130005         18130005          <NA>         <NA>          NA         NA                 4                 <NA>
genes.mrnaassignment genes.swissprot genes.unigene genes.gobiologicalprocess genes.gocellularcomponent
18130001                 <NA>            <NA>          <NA>                      <NA>                      <NA>
18130003                 <NA>            <NA>          <NA>                      <NA>                      <NA>
18130005                 <NA>            <NA>          <NA>                      <NA>                      <NA>
genes.gomolecularfunction genes.pathway genes.proteindomains genes.crosshybtype   genes.category    Amean   s2.post
18130001                      <NA>            NA                 <NA>                 NA normgene->intron 2.593790 0.2155920
18130003                      <NA>            NA                 <NA>                 NA normgene->intron 4.380437 0.1282866
18130005                      <NA>            NA                 <NA>                 NA normgene->intron 3.773452 0.1534614
t.LL_30d...C_30d t.LS_30d...C_30d t.LS_30d...LL_30d df.total p.value.LL_30d...C_30d p.value.LS_30d...C_30d
18130001         3.447104       -1.1017621        -4.5488658 22.41528            0.002252668             0.28225530
18130003         2.914220        2.1685159        -0.7457042 22.41528            0.007940430             0.04099417
18130005        -0.330302        0.9306394         1.2609414 22.41528            0.744239066             0.36195553
p.value.LS_30d...LL_30d lods.LL_30d...C_30d lods.LS_30d...C_30d lods.LS_30d...LL_30d          F    F.p.value
18130001            0.0001515378           -1.817578           -5.890931            0.8521341 11.2628612 0.0004113274
18130003            0.4635949253           -3.010193           -4.319430           -6.3338636  4.5837383 0.0214346606
18130005            0.2202968982           -6.650692           -6.060466           -5.8279230  0.8550541 0.4386705219

0
Entering edit mode

I would in general (being the author) use annotateEset in my affycoretools package to do that.

> library(affycoretools)
> data.rma_V02 <- annotateEset(data.rma_V02, pd.drogene.1.1.st)



and then go from there.

0
Entering edit mode

Many thanks James, I was trying the approach outlined initially (from a tutorial created by Benilton Carvalho) since the code you suggest above leads to an error message:

> data.rma_V02 <- annotateEset(data.rma_V02, pd.drogene.1.1.st)
Error: There appears to be a mismatch between the ExpressionSet and the annotation data.
Please ensure that the summarization level for the ExpressionSet and the 'type' argument are the same.

0
Entering edit mode

I have a simple test in annotateEset that compares the probeset IDs from the annotation data in the pdInfoPackage to the featureNames in the ExpressionSet, and I stupidly assumed that the two would be of comparable dimension. However, the annotation data for this array has about 3x more rows than there are probesets, for whatever reason, and so that test fails.

I'll fix it, but in the meantime you can get around it by doing something like

> debug(affycoretools:::.dataFromNetaffx)
> data.rma_V02 <- annotateEset(data.rma_V02, pd.drogene.1.1.st)
## step through the debugger until you see this:
Browse[2]>
debug: if (annotest < 0.95) stop(paste("There appears to be a mismatch between the ExpressionSet and",
"the annotation data.\nPlease ensure that the summarization level",
"for the ExpressionSet and the 'type' argument are the same.\n",
call. = FALSE)

## at which point you can do
Browse[2]> annotest <- .99
## and then hit 'c' to continue
Browse[2]> c
exiting from: .dataFromNetaffx(object, x, type, ...)


And then you will have the annotation in your fData slot

> fData(eset)[6000:6010,]
PROBEID          ID  SYMBOL
18162878 18162878 FBtr0071549 CG30291
18162883 18162883 FBtr0071544   Cht12
18162888 18162888        <NA>    <NA>
18162890 18162890        <NA>    <NA>
18162892 18162892 FBtr0087181 CG30324
18162895 18162895 FBtr0086821 CG30325
18162897 18162897        <NA>    <NA>
18162899 18162899 FBtr0305078 CG30334
18162902 18162902 FBtr0088534 CG30339
18162906 18162906 FBtr0088535 CG30340
18162910 18162910 FBtr0088726 CG30355
GENENAME
18162878 CG30291 gene product from transcript CG30291-RA
18162883 CG30293 gene product from transcript CG30293-RA
18162888                                            <NA>
18162890                                            <NA>
18162892 CG30324 gene product from transcript CG30324-RA
18162895 CG30325 gene product from transcript CG30325-RA
18162897                                            <NA>
18162899 CG30334 gene product from transcript CG30334-RB
18162902 CG30339 gene product from transcript CG30339-RA
18162906 CG30340 gene product from transcript CG30340-RA
18162910 CG30355 gene product from transcript CG30355-RA

0
Entering edit mode

Thanks for the extra help James. The debugging process you suggested bypasses the error. I can finally move forward with my analysis and leave Partek Software behind ;)

A few more points:

1. Is the reason for the failure in Benilton's approach the same as in the affycoretools, ie, the two datasets not having comparable dimensions?
2. Affycoretools returns 4 categorical variables (genes.PROBEID, genes.ID, genes.SYMBOL and genes.GENENAME) whereas the code in Benilton's tutorial returned at least 10, including gene ontology. Can annotateEset retrieve the other info as well, for example, "genes.gobiologicalprocess"?
0
Entering edit mode

Yes, hypothetically you can get the GO terms as well. But note that the GO terms are one-to-many, and it's difficult to deal with that as part of an annotation. In other words, say there are 10 GO terms that are annotated to a particular gene. How do you handle that? And what do you gain from having them? Generally you use GO terms only as part of a gene set test, in which case you don't need them as part of your annotation anyway.

0
Entering edit mode

Fair point James, I agree. I can always query BioMart to retrieve the extra annotation. Thanks again for your help!

0
Entering edit mode

It's probably not a good idea to use an Ensembl-based database to map things when you are starting with NCBI Gene IDs. You would be better served to use the org.Dm.eg.db package and the Gene IDs to do the mapping.

0
Entering edit mode

Thanks for the extra advice. I am not familiar with that package so will have a look. I used BioMart before, as I can use that DB for orthologue mapping. I sometimes want to query Ingenuity Pathway Analysis DB (QIAGEN) and they only hold mouse and human data. Whenever I load fly gene IDs I get fairly poor mapping %s using their database. I will look further at your suggestion.

0
Entering edit mode

Hi James, it's me again. I noticed that this feature is still not fixed. Are you planning to fix it anytime soon? I was going to add my script to GitHub but it would be more useful if annotateEset was working for these arrays...

All the best,

Miguel

0
Entering edit mode

Fixed in Devel (version 1.57.3), which should propagate to the download server within a day or two, or you can install from my GitHub page (username jmacdon, natch). It now just gives a warning, letting you know that things might be amiss:


> eset <- annotateEset(eset, pd.drogene.1.1.st)
Warning message:
There appears to be a mismatch between the ExpressionSet and the annotation data.
Please ensure that the summarization level for the ExpressionSet and the 'type' argument are the same.

There are 50525 probesets in the annotation data and 16322 probesets in the ExpressionSet; the overlap between the annotation and ExpressionSet is 32% and the overlap between the ExpressionSet and annotation is 100%. Proceed with caution.

0
Entering edit mode

Thanks for the speedy reply James, perhaps I am the only punter using pd.drogene.1.1.st but these are much cheaper than RNAseq. I might try GitHub once I learn how to install a library from there..

0
Entering edit mode
library(BiocManager)
BiocManager::install(version = "devel")
BiocManager::install("remotes")
BiocManager::install("jmacdon/affycoretools")



Should do the trick. You will be using devel versions, which you will need to upgrade to the upcoming release versions (upgrade R and BioC) on October 30.

0
Entering edit mode

Uh, check that. You will only need to upgrade to the new release of BioC on 30 October.

0
Entering edit mode

Thanks for that! I am right to conclude that I only need to upgrade BioC on the 30th (and not R). I am on R version 3.5.1

0
Entering edit mode

No, you need to upgrade R right now to R-3.6.1 because the version you are using is last year's model. You can't actually install the devel version of BioC on R-3.5.1

0
Entering edit mode

Thanks, that's useful. I wish there was a way to upgrade R from the command line, like "sudo apt-get update"...

0
Entering edit mode

You mean like this?

0
Entering edit mode

Something like that. I am on Macs so I guess I must use Homebrew

0
Entering edit mode

It is a one-click download https://cran.r-project.org/bin/macosx/ plus a couple more clicks to install... Your packages will need to be updated

install.packages("BiocManager")
BiocManager::install(version = "devel")
BiocManager::valid()

0
Entering edit mode

Thanks, I will follow your advice Martin. I run RStudio on Macs and keep my projects in Dropbox so will have to upgrade a couple of computers..

0
Entering edit mode

Hi again James, still unable to install affycoretools with the instructions above. The "remotes" package seems to be installed but I don't see it under "Packages" in RStudio. The message from attempting to install "affycoretools" is below. Many thanks for your help.

    > BiocManager::install("jmacdon/affycoretools")
Bioconductor version 3.10 (BiocManager 1.30.7), R 3.6.1 (2019-07-05)
Installing github package(s) 'jmacdon/affycoretools'
Installing 2 packages: data.table, htmlwidgets

There are binary versions available but the source versions are later:
binary source needs_compilation
data.table  1.12.2 1.12.4              TRUE
htmlwidgets    1.5  1.5.1             FALSE

Do you want to install from sources the package which needs compilation? (Yes/no/cancel) y
installing the source packages ‘data.table’, ‘htmlwidgets’

trying URL 'https://cran.rstudio.com/src/contrib/data.table_1.12.4.tar.gz'
Content type 'application/x-gzip' length 5010321 bytes (4.8 MB)
==================================================

trying URL 'https://cran.rstudio.com/src/contrib/htmlwidgets_1.5.1.tar.gz'
Content type 'application/x-gzip' length 867206 bytes (846 KB)
==================================================

* installing *source* package ‘data.table’ ...
** package ‘data.table’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -fopenmp -fPIC  -Wall -g -O2  -c assign.c -o assign.o
clang: error: unsupported option '-fopenmp'
make: *** [assign.o] Error 1
ERROR: compilation failed for package ‘data.table’
* removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/data.table’
Error: Failed to install 'affycoretools' from GitHub:
(converted from warning) installation of package ‘data.table’ had non-zero exit status

0
Entering edit mode

So the first thing I do when I see something I don't understand (say, for example clang: error: unsupported option '-fopenmp') is to Google the thing to see what the Googles say about it.

Which brings me to this page (fourth one down), which may be relevant? I don't use MacOS, so can't say for sure.

0
Entering edit mode

What happens when you answer 'no' to the question about installing from source?

0
Entering edit mode

It seems that answering no to that question bypasses the issue. I still had to answer yes to the questions downstream. Many thanks for that suggestion, it bypasses installing more stuff on top of OSX.

0
Entering edit mode

Great; for a little context, if the package (like affycoretools) does not contain C-level (or Fortran or C++) source code, then R can install the package 'from source' without additional tools. On the other hand, if there is C source code, then R needs access to a correctly configured compiler tool chain, and this will be an additional (and usually unnecessary) step.

Also, and mostly for anyone who is 'following along', it is MUCH BETTER for maintainers to push changes to the Bioconductor devel repository, and for users unable to wait for the next Bioconductor release (every six months) to use the 'devel' branch of Bioconductor (instructions here) where binaries will be installable as soon as they build successfully. Using the devel branch ensures that all packages are tested and, module extra-ordinary events, interoperable.