I am running an analysis of transcriptional changes in Drosophila using Affymetrix (Drogene-1_1-st). The issue is that even though I can retrieve the biological annotation for the different probes, I do not know how to then merge the annotation with the primary data.
How do I include the biological annotation from "featureData" in the data matrix or the final analysis when I try to identify differentially expressed genes?
In essence, how do I apply the featureData to the file containing the probe IDs so that additional columns, containing metadata such as gene names, are created?
#Load CEL files into R using the oligo package since the old affy package does not work with ST array. Make sure you run the commands inside the CELFiles folder
library(oligo)
list = list.files(full.names=TRUE)
data = oligo::read.celfiles(list)
ph = data@phenoData
ph@data[ ,1] = c("C_3d_01","LL_3d_01","LS_30d_01","C_3d_02","LL_3d_02","LS_30d_02","C_3d_03","LL_3d_03","LS_30d_03",
"C_3d_04","LL_3d_04","LS_30d_04","LS_3d_01","C_30d_01","LL_30d_01","LS_3d_02","C_30d_02","LL_30d_02",
"LS_3d_03","C_30d_03","LL_30d_03","LS_3d_04","C_30d_04","LL_30d_04")
Pset = fitProbeLevelModel(data)
librarypd.drogene.1.1.st) # Load the platform design info for Affymetrix DroGene-1_1-st
data.rma_V02 = rma(data, target="core") #Summaries at the transcript level
featureData(data.rma_V02) <- getNetAffx(data.rma_V02, "transcript") # Retrieving NetAffx Biological Annotation
data.matrix02 = exprs(data.rma_V02)
#How do I apply the featureData to data.matrix02?
Many thanks James, I am obviously a novice. Unfortunately, all the features on the platform design pd.drogene.1.1.st) return as "NA". Any thoughts for a plan B?
I would in general (being the author) use
annotateEset
in my affycoretools package to do that.and then go from there.
Many thanks James, I was trying the approach outlined initially (from a tutorial created by Benilton Carvalho) since the code you suggest above leads to an error message:
I have a simple test in
annotateEset
that compares the probeset IDs from the annotation data in the pdInfoPackage to thefeatureNames
in theExpressionSet
, and I stupidly assumed that the two would be of comparable dimension. However, the annotation data for this array has about 3x more rows than there are probesets, for whatever reason, and so that test fails.I'll fix it, but in the meantime you can get around it by doing something like
And then you will have the annotation in your
fData
slotThanks for the extra help James. The debugging process you suggested bypasses the error. I can finally move forward with my analysis and leave Partek Software behind ;)
A few more points:
Yes, hypothetically you can get the GO terms as well. But note that the GO terms are one-to-many, and it's difficult to deal with that as part of an annotation. In other words, say there are 10 GO terms that are annotated to a particular gene. How do you handle that? And what do you gain from having them? Generally you use GO terms only as part of a gene set test, in which case you don't need them as part of your annotation anyway.
Fair point James, I agree. I can always query BioMart to retrieve the extra annotation. Thanks again for your help!
It's probably not a good idea to use an Ensembl-based database to map things when you are starting with NCBI Gene IDs. You would be better served to use the org.Dm.eg.db package and the Gene IDs to do the mapping.
Thanks for the extra advice. I am not familiar with that package so will have a look. I used BioMart before, as I can use that DB for orthologue mapping. I sometimes want to query Ingenuity Pathway Analysis DB (QIAGEN) and they only hold mouse and human data. Whenever I load fly gene IDs I get fairly poor mapping %s using their database. I will look further at your suggestion.
Hi James, it's me again. I noticed that this feature is still not fixed. Are you planning to fix it anytime soon? I was going to add my script to GitHub but it would be more useful if annotateEset was working for these arrays...
All the best,
Miguel
Fixed in Devel (version 1.57.3), which should propagate to the download server within a day or two, or you can install from my GitHub page (username jmacdon, natch). It now just gives a warning, letting you know that things might be amiss:
Thanks for the speedy reply James, perhaps I am the only punter using pd.drogene.1.1.st but these are much cheaper than RNAseq. I might try GitHub once I learn how to install a library from there..
Should do the trick. You will be using devel versions, which you will need to upgrade to the upcoming release versions (upgrade R and BioC) on October 30.
Uh, check that. You will only need to upgrade to the new release of BioC on 30 October.
Thanks for that! I am right to conclude that I only need to upgrade BioC on the 30th (and not R). I am on R version 3.5.1
No, you need to upgrade R right now to R-3.6.1 because the version you are using is last year's model. You can't actually install the devel version of BioC on R-3.5.1
Thanks, that's useful. I wish there was a way to upgrade R from the command line, like "sudo apt-get update"...
You mean like this?
Something like that. I am on Macs so I guess I must use Homebrew
It is a one-click download https://cran.r-project.org/bin/macosx/ plus a couple more clicks to install... Your packages will need to be updated
Thanks, I will follow your advice Martin. I run RStudio on Macs and keep my projects in Dropbox so will have to upgrade a couple of computers..
Hi again James, still unable to install affycoretools with the instructions above. The "remotes" package seems to be installed but I don't see it under "Packages" in RStudio. The message from attempting to install "affycoretools" is below. Many thanks for your help.
So the first thing I do when I see something I don't understand (say, for example
clang: error: unsupported option '-fopenmp'
) is to Google the thing to see what the Googles say about it.Which brings me to this page (fourth one down), which may be relevant? I don't use MacOS, so can't say for sure.
What happens when you answer 'no' to the question about installing from source?
It seems that answering no to that question bypasses the issue. I still had to answer yes to the questions downstream. Many thanks for that suggestion, it bypasses installing more stuff on top of OSX.
Great; for a little context, if the package (like affycoretools) does not contain C-level (or Fortran or C++) source code, then R can install the package 'from source' without additional tools. On the other hand, if there is C source code, then R needs access to a correctly configured compiler tool chain, and this will be an additional (and usually unnecessary) step.
Also, and mostly for anyone who is 'following along', it is MUCH BETTER for maintainers to push changes to the Bioconductor devel repository, and for users unable to wait for the next Bioconductor release (every six months) to use the 'devel' branch of Bioconductor (instructions here) where binaries will be installable as soon as they build successfully. Using the devel branch ensures that all packages are tested and, module extra-ordinary events, interoperable.