Question: Install a custom CDF file
0
13 days ago by
Seymoo0
Oslo
Seymoo0 wrote:

I would like to access gene level expression data from GSE131418 study. However, since it is not possbile to do that with "GEOquery" for some reason, I want to use the CDF file "GPL15048HuRSTA2a520709.CDF.gz" provided under "GSE131418_RAW.tar" to perform RMA normalization and probe summarization in R. But I can not load and use this packge into R, similar to Brain-array CDF files.

Any thought would be appreciated.

Hossein

modified 12 days ago by James W. MacDonald50k • written 13 days ago by Seymoo0
Answer: Install a custom CDF file
0
12 days ago by
United States
James W. MacDonald50k wrote:

You need the makecdfenv package. The vignette is pretty clear, I think. However, here's how you would do it. The one trick is to know what to call the package, which won't involve the GPL number. I didn't bother to try to figure out what this array is for, so I just said the species is Homo sapiens. It doesn't really matter for you use, so you can use whatever.


> make.cdf.package("GPL15048_HuRSTA_2a520709.CDF.gz", "hursta2a520709cdf", species = "Homo sapiens", compress = TRUE)
Creating CDF environment
Creating package in C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf

A source package has now been produced in
C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf.
Before using this package it must be installed via 'R CMD INSTALL'
at a terminal prompt (or DOS command shell).
If you are using Windows, you will need to get set up to install packages.
See the 'R Installation and Administration' manual, specifically
Section 6 'Add-on Packages' as well as 'Appendix E: The Windows Toolset'

Alternatively, you could use make.cdf.env(), which will not require you to install a package.
However, this environment will only persist for the current R session
unless you save() it.

> install.packages("hursta2a520709cdf/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/3.5'
(as 'lib' is unspecified)
* installing *source* package 'hursta2a520709cdf' ...
** R
** data
** help
*** installing help indices
converting help for package 'hursta2a520709cdf'
geometry                                html
hursta2a520709cdf                       html
hursta2a520709dim                       html
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (hursta2a520709cdf)
In R CMD INSTALL

## Now test it

> dat <- ReadAffy(filenames = dir()[5:8])
> dat

AffyBatch object
size of arrays=1164x1164 features (19 kb)
cdf=HuRSTA-2a520709 (60607 affyids)
number of samples=4
number of genes=60607
annotation=hursta2a520709
notes=



Thanks a lot James! I did try your solution but I just replaced **ReadAffy** with **affy::justRMA** and used **cdfname = "hursta2a520709cdf, normalize = TRUE"** and now it seems that I have a log2 transformed expression matrix.However, I am not sure if the rows (n=60607) are converted to probe sets or still represent probes.

is there a way to convert these to gene symbol now?

Best,

Yes, the data you have are log2 transformed probeset intensities. You can get the annotations using getGEO from the GEOquery package:

> z <- getGEO("GPL15048")
ID GB_LIST EntrezGeneID GeneSymbol         SPOT_ID
1  AFFX-BioB-3_at                                  AFFX-BioB-3_at
2  AFFX-BioB-5_at                                  AFFX-BioB-5_at
3  AFFX-BioB-M_at                                  AFFX-BioB-M_at
4  AFFX-BioC-3_at                                  AFFX-BioC-3_at
5  AFFX-BioC-5_at                                  AFFX-BioC-5_at
6 AFFX-BioDn-3_at                                 AFFX-BioDn-3_at


And you can then use match with the featureNames of your ExpressionSet to get that ordered correctly, and ideally then put in the fData slot of your ExpressionSet, so your downstream applications (read limma) will output the annotations with your top genes.

Your homework is to figure out how to do what I just described, using things like the help pages (?match) and our good friend Google, and the various vignettes that come with the Biobase package, etc.

Great! I downloaded the annotation and converted it to a data frame z <-z @dataTable@table.

I did follow your steps and excluded rows in z without available EntrezGeneID. But now dim(z)=41090 and table(isUnique(z\$EntrezGeneID))shows 29109 non unique values. this is not usually the case when I use brain array CDF with justRMA .

So the question is that should merge these non unique values using median after I match rows in the ExpressionSet with z ?

Thanks a lot.

Why are you mucking about in the internals of an S4 object when I just showed you that the Table accessor gets you the data.frame directly? The whole idea of an accessor function is to remove the need for that sort of thing.

How you deal with duplicate measurements is something you have to decide on your own. If you want to assume that the probesets with duplicate Gene IDs are measuring the same thing, then I suppose taking the median is a thing you could do. There are other things you could do, like use findLargest from the genefilter package, or average the duplicates or just keep them all. There are likely good arguments for any of those choices, as well as good arguments against.

But as the analyst, it's your job to make decisions and have a good rationale to defend the decision. I won't help you make that decision, as it's not my analysis.