Question: Install a custom CDF file
0
gravatar for Seymoo
13 days ago by
Seymoo0
Oslo
Seymoo0 wrote:

I would like to access gene level expression data from GSE131418 study. However, since it is not possbile to do that with "GEOquery" for some reason, I want to use the CDF file "GPL15048HuRSTA2a520709.CDF.gz" provided under "GSE131418_RAW.tar" to perform RMA normalization and probe summarization in R. But I can not load and use this packge into R, similar to Brain-array CDF files.

Any thought would be appreciated.

Hossein

ADD COMMENTlink modified 12 days ago by James W. MacDonald50k • written 13 days ago by Seymoo0
Answer: Install a custom CDF file
0
gravatar for James W. MacDonald
12 days ago by
United States
James W. MacDonald50k wrote:

You need the makecdfenv package. The vignette is pretty clear, I think. However, here's how you would do it. The one trick is to know what to call the package, which won't involve the GPL number. I didn't bother to try to figure out what this array is for, so I just said the species is Homo sapiens. It doesn't really matter for you use, so you can use whatever.


> make.cdf.package("GPL15048_HuRSTA_2a520709.CDF.gz", "hursta2a520709cdf", species = "Homo sapiens", compress = TRUE)
Reading CDF file.
Creating CDF environment
Wait for about 606 dots..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Creating package in C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf 



README PLEASE:
A source package has now been produced in
C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf.
Before using this package it must be installed via 'R CMD INSTALL'
at a terminal prompt (or DOS command shell).
If you are using Windows, you will need to get set up to install packages.
See the 'R Installation and Administration' manual, specifically
Section 6 'Add-on Packages' as well as 'Appendix E: The Windows Toolset'
for more information.

Alternatively, you could use make.cdf.env(), which will not require you to install a package.
However, this environment will only persist for the current R session
unless you save() it.


> install.packages("hursta2a520709cdf/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/3.5'
(as 'lib' is unspecified)
* installing *source* package 'hursta2a520709cdf' ...
** R
** data
** byte-compile and prepare package for lazy loading
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
** help
*** installing help indices
  converting help for package 'hursta2a520709cdf'
    finding HTML links ... done
    geometry                                html  
    hursta2a520709cdf                       html  
    hursta2a520709dim                       html  
** building package indices
** testing if installed package can be loaded
*** arch - i386
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
*** arch - x64
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
* DONE (hursta2a520709cdf)
In R CMD INSTALL

## Now test it

> dat <- ReadAffy(filenames = dir()[5:8])
> dat

AffyBatch object
size of arrays=1164x1164 features (19 kb)
cdf=HuRSTA-2a520709 (60607 affyids)
number of samples=4
number of genes=60607
annotation=hursta2a520709
notes=

ADD COMMENTlink written 12 days ago by James W. MacDonald50k

Thanks a lot James! I did try your solution but I just replaced **ReadAffy** with **affy::justRMA** and used **cdfname = "hursta2a520709cdf, normalize = TRUE"** and now it seems that I have a log2 transformed expression matrix.However, I am not sure if the rows (n=60607) are converted to probe sets or still represent probes.

is there a way to convert these to gene symbol now?

Best,

ADD REPLYlink written 12 days ago by Seymoo0

Yes, the data you have are log2 transformed probeset intensities. You can get the annotations using getGEO from the GEOquery package:

> z <- getGEO("GPL15048")
> head(Table(z))
               ID GB_LIST EntrezGeneID GeneSymbol         SPOT_ID
1  AFFX-BioB-3_at                                  AFFX-BioB-3_at
2  AFFX-BioB-5_at                                  AFFX-BioB-5_at
3  AFFX-BioB-M_at                                  AFFX-BioB-M_at
4  AFFX-BioC-3_at                                  AFFX-BioC-3_at
5  AFFX-BioC-5_at                                  AFFX-BioC-5_at
6 AFFX-BioDn-3_at                                 AFFX-BioDn-3_at

And you can then use match with the featureNames of your ExpressionSet to get that ordered correctly, and ideally then put in the fData slot of your ExpressionSet, so your downstream applications (read limma) will output the annotations with your top genes.

Your homework is to figure out how to do what I just described, using things like the help pages (?match) and our good friend Google, and the various vignettes that come with the Biobase package, etc.

ADD REPLYlink written 11 days ago by James W. MacDonald50k

Great! I downloaded the annotation and converted it to a data frame z <-z @dataTable@table.

I did follow your steps and excluded rows in z without available EntrezGeneID. But now dim(z)=41090 and table(isUnique(z$EntrezGeneID))shows 29109 non unique values. this is not usually the case when I use brain array CDF with justRMA .

So the question is that should merge these non unique values using median after I match rows in the ExpressionSet with z ?

Thanks a lot.

ADD REPLYlink written 11 days ago by Seymoo0

Why are you mucking about in the internals of an S4 object when I just showed you that the Table accessor gets you the data.frame directly? The whole idea of an accessor function is to remove the need for that sort of thing.

How you deal with duplicate measurements is something you have to decide on your own. If you want to assume that the probesets with duplicate Gene IDs are measuring the same thing, then I suppose taking the median is a thing you could do. There are other things you could do, like use findLargest from the genefilter package, or average the duplicates or just keep them all. There are likely good arguments for any of those choices, as well as good arguments against.

But as the analyst, it's your job to make decisions and have a good rationale to defend the decision. I won't help you make that decision, as it's not my analysis.

ADD REPLYlink written 10 days ago by James W. MacDonald50k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 335 users visited in the last hour