Install a custom CDF file
1
0
Entering edit mode
Seymoo • 0
@seymoo-12522
Last seen 12 months ago
Oslo

I would like to access gene level expression data from GSE131418 study. However, since it is not possbile to do that with "GEOquery" for some reason, I want to use the CDF file "GPL15048HuRSTA2a520709.CDF.gz" provided under "GSE131418_RAW.tar" to perform RMA normalization and probe summarization in R. But I can not load and use this packge into R, similar to Brain-array CDF files.

Any thought would be appreciated.

Hossein

makecdf RMA annotation biobase GEOquery • 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

You need the makecdfenv package. The vignette is pretty clear, I think. However, here's how you would do it. The one trick is to know what to call the package, which won't involve the GPL number. I didn't bother to try to figure out what this array is for, so I just said the species is Homo sapiens. It doesn't really matter for you use, so you can use whatever.


> make.cdf.package("GPL15048_HuRSTA_2a520709.CDF.gz", "hursta2a520709cdf", species = "Homo sapiens", compress = TRUE)
Reading CDF file.
Creating CDF environment
Wait for about 606 dots..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Creating package in C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf 



README PLEASE:
A source package has now been produced in
C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf.
Before using this package it must be installed via 'R CMD INSTALL'
at a terminal prompt (or DOS command shell).
If you are using Windows, you will need to get set up to install packages.
See the 'R Installation and Administration' manual, specifically
Section 6 'Add-on Packages' as well as 'Appendix E: The Windows Toolset'
for more information.

Alternatively, you could use make.cdf.env(), which will not require you to install a package.
However, this environment will only persist for the current R session
unless you save() it.


> install.packages("hursta2a520709cdf/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/3.5'
(as 'lib' is unspecified)
* installing *source* package 'hursta2a520709cdf' ...
** R
** data
** byte-compile and prepare package for lazy loading
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
** help
*** installing help indices
  converting help for package 'hursta2a520709cdf'
    finding HTML links ... done
    geometry                                html  
    hursta2a520709cdf                       html  
    hursta2a520709dim                       html  
** building package indices
** testing if installed package can be loaded
*** arch - i386
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
*** arch - x64
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
* DONE (hursta2a520709cdf)
In R CMD INSTALL

## Now test it

> dat <- ReadAffy(filenames = dir()[5:8])
> dat

AffyBatch object
size of arrays=1164x1164 features (19 kb)
cdf=HuRSTA-2a520709 (60607 affyids)
number of samples=4
number of genes=60607
annotation=hursta2a520709
notes=

ADD COMMENT
0
Entering edit mode

Thanks a lot James! I did try your solution but I just replaced **ReadAffy** with **affy::justRMA** and used **cdfname = "hursta2a520709cdf, normalize = TRUE"** and now it seems that I have a log2 transformed expression matrix.However, I am not sure if the rows (n=60607) are converted to probe sets or still represent probes.

is there a way to convert these to gene symbol now?

Best,

ADD REPLY
0
Entering edit mode

Yes, the data you have are log2 transformed probeset intensities. You can get the annotations using getGEO from the GEOquery package:

> z <- getGEO("GPL15048")
> head(Table(z))
               ID GB_LIST EntrezGeneID GeneSymbol         SPOT_ID
1  AFFX-BioB-3_at                                  AFFX-BioB-3_at
2  AFFX-BioB-5_at                                  AFFX-BioB-5_at
3  AFFX-BioB-M_at                                  AFFX-BioB-M_at
4  AFFX-BioC-3_at                                  AFFX-BioC-3_at
5  AFFX-BioC-5_at                                  AFFX-BioC-5_at
6 AFFX-BioDn-3_at                                 AFFX-BioDn-3_at

And you can then use match with the featureNames of your ExpressionSet to get that ordered correctly, and ideally then put in the fData slot of your ExpressionSet, so your downstream applications (read limma) will output the annotations with your top genes.

Your homework is to figure out how to do what I just described, using things like the help pages (?match) and our good friend Google, and the various vignettes that come with the Biobase package, etc.

ADD REPLY
0
Entering edit mode

Great! I downloaded the annotation and converted it to a data frame z <-z @dataTable@table.

I did follow your steps and excluded rows in z without available EntrezGeneID. But now dim(z)=41090 and table(isUnique(z$EntrezGeneID))shows 29109 non unique values. this is not usually the case when I use brain array CDF with justRMA .

So the question is that should merge these non unique values using median after I match rows in the ExpressionSet with z ?

Thanks a lot.

ADD REPLY
0
Entering edit mode

Why are you mucking about in the internals of an S4 object when I just showed you that the Table accessor gets you the data.frame directly? The whole idea of an accessor function is to remove the need for that sort of thing.

How you deal with duplicate measurements is something you have to decide on your own. If you want to assume that the probesets with duplicate Gene IDs are measuring the same thing, then I suppose taking the median is a thing you could do. There are other things you could do, like use findLargest from the genefilter package, or average the duplicates or just keep them all. There are likely good arguments for any of those choices, as well as good arguments against.

But as the analyst, it's your job to make decisions and have a good rationale to defend the decision. I won't help you make that decision, as it's not my analysis.

ADD REPLY

Login before adding your answer.

Traffic: 497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6