I would like to access gene level expression data from GSE131418 study.
However, since it is not possbile to do that with "GEOquery" for some reason, I want to use the CDF file "GPL15048HuRSTA2a520709.CDF.gz" provided under "GSE131418_RAW.tar" to perform RMA normalization and probe summarization in R.
But I can not load and use this packge into R, similar to Brain-array CDF files.
You need the makecdfenv package. The vignette is pretty clear, I think. However, here's how you would do it. The one trick is to know what to call the package, which won't involve the GPL number. I didn't bother to try to figure out what this array is for, so I just said the species is Homo sapiens. It doesn't really matter for you use, so you can use whatever.
> make.cdf.package("GPL15048_HuRSTA_2a520709.CDF.gz", "hursta2a520709cdf", species = "Homo sapiens", compress = TRUE)
Reading CDF file.
Creating CDF environment
Wait for about 606 dots..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Creating package in C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf
README PLEASE:
A source package has now been produced in
C:/Users/jmacdon/Desktop/GSE131418/hursta2a520709cdf.
Before using this package it must be installed via 'R CMD INSTALL'
at a terminal prompt (or DOS command shell).
If you are using Windows, you will need to get set up to install packages.
See the 'R Installation and Administration' manual, specifically
Section 6 'Add-on Packages' as well as 'Appendix E: The Windows Toolset'
for more information.
Alternatively, you could use make.cdf.env(), which will not require you to install a package.
However, this environment will only persist for the current R session
unless you save() it.
> install.packages("hursta2a520709cdf/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/3.5'
(as 'lib' is unspecified)
* installing *source* package 'hursta2a520709cdf' ...
** R
** data
** byte-compile and prepare package for lazy loading
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
** help
*** installing help indices
converting help for package 'hursta2a520709cdf'
finding HTML links ... done
geometry html
hursta2a520709cdf html
hursta2a520709dim html
** building package indices
** testing if installed package can be loaded
*** arch - i386
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
*** arch - x64
Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hursta2a520709cdf'
Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hursta2a520709cdf'
* DONE (hursta2a520709cdf)
In R CMD INSTALL
## Now test it
> dat <- ReadAffy(filenames = dir()[5:8])
> dat
AffyBatch object
size of arrays=1164x1164 features (19 kb)
cdf=HuRSTA-2a520709 (60607 affyids)
number of samples=4
number of genes=60607
annotation=hursta2a520709
notes=
Thanks a lot James! I did try your solution but I just replaced **ReadAffy** with **affy::justRMA** and used **cdfname = "hursta2a520709cdf, normalize = TRUE"** and now it seems that I have a log2 transformed expression matrix.However, I am not sure if the rows (n=60607) are converted to probe sets or still represent probes.
is there a way to convert these to gene symbol now?
And you can then use match with the featureNames of your ExpressionSet to get that ordered correctly, and ideally then put in the fData slot of your ExpressionSet, so your downstream applications (read limma) will output the annotations with your top genes.
Your homework is to figure out how to do what I just described, using things like the help pages (?match) and our good friend Google, and the various vignettes that come with the Biobase package, etc.
Great! I downloaded the annotation and converted it to a data frame z <-z @dataTable@table.
I did follow your steps and excluded rows in z without available EntrezGeneID. But now dim(z)=41090 and table(isUnique(z$EntrezGeneID))shows 29109 non unique values.
this is not usually the case when I use brain array CDF with justRMA .
So the question is that should merge these non unique values using median after I match rows in the ExpressionSet with z ?
Why are you mucking about in the internals of an S4 object when I just showed you that the Table accessor gets you the data.frame directly? The whole idea of an accessor function is to remove the need for that sort of thing.
How you deal with duplicate measurements is something you have to decide on your own. If you want to assume that the probesets with duplicate Gene IDs are measuring the same thing, then I suppose taking the median is a thing you could do. There are other things you could do, like use findLargest from the genefilter package, or average the duplicates or just keep them all. There are likely good arguments for any of those choices, as well as good arguments against.
But as the analyst, it's your job to make decisions and have a good rationale to defend the decision. I won't help you make that decision, as it's not my analysis.
Thanks a lot James! I did try your solution but I just replaced
**ReadAffy**
with**affy::justRMA**
and used**cdfname = "hursta2a520709cdf, normalize = TRUE"**
and now it seems that I have a log2 transformed expression matrix.However, I am not sure if the rows (n=60607) are converted to probe sets or still represent probes.is there a way to convert these to gene symbol now?
Best,
Yes, the data you have are log2 transformed probeset intensities. You can get the annotations using
getGEO
from the GEOquery package:And you can then use
match
with thefeatureNames
of yourExpressionSet
to get that ordered correctly, and ideally then put in thefData
slot of yourExpressionSet
, so your downstream applications (read limma) will output the annotations with your top genes.Your homework is to figure out how to do what I just described, using things like the help pages (?match) and our good friend Google, and the various vignettes that come with the Biobase package, etc.
Great! I downloaded the annotation and converted it to a data frame
z <-z @dataTable@table
.I did follow your steps and excluded rows in z without available EntrezGeneID. But now dim(z)=41090 and
table(isUnique(z$EntrezGeneID))
shows 29109 non unique values. this is not usually the case when I use brain array CDF withjustRMA
.So the question is that should merge these non unique values using
median
after I match rows in the ExpressionSet with z ?Thanks a lot.
Why are you mucking about in the internals of an S4 object when I just showed you that the
Table
accessor gets you thedata.frame
directly? The whole idea of an accessor function is to remove the need for that sort of thing.How you deal with duplicate measurements is something you have to decide on your own. If you want to assume that the probesets with duplicate Gene IDs are measuring the same thing, then I suppose taking the median is a thing you could do. There are other things you could do, like use
findLargest
from the genefilter package, or average the duplicates or just keep them all. There are likely good arguments for any of those choices, as well as good arguments against.But as the analyst, it's your job to make decisions and have a good rationale to defend the decision. I won't help you make that decision, as it's not my analysis.