I'm trying to link GEOquery and minfi.
Specifically I want to obtain beta values from the idat files on GEOquery. I was following this guide: https://kasperdanielhansen.github.io/genbioconductor/html/minfi.html, up until the preprocessing part. Meaning that I was able to obtain the RGset. Then I used my own preprocessing code to obtain the beta values. However, I cross checked them with the beta values that were on GEO and they weren't consistent.
For example, the accession number I used was GSE68777. So I went to that study on GEO and clicked on the first sample: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1681154. Then I scrolled down and clicked "Download full table" to download the samples and beta values in a text file.
Then I typed head(beta) and chose the first sample. Then I did command F in the text file for that sample and it's value there wasn't the same as the value from the beta data table. Hopefully you can help find the error.
Here is the code I'm using:
library(GEOquery) library(minfi) library("IlluminaHumanMethylation450kanno.ilmn12.hg19") library("IlluminaHumanMethylation450kmanifest") #Code copied from 450k Guide getGEOSuppFiles("GSE68777") untar("GSE68777/GSE68777_RAW.tar", exdir = "GSE68777/idat") head(list.files("GSE68777/idat", pattern = "idat")) idatFiles <- list.files("GSE68777/idat", pattern = "idat.gz$", full = TRUE) sapply(idatFiles, gunzip, overwrite = TRUE) rgSet <- read.metharray.exp("GSE68777/idat") geoMat <- getGEO("GSE68777") pD.all <- pData(geoMat[[1]]) pD <- pD.all[, c("title", "geo_accession", "characteristics_ch1.1", "characteristics_ch1.2")] names(pD)[c(3,4)] <- c("group", "sex") pD$group <- sub("^diagnosis: ", "", pD$group) pD$sex <- sub("^Sex: ", "", pD$sex) sampleNames(rgSet) <- sub(".*_5", "5", sampleNames(rgSet)) rownames(pD) <- pD$title pD <- pD[sampleNames(rgSet),] pData(rgSet) <- pD
Do you think the beta values I'm getting now are more accurate and that I should just continue using the code of the newer minfi version?
I have no idea. In this context, accurate means 'more closely represents the true underlying methylation that is being estimated'. Without knowing what the true underlying methylation is, how would anybody know if the new values are better than the old?
That said, some of the changes to minfi are intended to better control for known technical artifacts, and are intended to be improvements over previous code. So I think it would be a reasonable assumption that newer versions of minfi are better than older versions for various reasons, and would produce better results. But that is a non-specific 'better', as I don't know what exactly the submitters did to produce their data, and I haven't perused your code, so I can't say something like 'Well you used <some improved method goes here> and that should make your estimates better because of <some rationale that the improved method is better>.'
As the person analyzing the data, it's actually your job to know these things and have a reason for doing whatever it is that you are doing. Relying on some dude on a support site is not really a good strategy, as in an ideal world you would be able to tell your boss/collaborator/whatever what you did and why.