Hi Everyone,
I am trying to analyse a micro array data and wonder if anyone could check if I am doing something wrong.
I am analyzing the microarray data from (Hamilton et al. 2015: Cell Stem Cell. 2015 Oct 1;17(4):397-411).
http://www.sciencedirect.com/science/article/pii/S1934590915003562
my script is as below:
library("limma") library("plyr") targets <- data.frame(rbind( c("GSE60460_Pair-E-Samples9+10.gpr","3xTg","WT"), c("GSE60460_Pair-F-Samples11+12.gpr","3xTg","WT"), c("GSE60460_Pair-G-Samples13+14.gpr","3xTg","WT"), c("GSE60460_Pair-H-Samples15+16.gpr","3xTg","WT") )) colnames(targets) <- c("FileName","Cy5","Cy3") rownames(targets) <- targets$FileName # Required files can be downloaded from: # "GSE60460_Pair-E-Samples9+10.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # "GSE60460_Pair-F-Samples11+12.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # "GSE60460_Pair-G-Samples13+14.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # "GSE60460_Pair-H-Samples15+16.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # and IDs that have been used in Hamilton et al. 2015. (http://www.sciencedirect.com/science/article/pii/S1934590915003562) # "GSE60460_series_matrix.txt" --> ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60460/matrix/ # "GPL10787-9758.txt" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10787 targets fF <- function(x) as.numeric(x$Flags > -99) calMeanDupIds <- function(rawMatrix, col4Mean, col4Values){ initialColID <- colnames(rawMatrix)[col4Mean] colnames(rawMatrix)[col4Mean] <- "col2Use" empIds <- rawMatrix$col2Use == "" # check for empty IDs! empIdsMatrix <- rawMatrix[empIds,] remainedIds <- rawMatrix[!empIds,] d <- duplicated(remainedIds$col2Use) duplicList <- remainedIds[d,] uniqueList <- remainedIds[!d,] tmp_duplicList <- uniqueList[(uniqueList$col2Use %in% duplicList$col2Use),] uniqueList <- uniqueList[!(uniqueList$col2Use %in% duplicList$col2Use),] duplicList <- rbind(duplicList,tmp_duplicList) duplicListIDs <- duplicList[,1:(col4Values-1)] duplicList <- ddply(duplicList[,c(col4Mean,col4Values:ncol(duplicList))],.(col2Use), colwise(mean)) duplicList <- merge(duplicListIDs,duplicList, by = "col2Use") duplicListIDs <- duplicListIDs[order(duplicListIDs$col2Use),] duplicList <- duplicList[order(duplicList$col2Use),] duplicList <- cbind(duplicListIDs,duplicList[,col4Values:ncol(duplicList)]) if(nrow(empIdsMatrix) != 0){ result <- rbind(empIdsMatrix,duplicList,uniqueList) }else{ result <- rbind(duplicList,uniqueList) } colnames(result)[col4Mean] <- initialColID return(result) } annotIDs <- read.delim("GPL10787-9758.txt",sep = "\t") annotIDs <- annotIDs[,1:16] iDs <- read.delim("GSE60460_series_matrix.txt",sep = "\t") annIds <- annotIDs[,c("ID","ENSEMBL_ID","GENE")] #targets <- readTargets("Targets.txt") targets <- readTargets() RG <- read.maimages(targets, source = "genepix.median", wt.fun = fF) RG <- backgroundCorrect(RG, method = "normexp", offset = 50) RG <- RG[RG$genes$ID %in% annotIDs$ID,] RG$genes <- merge(RG$genes,annIds, by = "ID") tmpValues <- cbind(RG$genes,RG$R,RG$G) col4Mean <- 1 col4Values <- 8 tmpValues <- calMeanDupIds(tmpValues, col4Mean, col4Values) RG$genes <- tmpValues[,1:(col4Values-1)] RG$R <- tmpValues[,col4Values:(col4Values+3)] RG$G <- tmpValues[,(col4Values+4):(col4Values+7)] RG <- RG[!(duplicated(RG$genes$ID)),] annotIDs <- annotIDs[(annotIDs$ID %in% iDs$ID_REF),] RG <- RG[RG$genes$ID %in% annotIDs$ID,] MA <- normalizeWithinArrays(RG , method = "loess") MA <- normalizeBetweenArrays(MA, method="quantile", "normalizeQuantiles") fit <- lmFit(MA) fit <- eBayes(fit) y <- topTable(fit, n = length(fit$genes$ID)) table(abs(y$logFC) >= 1.0) FALSE TRUE 32495 189 table(abs(y$logFC) >= 1.4 & y$P.Value < 0.005) FALSE TRUE 32632 52
table(y$logFC >= 1.4 & y$P.Value < 0.005) FALSE TRUE 32657 27 table(y$logFC <= -1.4 & y$P.Value < 0.005) FALSE TRUE 32659 25
In the original papers, 448 probes down regulated and 545probes up regulated (Figure-5C).
This is the summary of the data analysis that have been done in Hamilton et al. 2015:
"... Hybridizations were carried out using the SurePrint G3 Mouse GE 8x60K Microarray (Agilent Technologies) containing probes targeting 39,430 Entre Gene RNAs and 16,251 lincRNAs. Arrays were scanned at 5μm resolution using a GenePix4000B scanner (Molecular Devices). Data from scanned images were extracted using GenePix 6.1 (Axon). Chip annotations file was updated to a more recent version (20130207). Expression data was analyzed using Bioconductor packages (http://www.bioconductor.org/) and R statistical language www.r-project.org). Bioconductor limma (Smyth et al., 2005) package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities. Probe intensities were averaged when more than one probe was associated with a given probe identifier. The resulting matrix showing 55,681 probes as rows and 8 samples as columns was filtered to exclude lincRNA entries and probes in intensities below background in more than half of the samples. 32,684 probes were used as input for linear modeling using the method available in limma which uses a moderated t-statistic to assess significance. A 3xTg-AD vs WT in 7 months was evaluated and probes presenting a p-value lower than 0.005 and a foldchange greater than 1.4 or lower than –1.4 were considered for further analysis. ..."
This is the summary of the data analysis that have been done in GEO:
""Bioconductor limma26 package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities. Probe intensities were averaged when more than one probe was associated with a given probe identifier.""
best,
ilyas.
Hi Gordon,
I am really grateful and thank you very much for your help. Yes, definitely you are right. I automatically assumed that "fold change cut-off" as "log2 fold change"! Now, it makes sense.
avereps()
, I did not know this function! Really helpful and quicker!Thanks a lot!
ilyas.