Question: microarray data analsis with limma?
0
3.3 years ago by
Germany/Dresden/ CRTD - DZNE
Mehmet Ilyas Cosacak0 wrote:

Hi Everyone,

I am trying to analyse a micro array data and wonder if anyone could check if I am doing something wrong.

I am analyzing the microarray data from (Hamilton et al. 2015: Cell Stem Cell. 2015 Oct 1;17(4):397-411).

http://www.sciencedirect.com/science/article/pii/S1934590915003562

my script is as below:

library("limma")
library("plyr")
targets <- data.frame(rbind(
c("GSE60460_Pair-E-Samples9+10.gpr","3xTg","WT"),
c("GSE60460_Pair-F-Samples11+12.gpr","3xTg","WT"),
c("GSE60460_Pair-G-Samples13+14.gpr","3xTg","WT"),
c("GSE60460_Pair-H-Samples15+16.gpr","3xTg","WT")
))
colnames(targets) <- c("FileName","Cy5","Cy3")
rownames(targets) <- targets$FileName # Required files can be downloaded from: # "GSE60460_Pair-E-Samples9+10.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # "GSE60460_Pair-F-Samples11+12.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # "GSE60460_Pair-G-Samples13+14.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # "GSE60460_Pair-H-Samples15+16.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460 # and IDs that have been used in Hamilton et al. 2015. (http://www.sciencedirect.com/science/article/pii/S1934590915003562) # "GSE60460_series_matrix.txt" --> ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60460/matrix/ # "GPL10787-9758.txt" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10787 targets fF <- function(x) as.numeric(x$Flags > -99)
calMeanDupIds <- function(rawMatrix, col4Mean, col4Values){
initialColID <- colnames(rawMatrix)[col4Mean]
colnames(rawMatrix)[col4Mean] <- "col2Use"
empIds <- rawMatrix$col2Use == "" # check for empty IDs! empIdsMatrix <- rawMatrix[empIds,] remainedIds <- rawMatrix[!empIds,] d <- duplicated(remainedIds$col2Use)
duplicList <- remainedIds[d,]
uniqueList <- remainedIds[!d,]
tmp_duplicList <- uniqueList[(uniqueList$col2Use %in% duplicList$col2Use),]
uniqueList <- uniqueList[!(uniqueList$col2Use %in% duplicList$col2Use),]
duplicList <- rbind(duplicList,tmp_duplicList)
duplicListIDs <- duplicList[,1:(col4Values-1)]
duplicList <- ddply(duplicList[,c(col4Mean,col4Values:ncol(duplicList))],.(col2Use), colwise(mean))
duplicList <- merge(duplicListIDs,duplicList, by = "col2Use")
duplicListIDs <- duplicListIDs[order(duplicListIDs$col2Use),] duplicList <- duplicList[order(duplicList$col2Use),]
duplicList <- cbind(duplicListIDs,duplicList[,col4Values:ncol(duplicList)])
if(nrow(empIdsMatrix) != 0){
result <- rbind(empIdsMatrix,duplicList,uniqueList)
}else{
result <- rbind(duplicList,uniqueList)
}
colnames(result)[col4Mean] <- initialColID
return(result)
}
annotIDs <- annotIDs[,1:16]
annIds <- annotIDs[,c("ID","ENSEMBL_ID","GENE")]
RG <- read.maimages(targets, source = "genepix.median", wt.fun = fF)
RG <- backgroundCorrect(RG, method = "normexp", offset = 50)
RG <- RG[RG$genes$ID %in% annotIDs$ID,] RG$genes <- merge(RG$genes,annIds, by = "ID") tmpValues <- cbind(RG$genes,RG$R,RG$G)
col4Mean <- 1
col4Values <- 8
tmpValues <- calMeanDupIds(tmpValues, col4Mean, col4Values)
RG$genes <- tmpValues[,1:(col4Values-1)] RG$R <- tmpValues[,col4Values:(col4Values+3)]
RG$G <- tmpValues[,(col4Values+4):(col4Values+7)] RG <- RG[!(duplicated(RG$genes$ID)),] annotIDs <- annotIDs[(annotIDs$ID %in% iDs$ID_REF),] RG <- RG[RG$genes$ID %in% annotIDs$ID,]
MA <- normalizeWithinArrays(RG , method = "loess")
MA <- normalizeBetweenArrays(MA, method="quantile", "normalizeQuantiles")
fit <- lmFit(MA)
fit <- eBayes(fit)
y <- topTable(fit, n = length(fit$genes$ID))
table(abs(y$logFC) >= 1.0) FALSE TRUE 32495 189 table(abs(y$logFC) >= 1.4 & y$P.Value < 0.005) FALSE TRUE 32632 52 table(y$logFC >= 1.4 & y$P.Value < 0.005) FALSE TRUE 32657 27 table(y$logFC <= -1.4 & y$P.Value < 0.005) FALSE TRUE 32659 25 In the original papers, 448 probes down regulated and 545probes up regulated (Figure-5C). This is the summary of the data analysis that have been done in Hamilton et al. 2015: "... Hybridizations were carried out using the SurePrint G3 Mouse GE 8x60K Microarray (Agilent Technologies) containing probes targeting 39,430 Entre Gene RNAs and 16,251 lincRNAs. Arrays were scanned at 5μm resolution using a GenePix4000B scanner (Molecular Devices). Data from scanned images were extracted using GenePix 6.1 (Axon). Chip annotations file was updated to a more recent version (20130207). Expression data was analyzed using Bioconductor packages (http://www.bioconductor.org/) and R statistical language www.r-project.org). Bioconductor limma (Smyth et al., 2005) package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities. Probe intensities were averaged when more than one probe was associated with a given probe identifier. The resulting matrix showing 55,681 probes as rows and 8 samples as columns was filtered to exclude lincRNA entries and probes in intensities below background in more than half of the samples. 32,684 probes were used as input for linear modeling using the method available in limma which uses a moderated t-statistic to assess significance. A 3xTg-AD vs WT in 7 months was evaluated and probes presenting a p-value lower than 0.005 and a foldchange greater than 1.4 or lower than –1.4 were considered for further analysis. ..." This is the summary of the data analysis that have been done in GEO: ""Bioconductor limma26 package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities. Probe intensities were averaged when more than one probe was associated with a given probe identifier."" best, ilyas. ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Mehmet Ilyas Cosacak0 Hi Gordon, I am really grateful and thank you very much for your help. Yes, definitely you are right. I automatically assumed that "fold change cut-off" as "log2 fold change"! Now, it makes sense. avereps(), I did not know this function! Really helpful and quicker! Thanks a lot! ilyas. ADD REPLYlink written 3.3 years ago by Mehmet Ilyas Cosacak0 Answer: microarray data analsis with limma? 4 3.3 years ago by Gordon Smyth38k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth38k wrote: Here are a few issues: First, the original paper used a fold change cutoff of 1.4, whereas you have used a logFC cutoff of 1.4, which corresponds to a fold change cutoff of 2^1.4 = 2.6. So naturally you have fewer DE genes. It is not correct to average over duplicate Ids before normalization. If you want to take average over duplicate (and I don't usually do it) then just run MA2 <- avereps(MA, ID) after normalization. This is no need to write your own function to do this sort of thing. avereps() will be very much quicker, and will take account of the weights, which your code is ignoring. It is also not correct to quantile normalize after loess normalization, so better to omit the normalizeBetweenArrays() step. The normalizeBetweenArrays call given in your post will give an error anyway. It would improve the analysis to filter low expressed probes or alternatively to use eBayes() with trend=TRUE. Using FDR control would be far better than repeating the kludgy ad hoc DE cutoff from the Cell Stem Cell paper. ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Gordon Smyth38k Hi Gordon, First of all, sorry for my stupid questions! I am trying to understand clearly. What do you mean by FDR control? Shall I consider adj.P.val (adjusted p-value) for cut-off? or shall I do a multiple correction test with treat? by the way, this is the current status of the code based on your suggestion: library("limma") targets <- data.frame(rbind( c("GSE60460_Pair-E-Samples9+10.gpr","3xTg","WT"), c("GSE60460_Pair-F-Samples11+12.gpr","3xTg","WT"), c("GSE60460_Pair-G-Samples13+14.gpr","3xTg","WT"), c("GSE60460_Pair-H-Samples15+16.gpr","3xTg","WT"))) colnames(targets) <- c("FileName","Cy5","Cy3") rownames(targets) <- targets$FileName
# "GSE60460_Pair-E-Samples9+10.gpr"  --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-F-Samples11+12.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-G-Samples13+14.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-H-Samples15+16.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# and IDs that have been used in Hamilton et al. 2015. (http://www.sciencedirect.com/science/article/pii/S1934590915003562)
# "GSE60460_series_matrix.txt" -->  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60460/matrix/
# "GPL10787-9758.txt" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10787
fF <- function(x) as.numeric(x$Flags > -99) RG <- read.maimages(targets, source = "genepix.median", wt.fun = fF) RG <- backgroundCorrect(RG, method = "normexp", offset = 50) MA <- normalizeWithinArrays(RG, method = "median") MA <- normalizeBetweenArrays(MA, method="Aquantile","normalizeQuantiles") MA <- avereps(MA, ID = MA$genes$ID) fit <- lmFit(MA) fit <- eBayes(fit, trend = TRUE) y <- topTable(fit, n = row(fit$genes)

thank you very much in advance!

ilyas.

1

normalizeWithinArrays() with method="loess" (as you had before) was much better.

"normalizeQuantiles" isn't a valid argument for normalizeBetweenArrays(). No need for it.

You would do well to filter control probes (as always with Agilent arrays). Suggest you try this:

AnnCol <- c("ID","Name","ControlType","GeneName","TopHit","Description")
x <- read.maimages(targets,source="genepix.median",annotation=AnnCol)

Thank you very very much Gordon for your help!

Answer: microarray data analsis with limma?
1
3.3 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Trying to reproduce someone else's analyses from a plain-English description is a thankless task. There's a lot of room for interpretation, e.g., what are the background correction options? Did they do loess normalization within each array? You've also done a lot of manipulation before running limma; did the authors do the same? I would suggest contacting the authors and asking them directly for the R script they used to do their analysis. If the script you're showing in your post is from the authors, and you still can't reproduce the results... well, something's wrong. Best to ask whether the authors can reproduce the results on their own system, using only public data. Obviously, you should try to match whatever package versions they used, though this is easier said than done.

Hi Aaron,

I asked the author (corresponding) for the R script they used and did not get any reply (more than 1 month). That is why actually I am posting. I will try again.

But at least, we could get similar expression pattern.

is the logic of my analysis correct? or What would you do to perform the micro array analysis. I will use the data to compare with our RNA-Seq data. That is why I am trying to reproduce at least similar expression pattern as done in the paper.

best,

ilyas.

For starters, I would have a look at the genes that were reported as being DE in the paper. If they're not DE (or at least in the same direction) in your analysis, then there's clearly something wrong. Also, for the sake of trying to reproduce things, I would suggest that you cut out a lot of your extra analysis code, just in case it changes things (or is wrong). Get rid of calMeanDupIds, etc.; only put in the minimal set of analysis steps corresponding to what was described in the paper.

calMeanDupIds is just used to calculate the mean of probe ids. If there are many probes with the same id, I calculated the mean of them. That is also written in the GEO. or you can use ddply directly on RG\$genes.

"Bioconductor limma26 package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities.  Probe intensities were averaged when more than one probe was associated with a given probe identifier."

They don't seem to have done multiple testing correction which I think is a bit dodgy.